a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing an a2 (easyasc.a2, deviceb3) kernel with:one cube stage that produces a score tilevec logic that updates running row state and emits a delayed cube inputa later cube stage that consumes that delayed tilea final vec stage that accumulates the delayed cube outputTypical target formula:score_j q.float() k_j.float().t() * scalecurr_m maximum(prev_m, rowmax(score_j))expdiff_j exp(prev_m - curr_m)p_j exp(score_j - curr_m).half()pv_j p_j.float() v_j.float()out out * expdiff_j pv_jThis isnotnormalized online softmax. It keeps running max and a rescaled numerator only. There is no running sum or final divide. If you need runningrow_sumand a finalout / row_sum, switch toagent/references/patterns/a2-cube-vec-cube-vec-softmax.md.Why this needs its own a2 patternThis topology combines all a2 bridge constraints in one kernel:cube - vec cannot usel0c_to_ubvec - cube cannot useub_to_l1_*the delayed cube output must return to vec for the final accumulationSo the stable data path is:GM(q,k,v) - L1 - L0 - L0C(score) - GM(score_ws) - UB(score)- GM(p_ws) - L1 - L0 - L0C(pv) - GM(pv_ws) - UB(pv) - UB(accum) - GM(out)Use explicit workspaces instead of pretending this can stay on chip end-to-end.Workspaces and ownership edgesUse three GM workspaces:score_wsdtype:floatshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose:L0C(score)-UB(score)p_wsdtype:halfshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose:UB(p_j)-L1(p_j)pv_wsdtype:floatshape:[GetCubeNum(), 2, TILE_M, D]purpose:L0C(pv_j)-UB(pv_j)Ownership edges:stage 1 cube - vec:CvMutex(0, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)stage 1 vec - stage 2 cube:VcMutex(1, src_end_pipePipe.MTE3, dst_end_pipePipe.FIX)stage 2 cube - stage 3 vec:CvMutex(2, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)Stable scheduleUse one-tile lookahead:for ni in range(0, tiles_n 1): if ni tiles_n: # stage 1: produce tile j ni if ni 0: # stage 2 stage 3: consume tile j ni - 1This gives:warmup: first iteration only producessteady state: producejwhile consumingj - 1drain: final iteration only consumes the last delayed tileSharedL0CruleReuse one physicalL0Cfamily across the two cube stages.Why this is the stable a2 choice here:stage 1 writes a full float[TILE_M, TILE_N]score tilestage 2 writes a full float[TILE_M, D]pv_jtile with the same validatedD 128a2 only has128 KBL0C, so a second full float family would be a misleading design targetStable ownership story:keep onel0c DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)let stage 1 publishscore_wsbefore stage 2 reuses that slotlet stage 2 publishpv_wsbefore the next stage-1 reuseadvance one sharedl0c_cntThis is a capacity-driven exception, not a general license to merge unrelated counters. Only the physicalL0Cfamily is shared. Other stage-owned lifetimes stay separate.Counter layoutKeep these lifetimes separate:l1qk_cnt: stage-1q/kloadsl1pv_cnt: stage-2p/vloadsl0c_cnt: shared physicalL0Cfamily across the two cube stagesstage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiffstage2_cnt: delayed slot rhythm forp_wsconsumption andpv_wsDo not hide the delayed accumulator lifetime behindstage1_cnt.Vec-resident persistent stateKeep these values in per-subblock UB across the whole inner loop:running row max:[HALF_M, 1]delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)final numerator accumulation:[HALF_M, D]UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.Critical scalar-state rule on a2Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.Reason:ub_to_ubinfers burst length in units ofC0blocksfor[64, 1]float views, that means copying 8 elements per rowthis silently miscopies row-scalar state such asprev_mStable fix:keep scalar state in[HALF_M, 1]copy it with a vec binary op that respects the[M,1]stride model, for example:dup(ub_zero_s, 0.0) add(expdiff_buf[slot], ub_rmax_s, ub_zero_s)Then update or transform that copied buffer with more vec ops.Delayedexpdiffhandlingexpdiff_jbelongs to the delayed consumer lifetime, not only to stage 1.Stable pattern:stage 1 copiesprev_minto the delayedexpdiffslotstage 1 updates running maxstage 1 overwrites the delayed slot withexp(prev_m - curr_m)stage 3 later reads that same slot and broadcasts it before scalingaccumUsestage1_cntparity for the write slot andstage2_cntparity for the read slot.Final vec accumulationAfter loadingpv_jback into UB:brcbthe delayedexpdiffslot to[HALF_M, 8]scaleaccum[:, 0:64]scaleaccum[:, 64:128]add(accum, accum, pv_j)Why sliced scaling is required:accumis wide ([HALF_M, 128])expdiffbroadcast is narrow ([HALF_M, 8])follow the same sliced-row rule used for row-max subtractionValidation targetKeep the first validated contract narrow:D 128S1 % 128 0S2 % 128 0inputq/k/varefloat16output isfloat32Suggested cases:(1, 1, 256, 512, 128)(1, 3, 256, 512, 128)(1, 3, 2048, 4096, 128)Files to studyagent/example/kernels/a2/flash_attn_score_iter.pyagent/example/kernels/a2/flash_attn_score_pv.pyagent/example/kernels/a2/flash_attn_unnorm.pyagent/references/patterns/a2-cube-vec.mdagent/references/patterns/a2-cube-vec-cube.mdagent/references/constraints/a2-device.mdagent/references/constraints/vec-reduction-a2.mdagent/references/constraints/vec-stride.md【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考