CANNBot Skills A2三重桥接模式

张

张建站

2026/5/9 15:18:53

10分钟阅读

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing an a2 (easyasc.a2, deviceb3) kernel with:one cube stage that produces a score tilevec logic that updates running row state and emits a delayed cube inputa later cube stage that consumes that delayed tilea final vec stage that accumulates the delayed cube outputTypical target formula:score_j q.float() k_j.float().t() * scalecurr_m maximum(prev_m, rowmax(score_j))expdiff_j exp(prev_m - curr_m)p_j exp(score_j - curr_m).half()pv_j p_j.float() v_j.float()out out * expdiff_j pv_jThis isnotnormalized online softmax. It keeps running max and a rescaled numerator only. There is no running sum or final divide. If you need runningrow_sumand a finalout / row_sum, switch toagent/references/patterns/a2-cube-vec-cube-vec-softmax.md.Why this needs its own a2 patternThis topology combines all a2 bridge constraints in one kernel:cube - vec cannot usel0c_to_ubvec - cube cannot useub_to_l1_*the delayed cube output must return to vec for the final accumulationSo the stable data path is:GM(q,k,v) - L1 - L0 - L0C(score) - GM(score_ws) - UB(score)- GM(p_ws) - L1 - L0 - L0C(pv) - GM(pv_ws) - UB(pv) - UB(accum) - GM(out)Use explicit workspaces instead of pretending this can stay on chip end-to-end.Workspaces and ownership edgesUse three GM workspaces:score_wsdtype:floatshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose:L0C(score)-UB(score)p_wsdtype:halfshape:[GetCubeNum(), 2, TILE_M, TILE_N]purpose:UB(p_j)-L1(p_j)pv_wsdtype:floatshape:[GetCubeNum(), 2, TILE_M, D]purpose:L0C(pv_j)-UB(pv_j)Ownership edges:stage 1 cube - vec:CvMutex(0, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)stage 1 vec - stage 2 cube:VcMutex(1, src_end_pipePipe.MTE3, dst_end_pipePipe.FIX)stage 2 cube - stage 3 vec:CvMutex(2, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)Stable scheduleUse one-tile lookahead:for ni in range(0, tiles_n 1): if ni tiles_n: # stage 1: produce tile j ni if ni 0: # stage 2 stage 3: consume tile j ni - 1This gives:warmup: first iteration only producessteady state: producejwhile consumingj - 1drain: final iteration only consumes the last delayed tileSharedL0CruleReuse one physicalL0Cfamily across the two cube stages.Why this is the stable a2 choice here:stage 1 writes a full float[TILE_M, TILE_N]score tilestage 2 writes a full float[TILE_M, D]pv_jtile with the same validatedD 128a2 only has128 KBL0C, so a second full float family would be a misleading design targetStable ownership story:keep onel0c DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)let stage 1 publishscore_wsbefore stage 2 reuses that slotlet stage 2 publishpv_wsbefore the next stage-1 reuseadvance one sharedl0c_cntThis is a capacity-driven exception, not a general license to merge unrelated counters. Only the physicalL0Cfamily is shared. Other stage-owned lifetimes stay separate.Counter layoutKeep these lifetimes separate:l1qk_cnt: stage-1q/kloadsl1pv_cnt: stage-2p/vloadsl0c_cnt: shared physicalL0Cfamily across the two cube stagesstage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiffstage2_cnt: delayed slot rhythm forp_wsconsumption andpv_wsDo not hide the delayed accumulator lifetime behindstage1_cnt.Vec-resident persistent stateKeep these values in per-subblock UB across the whole inner loop:running row max:[HALF_M, 1]delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)final numerator accumulation:[HALF_M, D]UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.Critical scalar-state rule on a2Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.Reason:ub_to_ubinfers burst length in units ofC0blocksfor[64, 1]float views, that means copying 8 elements per rowthis silently miscopies row-scalar state such asprev_mStable fix:keep scalar state in[HALF_M, 1]copy it with a vec binary op that respects the[M,1]stride model, for example:dup(ub_zero_s, 0.0) add(expdiff_buf[slot], ub_rmax_s, ub_zero_s)Then update or transform that copied buffer with more vec ops.Delayedexpdiffhandlingexpdiff_jbelongs to the delayed consumer lifetime, not only to stage 1.Stable pattern:stage 1 copiesprev_minto the delayedexpdiffslotstage 1 updates running maxstage 1 overwrites the delayed slot withexp(prev_m - curr_m)stage 3 later reads that same slot and broadcasts it before scalingaccumUsestage1_cntparity for the write slot andstage2_cntparity for the read slot.Final vec accumulationAfter loadingpv_jback into UB:brcbthe delayedexpdiffslot to[HALF_M, 8]scaleaccum[:, 0:64]scaleaccum[:, 64:128]add(accum, accum, pv_j)Why sliced scaling is required:accumis wide ([HALF_M, 128])expdiffbroadcast is narrow ([HALF_M, 8])follow the same sliced-row rule used for row-max subtractionValidation targetKeep the first validated contract narrow:D 128S1 % 128 0S2 % 128 0inputq/k/varefloat16output isfloat32Suggested cases:(1, 1, 256, 512, 128)(1, 3, 256, 512, 128)(1, 3, 2048, 4096, 128)Files to studyagent/example/kernels/a2/flash_attn_score_iter.pyagent/example/kernels/a2/flash_attn_score_pv.pyagent/example/kernels/a2/flash_attn_unnorm.pyagent/references/patterns/a2-cube-vec.mdagent/references/patterns/a2-cube-vec-cube.mdagent/references/constraints/a2-device.mdagent/references/constraints/vec-reduction-a2.mdagent/references/constraints/vec-stride.md【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANN/hixl：LLM-DataDist支持HIXL传输后端

LLM-DataDist支持HIXL传输后端【免费下载链接】hixl HIXL（Huawei Xfer Library）是一个灵活、高效的昇腾单边通信库，面向集群场景提供简单、可靠、高效的点对点数据传输能力。项目地址: https://gitcode.com/cann/hixl 需求描述背景…...

2026/5/9 15:17:54 阅读更多 →

CANN HIXL示例指南

简介【免费下载链接】hixl HIXL（Huawei Xfer Library）是一个灵活、高效的昇腾单边通信库，面向集群场景提供简单、可靠、高效的点对点数据传输能力。项目地址: https://gitcode.com/cann/hixl 本项目提供了C和Python的调用样例&#…...

2026/5/9 15:17:53 阅读更多 →

对比直接使用原生 API 通过 Taotoken 调用在延迟上的体验差异

🚀 告别海外账号与网络限制！稳定直连全球优质大模型，限时半价接入中。 👉 点击领取海量免费额度对比直接使用原生 API 通过 Taotoken 调用在延迟上的体验差异 1. 关于延迟体验的说明在开发基于大模型的应用时，API …...

2026/5/9 15:14:31 阅读更多 →

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程【免费下载链接】Autovisor 2025智慧树刷课脚本基于Python Playwright的自动化程序 [有免安装版] 项目地址: https://gitcode.com/gh_mirrors/au/Autovisor 你是否厌倦了每天手动点击播放、等待…...

2026/5/9 12:51:47 阅读更多 →

ModelScope Auto Proxy：智能路由网关，零成本统一调用免费大模型API

1. 项目概述与核心价值如果你和我一样，是个重度依赖 AI 编程工具（比如 Cursor、Cline）的开发者，那你肯定对 OpenAI 的 API 调用成本又爱又恨。爱的是它强大的能力，恨的是账单上的数字。最近，国内的开源社…...

2026/5/9 5:30:52 阅读更多 →

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程）

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程） 当你第一次拿到AOCODARC H7DUAL这块飞控板时，可能会被密密麻麻的引脚和复杂的配置选项吓到。别担心，这篇教程将带你从零开始&#xff…...

2026/5/9 12:51:47 阅读更多 →

League Akari：你的英雄联盟游戏体验进化指南

League Akari：你的英雄联盟游戏体验进化指南【免费下载链接】League-Toolkit An all-in-one toolkit for LeagueClient. Gathering power 🚀. 项目地址: https://gitcode.com/gh_mirrors/le/League-Toolkit 想象一下这样的场景：你正在…...

2026/5/9 12:51:46 阅读更多 →