Attention 层MLA Absorb 路径无 Indexer【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills参考模型cann-recipes-infer/models/deepseek_r1/、cann-recipes-infer/models/kimi-k2-thinking/核心特征Q 吸收 W_uk 使维度降至kv_lora_rankKV Cache 只存 latent 向量。Prefill 和 Decode 走不同算子路径。Prefill 链路参考deepseek-r1/forward_page_attention_normal# ─── Pre-Norm ─── hidden_states, residual npu_add_rms_norm(residual, hidden_states, weight, eps) # ─── Q 投影 ─── q q_b_proj(npu_rms_norm(q_a_proj(x))) q_nope, q_pe q.split([qk_nope_head_dim, qk_rope_head_dim]) # ─── KV 低秩投影 ─── latent_cache kv_a_proj_with_mqa(x) # ─── KV RMSNorm RoPE Cache Write三合一融合─── k_rope, k_nope npu_kv_rmsnorm_rope_cache_v2( latent_cache, kv_a_layernorm.weight, cos, sin, slot_mapping, rope_cache, nope_cache, cache_modePA_NZ, is_output_kvTrue) # ─── Q RoPE ─── q_pe npu_interleave_rope(q_pe, cos, sin) # ─── 展开 K/V从 latent 反投影到 full dim非 absorb─── k_full matmul(k_nope, kv_b_proj_w_k) v_full matmul(k_nope, kv_b_proj_w_v) # ─── Flash Attention标准 v1─── output npu_fused_infer_attention_score( q_nope, k_full, v_full, query_ropeq_pe, key_ropek_rope, input_layoutNTD_TND, sparse_mode3) # ─── O 投影 ─── output o_proj(output)Decode 链路推荐路径参考deepseek-r1/forward_page_attention_mla_prolog# ─── Pre-Norm ─── hidden_states, residual npu_add_rms_norm(residual, hidden_states, weight, eps) # ─── 可选npu_dynamic_quant 量化 hidden_statesW8A8 场景─── # ─── 超级融合 prologQKV投影 LayerNorm RoPE Q absorb KV Cache Write─── q_nope, q_pe, dequant_scale, _, _ npu_mla_prolog_v3( token_xhidden_states, weight_dqq_a_proj.weight, weight_uq_qrq_b_proj.weight, weight_ukkv_b_proj_w_k, # Q absorbW_uk 吸收进 Q weight_dkv_krkv_a_proj.weight, rmsnorm_gamma_cq..., rmsnorm_gamma_ckv..., kv_cachenope_cache, kr_cacherope_cache, cache_indexslot_mapping, cache_modePA_NZ, weight_quant_mode2, # 0:无量化 1:仅qb 2:全int8 kv_cache_quant_mode1) # 0:无 1:per-tensor # ─── Flash Attention v2absorb 后 keyvaluelatent cache─── output npu_fused_infer_attention_score_v2( q_nope, k_nope_cache, k_nope_cache, # key 和 value 都是 latent query_ropeq_pe, key_ropek_rope_cache, block_tableblock_table, actual_seq_kvlenactual_seq_lengths_kv, input_layoutTND_NTD, sparse_mode0) # ─── V absorbFA 后用 W_v 反投影─── output npu_transpose_batchmatmul(output, kv_b_proj_w_v) # ─── O 投影 ─── output o_proj(output)Prefill vs Decode 关键差异环节PrefillDecodeQKV Norm RoPE Cache分步手动投影 npu_kv_rmsnorm_rope_cache_v2超级融合npu_mla_prolog_v3一步完成K/V 形态展开为 full dimmatmul(latent, W_k/W_v)不展开保持 latentabsorbFA 算子npu_fused_infer_attention_scorev1npu_fused_infer_attention_score_v2V absorb无Prefill 已展开 Vnpu_transpose_batchmatmul(output, W_v)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考