CANN VGGT模型NPU优化

张

张建站

2026/5/9 16:55:33

10分钟阅读

NPU VGGT 模型推理优化实践【免费下载链接】cann-recipes-embodied-intelligence本项目针对具身智能业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-embodied-intelligence本文主要介绍VGGT模型基于NPU的推理优化策略其中包括以下优化点Cos/Sin算子输入优化旋转编码计算优化支持NPU npu_rotary_mul融合算子重复计算消除DPT头位置编码计算优化支持NPU npu_add_layer_norm融合算子权重格式转为BF16二维卷积核私有格式提前转换支持NPU npu_fused_infer_attention_score融合算子支持Ulysses与Ringattention序列并行推理性能优化介绍Cos/Sin算子优化优化原因原网络结构代码中Cos算子和Sin算子的输入数据类型为double导致这两算子会下发到AI CPU上执行导致算子性能低。优化方式将vggt/heads/utils.py文件中make_sincos_pos_embed函数的omega变量数据类型从torch.double修改为torch.bfloat16。替换部分 def _make_sincos_pos_embed(embed_dim: int, pos: torch.Tensor, omega_0: float 100) - torch.Tensor: assert embed_dim % 2 0 device pos.device omega torch.arange(embed_dim // 2, dtypetorch.float32 if device.type mps else torch.double, devicedevice) omega / embed_dim / 2.0 omega 1.0 / omega_0**omega # (D/2,) pos pos.reshape(-1) # (M,) out torch.einsum(m,d-md, pos, omega) # (M, D/2), outer product emb_sin torch.sin(out) # (M, D/2) emb_cos torch.cos(out) # (M, D/2) emb torch.cat([emb_sin, emb_cos], dim1) # (M, D) return emb.float() #替换后 def make_sincos_pos_embed(embed_dim: int, pos: torch.Tensor, omega_0: float 100) - torch.Tensor: assert embed_dim % 2 0 device pos.device omega torch.arange(embed_dim // 2, dtypetorch.bfloat16, devicedevice) omega / embed_dim / 2.0 omega 1.0 / omega_0**omega # (D/2,) pos pos.reshape(-1) # (M,) out torch.einsum(m,d-md, pos, omega) # (M, D/2), outer product emb_sin torch.sin(out) # (M, D/2) emb_cos torch.cos(out) # (M, D/2) emb torch.cat([emb_sin, emb_cos], dim1) # (M, D) return emb旋转编码优化融合算子npu_rotary_mul使能优化原因原网络代码中通过(token * cos) (self._rotate_features(tokens) * sin)实现rotary操作Host侧需要下发多个小算子。优化方式修改vggt/layers/rope.py文件中_apply_1d_rope函数使用npu_rotary_mul替换原来的算子实现。替换部分 def __apply_1d_rope( self, tokens: torch.Tensor, positions: torch.Tensor, cos_comp: torch.Tensor, sin_comp: torch.Tensor ) - torch.Tensor: # Embed positions with frequency components cos F.embedding(positions, cos_comp)[:, None, :, :] sin F.embedding(positions, sin_comp)[:, None, :, :] # Apply rotation return (tokens * cos) (self._rotate_features(tokens) * sin) #替换后 def _apply_1d_rope( self, tokens: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor ) - torch.Tensor: # Apply rotation return torch_npu.npu_rotary_mul(tokens, cos, sin, rotary_modehalf)冗余操作去除优化原因在vggt的Attention网络rope模块实现中每次计算都需要重新计算cos和sin值存在冗余计算具体如下vggt/layers/rope.py文件中_apply_1d_rope函数通过cos F.embedding(positions, cos_comp)[:, None, :, :]和sin F.embedding(positions, sin_comp)[:, None, :, :]分别计算cos和sin变量。vggt/layers/rope.py文件中forward函数需要通过max函数计算输入变量positions的最大值。每次对q变量和k变量进行rope计算时都需要分别在垂直维度和水平维度计算cos和sin并且计算positions的最大值。优化方式旋转编码计算的输入依赖positions变量positions变量与输入图片的高度与宽度相关。因此通过建立key为(height, width)的字典缓存cos变量和sin变量的计算结果和positions的最大值针对不同输入下同样大小的图片能够减少重复的计算。替换部分 def __compute_frequency_components( self, dim: int, seq_len: int, device: torch.device, dtype: torch.dtype ) - Tuple[torch.Tensor, torch.Tensor]: cache_key (dim, seq_len, device, dtype) if cache_key not in self.frequency_cache: # Compute frequency bands exponents torch.arange(0, dim, 2, devicedevice).float() / dim inv_freq 1.0 / (self.base_frequency**exponents) # Generate position-dependent frequencies positions torch.arange(seq_len, devicedevice, dtypeinv_freq.dtype) angles torch.einsum(i,j-ij, positions, inv_freq) # Compute and cache frequency components angles angles.to(dtype) angles torch.cat((angles, angles), dim-1) cos_components angles.cos().to(dtype) sin_components angles.sin().to(dtype) self.frequency_cache[cache_key] (cos_components, sin_components) return self.frequency_cache[cache_key] # 替换后 def _compute_frequency_components( self, dim: int, input_positions: torch.tensor, device: torch.device, dtype: torch.dtype, height: None, width: None, batch_size: None ) - Tuple[torch.Tensor, torch.Tensor]: if height None or width None: seq_len int(input_positions.max()) 1 else: hw_key (height, width) if hw_key not in self.max_position_cache: self.max_position_cache[hw_key] int(input_positions.max()) 1 seq_len self.max_position_cache[hw_key] cache_key (dim, seq_len, device, dtype) if cache_key not in self.frequency_cache: # Compute frequency bands exponents torch.arange(0, dim, 2, devicedevice).float() / dim inv_freq 1.0 / (self.base_frequency**exponents) # Generate position-dependent frequencies positions torch.arange(seq_len, devicedevice, dtypeinv_freq.dtype) angles torch.einsum(i,j-ij, positions, inv_freq) # Compute and cache frequency components angles angles.to(dtype) angles torch.cat((angles, angles), dim-1) cos_components angles.cos().to(dtype) sin_components angles.sin().to(dtype) self.frequency_cache[cache_key] (cos_components, sin_components) cos_components, sin_components self.frequency_cache[cache_key] if height None or width None or batch_size None: vertical_cos F.embedding(input_positions[...,0], cos_components)[:, None, :, :] vertical_sin F.embedding(input_positions[...,0], sin_components)[:, None, :, :] horizontal_cos F.embedding(input_positions[...,1], cos_components)[:, None, :, :] horizontal_sin F.embedding(input_positions[...,1], sin_components)[:, None, :, :] return vertical_cos, vertical_sin, horizontal_cos, horizontal_sin sub_cache_key (height, width) if sub_cache_key not in self.cos_sin_cache: # Embed positions with frequency components vertical_cos F.embedding(input_positions[...,0], cos_components)[:, None, :, :] vertical_sin F.embedding(input_positions[...,0], sin_components)[:, None, :, :] horizontal_cos F.embedding(input_positions[...,1], cos_components)[:, None, :, :] horizontal_sin F.embedding(input_positions[...,1], sin_components)[:, None, :, :] self.cos_sin_cache[sub_cache_key] (vertical_cos, vertical_sin, horizontal_cos, horizontal_sin) vertical_cos, vertical_sin, horizontal_cos, horizontal_sin self.cos_sin_cache[sub_cache_key] return vertical_cos, vertical_sin, horizontal_cos, horizontal_sinDPT头位置编码计算优化优化原因VGGT网络在DPT头的实现中每次计算都需要对输入重新进行位置编码的计算存在冗余计算。优化方式由于位置编码的结果取决于输入图片的大小和token的长度因此通过建立字典将结果进行缓存的方式把位置编码的结果提前保存避免冗余计算减少计算量。替换部分 def __apply_pos_embed(self, x: torch.Tensor, W: int, H: int, ratio: float 0.1) - torch.Tensor: patch_w x.shape[-1] patch_h x.shape[-2] pos_embed create_uv_grid(patch_w, patch_h, aspect_ratioW / H, dtypex.dtype, devicex.device) pos_embed position_grid_to_embed(pos_embed, x.shape[1]) pos_embed pos_embed * ratio pos_embed pos_embed.permute(2, 0, 1)[None].expand(x.shape[0], -1, -1, -1) return x pos_embed #替换后 def _apply_pos_embed(self, x: torch.Tensor, W: int, H: int, ratio: float 0.1) - torch.Tensor: if (W, H, x.shape) not in self.pos_embed_cache: patch_w x.shape[-1] patch_h x.shape[-2] pos_embed create_uv_grid(patch_w, patch_h, aspect_ratioW / H, dtypex.dtype, devicex.device) pos_embed position_grid_to_embed(pos_embed, x.shape[1]) pos_embed pos_embed * ratio pos_embed pos_embed.permute(2, 0, 1)[None].expand(x.shape[0], -1, -1, -1) self.pos_embed_cache[(W, H, x.shape)] pos_embed pos_embed self.pos_embed_cache[(W, H, x.shape)] return x pos_embedAddLayerNorm融合算子优化原因将小算子替换为融合大算子提升性能。优化方式使用融合算子npu_add_layer_norm替换原来的算子实现。#替换后 def vggt_layernorm_forward(self, x: torch.Tensor, residual: Optional[torch.Tensor] None): if residual is None: return torch_npu.npu_layer_norm_eval(x, self.normalized_shape, self.weight, self.bias, self.eps) else: y, _, _, residual torch_npu.npu_add_layer_norm(residual, x, self.weight, self.bias, self.eps, additional_output True) return y, residual nn.LayerNorm.forward vggt_layernorm_forward使用BF16权重优化原因目前vggt网络的权重使用float数据类型考虑将vggt网络权重转为bfloat16。优化方式在加载完模型后将模型权重转为bfloat16。使用该方案后取得6.62%的性能收益相机位姿估算任务精度相比fp32精度从0.919下降至0.911精度损失在0.5%以内。使用INT8权重优化原因目前vggt网络的Linear层使用W8A8量化精度可控(下降1%以内)考虑将vggt网络权重离线转为int8。优化方式将部分Linear层的激活使用动态per-token量化权重使用静态per-channel量化。fp32(原始)模型大小为4.9GBbf16模型大小为2.46GBint8模型大小为2.16G. 相机位姿估算任务精度相比bf16精度从0.911下降至0.907精度损失在0.5%以内。。使能方式默认关闭使能需将enableW8A8设为True。卷积核私有格式提前转换优化原因NPU上进行二维卷积操作时需要提前通过Transdata算子将卷积核转为Fractal_Z格式在推理过程中存在数据格式的转换开销。优化方式在加载完模型权重后提前将二维卷积核的数据格式转为Fractal_Z进而避免转换开销的引入。#替换后 def cast_model_weight(model): def __format_cast(module, class_name): if issubclass(class_name, torch.nn.Conv2d): if module.groups 1: return if hasattr(module, weight) and module.weight is not None and \ weight in dict(module.named_parameters()): module.weight.data torch_npu.npu_format_cast(module.weight.data, 4) return module def cast_weight(module): current_class module.__class__ module __format_cast(module, current_class) if not module.children: return for sub_module in module.children(): if isinstance(sub_module, torch.nn.Module): sub_module cast_weight(sub_module) return module for _, module in model.named_modules(): module cast_weight(module) return model # 推理时 model VGGT() checkpoint torch.load(checkpoint_path) model.load_state_dict(checkpoint) model model.to(dtype) model.to(device).eval() model cast_model_weight(model)#调用cast_model_weight函数在推理前提前转换卷积核数据格式 predictions model(images)NPU fused_infer_attention_score算子适配优化原因npu_fused_infer_attention_score融合算子是Ascend Extension for PyTorch提供的适配增量和全量推理场景的FlashAttention算子在昇腾环境上相比SDPA有更好的推理性能表现。优化方式针对Frame-Global双层架构实现算子分离。Frame Attention处理帧内短序列保持使用PyTorch SDPAGlobal Attention处理跨帧长序列切换到NPU FIA融合算子。# 原始统一使用SDPA x F.scaled_dot_product_attention(q, k, v, dropout_p...) # 优化根据is_global_attention分流 if self.is_global_attention: # Global用FIA x torch_npu.npu_fused_infer_attention_score( q, k, v, num_headsnum_heads, scalescale, input_layoutBNSD, pre_tokens65535, next_tokens65535, inner_precise0 )[0] else: # Frame用SDPA x F.scaled_dot_product_attention(q, k, v, ...)序列并行的推理适配优化原因VGGT网络原生仅支持单卡推理所有计算在一张卡上完成未能有效利用Atlas A2的多卡优势多卡并行将序列切分到各rank可以有效提升整网的推理性能。优化方式1.Ulysses并行将num_heads维度切分到多卡通过all-to-all通信将序列维度聚合使每个rank能看到完整序列但只处理部分attention heads。2.Ring并行将序列切分到多卡通过overlap策略隐藏通信开销利用NPU FIA算子返回的LSElog-sum-exp信息支持分块attention结果的数值稳定合并。# Ulysses并行头维度切分 # 通过all-to-all通信实现head和sequence维度的互换 # 输入: [B, shard_s, H, D] - 每个rank持有部分序列、完整头 # 输出: [B, S, shard_hc, D] - 每个rank持有完整序列、部分头 def _all_to_all_head_to_seq(self, input_): input_t input_.reshape(bs, shard_s, world_size, shard_hc, d) input_t input_t.permute(2, 1, 0, 3, 4).contiguous() dist.all_to_all_single(output, input_t, groupself.ulysses_pg)# Ring并行序列维度切分 # Ring Attention with Overlap: 边通信边计算 def _ring_attention_overlap(self, params): # Step 1: 异步启动K/V的allgather k_handle dist.all_gather_into_tensor(k_gathered, k, async_opTrue) v_handle dist.all_gather_into_tensor(v_gathered, v, async_opTrue) # Step 2: 计算local attention (Q local_K/V) out_local, lse_local self._compute_attention_with_lse(q, k, v) # Step 3: 等待通信完成计算cross attention (Q other_K/V) k_handle.wait() out_others, lse_others self._compute_attention_with_lse(q, k_others, v_others) # Step 4: 使用LSE合并结果 out_merged self._merge_two_outputs(out_local, lse_local, out_others, lse_others)额外影响对VGGT网络适配序列并行后以下是整网精度情况 |并行度|AUC30| |:---:|:---:| |单卡|91.10| |Ulysses2|91.12| |Ulysses4|91.10| |Ulysses8|90.86| |Ring2|91.12| |Ring4|91.09| |Ring8|90.88|性能优化指标本方案使用8卡Atlas 800I A2推理产品输入vggt提供的样例数据(examples/kitchen)包含25张图片性能指标如下 |使能方法|推理耗时ms| |:---:|:---:| |Cos\Sin算子优化|1324.83| |旋转编码优化|1239.55| |DPT头计算优化|1211.26| |npu_add_layer_norm融合算子|1208.17| |权重格式转BF16|1128.18| |私有格式提前转换|1121.09|采用序列并行的多卡推理后性能提升如下 |并行度|推理提升| |:---:|:---:| |Ulysses2|1.75x| |Ulysses4|3.43x| |Ulysses8|6.47x| |Ring2|1.82x| |Ring4|3.48x| |Ring8|6.42x|【免费下载链接】cann-recipes-embodied-intelligence本项目针对具身智能业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-embodied-intelligence创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

从设备日志到生产排程：手把手教你用Codesys时间类型搞定工业场景

从设备日志到生产排程：手把手教你用Codesys时间类型搞定工业场景在工业自动化领域，时间数据就像生产线的脉搏，记录着每个关键节点的生命体征。想象一下：一台注塑机需要精确统计模具闭合时间，一个包装流水线要在特定时…...

2026/5/9 16:55:32 阅读更多 →

天线工程师选型指南：切比雪夫、泰勒、Villeneuve分布到底怎么选？从原理到实战对比

天线阵列设计实战：切比雪夫、泰勒与Villeneuve分布的性能博弈当32个阵元需要实现30dB副瓣抑制时，资深工程师的桌面上总会摆着三套方案：切比雪夫的等副瓣阵列、泰勒的近等副瓣结构，以及Villeneuve的渐降副瓣设计。这三种经典分布就…...

2026/5/9 16:54:41 阅读更多 →

AI/ML实战技巧库：从数据预处理到LLM应用的高效工作流

1. 项目概述与核心价值最近在GitHub上发现一个挺有意思的仓库，叫business-science/free-ai-tips。这名字起得挺直白，就是一个专门分享免费生成式AI和机器学习技巧的宝库。我自己在数据科学和AI应用领域摸爬滚打了十几年，深知这个领域知识迭代…...

2026/5/9 16:54:35 阅读更多 →

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程【免费下载链接】Autovisor 2025智慧树刷课脚本基于Python Playwright的自动化程序 [有免安装版] 项目地址: https://gitcode.com/gh_mirrors/au/Autovisor 你是否厌倦了每天手动点击播放、等待…...

2026/5/9 12:51:47 阅读更多 →

ModelScope Auto Proxy：智能路由网关，零成本统一调用免费大模型API

1. 项目概述与核心价值如果你和我一样，是个重度依赖 AI 编程工具（比如 Cursor、Cline）的开发者，那你肯定对 OpenAI 的 API 调用成本又爱又恨。爱的是它强大的能力，恨的是账单上的数字。最近，国内的开源社…...

2026/5/9 5:30:52 阅读更多 →

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程）

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程） 当你第一次拿到AOCODARC H7DUAL这块飞控板时，可能会被密密麻麻的引脚和复杂的配置选项吓到。别担心，这篇教程将带你从零开始&#xff…...

2026/5/9 12:51:47 阅读更多 →

League Akari：你的英雄联盟游戏体验进化指南

League Akari：你的英雄联盟游戏体验进化指南【免费下载链接】League-Toolkit An all-in-one toolkit for LeagueClient. Gathering power 🚀. 项目地址: https://gitcode.com/gh_mirrors/le/League-Toolkit 想象一下这样的场景：你正在…...

2026/5/9 12:51:46 阅读更多 →