CANN VGGT模型NPU优化
NPU VGGT 模型推理优化实践【免费下载链接】cann-recipes-embodied-intelligence本项目针对具身智能业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-embodied-intelligence本文主要介绍VGGT模型基于NPU的推理优化策略其中包括以下优化点Cos/Sin算子输入优化旋转编码计算优化支持NPU npu_rotary_mul融合算子重复计算消除DPT头位置编码计算优化支持NPU npu_add_layer_norm融合算子权重格式转为BF16二维卷积核私有格式提前转换支持NPU npu_fused_infer_attention_score融合算子支持Ulysses与Ringattention序列并行推理性能优化介绍Cos/Sin算子优化优化原因原网络结构代码中Cos算子和Sin算子的输入数据类型为double导致这两算子会下发到AI CPU上执行导致算子性能低。优化方式将vggt/heads/utils.py文件中make_sincos_pos_embed函数的omega变量数据类型从torch.double修改为torch.bfloat16。替换部分 def _make_sincos_pos_embed(embed_dim: int, pos: torch.Tensor, omega_0: float 100) - torch.Tensor: assert embed_dim % 2 0 device pos.device omega torch.arange(embed_dim // 2, dtypetorch.float32 if device.type mps else torch.double, devicedevice) omega / embed_dim / 2.0 omega 1.0 / omega_0**omega # (D/2,) pos pos.reshape(-1) # (M,) out torch.einsum(m,d-md, pos, omega) # (M, D/2), outer product emb_sin torch.sin(out) # (M, D/2) emb_cos torch.cos(out) # (M, D/2) emb torch.cat([emb_sin, emb_cos], dim1) # (M, D) return emb.float() #替换后 def make_sincos_pos_embed(embed_dim: int, pos: torch.Tensor, omega_0: float 100) - torch.Tensor: assert embed_dim % 2 0 device pos.device omega torch.arange(embed_dim // 2, dtypetorch.bfloat16, devicedevice) omega / embed_dim / 2.0 omega 1.0 / omega_0**omega # (D/2,) pos pos.reshape(-1) # (M,) out torch.einsum(m,d-md, pos, omega) # (M, D/2), outer product emb_sin torch.sin(out) # (M, D/2) emb_cos torch.cos(out) # (M, D/2) emb torch.cat([emb_sin, emb_cos], dim1) # (M, D) return emb旋转编码优化融合算子npu_rotary_mul使能优化原因原网络代码中通过(token * cos) (self._rotate_features(tokens) * sin)实现rotary操作Host侧需要下发多个小算子。优化方式修改vggt/layers/rope.py文件中_apply_1d_rope函数使用npu_rotary_mul替换原来的算子实现。替换部分 def __apply_1d_rope( self, tokens: torch.Tensor, positions: torch.Tensor, cos_comp: torch.Tensor, sin_comp: torch.Tensor ) - torch.Tensor: # Embed positions with frequency components cos F.embedding(positions, cos_comp)[:, None, :, :] sin F.embedding(positions, sin_comp)[:, None, :, :] # Apply rotation return (tokens * cos) (self._rotate_features(tokens) * sin) #替换后 def _apply_1d_rope( self, tokens: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor ) - torch.Tensor: # Apply rotation return torch_npu.npu_rotary_mul(tokens, cos, sin, rotary_modehalf)冗余操作去除优化原因在vggt的Attention网络rope模块实现中每次计算都需要重新计算cos和sin值存在冗余计算具体如下vggt/layers/rope.py文件中_apply_1d_rope函数通过cos F.embedding(positions, cos_comp)[:, None, :, :]和sin F.embedding(positions, sin_comp)[:, None, :, :]分别计算cos和sin变量。vggt/layers/rope.py文件中forward函数需要通过max函数计算输入变量positions的最大值。每次对q变量和k变量进行rope计算时都需要分别在垂直维度和水平维度计算cos和sin并且计算positions的最大值。优化方式旋转编码计算的输入依赖positions变量positions变量与输入图片的高度与宽度相关。因此通过建立key为(height, width)的字典缓存cos变量和sin变量的计算结果和positions的最大值针对不同输入下同样大小的图片能够减少重复的计算。替换部分 def __compute_frequency_components( self, dim: int, seq_len: int, device: torch.device, dtype: torch.dtype ) - Tuple[torch.Tensor, torch.Tensor]: cache_key (dim, seq_len, device, dtype) if cache_key not in self.frequency_cache: # Compute frequency bands exponents torch.arange(0, dim, 2, devicedevice).float() / dim inv_freq 1.0 / (self.base_frequency**exponents) # Generate position-dependent frequencies positions torch.arange(seq_len, devicedevice, dtypeinv_freq.dtype) angles torch.einsum(i,j-ij, positions, inv_freq) # Compute and cache frequency components angles angles.to(dtype) angles torch.cat((angles, angles), dim-1) cos_components angles.cos().to(dtype) sin_components angles.sin().to(dtype) self.frequency_cache[cache_key] (cos_components, sin_components) return self.frequency_cache[cache_key] # 替换后 def _compute_frequency_components( self, dim: int, input_positions: torch.tensor, device: torch.device, dtype: torch.dtype, height: None, width: None, batch_size: None ) - Tuple[torch.Tensor, torch.Tensor]: if height None or width None: seq_len int(input_positions.max()) 1 else: hw_key (height, width) if hw_key not in self.max_position_cache: self.max_position_cache[hw_key] int(input_positions.max()) 1 seq_len self.max_position_cache[hw_key] cache_key (dim, seq_len, device, dtype) if cache_key not in self.frequency_cache: # Compute frequency bands exponents torch.arange(0, dim, 2, devicedevice).float() / dim inv_freq 1.0 / (self.base_frequency**exponents) # Generate position-dependent frequencies positions torch.arange(seq_len, devicedevice, dtypeinv_freq.dtype) angles torch.einsum(i,j-ij, positions, inv_freq) # Compute and cache frequency components angles angles.to(dtype) angles torch.cat((angles, angles), dim-1) cos_components angles.cos().to(dtype) sin_components angles.sin().to(dtype) self.frequency_cache[cache_key] (cos_components, sin_components) cos_components, sin_components self.frequency_cache[cache_key] if height None or width None or batch_size None: vertical_cos F.embedding(input_positions[...,0], cos_components)[:, None, :, :] vertical_sin F.embedding(input_positions[...,0], sin_components)[:, None, :, :] horizontal_cos F.embedding(input_positions[...,1], cos_components)[:, None, :, :] horizontal_sin F.embedding(input_positions[...,1], sin_components)[:, None, :, :] return vertical_cos, vertical_sin, horizontal_cos, horizontal_sin sub_cache_key (height, width) if sub_cache_key not in self.cos_sin_cache: # Embed positions with frequency components vertical_cos F.embedding(input_positions[...,0], cos_components)[:, None, :, :] vertical_sin F.embedding(input_positions[...,0], sin_components)[:, None, :, :] horizontal_cos F.embedding(input_positions[...,1], cos_components)[:, None, :, :] horizontal_sin F.embedding(input_positions[...,1], sin_components)[:, None, :, :] self.cos_sin_cache[sub_cache_key] (vertical_cos, vertical_sin, horizontal_cos, horizontal_sin) vertical_cos, vertical_sin, horizontal_cos, horizontal_sin self.cos_sin_cache[sub_cache_key] return vertical_cos, vertical_sin, horizontal_cos, horizontal_sinDPT头位置编码计算优化优化原因VGGT网络在DPT头的实现中每次计算都需要对输入重新进行位置编码的计算存在冗余计算。优化方式由于位置编码的结果取决于输入图片的大小和token的长度因此通过建立字典将结果进行缓存的方式把位置编码的结果提前保存避免冗余计算减少计算量。替换部分 def __apply_pos_embed(self, x: torch.Tensor, W: int, H: int, ratio: float 0.1) - torch.Tensor: patch_w x.shape[-1] patch_h x.shape[-2] pos_embed create_uv_grid(patch_w, patch_h, aspect_ratioW / H, dtypex.dtype, devicex.device) pos_embed position_grid_to_embed(pos_embed, x.shape[1]) pos_embed pos_embed * ratio pos_embed pos_embed.permute(2, 0, 1)[None].expand(x.shape[0], -1, -1, -1) return x pos_embed #替换后 def _apply_pos_embed(self, x: torch.Tensor, W: int, H: int, ratio: float 0.1) - torch.Tensor: if (W, H, x.shape) not in self.pos_embed_cache: patch_w x.shape[-1] patch_h x.shape[-2] pos_embed create_uv_grid(patch_w, patch_h, aspect_ratioW / H, dtypex.dtype, devicex.device) pos_embed position_grid_to_embed(pos_embed, x.shape[1]) pos_embed pos_embed * ratio pos_embed pos_embed.permute(2, 0, 1)[None].expand(x.shape[0], -1, -1, -1) self.pos_embed_cache[(W, H, x.shape)] pos_embed pos_embed self.pos_embed_cache[(W, H, x.shape)] return x pos_embedAddLayerNorm融合算子优化原因将小算子替换为融合大算子提升性能。优化方式使用融合算子npu_add_layer_norm替换原来的算子实现。#替换后 def vggt_layernorm_forward(self, x: torch.Tensor, residual: Optional[torch.Tensor] None): if residual is None: return torch_npu.npu_layer_norm_eval(x, self.normalized_shape, self.weight, self.bias, self.eps) else: y, _, _, residual torch_npu.npu_add_layer_norm(residual, x, self.weight, self.bias, self.eps, additional_output True) return y, residual nn.LayerNorm.forward vggt_layernorm_forward使用BF16权重优化原因目前vggt网络的权重使用float数据类型考虑将vggt网络权重转为bfloat16。优化方式在加载完模型后将模型权重转为bfloat16。使用该方案后取得6.62%的性能收益相机位姿估算任务精度相比fp32精度从0.919下降至0.911精度损失在0.5%以内。使用INT8权重优化原因目前vggt网络的Linear层使用W8A8量化精度可控(下降1%以内)考虑将vggt网络权重离线转为int8。优化方式将部分Linear层的激活使用动态per-token量化权重使用静态per-channel量化。fp32(原始)模型大小为4.9GBbf16模型大小为2.46GBint8模型大小为2.16G. 相机位姿估算任务精度相比bf16精度从0.911下降至0.907精度损失在0.5%以内。。使能方式默认关闭使能需将enableW8A8设为True。卷积核私有格式提前转换优化原因NPU上进行二维卷积操作时需要提前通过Transdata算子将卷积核转为Fractal_Z格式在推理过程中存在数据格式的转换开销。优化方式在加载完模型权重后提前将二维卷积核的数据格式转为Fractal_Z进而避免转换开销的引入。#替换后 def cast_model_weight(model): def __format_cast(module, class_name): if issubclass(class_name, torch.nn.Conv2d): if module.groups 1: return if hasattr(module, weight) and module.weight is not None and \ weight in dict(module.named_parameters()): module.weight.data torch_npu.npu_format_cast(module.weight.data, 4) return module def cast_weight(module): current_class module.__class__ module __format_cast(module, current_class) if not module.children: return for sub_module in module.children(): if isinstance(sub_module, torch.nn.Module): sub_module cast_weight(sub_module) return module for _, module in model.named_modules(): module cast_weight(module) return model # 推理时 model VGGT() checkpoint torch.load(checkpoint_path) model.load_state_dict(checkpoint) model model.to(dtype) model.to(device).eval() model cast_model_weight(model)#调用cast_model_weight函数在推理前提前转换卷积核数据格式 predictions model(images)NPU fused_infer_attention_score算子适配优化原因npu_fused_infer_attention_score融合算子是Ascend Extension for PyTorch提供的适配增量和全量推理场景的FlashAttention算子在昇腾环境上相比SDPA有更好的推理性能表现。优化方式针对Frame-Global双层架构实现算子分离。Frame Attention处理帧内短序列保持使用PyTorch SDPAGlobal Attention处理跨帧长序列切换到NPU FIA融合算子。# 原始统一使用SDPA x F.scaled_dot_product_attention(q, k, v, dropout_p...) # 优化根据is_global_attention分流 if self.is_global_attention: # Global用FIA x torch_npu.npu_fused_infer_attention_score( q, k, v, num_headsnum_heads, scalescale, input_layoutBNSD, pre_tokens65535, next_tokens65535, inner_precise0 )[0] else: # Frame用SDPA x F.scaled_dot_product_attention(q, k, v, ...)序列并行的推理适配优化原因VGGT网络原生仅支持单卡推理所有计算在一张卡上完成未能有效利用Atlas A2的多卡优势多卡并行将序列切分到各rank可以有效提升整网的推理性能。优化方式1.Ulysses并行将num_heads维度切分到多卡通过all-to-all通信将序列维度聚合使每个rank能看到完整序列但只处理部分attention heads。2.Ring并行将序列切分到多卡通过overlap策略隐藏通信开销利用NPU FIA算子返回的LSElog-sum-exp信息支持分块attention结果的数值稳定合并。# Ulysses并行头维度切分 # 通过all-to-all通信实现head和sequence维度的互换 # 输入: [B, shard_s, H, D] - 每个rank持有部分序列、完整头 # 输出: [B, S, shard_hc, D] - 每个rank持有完整序列、部分头 def _all_to_all_head_to_seq(self, input_): input_t input_.reshape(bs, shard_s, world_size, shard_hc, d) input_t input_t.permute(2, 1, 0, 3, 4).contiguous() dist.all_to_all_single(output, input_t, groupself.ulysses_pg)# Ring并行序列维度切分 # Ring Attention with Overlap: 边通信边计算 def _ring_attention_overlap(self, params): # Step 1: 异步启动K/V的allgather k_handle dist.all_gather_into_tensor(k_gathered, k, async_opTrue) v_handle dist.all_gather_into_tensor(v_gathered, v, async_opTrue) # Step 2: 计算local attention (Q local_K/V) out_local, lse_local self._compute_attention_with_lse(q, k, v) # Step 3: 等待通信完成计算cross attention (Q other_K/V) k_handle.wait() out_others, lse_others self._compute_attention_with_lse(q, k_others, v_others) # Step 4: 使用LSE合并结果 out_merged self._merge_two_outputs(out_local, lse_local, out_others, lse_others)额外影响对VGGT网络适配序列并行后以下是整网精度情况 |并行度|AUC30| |:---:|:---:| |单卡|91.10| |Ulysses2|91.12| |Ulysses4|91.10| |Ulysses8|90.86| |Ring2|91.12| |Ring4|91.09| |Ring8|90.88|性能优化指标本方案使用8卡Atlas 800I A2推理产品输入vggt提供的样例数据(examples/kitchen)包含25张图片性能指标如下 |使能方法|推理耗时ms| |:---:|:---:| |Cos\Sin算子优化|1324.83| |旋转编码优化|1239.55| |DPT头计算优化|1211.26| |npu_add_layer_norm融合算子|1208.17| |权重格式转BF16|1128.18| |私有格式提前转换|1121.09|采用序列并行的多卡推理后性能提升如下 |并行度|推理提升| |:---:|:---:| |Ulysses2|1.75x| |Ulysses4|3.43x| |Ulysses8|6.47x| |Ring2|1.82x| |Ring4|3.48x| |Ring8|6.42x|【免费下载链接】cann-recipes-embodied-intelligence本项目针对具身智能业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-embodied-intelligence创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考