【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsname: triton-ascend-case-reduction-amin-medium description: 大规模2D归约aminreduce轴很大优化在优先占满UB前提下为reduce轴分配较大切分尺寸BLOCK_SIZE_N16384最优减少循环次数但需权衡单次迭代负载适用于非reduce轴中等、reduce轴很大50万级元素的场景 category: example version: 1.0.0 metadata: backend: ascend dsl: triton-ascend hardware: Atlas A2, Atlas A3大规模 2D Amin 归约优化任务特征数据尺寸(2048, 262144)非reduce轴中等reduce轴很大优化reduce轴大切分# 错误简单循环内多次归约 row_min float(inf) for n_start in range(0, N, BLOCK_SIZE_N): curr_min tl.min(data_block, 1) row_min tl.minimum(curr_min, row_min) # 正确优化维护矩阵结构 curr_min tl.full((BLOCK_SIZE_M, BLOCK_SIZE_N), float(inf), dtypetl.float32) for n_start in range(0, N, BLOCK_SIZE_N): curr_min tl.minimum(data_block, curr_min) row_min tl.min(curr_min, 1)Autotune 配置# 1. reduce轴切分较大, UB用满 - 2864.90 us triton.Config({BLOCK_SIZE_M: 8, BLOCK_SIZE_N: 2048}) # 2-4. reduce轴切分逐渐增大M切分相应减小 - 性能逐渐提升 triton.Config({BLOCK_SIZE_M: 4, BLOCK_SIZE_N: 4096}) # 2840.48 us triton.Config({BLOCK_SIZE_M: 2, BLOCK_SIZE_N: 8192}) # 2801.20 us triton.Config({BLOCK_SIZE_M: 1, BLOCK_SIZE_N: 16384}) # 2779.78 us 最优总结在优先占满UB前提下为reduce轴分配较大切分尺寸减少循环次数但需权衡单次迭代计算负载。【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考