CANN/pyasc算子调试调优指南
Ascend C Python算子调试调优指南【免费下载链接】pyasc本项目为Python用户提供算子编程接口支持在昇腾AI处理器上加速计算接口与Ascend C一一对应并遵守Python原生语法。项目地址: https://gitcode.com/cann/pyasc本文档介绍了Ascend C Python工程支持的算子调试调优方法使用指导。算子功能调试为了方便开发者Model仿真、NPU实际运行时进行功能调试pyasc提供printf、dump_tensor接口供开发者调用其中printf主要用于打印标量和字符串信息dump_tensor用于打印指定Tensor的数据。使用方法printf示例import asc asc.jit def vadd_kernel(x: asc.GlobalAddress, y: asc.GlobalAddress, z: asc.GlobalAddress, BLOCK_LENGTH: asc.ConstExpr[int], BUFFER_NUM: asc.ConstExpr[int], TILE_LENGTH: asc.ConstExpr[int], TILE_NUM: asc.ConstExpr[int]): offset asc.get_block_idx() * BLOCK_LENGTH ... # 使用printf打印字符串 asc.printf(Before calculating.\\n) for i in range(TILE_NUM): # 使用printf打印当前循环次数 asc.printf(current index is %d.\\n, i) copy_in(i, x_gm, y_gm, in_queue_x, in_queue_y, TILE_LENGTH) compute(z_gm, in_queue_x, in_queue_y, out_queue_z, TILE_LENGTH) copy_out(i, z_gm, out_queue_z, TILE_LENGTH)执行结果样例如下opTypev, DumpHead: AIV-0, CoreTypeAIV, block dim16, total_block_num16, block_remain_len1048024, block_initial_space1048576, rsv0, magic5aa5bccd CANN Version: XX.XX, TimeStamp: XXXXXXXXXXXXXXXXX Before calculating. current index is 0. current index is 1. current index is 2. current index is 3. current index is 4. current index is 5. current index is 6. current index is 7.约束使用asc.printf若需换行需对\n进行转义asc.printf接口会对算子实际运行的性能带来一定影响通常在调测阶段使用。dump_tensor示例import asc asc.jit def vadd_kernel(x: asc.GlobalAddress, y: asc.GlobalAddress, z: asc.GlobalAddress, BLOCK_LENGTH: asc.ConstExpr[int], BUFFER_NUM: asc.ConstExpr[int], TILE_LENGTH: asc.ConstExpr[int], TILE_NUM: asc.ConstExpr[int]): offset asc.get_block_idx() * BLOCK_LENGTH x_gm asc.GlobalTensor() x_gm.set_global_buffer(x offset) ... # 使用dump_tensor打印x_gm输入 tmp_array asc.array(asc.uint32, [4, 16]) tmp_shape_info asc.ShapeInfo(tmp_array) asc.dump_tensor(x_gm, 0, 32, tmp_shape_info) for i in range(TILE_NUM): copy_in(i, x_gm, y_gm, in_queue_x, in_queue_y, TILE_LENGTH) compute(z_gm, in_queue_x, in_queue_y, out_queue_z, TILE_LENGTH) copy_out(i, z_gm, out_queue_z, TILE_LENGTH) asc.jit def copy_in(i: int, x_gm: asc.GlobalAddress, y_gm: asc.GlobalAddress, in_queue_x: asc.TQue, in_queue_y: asc.TQue, TILE_LENGTH: asc.ConstExpr[int]): x_local in_queue_x.alloc_tensor(x_gm.dtype) asc.data_copy(x_local, x_gm[i * TILE_LENGTH:], countTILE_LENGTH) # 使用dump_tensor打印Local Memory的Tensor if i 0: asc.dump_tensor(x_local, 1, 32) ...执行结果样例如下opTypev, DumpHead: AIV-0, CoreTypeAIV, block dim16, total_block_num16, block_remain_len1048024, block_initial_space1048576, rsv0, magic5aa5bccd CANN Version: XX.XX, TimeStamp: XXXXXXXXXXXXXXXXX DumpTensor: desc0, addr41200000, data_typefloat32, positionGM, dump_size32 [[19.000000, 4.000000, 38.000000, 50.000000, 39.000000, 67.000000, 84.000000, 98.000000, 21.000000, 36.000000, 18.000000, 46.000000, 10.000000, 92.000000, 26.000000, 38.000000], [39.000000, 9.000000, 82.000000, 37.000000, 35.000000, 65.000000, 97.000000, 59.000000, 89.000000, 63.000000, 70.000000, 57.000000, 35.000000, 3.000000, 16.000000, 42.000000], [-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-], [-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-]] DumpTensor: desc1, addr0, data_typefloat32, positionUB, dump_size32 [6.000000, 34.000000, 52.000000, 38.000000, 73.000000, 38.000000, 35.000000, 14.000000, 67.000000, 62.000000, 30.000000, 49.000000, 86.000000, 37.000000, 84.000000, 18.000000, 38.000000, 18.000000, 44.000000, 21.000000, 86.000000, 99.000000, 13.000000, 79.000000, 84.000000, 9.000000, 48.000000, 74.000000, 52.000000, 99.000000, 80.000000, 53.000000]约束asc.dump_tensor接口会对算子实际运行的性能带来一定影响通常在调测阶段使用。算子性能调优为了帮助开发者快速完成高性能算子开发pyasc支持通过算子性能调优工具msprof op采集profiling数据生成性能数据、内存热力图、仿真流水图等。本文档介绍了msprof op在算子开发过程中的应用。msprof op工具用于采集和分析运行在昇腾AI处理器上算子的关键性能指标开发者根据输出的profiling数据快速定位算子的软硬件性能瓶颈提升算子性能分析效率。该工具的详细介绍请参考《算子开发工具-算子调优》。环境准备工具使用前需完成环境准备详细参考quick_start.md。算子性能数据采集msprof op工具包含msprof op上板和msprof op simulator仿真两种使用方式协助用户定位算子内存、算子代码以及算子指令的异常实现全方位的算子调优。功能简介功能名称适用场景展示的图形msprof op适用于实际运行环境中的性能分析计算内存热力图、Roofline瓶颈分析图、Cache热力图、通算流水图、算子代码热点图msprof op simulator适用于开发和调试阶段进行详细仿真调优指令流水图、算子代码热点图、内存通路吞吐率波形图本文以add算子为例介绍如何在上板和仿真两种场景中使用msprof op算子性能调优工具。add算子实现代码参考02_add_framework.py,上板性能数据采集算子上板信息采集参考命令如下详细命令参数请参考msprof op工具文档的说明。msprof op --output./output python add_framework.py -r NPU成功执行后在output目录下生成如下文件OPPROF_{timestamp}_XXX ├── dump ├── ArithmeticUtilization.csv ├── L2Cache.csv ├── Memory.csv ├── MemoryL0.csv ├── MemoryUB.csv ├── OpBasicInfo.csv ├── PipeUtilization.csv ├── ResourceConflictRatio.csv ├── visualize_data.bin将visualize_data.bin文件导入MindStudio Insight后将会展示计算内存热力图、Roofline瓶颈分析图、Cache热力图、通算流水图和算子代码热点图。将trace.json文件导入Chrome浏览器或MindStudio Insight后将会展示通算流水图。上述展示的图示内容的说明参考算子调优的相关章节。仿真性能数据采集算子仿真以Ascend910B1 simulator为例运行信息采集参考命令如下各参数含义请参考msProf工具文档的说明。msprof op simulator --output./output python add_framework.py -r Model -v Ascend910B1成功执行后在output目录下生成如下文件OPPROF_{timestamp}_XXX ├── dump └── simulator ├── core0.veccore0 // 按照core*.veccore*或core*.cubecore*目录存放各核的数据文件 │ ├── core0.veccore0_code_exe.csv │ ├── core0.veccore0_instr_exe.csv │ └── trace.json // 该核的仿真指令流水图文件 ├── core0.veccore1 │ ├── core0.veccore1_code_exe.csv │ ├── core0.veccore1_instr_exe.csv │ └── trace.json ├── core1.veccore0 │ ├── core1.veccore0_code_exe.csv │ ├── core1.veccore0_instr_exe.csv │ └── trace.json ├── ... ├── visualize_data.bin └── trace.json // 全部核的仿真指令流水图文件将visualize_data.bin文件导入MindStudio Insight后将会展示指令流水图、算子代码热点图和内存通路吞吐率波形图。将trace.json文件导入Chrome浏览器或MindStudio Insight后将会展示指令流水图和内存通路吞吐率波形图。上述展示的图示内容的说明参考算子调优的相关章节。使用Ascend PyTorch Profiler采集性能数据pyasc支持针对PyTorch框架开发的性能分析工具Ascend PyTorch Profiler。PyTorch训练/在线推理场景下推荐通过Ascend PyTorch Profiler接口采集并解析性能数据用户可以根据结果自行分析和识别性能瓶颈请参考官方文档Ascend PyTorch Profiler性能数据采集和自动解析。环境准备pyasc的环境准备请参考quick_start.mdAscend PyTorch Profiler的环境准备请参考Ascend PyTorch Profiler性能数据采集和自动解析的前提条件小节。本文以add算子为例介绍如何通过Ascend PyTorch Profiler接口采集性能数据。示例代码通过with语句进行采集import torch try: import torch_npu except ModuleNotFoundError: pass import asc import asc.runtime.config as config import asc.lib.runtime as rt asc.jit def vadd_kernel(x: asc.GlobalAddress, y: asc.GlobalAddress, z: asc.GlobalAddress, block_length: int): offset asc.get_block_idx() * block_length x_gm asc.GlobalTensor() y_gm asc.GlobalTensor() z_gm asc.GlobalTensor() ...... def vadd_launch(x: torch.Tensor, y: torch.Tensor) - torch.Tensor: z torch.zeros_like(x) total_length z.numel() block_length total_length // USE_CORE_NUM vadd_kernel[USE_CORE_NUM, rt.current_stream()](https://link.gitcode.com/i/98193dbe9e3f5bea4a201aac7c70f2fd) return z def vadd_custom(backend: config.Backend, platform: config.Platform): config.set_platform(backend, platform) device npu if config.Backend(backend) config.Backend.NPU else cpu size 8 * 2048 x torch.rand(size, dtypetorch.float32, devicedevice) y torch.rand(size, dtypetorch.float32, devicedevice) z vadd_launch(x, y) assert torch.allclose(z, x y) experimental_config torch_npu.profiler._ExperimentalConfig( export_type[ torch_npu.profiler.ExportType.Text ], profiler_leveltorch_npu.profiler.ProfilerLevel.Level0, msprof_txFalse, aic_metricstorch_npu.profiler.AiCMetrics.AiCoreNone, l2_cacheFalse, op_attrFalse, data_simplificationFalse, record_op_argsFalse, gc_detect_thresholdNone ) with torch_npu.profiler.profile( activities[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], scheduletorch_npu.profiler.schedule(wait0, warmup1, active1, repeat1, skip_first0), on_trace_readytorch_npu.profiler.tensorboard_trace_handler(./result), record_shapesFalse, profile_memoryFalse, with_stackFalse, with_modulesFalse, with_flopsFalse, experimental_configexperimental_config) as prof: steps 2 for step in range(steps): vadd_custom(config.Backend.NPU) prof.step()NPU性能数据目录结构执行后生成数据目录结构示例如下├── ubuntu_239247_20251101101435_ascend_pt ├── ASCEND_PROFILER_OUTPUT ├── FRAMEWORK ├── logs ├── PROF_000001_20230628101435646_FKFLNPEPPRRCFCBA ├── profiler_info.json ├── profiler_metadata.json【免费下载链接】pyasc本项目为Python用户提供算子编程接口支持在昇腾AI处理器上加速计算接口与Ascend C一一对应并遵守Python原生语法。项目地址: https://gitcode.com/cann/pyasc创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考