CANN opbase算子数据Dump接口

张

张建站

2026/5/9 15:32:48

10分钟阅读

aclDumpOpTensors【免费下载链接】opbase本项目是CANN算子库的基础框架库为算子提供公共依赖文件和基础调度能力。项目地址: https://gitcode.com/cann/opbase功能说明模型执行过程中支持Dump算子输入/输出Tensor数据方便算子输入/输出异常数据的问题定位和分析。函数原型aclnnStatus aclDumpOpTensors(const char *opType, const char *opName, aclTensor **tensors, size_t inputTensorNum, size_t outputTensorNum, aclrtStream stream)参数说明参数名输入/输出说明opType输入字符串表示算子类型例如“Add”。opName输入字符串表示算子名称例如“add_custom”。tensors输入一维张量表示待Dump的输入/输出Tensor对象指针。注意Tensor顺序输入Tensor在前输出Tensor在后。inputTensorNum输入表示待Dump的输入Tensor个数。outputTensorNum输入表示待Dump的输出Tensor个数。stream输入指定执行任务的Stream。返回值说明返回0表示成功返回其他值表示失败返回码列表参见公共接口返回码。约束说明本接口需要在开启算子Dump功能时有效您可以通过aclInit接口开启Dump也可以通过aclmdlInitDump、aclmdlSetDump、aclmdlFinalizeDump系列接口开启Dump接口介绍请参见《Runtime运行时 API》。调用示例关键代码示例如下仅供参考不支持直接拷贝运行。通过aclInit接口开启算子Dump功能。关键代码如下// 资源初始化 aclInit(./acl.json); aclrtSetDevice(0); aclrtStream stream nullptr; aclrtCreateStream(stream);acl.json示例如下具体参见aclInit接口文档中模型Dump配置、单算子Dump配置示例{ dump: { dump_path: ./, dump_list: [], dump_mode: all, dump_data: tensor } }调用本接口关键伪代码以torch算子为例如下#include torch/extension.h #include torch_npu/csrc/core/npu/NPUStream.h #include torch_npu/csrc/core/npu/NPUFunctions.h #include torch_npu/csrc/framework/OpCommand.h #include torch_npu/csrc/framework/interface/AclOpCompileInterface.h #include torch_npu/csrc/core/npu/register/OptionsManager.h #include torch_npu/csrc/aten/NPUNativeFunctions.h #include torch_npu/csrc/flopcount/FlopCount.h #include torch_npu/csrc/flopcount/FlopCounter.h #include torch_npu/csrc/core/npu/NpuVariables.h #include kernel_operator.h #include acl/acl_base.h #include aclnn/acl_meta.h constexpr int32_t BUFFER_NUM 2; constexpr int64_t MAX_DIM_NUM 5; constexpr int64_t NCL_DIM_NUM 3; constexpr int64_t NCHW_DIM_NUM 4; constexpr int64_t NCDHW_DIM_NUM 5; // 生成待Dump算子的输入/输出Tensor对象指针一维张量。 #define INIT_ACL_TENSOR_ARRAY(tensors, ...) aclTensor* tensors[] {__VA_ARGS__} // at::Tensor对象转换成aclTensor对象。本函数简化了处理过程具体以实际算子为准。 aclTensor *ConvertTensor(const at::Tensor at_tensor) { aclDataType acl_data_type ACL_FLOAT16; c10::SmallVectorint64_t, MAX_DIM_NUM storageDims; const auto dimNum at_tensor.sizes().size(); aclFormat format ACL_FORMAT_ND; switch (dimNum) { case NCL_DIM_NUM: format ACL_FORMAT_NCL; break; case NCHW_DIM_NUM: format ACL_FORMAT_NCHW; break; case NCDHW_DIM_NUM: format ACL_FORMAT_NCDHW; break; default: format ACL_FORMAT_ND; } // if acl_data_type is ACL_STRING, storageDims is empty. if (acl_data_type ! ACL_STRING) { storageDims.push_back(at_tensor.storage().nbytes() / at_tensor.itemsize()); } auto acl_tensor aclCreateTensor(at_tensor.sizes().data(), at_tensor.sizes().size(), acl_data_type, at_tensor.strides().data(), at_tensor.storage_offset(), format, storageDims.data(), storageDims.size(), const_castvoid *(at_tensor.storage().data())); return acl_tensor; } // 自定义算子实现。具体以实际算子为准。 class KernelAdd { public: __aicore__ inline KernelAdd() {} __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength) { this-blockLength totalLength / AscendC::GetBlockNum(); this-tileNum 8; this-tileLength this-blockLength / this-tileNum / BUFFER_NUM; xGm.SetGlobalBuffer((__gm__ half *)x this-blockLength * AscendC::GetBlockIdx(), this-blockLength); yGm.SetGlobalBuffer((__gm__ half *)y this-blockLength * AscendC::GetBlockIdx(), this-blockLength); zGm.SetGlobalBuffer((__gm__ half *)z this-blockLength * AscendC::GetBlockIdx(), this-blockLength); pipe.InitBuffer(inQueueX, BUFFER_NUM, this-tileLength * sizeof(half)); pipe.InitBuffer(inQueueY, BUFFER_NUM, this-tileLength * sizeof(half)); pipe.InitBuffer(outQueueZ, BUFFER_NUM, this-tileLength * sizeof(half)); } __aicore__ inline void Process() { int32_t loopCount this-tileNum * BUFFER_NUM; for (int32_t i 0; i loopCount; i) { CopyIn(i); Compute(i); CopyOut(i); } } private: __aicore__ inline void CopyIn(int32_t progress) { AscendC::LocalTensorhalf xLocal inQueueX.AllocTensorhalf(); AscendC::LocalTensorhalf yLocal inQueueY.AllocTensorhalf(); AscendC::DataCopy(xLocal, xGm[progress * this-tileLength], this-tileLength); AscendC::DataCopy(yLocal, yGm[progress * this-tileLength], this-tileLength); inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); } __aicore__ inline void Compute(int32_t progress) { AscendC::LocalTensorhalf xLocal inQueueX.DeQuehalf(); AscendC::LocalTensorhalf yLocal inQueueY.DeQuehalf(); AscendC::LocalTensorhalf zLocal outQueueZ.AllocTensorhalf(); AscendC::Add(zLocal, xLocal, yLocal, this-tileLength); outQueueZ.EnQuehalf(zLocal); inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); } __aicore__ inline void CopyOut(int32_t progress) { AscendC::LocalTensorhalf zLocal outQueueZ.DeQuehalf(); AscendC::DataCopy(zGm[progress * this-tileLength], zLocal, this-tileLength); outQueueZ.FreeTensor(zLocal); } private: AscendC::TPipe pipe; AscendC::TQueAscendC::TPosition::VECIN, BUFFER_NUM inQueueX, inQueueY; AscendC::TQueAscendC::TPosition::VECOUT, BUFFER_NUM outQueueZ; AscendC::GlobalTensorhalf xGm; AscendC::GlobalTensorhalf yGm; AscendC::GlobalTensorhalf zGm; uint32_t blockLength; uint32_t tileNum; uint32_t tileLength; }; __global__ __vector__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength) { KernelAdd op; op.Init(x, y, z, totalLength); op.Process(); } namespace ascendc_ops { at::Tensor ascendc_add(const at::Tensor x, const at::Tensor y) { auto aclStream c10_npu::getCurrentNPUStream().stream(false); at::Tensor z at::empty_like(x); uint32_t numBlocks 8; uint32_t totalLength 1; for (uint32_t size : x.sizes()) { totalLength * size; } add_customnumBlocks, nullptr, aclStream((uint8_t*)(x.mutable_data_ptr()), (uint8_t*)(y.mutable_data_ptr()), (uint8_t*)(z.mutable_data_ptr()), totalLength); // Dump算子输入/输出Tensor数据。 INIT_ACL_TENSOR_ARRAY(tensors, ConvertTensor(x), ConvertTensor(y), ConvertTensor(z)); aclDumpOpTensors(Add, add_custom, tensors, 2, 1, aclStream); // 释放aclTensor对象。 for (size_t i 0; i 3; i) { aclDestroyTensor(tensors[i]); } return z; } } // namespace ascendc_ops TORCH_LIBRARY(ascendc_ops, m) { m.def(ascendc_add(Tensor x, Tensor y) - Tensor); } TORCH_LIBRARY_IMPL(ascendc_ops, PrivateUse1, m) { m.impl(ascendc_add, TORCH_FN(ascendc_ops::ascendc_add)); }【免费下载链接】opbase本项目是CANN算子库的基础框架库为算子提供公共依赖文件和基础调度能力。项目地址: https://gitcode.com/cann/opbase创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

从‘芝麻开门’到智能合约：用TPM 2.0策略会话玩转条件式密钥访问

从‘芝麻开门’到智能合约：用TPM 2.0策略会话玩转条件式密钥访问在物联网设备爆发式增长和云原生架构普及的今天，传统密钥管理方案正面临前所未有的挑战。想象这样一个场景：某医疗设备制造商需要确保其CT机仅在设备固件未被篡改（…...

2026/5/9 15:31:20 阅读更多 →

深入SplaTAM代码：手把手解析3D高斯溅射（3DGS）如何与SLAM框架在Python/CUDA层协同工作

深入SplaTAM代码：手把手解析3D高斯溅射（3DGS）如何与SLAM框架在Python/CUDA层协同工作在计算机视觉领域，实时密集三维重建一直是一个极具挑战性的课题。传统SLAM系统往往需要在精度和效率之间做出妥协，而SplaTAM的出现…...

2026/5/9 15:28:19 阅读更多 →

别再只会用ref_table了！ABAP ALV里给自定义字段加F4搜索帮助的完整流程（附代码）

ABAP ALV自定义字段F4搜索帮助实战：突破数据字典依赖的完整解决方案在SAP系统开发中，ALV（ABAP List Viewer）报表是最常用的数据展示工具之一。许多开发者习惯性地认为，只有绑定到数据字典（DDIC&#xff09…...

2026/5/9 15:26:22 阅读更多 →

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程【免费下载链接】Autovisor 2025智慧树刷课脚本基于Python Playwright的自动化程序 [有免安装版] 项目地址: https://gitcode.com/gh_mirrors/au/Autovisor 你是否厌倦了每天手动点击播放、等待…...

2026/5/9 12:51:47 阅读更多 →

ModelScope Auto Proxy：智能路由网关，零成本统一调用免费大模型API

1. 项目概述与核心价值如果你和我一样，是个重度依赖 AI 编程工具（比如 Cursor、Cline）的开发者，那你肯定对 OpenAI 的 API 调用成本又爱又恨。爱的是它强大的能力，恨的是账单上的数字。最近，国内的开源社…...

2026/5/9 5:30:52 阅读更多 →

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程）

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程） 当你第一次拿到AOCODARC H7DUAL这块飞控板时，可能会被密密麻麻的引脚和复杂的配置选项吓到。别担心，这篇教程将带你从零开始&#xff…...

2026/5/9 12:51:47 阅读更多 →

League Akari：你的英雄联盟游戏体验进化指南

League Akari：你的英雄联盟游戏体验进化指南【免费下载链接】League-Toolkit An all-in-one toolkit for LeagueClient. Gathering power 🚀. 项目地址: https://gitcode.com/gh_mirrors/le/League-Toolkit 想象一下这样的场景：你正在…...

2026/5/9 12:51:46 阅读更多 →