Copy L1 To L0A 模块概述【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass代码位置[TOC]概述copy_l1_to_l0a模块提供将 A 矩阵 tile 块从 L1Local MemoryA1 Buffer搬运到 L0AA2 Buffer的模板类支持多种数据排布格式layout转换。根据架构不同实现分为两套AtlasA2ARCH 2201atlasa2/copy_l1_to_l0a.hppAscend950ARCH 3510ascend950/copy_l1_to_l0a.hpp模块包含非 TLA 风格直接操作LocalTensor和TLA 风格通过tla::Tensor封装两套 API。API 清单组件名风格适用硬件说明CopyL1ToL0A非 TLAAtlasA2 / Ascend950基础 L1→L0A 搬运模板支持多种 layout 转换TileCopyTlaTLAAtlasA2 / Ascend950TLA 风格 L1→L0A 搬运通过 tla::Tensor 封装简化调用TileCopySparseTlaTLAAtlasA2Sparse GEMM L1→L0A 搬运zN→zZ LoadData3D v2说明该模块通常不直接使用而是作为 TileCopy 的成员类型CopyL1ToL0A由 blockMmad 自动管理。仅在需要自定义 kernel 模板组装时显式声明。适用硬件型号说明硬件型号架构标识ARCH 宏支持的非 TLA 模板支持的 TLA 模板Atlas A2Arch::AtlasA2CATLASS_ARCH 2201CopyL1ToL0ATileCopyTlaAscend 950Arch::Ascend950CATLASS_ARCH 3510CopyL1ToL0ATileCopyTla架构差异特性AtlasA2Ascend950目标 L0A layoutzZzN基础搬运指令LoadData2DLoadData2DParamsV2l0Batch 批量搬运不支持支持operator()重载MX Scale 浮点量化不支持支持operator()重载Vector layout不支持支持接口调用示例非 TLA 风格CopyL1ToL0A#include catlass/gemm/tile/copy_l1_to_l0a.hpp using namespace Catlass::Gemm::Tile; using Element half; using L1Type Gemm::GemmTypeElement, layout::zN, AscendC::TPosition::A1; using L0Type Gemm::GemmTypeElement, layout::zZ, AscendC::TPosition::A2; uint32_t row 256; uint32_t col 256; // 构造 L1 上的 zN layout 和 L0A 上的 zZ layout auto layoutSrc layout::zN::MakeLayoutElement(row, col); auto layoutDst layout::zZ::MakeLayoutElement(row, col); AscendC::LocalTensorElement srcL1Tensor; AscendC::LocalTensorElement dstL0ATensor; // 实例化并调用 using CopyOp CopyL1ToL0AArch::AtlasA2, L1Type, L0Type; CopyOp copyOp; copyOp(dstL0ATensor, srcL1Tensor, layoutDst, layoutSrc);TLA 风格TileCopyTla#include catlass/gemm/tile/copy_l1_to_l0a.hpp #include tla/tensor.hpp using namespace Catlass::Gemm::Tile; const uint32_t M 256; const uint32_t K 256; // 通过 tla::MakeLayout 创建 Layout auto layoutSrc tla::MakeLayouthalf, layout::zN(M, K); auto layoutDst tla::MakeLayouthalf, layout::zZ(M, K); // 通过 tla::MakeTensor 构造 TLA Tensor AscendC::LocalTensorhalf srcL1Tensor; AscendC::LocalTensorhalf dstL0ATensor; auto srcTensor tla::MakeTensor(srcL1Tensor, layoutSrc, Arch::PositionL1{}); auto dstTensor tla::MakeTensor(dstL0ATensor, layoutDst, Arch::PositionL0A{}); // 实例化并调用SFINAE 根据 src/dst layout trait 自动匹配偏特化 TileCopyTlaArch::AtlasA2, decltype(srcTensor), decltype(dstTensor) copyOp; copyOp(dstTensor, srcTensor);TLA 风格 - 转置搬运AtlasA2auto layoutSrc tla::MakeLayouthalf, layout::nZ(M, K); auto layoutDst tla::MakeLayouthalf, layout::zZ(M, K); auto srcTensor tla::MakeTensor(srcL1Tensor, layoutSrc, Arch::PositionL1{}); auto dstTensor tla::MakeTensor(dstL0ATensor, layoutDst, Arch::PositionL0A{}); // isnZLayoutSrc iszZLayoutDst → 自动匹配转置偏特化 TileCopyTlaArch::AtlasA2, decltype(srcTensor), decltype(dstTensor) copyOp; copyOp(dstTensor, srcTensor);TLA 风格 - Ascend950 基础搬运// Ascend950 目标 layout 为 zN auto layoutSrc tla::MakeLayouthalf, layout::zN(M, K); auto layoutDst tla::MakeLayouthalf, layout::zN(M, K); auto srcTensor tla::MakeTensor(srcL1Tensor, layoutSrc, Arch::PositionL1{}); auto dstTensor tla::MakeTensor(dstL0ATensor, layoutDst, Arch::PositionL0A{}); // Ascend950: zN L1 → zN L0A TileCopyTlaArch::Ascend950, decltype(srcTensor), decltype(dstTensor) copyOp; copyOp(dstTensor, srcTensor);TLA 风格 - Ascend950 l0Batch 批量搬运uint32_t l0Batch 4; // l0Batch 重载多 batch 连续搬运 copyOp(dstTensor, srcTensor, l0Batch);TLA 风格 - Ascend950 MX Scale 搬运using ElementSrc float8_e4m3_t; using ElementDst AscendC::mx_fp8_e4m3_t; using ElementMxScale float8_e8m0_t; // MX Scale 的 K 方向维度每 MX_SCALE_GROUP_NUM32个元素共享一个 scale 值 const uint32_t mxScaleK CeilDivMX_SCALE_GROUP_NUM(K); // 源数据 layoutL1 zN auto layoutSrc tla::MakeLayoutElementSrc, layout::zN(M, K); auto srcTensor tla::MakeTensor(srcL1Tensor, layoutSrc, Arch::PositionL1{}); // 目标数据 layoutL0A zN元素类型为 mx_fp8 auto layoutDst tla::MakeLayoutElementDst, layout::zN(M, K); auto dstTensor tla::MakeTensor(dstL0ATensor, layoutDst, Arch::PositionL0A{}); // MX Scale layoutL1 zZ使用 MakeMxScaleLayout 构造 auto layoutScaleL1 tla::MakeMxScaleLayoutElementMxScale, layout::zZ, false(M, mxScaleK); AscendC::LocalTensorElementMxScale scaleL1Tensor; auto scaleTensor tla::MakeTensor(scaleL1Tensor, layoutScaleL1, Arch::PositionL1{}); // MX Scale 重载L1 zN 源数据 L1 zZ scale → L0A zN mx 数据 TileCopyTlaArch::Ascend950, decltype(srcTensor), decltype(dstTensor) copyOp; copyOp(dstTensor, srcTensor, scaleTensor);模板选择指南场景推荐模板通用矩阵乘 tile L1→L0A 搬运CopyL1ToL0A非 TLA或TileCopyTlaTLA转置搬运nZ → zZ / nZ → zNCopyL1ToL0A或TileCopyTla自动匹配Ascend950 多 batch 搬运TileCopyTlal0Batch 重载Ascend950 MX 浮点量化TileCopyTlaMX Scale 重载卷积场景 NDC1HWC0 搬运CopyL1ToL0A非 TLANDC1HWC0 偏特化已使用 TLA 编程范式TileCopyTla统一风格【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考