本文目录一、为什么要做PD分离?二、硬件与软件环境硬件配置软件版本清单三、环境准备工作3.1 NPU设备检查3.2 创建Docker容器3.3 安装软件依赖安装CANN套件安装PyTorch和torch_npu安装vLLM和vLLM-ascend四、部署流程4.1 生成ranktable配置文件4.2 编写Prefill启动脚本4.3 编写Decode启动脚本4.4 启动服务4.5 启动负载均衡代理五、功能验证5.1 API接口测试5.2 性能压测六、踩过的坑问题1: 找不到torch-npu包问题2: torch和torch-npu依赖冲突问题3: NPU显存溢出问题4: 量化算子不支持float16问题5: HCCL通信初始化失败总结一、为什么要做PD分离?做过大模型推理优化的朋友都知道,Prefill和Decode这两个阶段的性质完全不同:Prefill阶段就像是读完整本书做笔记——需要把用户输入的完整prompt编码成隐藏状态,同时生成KV Cache。这个过程计算量大,吃算力。Decode阶段更像是一个字一个字往外蹦——每次只生成一个token,一直重复到结束。这个过程主要瓶颈在显存带宽,属于典型的访存密集型任务。两个阶段混在一起跑,就会出现资源利用不均的问题。PD分离部署方案把这两个阶段拆开,让Prefill和Decode各自在最适合的硬件环境中运行,配合vLLM的PagedAttention技术管理KV缓存,能大幅提升整体吞吐。本文记录了在昇腾NPU平台上部署DeepSeek-V3-w8a8量化模型的完整过程,希望能帮到正在做类似工作的同学。二、硬件与软件环境硬件配置项目配置硬件平台2台Atlas 800I A2服务器(每台16卡×64GB)操作系统Ubuntu 22.04驱动版本25.2.0Python版本3.11软件版本清单组件版本说明CANN8.2.RC1昇腾计算架构torch2.5.1cpuCPU版本torch_npu2.5.1.post1.dev20250619NPU适配版本torchvision0.20.1随torch自动安装vLLM0.9.1推理引擎vLLM-ascend0.9.1-devNPU适配版本三、环境准备工作3.1 NPU设备检查在开始部署前,先确认NPU设备和网络状态正常。下面这组命令建议都跑一遍:# 查看NPU设备状态npu-smi info# 检查物理连接foriin{0..15};dohccn_tool-i$i-lldp-g|grepIfname;done# 确认以太网端口状态(应该都是up)foriin{0..15};dohccn_tool-i$i-link-g;done# 检查网络健康状况(应该显示success)foriin{0..15};dohccn_tool-i$i-net_health-g;done# 查看IP配置foriin{0..15};dohccn_tool-i$i-netdetect-g;done# 查看网关配置foriin{0..15};dohccn_tool-i$i-gateway-g;done# 检查TLS配置一致性foriin{0..15};dohccn_tool-i$i-tls-g;done|grepswitch# 关闭TLS校验(这一步很重要)foriin{0..15};dohccn_tool-i$i-tls-senable0;done# 获取各NPU的IP地址foriin{0..15};dohccn_tool-i$i-ip-g;done# 测试跨节点连通性(替换NPU-IP为实际IP)foriin{0..15};dohccn_tool-i$i-ping-gaddressNPU-IP;done3.2 创建Docker容器容器环境能避免很多依赖冲突的问题。如果不是root用户,先切换:sudosu- rootmkdir/home/your_name出现如上所示结果然后输入密码后mkdir /home/your_name输入指令这里替换路径中的your_name换成我们自己的创建容器(注意替换路径中的your_name):dockerrun-it--privileged--namevllm_deepseek_pd\--nethost --shm-size500g\--device/dev/davinci_manager\--device/dev/hisi_hdc\--device/dev/devmm_svm\--device/dev/davinci0\--device/dev/davinci1\--device/dev/davinci2\--device/dev/davinci3\--device/dev/davinci4\--device/dev/davinci5\--device/dev/davinci6\--device/dev/davinci7\--device/dev/davinci8\--device/dev/davinci9\--device/dev/davinci10\--device/dev/davinci11\--device/dev/davinci12\--device/dev/davinci13\--device/dev/davinci14\--device/dev/davinci15\-v/usr/local/Ascend/driver:/usr/local/Ascend/driver\-v/usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/\-v/usr/local/sbin/:/usr/local/sbin/\-v/var/log/npu/slog/:/var/log/npu/slog\-v/usr/local/dcmi:/usr/local/dcmi\-v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi\-v/etc/ascend_install.info:/etc/ascend_install.info\-v/var/log/npu/profiling/:/var/log/npu/profiling\-v/var/log/npu/dump/:/var/log/npu/dump\-v/var/log/npu/:/usr/slog\-v/etc/hccn.conf:/etc/hccn.conf\-v/home/:/home\-w/home/your_name\mindie:2.1.RC1-800I-A2-py311-ubuntu22.04-x86_64\/bin/bash进入容器后配置网络代理,用curl www.baidu.com测试网络是否正常。3.3 安装软件依赖安装CANN套件从昇腾社区下载三个安装包:Ascend-cann-toolkit_*.run (开发套件)Ascend-cann-kernels-*.run (算子包)Ascend-cann-nnal_*.run (神经网络加速库)安装前确保目标目录有10GB以上可用空间:chmodax Ascend-cann-kernels-910b_8.2.RC1_linux-x86_64.runchmodax Ascend-cann-nnal_8.2.RC1_linux-x86_64.runchmodax Ascend-cann-toolkit_8.2.RC1_linux-x86_64.run# 先做检查./Ascend-cann-toolkit_8.2.RC1_linux-x86_64.run--check./Ascend-cann-toolkit_8.2.RC1_linux-x86_64.run--install--install-path/home/your_name/cann_8.2.rc1 ./Ascend-cann-kernels-910b_8.2.RC1_linux-x86_64.run--check./Ascend-cann-kernels-910b_8.2.RC1_linux-x86_64.run--install--install-path/home/your_name/cann_8.2.rc1 ./Ascend-cann-nnal_8.2.RC1_linux-x86_64.run--check./Ascend-cann-nnal_8.2.RC1_linux-x86_64.run--install--install-path/home/your_name/cann_8.2.rc1# 配置环境变量(每次进容器都要执行)source/home/your_name/cann_8.2.rc1/ascend-toolkit/set_env.shsource/home/your_name/cann_8.2.rc1/nnal/atb/set_env.sh安装PyTorch和torch_npu先配置pip镜像源,然后安装:pip configsetglobal.extra-index-urlhttps://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypipipinstallattrs cythonnumpy1.26.4 decoratorsympy1.13.1\cffi pyyaml pathlib2 psutilprotobuf6.31.1 scipy requests absl-py pipinstalltorchvision0.20.1# 会自动安装torch2.5.1cpupipinstalltorch-npu2.5.1.post1.dev20250619安装vLLM和vLLM-ascend从GitHub拉取代码并安装:# 安装vLLMgitclone https://github.com/vllm-project/vllm.gitcdvllmgitcheckout releases/v0.9.1VLLM_TARGET_DEVICEempty pipinstall-v-e.# 安装vLLM-ascendcd..gitclone https://github.com/vllm-project/vllm-ascendcdvllm-ascendgitcheckout v0.9.1-dev pipinstall-v-e.如果遇到SSL证书问题,执行:exportGIT_SSL_NO_VERIFY1gitconfig--globalhttp.sslVerifyfalse四、部署流程4.1 生成ranktable配置文件在两台设备上都执行这个脚本,生成集群拓扑配置:cd/home/your_name/vllm-ascend/examples/disaggregate_prefill_v1/bashgen_ranktable.sh\--ips141.61.41.163141.61.41.164\--npus-per-node16\--network-card-name ens3f0\--prefill-device-cnt16\--decode-device-cnt16参数解释:ips: P节点和D节点的IP,P在前D在后npus-per-node: 每台机器的NPU数量network-card-name: 网卡名称(用ifconfig查看)prefill-device-cnt: Prefill用的卡数decode-device-cnt: Decode用的卡数4.2 编写Prefill启动脚本在P节点创建start_prefill.sh:#!/bin/bash# 环境变量配置exportHCCL_IF_IP141.61.41.163exportGLOO_SOCKET_IFNAMEens3f0exportTP_SOCKET_IFNAMEens3f0exportHCCL_SOCKET_IFNAMEens3f0exportDISAGGREGATED_PREFILL_RANK_TABLE_PATH/home/your_name/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.jsonexportOMP_PROC_BINDfalseexportOMP_NUM_THREADS32exportVLLM_USE_V11exportVLLM_LLMDD_RPC_PORT5559# 启动Prefill服务vllm serve /home/models/DeepSeek-V3.1-w8a8-rot-mtp\--host0.0.0.0\--port20002\--data-parallel-size1\--data-parallel-size-local1\--api-server-count1\--data-parallel-address141.61.41.163\--data-parallel-rpc-port13356\--tensor-parallel-size16\--enable-expert-parallel\--quantizationascend\--seed1024\--served-model-name deepseek\--max-model-len32768\--max-num-batched-tokens32768\--max-num-seqs64\--trust-remote-code\--enforce-eager\--gpu-memory-utilization0.9\--kv-transfer-config{ kv_connector: LLMDataDistCMgrConnector, kv_buffer_device: npu, kv_role: kv_producer, kv_parallel_size: 1, kv_port: 20001, engine_id: 0, kv_connector_module_path: vllm_ascend.distributed.llmdatadist_c_mgr_connector }4.3 编写Decode启动脚本在D节点创建start_decode.sh:#!/bin/bash# 环境变量配置exportHCCL_IF_IP141.61.41.164exportGLOO_SOCKET_IFNAMEens3f0exportTP_SOCKET_IFNAMEens3f0exportHCCL_SOCKET_IFNAMEens3f0exportDISAGGREGATED_PREFILL_RANK_TABLE_PATH/home/your_name/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.jsonexportOMP_PROC_BINDfalseexportOMP_NUM_THREADS32exportVLLM_USE_V11exportVLLM_LLMDD_RPC_PORT5659# 启动Decode服务vllm serve /home/models/DeepSeek-V3.1-w8a8-rot-mtp\--host0.0.0.0\--port20002\--data-parallel-size1\--data-parallel-size-local1\--api-server-count1\--data-parallel-address141.61.41.164\--data-parallel-rpc-port13356\--tensor-parallel-size16\--enable-expert-parallel\--quantizationascend\--seed1024\--served-model-name deepseek\--max-model-len8192\--max-num-batched-tokens256\--max-num-seqs64\--trust-remote-code\--gpu-memory-utilization0.9\--kv-transfer-config{ kv_connector: LLMDataDistCMgrConnector, kv_buffer_device: npu, kv_role: kv_consumer, kv_parallel_size: 1, kv_port: 20001, engine_id: 0, kv_connector_module_path: vllm_ascend.distributed.llmdatadist_c_mgr_connector }\--additional-config{torchair_graph_config: {enabled:true}}4.4 启动服务分别在两个节点执行启动脚本:# P节点bashstart_prefill.sh# D节点bashstart_decode.sh看到Application startup complete.就说明启动成功了。4.5 启动负载均衡代理在P节点开一个新的终端窗口,先配置环境变量:source/home/your_name/cann_8.2.rc1/ascend-toolkit/set_env.shsource/home/your_name/cann_8.2.rc1/nnal/atb/set_env.sh然后运行代理服务:cd/home/your_name/vllm-ascend/examples/disaggregate_prefill_v1/ python load_balance_proxy_server_example.py\--host141.61.41.163\--port1025\--prefiller-hosts141.61.41.163\--prefiller-ports20002\--decoder-hosts141.61.41.164\--decoder-ports20002启动成功后会显示初始化的prefill和decode客户端数量:五、功能验证5.1 API接口测试开一个新的终端,发送测试请求:curlhttp://141.61.41.163:1025/v1/completions\-HContent-Type: application/json\-d{ model: deepseek, prompt: how is it today, max_tokens: 50, temperature: 0 }加上-v参数可以看详细的请求响应过程。返回结果中的choice/text就是模型生成的文本:5.2 性能压测在P节点运行benchmark脚本:source/home/your_name/cann_8.2.rc1/ascend-toolkit/set_env.shsource/home/your_name/cann_8.2.rc1/nnal/atb/set_env.shcd/home/your_name/vllm/benchmarks/ python benchmark_serving.py\--backendvllm\--dataset-name random\--random-input-len10\--random-output-len100\--num-prompts10\--ignore-eos\--modeldeepseek\--tokenizer/home/models/DeepSeek-V3.1-w8a8-rot-mtp\--host141.61.41.163\--port1025\--endpoint/v1/completions\--max-concurrency4\--request-rate4测试结果会显示吞吐量、延迟等关键指标:六、踩过的坑问题1: 找不到torch-npu包使用pip 25.1.1安装时提示找不到匹配的版本:解决办法: 配置正确的镜像源pip uninstall torch pip configsetglobal.trusted-hostdownload.pytorch.org mirrors.huaweicloud.com mirrors.aliyun.compip configsetglobal.extra-index-urlhttps://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypipipinstalltorchvision0.20.1 pipinstalltorch-npu2.5.1.post1.dev20250619问题2: torch和torch-npu依赖冲突安装vllm-ascend时报错:Cannot install torch-npu2.5.1.post1 and torch2.5.1 because these package versions have conflicting dependencies.解决办法: 清理环境后重装pip uninstalltorch2.5.1 torch-npu2.5.1.post1.dev20250619-ypip cache purge pip configunsetglobal.extra-index-url pip configsetglobal.extra-index-urlhttps://download.pytorch.org/whl/cpu/ https://mirrors.huaweicloud.com/ascend/repos/pypipipinstalltorch2.5.1 pipinstalltorch-npu2.5.1.post1.dev20250619# 验证安装pip show torch torch-npu|grepVersion问题3: NPU显存溢出启动服务时报错:RuntimeError: NPU out of memory. Tried to allocate 898.00 MiB...原因: 没有按量化模式加载模型解决办法: 在vllm serve命令中加上--quantization ascend问题4: 量化算子不支持float16报错信息:Tensor scale not implemented for DT_FLOAT16, should be in dtype support list [DT_UINT64,DT_BFLOAT16,DT_INT64,DT_FLOAT,].原因: aclnnQuantMatmulV4算子不支持float16类型解决办法: 修改模型config.json文件,把torch_dtype从float16改成bfloat16问题5: HCCL通信初始化失败报错:RuntimeError: createHCCLComm:torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:2166 HCCL function error...原因: HCCL_IF_IP环境变量配置错误解决办法: 检查并修正启动脚本中的HCCL_IF_IP,确保和本机IP一致总结PD分离部署对大模型推理性能提升确实有明显效果,但配置过程比较繁琐,需要注意的点也很多。希望这份实战记录能帮大家少踩一些坑。如果在部署过程中遇到其他问题,建议先查看官方文档和社区讨论,很多常见问题都有解决方案。