上个月我们团队决定把 DeepSeek V4 部署到自己的 GPU 集群上跑一些内部的代码和文档生成任务。说实话模型跑起来不难难的是怎么让它在生产环境稳定运行——我花了差不多一周时间才把从容器化、K8s 编排到 Prometheus 监控的整条链路跑通。这篇文章把我踩过的坑和最终方案都整理出来了希望能帮你少走弯路。DeepSeek V4 的生产级部署需要三层架构容器化打包Dockerfile vLLM、K8s 编排Deployment HPA Service、Prometheus Grafana 的全链路监控。本文会给出每一层的完整配置文件全部实测可用。先说结论环节方案核心工具踩坑指数模型推理vLLM 0.8.xvllm serve⭐⭐容器化多阶段 DockerfileCUDA 12.4 Python 3.11⭐⭐⭐编排调度K8s Deployment HPAGPU 资源限制 自定义指标扩缩⭐⭐⭐⭐监控告警Prometheus GrafanavLLM 内置 metrics 自定义 exporter⭐⭐⭐日志收集Loki Promtail结构化日志⭐⭐环境准备我的测试环境GPU4 × NVIDIA A100 80GBDeepSeek V4 满血版需要至少 2 × A100 80GB量化版 1 张够K8sv1.29已装 NVIDIA GPU OperatorOSUbuntu 22.04存储NFS 挂载模型权重大概 140GBBF16 满血版先确认 GPU Operator 正常# 检查 GPU 是否被正确识别kubectl get nodes-ojson|jq.items[].status.capacity[nvidia.com/gpu]# 应该输出你的 GPU 数量比如 4# 检查 nvidia-device-plugin 是否正常kubectl get pods-ngpu-operator|grepnvidia-device-plugin第一步Dockerfile 容器化这个 Dockerfile 折腾了我两天。最大的坑是 CUDA 版本和 vLLM 的兼容问题——用 CUDA 12.6 会导致 vLLM 的某些 kernel 编译失败退回 12.4 就好了。# # DeepSeek V4 生产级 Dockerfile # 基于 vLLM 推理引擎CUDA 12.4 # # Stage 1: 基础运行环境 FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base ENV DEBIAN_FRONTENDnoninteractive ENV PYTHONUNBUFFERED1 RUN apt-get update apt-get install -y --no-install-recommends \ python3.11 python3.11-venv python3-pip \ curl wget git \ rm -rf /var/lib/apt/lists/* RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \ update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 # Stage 2: 安装依赖 FROM base AS builder WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Stage 3: 最终镜像 FROM base AS runtime WORKDIR /app COPY --frombuilder /usr/local/lib/python3.11/dist-packages /usr/local/lib/python3.11/dist-packages COPY --frombuilder /usr/local/bin /usr/local/bin # 健康检查脚本 COPY healthcheck.sh /app/healthcheck.sh RUN chmod x /app/healthcheck.sh # 暴露 vLLM API 端口和 metrics 端口 EXPOSE 8000 EXPOSE 9090 HEALTHCHECK --interval30s --timeout10s --retries3 \ CMD /app/healthcheck.sh # 启动 vLLM 推理服务 ENTRYPOINT [python, -m, vllm.entrypoints.openai.api_server] CMD [ \ --model, /models/deepseek-v4, \ --tensor-parallel-size, 2, \ --max-model-len, 32768, \ --gpu-memory-utilization, 0.92, \ --enable-prefix-caching, \ --port, 8000, \ --served-model-name, deepseek-v4, \ --trust-remote-code \ ]requirements.txtvllm0.8.4 prometheus-client0.21.0healthcheck.sh#!/bin/bash# 检查 vLLM 是否正常响应curl-sfhttp://localhost:8000/health||exit1构建命令dockerbuild-tdeepseek-v4-vllm:latest.# 本地测试跑一下假设模型权重在 /data/models/deepseek-v4dockerrun--gpusdevice0,1\-v/data/models/deepseek-v4:/models/deepseek-v4\-p8000:8000\deepseek-v4-vllm:latest第二步K8s 编排配置这部分是整个方案最复杂的地方。我把配置拆成几个文件方便维护。2.1 Namespace 和 PV模型权重存储# namespace.yamlapiVersion:v1kind:Namespacemetadata:name:llm-servinglabels:app.kubernetes.io/part-of:deepseek-v4---# pv-model.yamlapiVersion:v1kind:PersistentVolumemetadata:name:deepseek-v4-weightsspec:capacity:storage:200GiaccessModes:-ReadOnlyManynfs:server:10.0.1.50path:/exports/models/deepseek-v4persistentVolumeReclaimPolicy:Retain---apiVersion:v1kind:PersistentVolumeClaimmetadata:name:deepseek-v4-weightsnamespace:llm-servingspec:accessModes:-ReadOnlyManyresources:requests:storage:200GivolumeName:deepseek-v4-weights2.2 Deployment核心# deployment.yamlapiVersion:apps/v1kind:Deploymentmetadata:name:deepseek-v4namespace:llm-servinglabels:app:deepseek-v4version:v4-bf16spec:replicas:2selector:matchLabels:app:deepseek-v4strategy:type:RollingUpdaterollingUpdate:maxUnavailable:0# 滚动更新时不允许不可用maxSurge:1template:metadata:labels:app:deepseek-v4version:v4-bf16annotations:prometheus.io/scrape:trueprometheus.io/port:8000prometheus.io/path:/metricsspec:nodeSelector:nvidia.com/gpu.product:NVIDIA-A100-SXM4-80GBtolerations:-key:nvidia.com/gpuoperator:Existseffect:NoSchedulecontainers:-name:vllmimage:your-registry.com/deepseek-v4-vllm:latestports:-containerPort:8000name:apiprotocol:TCPresources:requests:cpu:8memory:64Ginvidia.com/gpu:2limits:cpu:16memory:128Ginvidia.com/gpu:2env:-name:CUDA_VISIBLE_DEVICESvalue:0,1-name:VLLM_LOGGING_LEVELvalue:INFO-name:NCCL_P2P_DISABLEvalue:0volumeMounts:-name:model-weightsmountPath:/models/deepseek-v4readOnly:true-name:shmmountPath:/dev/shmlivenessProbe:httpGet:path:/healthport:8000initialDelaySeconds:120# 模型加载需要时间periodSeconds:30timeoutSeconds:10failureThreshold:3readinessProbe:httpGet:path:/healthport:8000initialDelaySeconds:120periodSeconds:10timeoutSeconds:5startupProbe:httpGet:path:/healthport:8000initialDelaySeconds:60periodSeconds:10failureThreshold:30# 最多等 5 分钟启动volumes:-name:model-weightspersistentVolumeClaim:claimName:deepseek-v4-weights-name:shmemptyDir:medium:MemorysizeLimit:16Gi# tensor parallel 通信需要大 shm踩坑提醒/dev/shm那个 emptyDir 千万别忘了。vLLM 的 tensor parallel 依赖共享内存做 GPU 间通信默认的 64MB 会直接 OOM 崩掉。我第一次部署的时候 Pod 反复 CrashLoopBackOff查了半天日志才发现是这个问题。2.3 Service HPA# service.yamlapiVersion:v1kind:Servicemetadata:name:deepseek-v4-svcnamespace:llm-servinglabels:app:deepseek-v4spec:type:ClusterIPports:-port:8000targetPort:8000name:apiselector:app:deepseek-v4---# hpa.yamlapiVersion:autoscaling/v2kind:HorizontalPodAutoscalermetadata:name:deepseek-v4-hpanamespace:llm-servingspec:scaleTargetRef:apiVersion:apps/v1kind:Deploymentname:deepseek-v4minReplicas:2maxReplicas:6metrics:-type:Podspods:metric:name:vllm_num_requests_runningtarget:type:AverageValueaverageValue:20# 每个 Pod 并发请求超过 20 就扩容behavior:scaleUp:stabilizationWindowSeconds:60policies:-type:Podsvalue:1periodSeconds:120# 每 2 分钟最多加 1 个 PodGPU 资源贵scaleDown:stabilizationWindowSeconds:300policies:-type:Podsvalue:1periodSeconds:300# 缩容更保守整体架构长这样/metrics/metrics/metrics读取 vllm_num_requests_running扩缩容客户端请求Ingress / GatewayK8s Servicedeepseek-v4-svc:8000Pod 1vLLM 2×A100Pod 2vLLM 2×A100Pod NHPA 动态扩缩NFS PV模型权重 140GBPrometheusGrafana DashboardAlertManager钉钉/飞书告警HPA Controller第三步Prometheus Grafana 监控vLLM 自带/metrics端点暴露了一堆 Prometheus 格式的指标这点真的挺厚道。但默认的指标不够用我额外加了几个关键的。3.1 Prometheus 采集配置# prometheus-scrape-config.yaml追加到 Prometheus 的 scrape_configsscrape_configs:-job_name:deepseek-v4-vllmkubernetes_sd_configs:-role:podnamespaces:names:-llm-servingrelabel_configs:-source_labels:[__meta_kubernetes_pod_annotation_prometheus_io_scrape]action:keepregex:true-source_labels:[__meta_kubernetes_pod_annotation_prometheus_io_port]action:replacetarget_label:__address__regex:(.)replacement:${1}-source_labels:[__meta_kubernetes_pod_ip,__meta_kubernetes_pod_annotation_prometheus_io_port]action:replacetarget_label:__address__separator::metric_relabel_configs:-source_labels:[__name__]regex:vllm_.*action:keep3.2 核心监控指标这些是我觉得生产环境必须盯着的指标名类型含义告警阈值建议vllm_num_requests_runningGauge当前正在处理的请求数 50 告警vllm_num_requests_waitingGauge等待队列长度 20 告警vllm_gpu_cache_usage_percGaugeGPU KV Cache 使用率 95% 告警vllm_avg_generation_throughput_toks_per_sGauge生成吞吐token/s 100 告警vllm_request_success_totalCounter成功请求总数用于计算成功率vllm_e2e_request_latency_secondsHistogram端到端请求延迟P99 30s 告警vllm_time_to_first_token_secondsHistogram首 token 延迟TTFTP99 5s 告警3.3 AlertManager 告警规则# alerting-rules.yamlgroups:-name:deepseek-v4-alertsrules:-alert:HighRequestQueueDepthexpr:vllm_num_requests_waiting{jobdeepseek-v4-vllm}20for:2mlabels:severity:warningannotations:summary:DeepSeek V4 请求队列堆积description:等待队列 {{ $value }} 个请求持续 2 分钟考虑扩容-alert:GPUCacheNearlyFullexpr:vllm_gpu_cache_usage_perc{jobdeepseek-v4-vllm}0.95for:1mlabels:severity:criticalannotations:summary:GPU KV Cache 使用率超过 95%description:Pod {{ $labels.pod }} 的 KV Cache 即将耗尽新请求会被拒绝-alert:HighP99Latencyexpr:histogram_quantile(0.99,rate(vllm_e2e_request_latency_seconds_bucket[5m]))30for:3mlabels:severity:warningannotations:summary:DeepSeek V4 P99 延迟超过 30 秒-alert:LowThroughputexpr:vllm_avg_generation_throughput_toks_per_s{jobdeepseek-v4-vllm} 100for:5mlabels:severity:warningannotations:summary:生成吞吐过低可能存在性能问题-alert:PodCrashLoopingexpr:rate(kube_pod_container_status_restarts_total{namespacellm-serving,containervllm}[15m])0for:5mlabels:severity:criticalannotations:summary:vLLM Pod 反复重启踩坑记录这部分是真金白银换来的教训坑 1模型加载 OOMKilled第一次部署Pod 起来几十秒就被 kill 了。kubectl describe pod一看OOMKilled。原因是 K8s 的 memory limit 设太小。DeepSeek V4 BF16 权重 140GB加载到 GPU 之前需要先在 CPU 内存里过一遍。我最后把 memory limit 拉到 128Gi 才稳住。坑 2startupProbe 超时导致 Pod 被杀大模型加载慢A100 上 DeepSeek V4 大概要 2-3 分钟才能完全 ready。默认的initialDelaySeconds根本不够Pod 还没加载完就被 kubelet 判定不健康然后重启了。解决方案是用startupProbefailureThreshold: 30给足 5 分钟启动时间。坑 3tensor parallel 跨通信慢到怀疑人生一开始我想用 4 张 A100 做 TP4但两张卡在 A两张在 B。结果 NCCL 走网络通信推理速度直接掉了 3 倍。结论tensor parallel 一定要在同一个内。K8s 里可以用nodeSelector或者topology约束来保证。坑 4/dev/shm 默认 64MB前面提过了这里再说一次。Docker 默认 shm 只有 64MBK8s 里必须用emptyDirmedium: Memory手动挂载一个大的 shm。混合方案本地 云端 API 互补说实话自己部署 DeepSeek V4 的运维成本不低。我们团队的做法是内部高频、对延迟敏感的任务走本地部署的 V4低频或需要用 GPT-5.5/Claude Opus 4.6 等闭源模型的场景走 API。ofox.ai 是一个 AI 模型聚合平台一个 API Key 可以调用 GPT-5.5、Claude Opus 4.6、Gemini 3、DeepSeek V4 等 50 模型低延迟直连无需代理支持支付宝付款。我们在代码里做了个简单的路由fromopenaiimportOpenAI# 本地 DeepSeek V4K8s Service 地址local_clientOpenAI(api_keynot-needed,base_urlhttp://deepseek-v4-svc.llm-serving:8000/v1)# 云端聚合 APIGPT-5.5 / Claude 等闭源模型cloud_clientOpenAI(api_keyyour-ofox-key,base_urlhttps://api.ofox.ai/v1)defsmart_route(prompt:str,task_type:strgeneral):根据任务类型选择本地或云端模型iftask_typein(code_review,doc_gen):# 高频内部任务 → 本地 DeepSeek V4returnlocal_client.chat.completions.create(modeldeepseek-v4,messages[{role:user,content:prompt}],max_tokens4096,streamTrue)else:# 需要闭源模型能力的任务 → 云端 APIreturncloud_client.chat.completions.create(modelgpt-5.5,# 或 claude-opus-4.6messages[{role:user,content:prompt}],max_tokens4096,streamTrue)这样本地集群扛住 80% 的请求量剩下 20% 走聚合 API成本和灵活性都兼顾了。小结整套方案跑下来最核心的经验就三条shm 和 startupProbe 是必踩的坑别等 Pod 反复重启了再去查tensor parallel 不要跨NCCL 走网络通信的性能损耗大到离谱监控指标里最该盯的是vllm_gpu_cache_usage_perc和vllm_num_requests_waiting这两个一飙就说明该扩容了自建部署适合有 GPU 资源、请求量稳定的团队。如果你的场景是多模型切换、请求量波动大直接用 API 聚合平台可能更划算。两者不矛盾混着用就行。有问题评论区聊特别是 K8s GPU 相关的坑我踩得比较多能帮的尽量帮。