08-MLOps与工程落地——指标监控：Prometheus + Grafana

张

张建站

2026/5/9 0:43:40

10分钟阅读

08-MLOps与工程落地——指标监控：Prometheus + Grafana

指标监控Prometheus Grafana系统指标采集、可视化仪表盘、告警配置一、Prometheus概述1.1 什么是Prometheusimportmatplotlib.pyplotaspltfrommatplotlib.patchesimportRectangle,FancyBboxPatchimportwarnings warnings.filterwarnings(ignore)print(*60)print(Prometheus Grafana监控解决方案)print(*60)# Prometheus架构图fig,axplt.subplots(figsize(14,8))ax.axis(off)# 组件components{服务发现:(0.15,0.7),Prometheus\nServer:(0.4,0.7),Alertmanager:(0.65,0.7),Grafana:(0.85,0.7),Exporter:(0.15,0.4),Pushgateway:(0.4,0.4),告警:(0.65,0.4),可视化:(0.85,0.4),}forname,(x,y)incomponents.items():circleplt.Circle((x,y),0.08,colorlightblue,ecblack)ax.add_patch(circle)ax.text(x,y,name,hacenter,vacenter,fontsize7)# 连接ax.annotate(,xy(0.3,0.7),xytext(0.23,0.7),arrowpropsdict(arrowstyle-,lw1))ax.annotate(,xy(0.55,0.7),xytext(0.48,0.7),arrowpropsdict(arrowstyle-,lw1))ax.annotate(,xy(0.75,0.7),xytext(0.73,0.7),arrowpropsdict(arrowstyle-,lw1))ax.set_xlim(0,1)ax.set_ylim(0,1)ax.set_title(Prometheus架构,fontsize14)plt.tight_layout()plt.show()print(\n Prometheus核心特性:)print( - 多维数据模型指标名称标签)print( - 强大的查询语言PromQL)print( - 拉取模式采集指标)print( - 服务发现自动发现目标)print( - 与Grafana无缝集成)二、Prometheus安装配置2.1 安装配置defprometheus_setup():Prometheus安装配置print(\n*60)print(Prometheus安装配置)print(*60)code # 1. 使用Docker安装 docker run -d \\ -p 9090:9090 \\ -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \\ prom/prometheus # 2. prometheus.yml配置 global: scrape_interval: 15s evaluation_interval: 15s external_labels: monitor: ml-monitor alerting: alertmanagers: - static_configs: - targets: [alertmanager:9093] rule_files: - alerts.yml scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: ml-api static_configs: - targets: [ml-api:8000] metrics_path: /metrics - job_name: model-service static_configs: - targets: [model-service:8080] - job_name: node-exporter static_configs: - targets: [node-exporter:9100] # 3. Docker Compose部署 version: 3.8 services: prometheus: image: prom/prometheus:latest ports: - 9090:9090 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - --config.file/etc/prometheus/prometheus.yml - --storage.tsdb.path/prometheus restart: unless-stopped grafana: image: grafana/grafana:latest ports: - 3000:3000 environment: - GF_SECURITY_ADMIN_PASSWORDadmin volumes: - grafana-data:/var/lib/grafana depends_on: - prometheus volumes: prometheus-data: grafana-data: print(code)prometheus_setup()三、指标暴露3.1 Python应用指标defmetrics_exposure():Python应用指标暴露print(\n*60)print(Python应用指标暴露)print(*60)code from prometheus_client import Counter, Histogram, Gauge, Summary, Info from prometheus_client import start_http_server, REGISTRY import random import time import threading # 1. Counter只增不减 REQUEST_COUNT Counter( http_requests_total, Total HTTP requests, [method, endpoint, status] ) # 2. Histogram分布统计 REQUEST_LATENCY Histogram( http_request_duration_seconds, HTTP request latency, [method, endpoint], buckets(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10) ) # 3. Gauge可增可减 ACTIVE_REQUESTS Gauge( http_active_requests, Active HTTP requests, [method] ) MODEL_CONFIDENCE Gauge( model_prediction_confidence, Model prediction confidence, [model_version] ) # 4. Summary分位数统计 PREDICTION_LATENCY Summary( model_prediction_latency_seconds, Model prediction latency, [model_name] ) # 5. Info静态信息 MODEL_INFO Info( model_info, Model information ) MODEL_INFO.info({version: v1.0, framework: sklearn}) # 6. 自定义指标 class ModelMetrics: def __init__(self): self.prediction_count Counter(model_predictions_total, Total predictions) self.error_count Counter(model_errors_total, Total errors, [error_type]) self.latency Histogram(model_inference_latency_seconds, Inference latency) def record_prediction(self, latency): self.prediction_count.inc() self.latency.observe(latency) def record_error(self, error_type): self.error_count.labels(error_typeerror_type).inc() # 7. 启动指标服务器 def start_metrics_server(port8001): start_http_server(port) print(fMetrics server started on port {port}) # 8. 模拟指标生成 def generate_metrics(): while True: # 模拟请求 REQUEST_COUNT.labels(methodGET, endpoint/predict, status200).inc() # 模拟延迟 latency random.uniform(0.01, 0.5) REQUEST_LATENCY.labels(methodGET, endpoint/predict).observe(latency) # 模拟活跃请求 ACTIVE_REQUESTS.labels(methodGET).inc() time.sleep(random.uniform(0.01, 0.1)) ACTIVE_REQUESTS.labels(methodGET).dec() # 模拟模型置信度 confidence random.uniform(0.7, 0.99) MODEL_CONFIDENCE.labels(model_versionv1).set(confidence) time.sleep(1) # 启动服务器和生成器 start_metrics_server(8001) threading.Thread(targetgenerate_metrics, daemonTrue).start() print(code)metrics_exposure()3.2 ML模型指标defml_model_metrics():ML模型指标print(\n*60)print(ML模型指标)print(*60)code from prometheus_client import Counter, Histogram, Gauge import time import random # 模型性能指标 PREDICTION_COUNT Counter( model_predictions_total, Total number of predictions, [model_name, model_version] ) PREDICTION_ERRORS Counter( model_prediction_errors_total, Total prediction errors, [model_name, error_type] ) PREDICTION_LATENCY Histogram( model_prediction_latency_seconds, Prediction latency, [model_name], buckets(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5) ) MODEL_ACCURACY Gauge( model_accuracy, Model accuracy, [model_name, dataset] ) MODEL_DRIFT_SCORE Gauge( model_drift_score, Model drift score, [model_name, feature] ) DATA_DRIFT_SCORE Gauge( data_drift_score, Data drift score, [feature] ) # 业务指标 BUSINESS_METRICS Counter( business_events_total, Business events, [event_type] ) CONVERSION_RATE Gauge( conversion_rate, Conversion rate, [model_name] ) # 资源指标 GPU_UTILIZATION Gauge( gpu_utilization, GPU utilization percentage, [gpu_id] ) GPU_MEMORY_USED Gauge( gpu_memory_used_bytes, GPU memory used in bytes, [gpu_id] ) # 模拟指标更新 def update_model_metrics(): while True: # 模型准确率 accuracy random.uniform(0.85, 0.95) MODEL_ACCURACY.labels(model_nameclassifier, datasettest).set(accuracy) # 数据漂移 for i in range(5): drift random.uniform(0, 0.3) DATA_DRIFT_SCORE.labels(featureffeature_{i}).set(drift) # GPU利用率 for i in range(2): util random.uniform(0, 100) GPU_UTILIZATION.labels(gpu_idfgpu_{i}).set(util) time.sleep(10) print(code)ml_model_metrics()四、PromQL查询4.1 常用查询defpromql_queries():PromQL查询print(\n*60)print(PromQL查询)print(*60)code # 1. 基础查询 # 查询单个指标 http_requests_total # 带标签过滤 http_requests_total{methodGET, status200} # 正则匹配 http_requests_total{method~GET|POST} # 2. 聚合操作 # 求和 sum(http_requests_total) # 按标签聚合 sum(http_requests_total) by (method) # 平均值 avg(http_requests_total) by (endpoint) # 最大值 max(model_prediction_latency_seconds) # 分位数 histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # 3. 速率计算 # 每秒增长率 rate(http_requests_total[1m]) # 每秒增长率处理计数器重置 irate(http_requests_total[5m]) # 4. 预测 # 线性预测 predict_linear(http_requests_total[1h], 3600) # 5. 布尔运算 # 大于阈值 model_accuracy 0.9 # 条件判断 model_accuracy 0.85 or model_accuracy 0.7 # 6. 时间聚合 # 过去5分钟平均值 avg_over_time(model_prediction_latency_seconds[5m]) # 过去1小时最大值 max_over_time(http_requests_total[1h]) # 7. 向量匹配 # 一对多匹配 http_requests_total * on(group) group_left group_requests_total # 8. 实用查询示例 # API错误率 sum(rate(http_requests_total{status~5..}[5m])) / sum(rate(http_requests_total[5m])) # P99延迟 histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # 模型准确率趋势 avg_over_time(model_accuracy[1h]) # 数据漂移检测 data_drift_score 0.2 print(code)promql_queries()五、Grafana可视化5.1 仪表盘配置defgrafana_dashboard():Grafana仪表盘print(\n*60)print(Grafana仪表盘)print(*60)code # 1. 添加数据源 # Configuration - Data Sources - Add data source # Type: Prometheus # URL: http://prometheus:9090 # 2. 仪表盘JSON配置 { dashboard: { title: ML Model Monitoring, panels: [ { title: Request Rate, type: graph, targets: [ { expr: rate(http_requests_total[1m]), legendFormat: {{method}} {{endpoint}} } ], gridPos: {h: 8, w: 12, x: 0, y: 0} }, { title: P99 Latency, type: graph, targets: [ { expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)), legendFormat: p99 } ], gridPos: {h: 8, w: 12, x: 12, y: 0} }, { title: Error Rate, type: stat, targets: [ { expr: sum(rate(http_requests_total{status~5..}[5m])) / sum(rate(http_requests_total[5m])) } ], gridPos: {h: 4, w: 6, x: 0, y: 8} }, { title: Model Accuracy, type: gauge, targets: [ { expr: model_accuracy } ], gridPos: {h: 4, w: 6, x: 6, y: 8}, fieldConfig: { defaults: { unit: percent, min: 0, max: 100 } } }, { title: Data Drift, type: heatmap, targets: [ { expr: data_drift_score } ], gridPos: {h: 8, w: 12, x: 0, y: 12} }, { title: GPU Utilization, type: graph, targets: [ { expr: gpu_utilization, legendFormat: GPU {{gpu_id}} } ], gridPos: {h: 8, w: 12, x: 12, y: 12} } ], variables: [ { name: model_name, type: query, query: label_values(model_accuracy, model_name) }, { name: environment, type: custom, query: production,staging,development } ] } } # 3. 使用Python创建仪表盘 import requests GRAFANA_URL http://localhost:3000 API_KEY your-api-key headers { Authorization: fBearer {API_KEY}, Content-Type: application/json } dashboard_json { dashboard: {...}, overwrite: True } response requests.post( f{GRAFANA_URL}/api/dashboards/db, headersheaders, jsondashboard_json ) print(code)grafana_dashboard()六、告警配置6.1 Alertmanager配置defalertmanager_config():Alertmanager配置print(\n*60)print(Alertmanager告警配置)print(*60)code # 1. alertmanager.yml global: smtp_smarthost: smtp.gmail.com:587 smtp_from: alertsexample.com smtp_auth_username: alertsexample.com smtp_auth_password: password route: group_by: [alertname, cluster] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: team-alerts receivers: - name: team-alerts email_configs: - to: teamexample.com send_resolved: true slack_configs: - channel: #alerts api_url: https://hooks.slack.com/services/xxx/yyy/zzz # 2. alerts.yml告警规则 groups: - name: ml_alerts interval: 30s rules: - alert: HighErrorRate expr: sum(rate(http_requests_total{status~5..}[5m])) / sum(rate(http_requests_total[5m])) 0.05 for: 5m labels: severity: critical annotations: summary: High error rate detected description: Error rate is {{ $value }} for the last 5 minutes - alert: HighLatency expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) 1 for: 5m labels: severity: warning annotations: summary: High latency detected description: P99 latency is {{ $value }}s - alert: ModelAccuracyDrop expr: model_accuracy 0.85 for: 10m labels: severity: critical annotations: summary: Model accuracy dropped description: Model accuracy is {{ $value }} - alert: DataDriftDetected expr: data_drift_score 0.2 for: 1h labels: severity: warning annotations: summary: Data drift detected description: Drift score for feature {{ $labels.feature }} is {{ $value }} - alert: HighGPUUtilization expr: gpu_utilization 95 for: 5m labels: severity: warning annotations: summary: High GPU utilization description: GPU {{ $labels.gpu_id }} utilization is {{ $value }}% - alert: PredictionVolumeDrop expr: rate(model_predictions_total[1h]) 10 for: 30m labels: severity: info annotations: summary: Prediction volume dropped description: Prediction rate is {{ $value }} req/s print(code)alertmanager_config()七、完整监控系统7.1 部署配置defcomplete_monitoring():完整监控系统print(\n*60)print(完整监控系统部署)print(*60)code # docker-compose-monitoring.yml version: 3.8 services: prometheus: image: prom/prometheus:latest ports: - 9090:9090 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./alerts.yml:/etc/prometheus/alerts.yml - prometheus-data:/prometheus command: - --config.file/etc/prometheus/prometheus.yml - --storage.tsdb.path/prometheus - --storage.tsdb.retention.time30d restart: unless-stopped alertmanager: image: prom/alertmanager:latest ports: - 9093:9093 volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml command: - --config.file/etc/alertmanager/alertmanager.yml restart: unless-stopped grafana: image: grafana/grafana:latest ports: - 3000:3000 environment: - GF_SECURITY_ADMIN_PASSWORDadmin - GF_INSTALL_PLUGINSgrafana-piechart-panel volumes: - grafana-data:/var/lib/grafana - ./dashboards:/etc/grafana/provisioning/dashboards - ./datasources:/etc/grafana/provisioning/datasources depends_on: - prometheus restart: unless-stopped node-exporter: image: prom/node-exporter:latest ports: - 9100:9100 volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - --path.procfs/host/proc - --path.sysfs/host/sys - --path.rootfs/rootfs restart: unless-stopped cadvisor: image: gcr.io/cadvisor/cadvisor:latest ports: - 8080:8080 volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro restart: unless-stopped pushgateway: image: prom/pushgateway:latest ports: - 9091:9091 restart: unless-stopped volumes: prometheus-data: grafana-data: print(code)complete_monitoring()八、总结组件功能端口Prometheus指标采集存储9090Alertmanager告警管理9093Grafana可视化3000Node Exporter主机指标9100cAdvisor容器指标8080Pushgateway推送指标9091监控最佳实践定义SLO/SLI指标设置合理的告警阈值使用标签进行多维度分析定期review仪表盘实施告警静默和依赖

ClawCare：为AI编码助手构建运行时安全防御体系

1. 项目概述：为AI编码助手戴上“安全爪套”如果你和我一样，在日常开发中深度依赖Claude Code、Cursor这类AI编码助手，那你一定体验过它们带来的效率革命。它们能帮你写代码、重构、调试，甚至通过安装第三方技能（Skill&…...

2026/5/9 0:33:16 阅读更多 →

SQL利用子查询实现复杂条件排序_嵌套逻辑实现业务规则

...

2026/5/9 0:26:30 阅读更多 →

【GitHub】skillshare：一条命令同步所有 AI CLI 工具 Skills 的神器

在 AI 编程助手日益普及的今天，你是不是也遇到过这样的困扰：Claude Code 用得好好的技能，换到 Cursor 又得重新配置一遍？不同工具之间各自为政，Skills 无法复用，管理成本极高。今天要介绍的 skillshare 正是…...

2026/5/9 0:13:38 阅读更多 →

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程

Autovisor：终极自动化学习助手 - 5分钟快速上手智慧树刷课教程【免费下载链接】Autovisor 2025智慧树刷课脚本基于Python Playwright的自动化程序 [有免安装版] 项目地址: https://gitcode.com/gh_mirrors/au/Autovisor 你是否厌倦了每天手动点击播放、等待…...

2026/5/7 18:12:05 阅读更多 →

ModelScope Auto Proxy：智能路由网关，零成本统一调用免费大模型API

1. 项目概述与核心价值如果你和我一样，是个重度依赖 AI 编程工具（比如 Cursor、Cline）的开发者，那你肯定对 OpenAI 的 API 调用成本又爱又恨。爱的是它强大的能力，恨的是账单上的数字。最近，国内的开源社…...

2026/5/7 9:02:42 阅读更多 →

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程）

从零到一：手把手教你用BetaFlight CLI命令配置AOCODARC H7DUAL飞控板（保姆级教程） 当你第一次拿到AOCODARC H7DUAL这块飞控板时，可能会被密密麻麻的引脚和复杂的配置选项吓到。别担心，这篇教程将带你从零开始&#xff…...

2026/5/7 19:32:04 阅读更多 →

League Akari：你的英雄联盟游戏体验进化指南

League Akari：你的英雄联盟游戏体验进化指南【免费下载链接】League-Toolkit An all-in-one toolkit for LeagueClient. Gathering power 🚀. 项目地址: https://gitcode.com/gh_mirrors/le/League-Toolkit 想象一下这样的场景：你正在…...

2026/5/7 19:28:13 阅读更多 →