微服务监控:Prometheus与Grafana实战
微服务监控Prometheus与Grafana实战大家好我是欧阳瑞Rich Own。今天想和大家聊聊微服务监控这个重要话题。作为一个全栈开发者监控是保障系统稳定运行的关键。今天就来分享一下Prometheus和Grafana的实战经验。为什么需要监控场景说明故障排查快速定位问题性能优化发现性能瓶颈容量规划预测资源需求安全审计追踪异常行为Prometheus简介Prometheus是一个开源的监控系统具有以下特点多维度数据模型灵活的查询语言PromQL高效的时间序列数据库内置告警机制安装Prometheus# 使用Docker安装 docker run -d --name prometheus \ -p 9090:9090 \ -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus配置文件# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: node-exporter static_configs: - targets: [node-exporter:9100] - job_name: api-service static_configs: - targets: [api-service:3000] metrics_path: /metrics指标类型# 计数器Counter http_requests_total Counter(http_requests_total, Total HTTP requests) # 仪表盘Gauge memory_usage Gauge(memory_usage_bytes, Memory usage in bytes) # 直方图Histogram request_duration Histogram(request_duration_seconds, Request duration) # 摘要Summary response_size Summary(response_size_bytes, Response size)实战监控API服务from flask import Flask from prometheus_client import Counter, Histogram, generate_latest app Flask(__name__) REQUESTS Counter(http_requests_total, Total HTTP requests, [method, endpoint]) DURATION Histogram(request_duration_seconds, Request duration) app.route(/) DURATION.time() def index(): REQUESTS.labels(methodGET, endpoint/).inc() return Hello World app.route(/metrics) def metrics(): return generate_latest(), 200, {Content-Type: text/plain} if __name__ __main__: app.run(port3000)Grafana配置# 使用Docker安装Grafana docker run -d --name grafana \ -p 3000:3000 \ -v /path/to/grafana-data:/var/lib/grafana \ grafana/grafana配置数据源# 添加Prometheus数据源 apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus:9090 access: proxy isDefault: true创建仪表盘{ dashboard: { id: null, title: API监控, panels: [ { type: graph, title: 请求数, targets: [ { expr: rate(http_requests_total[5m]), legendFormat: {{method}} {{endpoint}} } ] }, { type: graph, title: 请求延迟, targets: [ { expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])), legendFormat: P95 } ] } ] } }告警配置# alerting_rules.yml groups: - name: api-alerts rules: - alert: HighErrorRate expr: rate(http_errors_total[5m]) 0.1 for: 5m labels: severity: critical annotations: summary: High error rate detected description: Error rate is {{ $value }}% for API service - alert: HighLatency expr: histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) 1 for: 5m labels: severity: warning annotations: summary: High latency detected description: P95 latency is {{ $value }}s最佳实践1. 指标命名规范# metric_type_name_unit http_requests_total memory_usage_bytes request_duration_seconds2. 标签管理REQUESTS.labels( methodGET, endpoint/api/users, status_code200 ).inc()3. 可视化技巧{ panels: [ { type: stat, title: 平均延迟, targets: [ { expr: avg(request_duration_seconds) } ] }, { type: gauge, title: 内存使用率, targets: [ { expr: memory_usage_bytes / memory_total_bytes * 100 } ] } ] }总结Prometheus和Grafana是监控领域的黄金组合。通过合理的指标设计和可视化配置可以全面监控系统的运行状态。我的鬃狮蜥Hash对监控也有自己的理解——它总是时刻关注周围环境的变化这也许就是自然界的监控系统吧如果你对监控感兴趣欢迎留言交流我是欧阳瑞极客之路永无止境技术栈Prometheus · Grafana · 监控