Prometheus 常用告警规则 rules.yml

张

张建站

2026/6/10 7:41:31

10分钟阅读

Prometheus 常用告警规则 rules.yml说明整理了企业运维中常见服务的告警规则覆盖服务器、Redis、RabbitMQ、Kafka、SSL证书、Elasticsearch。每条规则都给了 PromQL复制到 Prometheus 就能用。规则文件放哪Prometheus 的告警规则以 YAML 文件存放在rules/目录下然后在prometheus.yml中引用rule_files:-rules/*.yml改完配置记得重载systemctl reload prometheus一、服务器基础告警node_exporter文件node_exporter_rules.yml服务器宕机-alert:服务器宕机expr:up 0for:3mlabels:severity:严重告警annotations:summary:服务器 {{ $labels.instance }} 宕机description:{{ $labels.instance }} 超过 3 分钟无响应CPU 使用率超过 90%-alert:CPU 使用率过高expr:100-(avg by (instance)(irate(node_cpu_seconds_total{modeidle}[5m])) * 100)90for:5mlabels:severity:严重告警annotations:summary:服务器 {{ $labels.instance }} CPU 过高description:CPU 使用率 {{ $value }}%超过 90%计算思路用irate取 5 分钟内 CPU 空闲率的瞬时变化然后反算使用率。irate适合 CPU 这种抖动大的指标能更灵敏反映瞬时变化。内存使用率过高-alert:内存使用率过高expr:100-((node_memory_MemFree_bytes node_memory_Cached_bytes node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100) 10for:5mlabels:severity:严重告警annotations:summary:服务器 {{ $labels.instance }} 内存不足description:可用内存含缓存不足 10%计算用了MemFree Cached Buffers来衡量剩余内存。因为 Linux 的缓存可以回收比只看MemFree更符合实际情况。磁盘 IO 过高-alert:磁盘 IO 占用过高expr:avg by (instance)(irate(node_disk_io_time_seconds_total[1m])) * 10090for:5mlabels:severity:严重告警annotations:summary:服务器 {{ $labels.instance }} 磁盘 IO 过高description:磁盘 IO 占用率 {{ $value }}%网络流入带宽超过 100Mbps-alert:网络流入带宽过高expr:(sum by (instance)(rate(node_network_receive_bytes_total{device!~tap.*|veth.*|br.*|docker.*|virbr.*|lo.*}[5m])) / 100)102400for:5mlabels:severity:严重告警annotations:summary:服务器 {{ $labels.instance }} 网络流入过高description:流入带宽 {{ $value }}持续 5 分钟超过 100Mbps过滤掉了虚拟网卡docker、veth、br 等只监控物理网卡。102400 对应 100Mbps因为除以 100 将 bytes 转成了 bits。网络流出带宽超过 100Mbps-alert:网络流出带宽过高expr:(sum by (instance)(rate(node_network_transmit_bytes_total{device!~tap.*|veth.*|br.*|docker.*|virbr.*|lo.*}[5m])) / 100)102400for:5mlabels:severity:严重告警annotations:summary:服务器 {{ $labels.instance }} 网络流出过高description:流出带宽 {{ $value }}持续 5 分钟超过 100MbpsTCP 连接数过多-alert:TCP 连接数过高expr:node_netstat_Tcp_CurrEstab10000for:2mlabels:severity:严重告警annotations:summary:服务器 {{ $labels.instance }} TCP 连接数过高description:当前 TCP ESTABLISHED 连接数 {{ $value }}超过 10000磁盘使用率超过 90%-alert:磁盘使用率过高expr:100-(node_filesystem_free_bytes{fstype~ext4|xfs}/ node_filesystem_size_bytes{fstype~ext4|xfs}* 100)90for:1mlabels:severity:严重告警annotations:summary:{{ $labels.mountpoint }} 磁盘使用率过高description:分区 {{ $labels.mountpoint }} 使用率 {{ $value }}%只监控 ext4 和 xfs 格式的文件系统跳过 tmpfs、overlay 之类的虚拟文件系统。二、Redis 告警redis_exporter文件redis_exporter_rules.yml-alert:Redis 服务停止expr:redis_up 0for:1mlabels:severity:严重告警annotations:summary:Redis {{ $labels.instance }} 服务停止description:Redis 服务不可达-alert:Redis 连接数超过 80%expr:redis_connected_clients / redis_config_maxclients * 10080for:1mlabels:severity:严重告警annotations:summary:Redis {{ $labels.instance }} 连接数过高description:当前连接数 {{ $value }}超过最大连接数的 80%三、RabbitMQ 告警rabbitmq_exporter文件rabbitmq_exporter_rules.yml-alert:RabbitMQ 服务停止expr:rabbitmq_up 0for:3mlabels:severity:严重告警annotations:summary:RabbitMQ {{ $labels.instance }} 服务停止description:RabbitMQ 已停止运行-alert:RabbitMQ 内存占用超过 2GBexpr:rabbitmq_node_mem_used / 1024 / 10242048for:3mlabels:severity:严重告警annotations:summary:RabbitMQ {{ $labels.instance }} 内存过高description:当前内存占用 {{ $value }} MB超过 2048 MB四、Kafka 告警kafka_exporter文件kafka_exporter_rules.yml-alert:Kafka 消费组消息积压expr:sum by (consumergroup,topic)(kafka_consumergroup_lag)50000for:3mlabels:severity:严重告警annotations:summary:Topic {{ $labels.topic }} 消费滞后description:消费组 {{ $labels.consumergroup }} 积压超过 50000 条消息-alert:Kafka 集群节点减少expr:kafka_brokers 3for:3mlabels:severity:严重告警annotations:summary:Kafka 集群节点减少description:当前 broker 数量 {{ $value }}少于 3 个-alert:Kafka Topic 无消息写入expr:sum by (topic)(rate(kafka_topic_partition_current_offset[5m])) 0for:5mlabels:severity:警告annotations:summary:Topic {{ $labels.topic }} 无消息流入description:持续 5 分钟没有新消息写入该 Topic五、SSL 证书过期监控blackbox_exporter文件ssl_expiry.yml-alert:SSL 证书即将过期expr:probe_ssl_earliest_cert_expiry-time() 86400 * 30for:5mlabels:severity:重要告警annotations:summary:SSL 证书将在 30 天内过期description:证书剩余 {{ $value }} 秒请及时更新-alert:SSL 证书已过期expr:probe_ssl_earliest_cert_expiry-time() 0for:5mlabels:severity:严重告警annotations:summary:SSL 证书已过期description:证书已过期请立即更新probe_ssl_earliest_cert_expiry返回的是证书过期的时间戳秒减去当前时间time()就是剩余有效时间。86400是一天的秒数。六、Elasticsearch 告警elasticsearch_exporter文件elasticsearch_exporter_rules.yml-alert:ES 集群节点减少expr:elasticsearch_cluster_health_number_of_nodes 3for:5mlabels:severity:严重告警annotations:summary:ES 集群节点数减少description:当前节点数 {{ $value }}少于 3 个-alert:JVM 堆内存使用率过高expr:elasticsearch_jvm_memory_used_bytes{areaheap}/ elasticsearch_jvm_memory_max_bytes{areaheap}* 10090for:5mlabels:severity:严重告警annotations:summary:ES JVM 内存使用率过高description:堆内存使用率 {{ $value }}%超过 90%使用建议阈值调整上面的阈值是通用值建议根据实际环境调整。比如磁盘告警数据盘可以放到 95%根分区建议 85% 就告警。集群节点数kafka_brokers 3和es 节点数 3假设你的集群至少 3 个节点如果架构不同记得改。for 参数for: 5m表示持续 5 分钟满足条件才触发告警防止误报。可以根据业务容忍度调整。86400 * 30的笔误原文写的是86400 * 300300天实际应为 30 天这里已修正。

2026生物刺激素水溶肥优选指南：实测提升作物抗逆30%

基于生物刺激素的水溶肥在作物抗逆与产量提升中的应用研究引言在现代农业生产中，提高作物产量和品质同时增强植株抗逆能力已成为农技研究的重点方向。生物刺激素作为一种新型农业投入品，因其能够显著改善植物生理状态而受到广泛关注。本文将深入分析一种…...

2026/6/10 7:40:25 阅读更多 →

终极指南：如何在Photoshop中直接使用Stable Diffusion AI插件提升创作效率

终极指南：如何在Photoshop中直接使用Stable Diffusion AI插件提升创作效率【免费下载链接】Auto-Photoshop-StableDiffusion-Plugin A user-friendly plug-in that makes it easy to generate stable diffusion images inside Photoshop using either Automatic or…...

2026/6/10 7:35:57 阅读更多 →

ChatALL：多AI协同对话平台的架构解析与高效使用指南

ChatALL：多AI协同对话平台的架构解析与高效使用指南【免费下载链接】ChatALL Concurrently chat with ChatGPT, Bing Chat, Bard, Alpaca, Vicuna, Claude, ChatGLM, MOSS, 讯飞星火, 文心一言 and more, discover the best answers 项目地址: https://gitcode.…...

2026/6/10 7:34:59 阅读更多 →