2026年云原生监控实战：Prometheus + Grafana 搭建企业级监控体系

从零搭建一套覆盖服务器、数据库、中间件的完整监控方案，告警准确率达到 95% 以上。

为什么云原生时代需要 Prometheus + Grafana？

做过运维的同学都知道，服务器一多，传统的监控方案（Zabbix、Nagios）就开始力不从心——配置繁琐、扩展困难、告警规则写起来像写天书。而在云原生和容器化时代，Prometheus + Grafana 已经成了监控领域的”黄金搭档”。

Prometheus vs 传统监控方案

特性	Prometheus	Zabbix	Nagios
——	———–	——–	——–
数据模型	时序数据库（TSDB）	关系型数据库	RRD 文件
服务发现	原生支持，自动发现	需手动配置	需手动配置
容器监控	原生支持	需插件	基本不支持
告警能力	Alertmanager，功能强大	内置，中规中矩	依赖插件
查询语言	PromQL，灵活强大	简单表达式	基本不支持
横向扩展	联邦集群	代理/节点模式	分布式架构
社区生态	极其丰富（上千个 Exporter）	成熟稳定	老牌，但更新慢

一句话总结：如果是传统物理机/虚拟机环境，Zabbix 仍然够用；但如果是容器化、微服务、云原生环境，Prometheus 是不二之选。

整体架构设计

在动手之前，先搞清楚整体架构，避免”搭了半天发现搭错了”。

┌─────────────────────────────────────────────────────┐
│                  Grafana (可视化面板)                  │
│         仪表盘 / 图表 / 告警通知                      │
└──────────────────────┬──────────────────────────────┘
                       │ 查询数据
                       ▼
┌─────────────────────────────────────────────────────┐
│               Prometheus (数据采集+存储)               │
│    Pull 模式抓取 / PromQL 查询 / 服务发现              │
└──────┬──────────┬──────────┬──────────┬─────────────┘
       │          │          │          │
       ▼          ▼          ▼          ▼
   Node      MySQL      Redis      Nginx
  Exporter  Exporter   Exporter   Exporter
  (服务器)   (数据库)   (缓存)     (Web)
       │          │          │          │
       ▼          ▼          ▼          ▼
   物理机/     MySQL      Redis      Nginx
   云服务器    实例       实例       实例

┌─────────────────────────────────────────────────────┐
│            Alertmanager (告警管理)                     │
│    告警分组 / 静默 / 路由 / 通知（邮件/钉钉/企微）      │
└─────────────────────────────────────────────────────┘

核心组件说明：

**Prometheus Server**：核心组件，负责抓取（Pull）各目标指标并存储

**Exporter**：各服务的指标采集器，暴露 `/metrics` 端口

**Grafana**：数据可视化面板，把 Prometheus 数据变成漂亮的图表

**Alertmanager**：告警管理，支持分组、抑制、静默、多渠道通知

第一步：Docker Compose 一键部署

用 Docker Compose 把 Prometheus + Grafana + Alertmanager 一起拉起来，省去逐个安装的麻烦。

目录结构

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── alert_rules.yml
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml
│       └── dashboards/
│           └── dashboard.yml
└── alertmanager/
    └── alertmanager.yml

docker-compose.yml

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.54.0
    container_name: prometheus
    restart: always
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: always
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: always
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=your_password_here
      - GF_USERS_ALLOW_SIGN_UP=false
    networks:
      - monitoring

  node_exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node_exporter
    restart: always
    ports:
      - "9100:9100"
    command:
      - '--collector.cpu'
      - '--collector.diskstats'
      - '--collector.filesystem'
      - '--collector.loadavg'
      - '--collector.meminfo'
      - '--collector.network'
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

prometheus.yml（核心配置）

global:
  scrape_interval: 15s         # 每15秒采集一次
  evaluation_interval: 15s     # 每15秒评估一次告警规则
  scrape_timeout: 10s          # 采集超时时间

# 告警规则文件
rule_files:
  - 'alert_rules.yml'

# Alertmanager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'alertmanager:9093'

# 采集目标配置
scrape_configs:
  # Prometheus 自身监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 服务器基础指标（CPU/内存/磁盘/网络）
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100']

  # MySQL 监控（按需开启）
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql_exporter:9104']

  # Redis 监控（按需开启）
  - job_name: 'redis'
    static_configs:
      - targets: ['redis_exporter:9121']

启动整套系统：

cd monitoring
docker-compose up -d

启动后访问：

Prometheus：`http://你的IP:9090`

Grafana：`http://你的IP:3000`（默认账号 admin / your_password_here）

Alertmanager：`http://你的IP:9093`

第二步：配置核心告警规则

监控不告警等于没监控。以下是我实战中验证过的最核心告警规则：

alert_rules.yml

groups:
  - name: 服务器基础告警
    rules:
      # CPU 使用率超过 85%
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU使用率过高 - {{ $labels.instance }}"
          description: "CPU使用率已达 {{ $value | printf \"%.1f\" }}%，持续5分钟"

      # 内存使用率超过 90%
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "内存使用率过高 - {{ $labels.instance }}"
          description: "内存使用率已达 {{ $value | printf \"%.1f\" }}%"

      # 磁盘使用率超过 85%
      - alert: HighDiskUsage
        expr: (1 - node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘使用率过高 - {{ $labels.instance }}"
          description: "{{ $labels.mountpoint }} 磁盘使用率 {{ $value | printf \"%.1f\" }}%"

      # 磁盘预测24小时内会满
      - alert: DiskWillFillIn24h
        expr: predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[1h], 24*3600) < 0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "磁盘将在24小时内写满 - {{ $labels.instance }}"
          description: "{{ $labels.mountpoint }} 预计24小时内磁盘空间将耗尽"

      # 服务器离线
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务器离线 - {{ $labels.instance }}"
          description: "{{ $labels.job }} 上的 {{ $labels.instance }} 已离线超过1分钟"

  - name: 网络告警
    rules:
      # 网络流出带宽异常
      - alert: HighNetworkOut
        expr: irate(node_network_transmit_bytes_total{device!~"lo"}[5m]) > 100 * 1024 * 1024
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "网络流出带宽异常 - {{ $labels.instance }}"
          description: "网卡 {{ $labels.device }} 流出超过 100MB/s"

告警级别说明

级别	含义	响应时间
——	——	———
critical	严重故障，影响业务	5 分钟内响应
warning	预警，需要关注	30 分钟内处理
info	通知类，记录即可	下个工作日

第三步：配置告警通知（钉钉/企业微信）

Alertmanager 配置企业微信 Webhook

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s        # 同组告警等待30秒一起发送
  group_interval: 5m     # 同组告警间隔5分钟发送
  repeat_interval: 4h    # 重复告警间隔4小时
  receiver: 'wechat'
  routes:
    # 严重告警立即推送
    - match:
        severity: critical
      receiver: 'wechat-critical'
      repeat_interval: 30m

receivers:
  - name: 'wechat'
    webhook_configs:
      - url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=你的KEY'
        send_resolved: true

  - name: 'wechat-critical'
    webhook_configs:
      - url: 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=你的KEY'
        send_resolved: true

实战建议：

先用 `severity: warning` 观察一周，确认没有误报后再开 `critical`

企业微信群建议建一个专门的”监控告警”群，避免打扰其他同事

`repeat_interval` 不要设太短，否则告警风暴会让你崩溃

第四步：Grafana 导入现成仪表盘

Grafana 最大的优势就是社区有大量现成的仪表盘模板，不用自己从零画图。

配置数据源自动注入

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Dashboard ID	名称	用途
1860	Node Exporter Full	服务器综合监控（CPU/内存/磁盘/网络）
7362	Prometheus Overview	Prometheus 自身运行状态
14031	MySQL Overview	MySQL 数据库监控
11835	Redis Dashboard	Redis 缓存监控
12740	Nginx Monitoring	Nginx Web 服务器监控

第五步：添加 MySQL 和 Redis 监控

MySQL Exporter

在 docker-compose.yml 中添加：

  mysql_exporter:
    image: prom/mysqld-exporter:v0.15.0
    container_name: mysql_exporter
    restart: always
    ports:
      - "9104:9104"
    environment:
      - DATA_SOURCE_NAME=root:your_password@(mysql_host:3306)/
    networks:
      - monitoring

Redis Exporter

  redis_exporter:
    image: oliver006/redis_exporter:v1.61.0
    container_name: redis_exporter
    restart: always
    ports:
      - "9121:9121"
    environment:
      - REDIS_ADDR=redis://redis_host:6379
      - REDIS_PASSWORD=your_password
    networks:
      - monitoring

添加后记得更新 prometheus.yml 中的 scrape_configs，然后 docker-compose restart prometheus 即可。

常用 PromQL 查询速查

日常运维中经常用到的查询语句，收藏备用：

# CPU 使用率
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 磁盘使用率（按挂载点）
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# 网络带宽使用（入/出，单位 MB/s）
irate(node_network_receive_bytes_total{device!="lo"}[5m]) / 1024 / 1024
irate(node_network_transmit_bytes_total{device!="lo"}[5m]) / 1024 / 1024

# 磁盘 I/O（读/写，单位 MB/s）
irate(node_disk_read_bytes_total[5m]) / 1024 / 1024
irate(node_disk_written_bytes_total[5m]) / 1024 / 1024

# 系统负载
node_load1
node_load5
node_load15

# TCP 连接数
node_netstat_Tcp_CurrEstab

# MySQL 慢查询数
rate(mysql_global_status_slow_queries[5m])

# Redis 内存使用率
redis_memory_used_bytes / redis_memory_max_bytes * 100

踩坑经验与优化建议

1. 数据保留策略

默认 Prometheus 数据保留 15 天，生产环境建议调整为 30 天：

# prometheus.yml
storage:
  tsdb:
    retention.time: 30d
    retention.size: 50GB    # 限制最大存储空间

2. 告警降噪三原则

我在运维过程中总结的告警降噪经验：

**分组（group_by）**：同类告警归为一组，避免轰炸

**抑制（inhibit_rules）**：服务器离线时，不需要再发 CPU/内存告警

**静默（silence）**：维护期间主动静默，手动停告警比事后清理方便

# 抑制规则示例：服务器离线时不发送其他告警
inhibit_rules:
  - source_match:
      alertname: InstanceDown
    target_match_re:
      severity: (warning|info)
    equal: ['instance']

3. 性能优化

当监控目标超过 500 个时，Prometheus 单机可能扛不住：

目标数量	推荐方案
———	———
< 500	单实例足够
500-2000	Thanos 或 VictoriaMetrics
> 2000	Mimir 或 Cortex（分布式方案）

总结

整套监控体系搭建完成后，你将获得：

✅ **可视化**：Grafana 仪表盘实时展示所有服务器状态

✅ **自动告警**：CPU/内存/磁盘异常自动推送企业微信

✅ **历史数据**：30 天历史趋势可查，问题可回溯

✅ **扩展性**：新增服务器只需配置一个 Exporter

从零到搭建完成，实际操作时间大约 2 小时（前提是网络通畅）。搭建完成后，建议先观察一周的告警数据，根据实际情况调整阈值，避免告警疲劳。

监控体系不是搭建完就结束了，持续的阈值调优才是让监控真正有用的关键。

关于作者

长期关注大模型应用落地与云服务器实战，专注技术在企业场景中的落地实践。

个人博客：yunduancloud.icu —— 持续更新云计算、AI大模型实战教程，欢迎访问交流。

2026年云原生监控实战：Prometheus + Grafana 搭建企业级监控体系

2026年云原生监控实战：Prometheus + Grafana 搭建企业级监控体系

为什么云原生时代需要 Prometheus + Grafana？

Prometheus vs 传统监控方案

整体架构设计

第一步：Docker Compose 一键部署

目录结构

docker-compose.yml

prometheus.yml（核心配置）

第二步：配置核心告警规则

alert_rules.yml

告警级别说明

第三步：配置告警通知（钉钉/企业微信）

Alertmanager 配置企业微信 Webhook

第四步：Grafana 导入现成仪表盘

配置数据源自动注入

推荐导入的仪表盘

第五步：添加 MySQL 和 Redis 监控

MySQL Exporter

Redis Exporter

常用 PromQL 查询速查

踩坑经验与优化建议

1. 数据保留策略

2. 告警降噪三原则

3. 性能优化

总结

发表评论取消回复

2026年云原生监控实战：Prometheus + Grafana 搭建企业级监控体系

为什么云原生时代需要 Prometheus + Grafana？

Prometheus vs 传统监控方案

整体架构设计

第一步：Docker Compose 一键部署

目录结构

docker-compose.yml

prometheus.yml（核心配置）

第二步：配置核心告警规则

alert_rules.yml

告警级别说明

第三步：配置告警通知（钉钉/企业微信）

Alertmanager 配置企业微信 Webhook

第四步：Grafana 导入现成仪表盘

配置数据源自动注入

推荐导入的仪表盘

第五步：添加 MySQL 和 Redis 监控

MySQL Exporter

Redis Exporter

常用 PromQL 查询速查

踩坑经验与优化建议

1. 数据保留策略

2. 告警降噪三原则

3. 性能优化

总结

发表评论 取消回复

发表评论取消回复