Alertmanager 教程

1. 核心概念

Alertmanager 是 Prometheus 生态系统中的一个组件,用于处理和管理来自 Prometheus 服务器的告警。它提供了告警路由、分组、抑制和静默等功能,确保告警能够及时、准确地传递给相关人员。

1.1 基本概念

  • 告警:由 Prometheus 基于规则生成的通知,包含告警名称、标签、时间戳等信息。
  • 告警路由:根据告警的标签将告警路由到不同的接收器。
  • 告警分组:将相似的告警分组,避免告警风暴。
  • 告警抑制:当一个重要告警触发时,抑制相关的次要告警。
  • 告警静默:暂时禁止特定告警的通知。
  • 接收器:接收告警的目标,如电子邮件、Slack、PagerDuty 等。

1.2 工作原理

  1. 接收告警:接收来自 Prometheus 或其他监控系统的告警。
  2. 处理告警:对告警进行分组、路由、抑制等处理。
  3. 发送通知:将处理后的告警通过配置的接收器发送给相关人员。
  4. 管理静默:根据配置的静默规则,暂时禁止特定告警的通知。

2. 安装配置

2.1 二进制安装

  1. 下载二进制文件

从 GitHub Releases 页面下载适合您系统的二进制文件:

# 下载最新版本
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz

# 解压
 tar xvfz alertmanager-0.25.0.linux-amd64.tar.gz

# 进入目录
 cd alertmanager-0.25.0.linux-amd64

# 运行
 ./alertmanager

2.2 Docker 安装

docker run -d \n  --name alertmanager \n  -p 9093:9093 \n  -v "$(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml" \n  prom/alertmanager:v0.25.0

2.3 配置文件

创建 alertmanager.yml 配置文件:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty'
  - match:
      service: frontend
    receiver: 'slack'

receivers:
- name: 'email'
  email_configs:
  - to: 'team@example.com'

- name: 'slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#alerts'
    send_resolved: true

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

2.4 Prometheus 配置

在 Prometheus 配置文件中添加 Alertmanager 的地址:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

rule_files:
  - "alerts/*.yml"

3. 基本使用

3.1 访问 Web 界面

启动 Alertmanager 后,可以通过 http://localhost:9093 访问其 Web 界面。

3.2 查看告警

在 Web 界面的 "Alerts" 标签页中,可以查看当前的告警状态:

  • Active:当前激活的告警。
  • Suppressed:被抑制的告警。
  • Silenced:被静默的告警。
  • Inactive:已解决的告警。

3.3 创建静默

通过 Web 界面可以创建静默,暂时禁止特定告警的通知:

  1. 在 Web 界面的 "Silences" 标签页中,点击 "New Silence" 按钮。
  2. 输入匹配规则,如 alertname=HighCPUUsage
  3. 设置静默的开始时间和结束时间。
  4. 添加注释,说明创建静默的原因。
  5. 点击 "Create" 按钮创建静默。

3.4 查看通知

在 Web 界面的 "Status" 标签页中,可以查看通知的发送状态和历史记录。

4. 高级功能

4.1 告警路由

Alertmanager 支持复杂的告警路由配置,可以根据告警的标签将告警路由到不同的接收器:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'critical'
  - match_re:
      service: '^(frontend|backend)$'
    receiver: 'service-team'
    routes:
    - match:
        environment: production
      receiver: 'oncall'

4.2 告警分组

告警分组可以将相似的告警分组,避免告警风暴:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s  # 等待 30 秒,收集更多相似的告警
  group_interval: 5m  # 分组之间的间隔
  repeat_interval: 1h  # 重复发送告警的间隔

4.3 告警抑制

告警抑制可以避免在一个重要告警触发时,发送大量相关的次要告警:

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']  # 必须匹配的标签

4.4 模板

Alertmanager 支持使用 Go 模板来自定义告警通知的内容:

templates:
  - '/etc/alertmanager/templates/*.tmpl'

receivers:
- name: 'email'
  email_configs:
  - to: 'team@example.com'
    html: '{{ template "email.default.html" . }}'
    text: '{{ template "email.default.text" . }}'

模板文件示例:

template "email.default.html" {
  <html>
  <body>
    <h1>{{ .CommonLabels.alertname }}</h1>
    <p>Cluster: {{ .CommonLabels.cluster }}</p>
    <p>Service: {{ .CommonLabels.service }}</p>
    <p>Status: {{ .Status }}</p>
    {{ range .Alerts }}
    <h2>Alert: {{ .Labels.alertname }}</h2>
    <p>{{ .Annotations.description }}</p>
    <p>Value: {{ .Annotations.value }}</p>
    {{ end }}
  </body>
  </html>
}

5. 最佳实践

5.1 配置最佳实践

  • 合理设置路由规则:根据业务逻辑和团队结构设置合理的路由规则。
  • 使用告警分组:对相似的告警进行分组,避免告警风暴。
  • 配置告警抑制:设置合理的抑制规则,避免在重要告警触发时发送大量次要告警。
  • 使用模板:自定义告警通知的内容,使其更加清晰、有用。
  • 配置接收器:根据告警的严重性和类型,配置不同的接收器。

5.2 部署最佳实践

  • 高可用部署:对于生产环境,建议部署多个 Alertmanager 实例,使用负载均衡器进行负载分发。
  • 数据持久化:配置数据持久化,避免在重启后丢失静默和告警状态。
  • 配置管理:使用版本控制系统管理 Alertmanager 的配置文件。
  • 监控 Alertmanager:监控 Alertmanager 本身的健康状态,确保其正常运行。
  • 安全配置:配置 TLS 和认证,保护 Alertmanager 的 Web 界面和 API。

5.3 告警最佳实践

  • 告警分级:根据告警的严重性,将告警分为不同的级别(如 critical、warning、info)。
  • 告警阈值:设置合理的告警阈值,避免误报。
  • 告警描述:为告警添加清晰、详细的描述,说明告警的原因和可能的解决方案。
  • 告警测试:定期测试告警的触发和通知,确保其正常工作。
  • 告警回顾:定期回顾告警历史,优化告警规则和配置。

6. 实用应用案例

6.1 企业级告警管理

场景:管理企业内部多个服务的告警,确保告警能够及时、准确地传递给相关团队。

配置

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alertmanager@company.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      severity: critical
    receiver: 'oncall'
  - match:
      team: frontend
    receiver: 'frontend-team'
  - match:
      team: backend
    receiver: 'backend-team'
  - match:
      team: database
    receiver: 'database-team'

receivers:
- name: 'default'
  email_configs:
  - to: 'devops@company.com'

- name: 'oncall'
  email_configs:
  - to: 'oncall@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#oncall-alerts'
    send_resolved: true
  pagerduty_configs:
  - service_key: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

- name: 'frontend-team'
  email_configs:
  - to: 'frontend-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#frontend-alerts'
    send_resolved: true

- name: 'backend-team'
  email_configs:
  - to: 'backend-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#backend-alerts'
    send_resolved: true

- name: 'database-team'
  email_configs:
  - to: 'database-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#database-alerts'
    send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

告警规则

groups:
- name: node_alerts
  rules:
  - alert: HighCPUUsage
    expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
    for: 5m
    labels:
      severity: warning
      team: backend
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is {{ $value }}% for more than 5 minutes."

  - alert: CriticalCPUUsage
    expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 90
    for: 2m
    labels:
      severity: critical
      team: backend
    annotations:
      summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is {{ $value }}% for more than 2 minutes."

  - alert: HighMemoryUsage
    expr: (100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)) > 80
    for: 5m
    labels:
      severity: warning
      team: backend
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage on {{ $labels.instance }} is {{ $value }}% for more than 5 minutes."

  - alert: CriticalMemoryUsage
    expr: (100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)) > 90
    for: 2m
    labels:
      severity: critical
      team: backend
    annotations:
      summary: "Critical memory usage on {{ $labels.instance }}"
description: "Memory usage on {{ $labels.instance }} is {{ $value }}% for more than 2 minutes."

6.2 云服务监控告警

场景:监控云服务的状态和性能,当云服务出现问题时及时通知。

配置

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alertmanager@company.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'service', 'region']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      cloud: aws
    receiver: 'aws-team'
  - match:
      cloud: azure
    receiver: 'azure-team'
  - match:
      cloud: gcp
    receiver: 'gcp-team'

receivers:
- name: 'default'
  email_configs:
  - to: 'cloud-team@company.com'

- name: 'aws-team'
  email_configs:
  - to: 'aws-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#aws-alerts'
    send_resolved: true

- name: 'azure-team'
  email_configs:
  - to: 'azure-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#azure-alerts'
    send_resolved: true

- name: 'gcp-team'
  email_configs:
  - to: 'gcp-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#gcp-alerts'
    send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'service', 'region']

告警规则

groups:
- name: cloud_alerts
  rules:
  - alert: AWSServiceDown
    expr: aws_service_health_status{status="down"} == 1
    for: 1m
    labels:
      severity: critical
      cloud: aws
    annotations:
      summary: "AWS service {{ $labels.service }} is down in {{ $labels.region }}"
description: "AWS service {{ $labels.service }} has been down in {{ $labels.region }} for more than 1 minute."

  - alert: AzureServiceDown
    expr: azure_service_health_status{status="down"} == 1
    for: 1m
    labels:
      severity: critical
      cloud: azure
    annotations:
      summary: "Azure service {{ $labels.service }} is down in {{ $labels.region }}"
description: "Azure service {{ $labels.service }} has been down in {{ $labels.region }} for more than 1 minute."

  - alert: GCPServiceDown
    expr: gcp_service_health_status{status="down"} == 1
    for: 1m
    labels:
      severity: critical
      cloud: gcp
    annotations:
      summary: "GCP service {{ $labels.service }} is down in {{ $labels.region }}"
description: "GCP service {{ $labels.service }} has been down in {{ $labels.region }} for more than 1 minute."

6.3 应用性能监控告警

场景:监控应用的性能指标,当应用性能下降时及时通知。

配置

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alertmanager@company.com'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'app', 'environment']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default'
  routes:
  - match:
      environment: production
    receiver: 'production-alerts'
  - match:
      environment: staging
    receiver: 'staging-alerts'

receivers:
- name: 'default'
  email_configs:
  - to: 'dev-team@company.com'

- name: 'production-alerts'
  email_configs:
  - to: 'prod-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#prod-alerts'
    send_resolved: true
  pagerduty_configs:
  - service_key: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

- name: 'staging-alerts'
  email_configs:
  - to: 'dev-team@company.com'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#staging-alerts'
    send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'app', 'environment']

告警规则

groups:
- name: app_alerts
  rules:
  - alert: HighRequestLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, app, environment)) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High request latency for {{ $labels.app }} in {{ $labels.environment }}"
description: "95th percentile request latency for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} seconds for more than 5 minutes."

  - alert: CriticalRequestLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, app, environment)) > 2
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Critical request latency for {{ $labels.app }} in {{ $labels.environment }}"
description: "95th percentile request latency for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} seconds for more than 2 minutes."

  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (app, environment) / sum(rate(http_requests_total[5m])) by (app, environment) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate for {{ $labels.app }} in {{ $labels.environment }}"
description: "Error rate for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} for more than 5 minutes."

  - alert: CriticalErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (app, environment) / sum(rate(http_requests_total[5m])) by (app, environment) > 0.1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Critical error rate for {{ $labels.app }} in {{ $labels.environment }}"
description: "Error rate for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} for more than 2 minutes."

6. 总结

Alertmanager 是 Prometheus 生态系统中一个强大的告警管理工具,通过提供告警路由、分组、抑制和静默等功能,确保告警能够及时、准确地传递给相关人员。

通过合理配置和使用 Alertmanager,可以实现对告警的有效管理,避免告警风暴,提高运维效率,确保系统的稳定运行。

6.1 核心优势

  • 灵活的路由规则:根据告警的标签将告警路由到不同的接收器。
  • 强大的告警分组:对相似的告警进行分组,避免告警风暴。
  • 智能的告警抑制:当一个重要告警触发时,抑制相关的次要告警。
  • 方便的静默功能:暂时禁止特定告警的通知,避免在维护期间收到不必要的告警。
  • 丰富的接收器:支持电子邮件、Slack、PagerDuty 等多种接收器。
  • 自定义模板:使用 Go 模板自定义告警通知的内容。

6.2 应用前景

随着微服务架构的普及和云原生技术的发展,对监控和告警的需求越来越高。Alertmanager 作为一种专业的告警管理工具,在以下场景中有着广泛的应用前景:

  • 企业级监控:管理企业内部多个服务和系统的告警。
  • 云服务监控:监控云服务的状态和性能,当云服务出现问题时及时通知。
  • 微服务监控:监控微服务架构中的各个服务,确保服务的正常运行。
  • DevOps 实践:作为 DevOps 工具链的一部分,实现自动化的告警管理。
  • SRE 实践:支持 SRE 团队的工作,提高系统的可靠性和可用性。

通过不断探索和实践 Alertmanager 的功能和最佳实践,可以构建更加完善和可靠的告警系统,为业务的稳定运行提供有力保障。

« 上一篇 Blackbox Exporter 教程 下一篇 » Nginx 教程