Alertmanager 教程
1. 核心概念
Alertmanager 是 Prometheus 生态系统中的一个组件,用于处理和管理来自 Prometheus 服务器的告警。它提供了告警路由、分组、抑制和静默等功能,确保告警能够及时、准确地传递给相关人员。
1.1 基本概念
- 告警:由 Prometheus 基于规则生成的通知,包含告警名称、标签、时间戳等信息。
- 告警路由:根据告警的标签将告警路由到不同的接收器。
- 告警分组:将相似的告警分组,避免告警风暴。
- 告警抑制:当一个重要告警触发时,抑制相关的次要告警。
- 告警静默:暂时禁止特定告警的通知。
- 接收器:接收告警的目标,如电子邮件、Slack、PagerDuty 等。
1.2 工作原理
- 接收告警:接收来自 Prometheus 或其他监控系统的告警。
- 处理告警:对告警进行分组、路由、抑制等处理。
- 发送通知:将处理后的告警通过配置的接收器发送给相关人员。
- 管理静默:根据配置的静默规则,暂时禁止特定告警的通知。
2. 安装配置
2.1 二进制安装
- 下载二进制文件
从 GitHub Releases 页面下载适合您系统的二进制文件:
# 下载最新版本
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
# 解压
tar xvfz alertmanager-0.25.0.linux-amd64.tar.gz
# 进入目录
cd alertmanager-0.25.0.linux-amd64
# 运行
./alertmanager2.2 Docker 安装
docker run -d \n --name alertmanager \n -p 9093:9093 \n -v "$(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml" \n prom/alertmanager:v0.25.02.3 配置文件
创建 alertmanager.yml 配置文件:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
service: frontend
receiver: 'slack'
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#alerts'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']2.4 Prometheus 配置
在 Prometheus 配置文件中添加 Alertmanager 的地址:
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "alerts/*.yml"3. 基本使用
3.1 访问 Web 界面
启动 Alertmanager 后,可以通过 http://localhost:9093 访问其 Web 界面。
3.2 查看告警
在 Web 界面的 "Alerts" 标签页中,可以查看当前的告警状态:
- Active:当前激活的告警。
- Suppressed:被抑制的告警。
- Silenced:被静默的告警。
- Inactive:已解决的告警。
3.3 创建静默
通过 Web 界面可以创建静默,暂时禁止特定告警的通知:
- 在 Web 界面的 "Silences" 标签页中,点击 "New Silence" 按钮。
- 输入匹配规则,如
alertname=HighCPUUsage。 - 设置静默的开始时间和结束时间。
- 添加注释,说明创建静默的原因。
- 点击 "Create" 按钮创建静默。
3.4 查看通知
在 Web 界面的 "Status" 标签页中,可以查看通知的发送状态和历史记录。
4. 高级功能
4.1 告警路由
Alertmanager 支持复杂的告警路由配置,可以根据告警的标签将告警路由到不同的接收器:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
- match_re:
service: '^(frontend|backend)$'
receiver: 'service-team'
routes:
- match:
environment: production
receiver: 'oncall'4.2 告警分组
告警分组可以将相似的告警分组,避免告警风暴:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # 等待 30 秒,收集更多相似的告警
group_interval: 5m # 分组之间的间隔
repeat_interval: 1h # 重复发送告警的间隔4.3 告警抑制
告警抑制可以避免在一个重要告警触发时,发送大量相关的次要告警:
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service'] # 必须匹配的标签4.4 模板
Alertmanager 支持使用 Go 模板来自定义告警通知的内容:
templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: 'email'
email_configs:
- to: 'team@example.com'
html: '{{ template "email.default.html" . }}'
text: '{{ template "email.default.text" . }}'模板文件示例:
template "email.default.html" {
<html>
<body>
<h1>{{ .CommonLabels.alertname }}</h1>
<p>Cluster: {{ .CommonLabels.cluster }}</p>
<p>Service: {{ .CommonLabels.service }}</p>
<p>Status: {{ .Status }}</p>
{{ range .Alerts }}
<h2>Alert: {{ .Labels.alertname }}</h2>
<p>{{ .Annotations.description }}</p>
<p>Value: {{ .Annotations.value }}</p>
{{ end }}
</body>
</html>
}5. 最佳实践
5.1 配置最佳实践
- 合理设置路由规则:根据业务逻辑和团队结构设置合理的路由规则。
- 使用告警分组:对相似的告警进行分组,避免告警风暴。
- 配置告警抑制:设置合理的抑制规则,避免在重要告警触发时发送大量次要告警。
- 使用模板:自定义告警通知的内容,使其更加清晰、有用。
- 配置接收器:根据告警的严重性和类型,配置不同的接收器。
5.2 部署最佳实践
- 高可用部署:对于生产环境,建议部署多个 Alertmanager 实例,使用负载均衡器进行负载分发。
- 数据持久化:配置数据持久化,避免在重启后丢失静默和告警状态。
- 配置管理:使用版本控制系统管理 Alertmanager 的配置文件。
- 监控 Alertmanager:监控 Alertmanager 本身的健康状态,确保其正常运行。
- 安全配置:配置 TLS 和认证,保护 Alertmanager 的 Web 界面和 API。
5.3 告警最佳实践
- 告警分级:根据告警的严重性,将告警分为不同的级别(如 critical、warning、info)。
- 告警阈值:设置合理的告警阈值,避免误报。
- 告警描述:为告警添加清晰、详细的描述,说明告警的原因和可能的解决方案。
- 告警测试:定期测试告警的触发和通知,确保其正常工作。
- 告警回顾:定期回顾告警历史,优化告警规则和配置。
6. 实用应用案例
6.1 企业级告警管理
场景:管理企业内部多个服务的告警,确保告警能够及时、准确地传递给相关团队。
配置:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alertmanager@company.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'oncall'
- match:
team: frontend
receiver: 'frontend-team'
- match:
team: backend
receiver: 'backend-team'
- match:
team: database
receiver: 'database-team'
receivers:
- name: 'default'
email_configs:
- to: 'devops@company.com'
- name: 'oncall'
email_configs:
- to: 'oncall@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#oncall-alerts'
send_resolved: true
pagerduty_configs:
- service_key: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
- name: 'frontend-team'
email_configs:
- to: 'frontend-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#frontend-alerts'
send_resolved: true
- name: 'backend-team'
email_configs:
- to: 'backend-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#backend-alerts'
send_resolved: true
- name: 'database-team'
email_configs:
- to: 'database-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#database-alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']告警规则:
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is {{ $value }}% for more than 5 minutes."
- alert: CriticalCPUUsage
expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 90
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} is {{ $value }}% for more than 2 minutes."
- alert: HighMemoryUsage
expr: (100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)) > 80
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage on {{ $labels.instance }} is {{ $value }}% for more than 5 minutes."
- alert: CriticalMemoryUsage
expr: (100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)) > 90
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "Critical memory usage on {{ $labels.instance }}"
description: "Memory usage on {{ $labels.instance }} is {{ $value }}% for more than 2 minutes."6.2 云服务监控告警
场景:监控云服务的状态和性能,当云服务出现问题时及时通知。
配置:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alertmanager@company.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'service', 'region']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default'
routes:
- match:
cloud: aws
receiver: 'aws-team'
- match:
cloud: azure
receiver: 'azure-team'
- match:
cloud: gcp
receiver: 'gcp-team'
receivers:
- name: 'default'
email_configs:
- to: 'cloud-team@company.com'
- name: 'aws-team'
email_configs:
- to: 'aws-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#aws-alerts'
send_resolved: true
- name: 'azure-team'
email_configs:
- to: 'azure-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#azure-alerts'
send_resolved: true
- name: 'gcp-team'
email_configs:
- to: 'gcp-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#gcp-alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'service', 'region']告警规则:
groups:
- name: cloud_alerts
rules:
- alert: AWSServiceDown
expr: aws_service_health_status{status="down"} == 1
for: 1m
labels:
severity: critical
cloud: aws
annotations:
summary: "AWS service {{ $labels.service }} is down in {{ $labels.region }}"
description: "AWS service {{ $labels.service }} has been down in {{ $labels.region }} for more than 1 minute."
- alert: AzureServiceDown
expr: azure_service_health_status{status="down"} == 1
for: 1m
labels:
severity: critical
cloud: azure
annotations:
summary: "Azure service {{ $labels.service }} is down in {{ $labels.region }}"
description: "Azure service {{ $labels.service }} has been down in {{ $labels.region }} for more than 1 minute."
- alert: GCPServiceDown
expr: gcp_service_health_status{status="down"} == 1
for: 1m
labels:
severity: critical
cloud: gcp
annotations:
summary: "GCP service {{ $labels.service }} is down in {{ $labels.region }}"
description: "GCP service {{ $labels.service }} has been down in {{ $labels.region }} for more than 1 minute."6.3 应用性能监控告警
场景:监控应用的性能指标,当应用性能下降时及时通知。
配置:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alertmanager@company.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'app', 'environment']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default'
routes:
- match:
environment: production
receiver: 'production-alerts'
- match:
environment: staging
receiver: 'staging-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'dev-team@company.com'
- name: 'production-alerts'
email_configs:
- to: 'prod-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#prod-alerts'
send_resolved: true
pagerduty_configs:
- service_key: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
- name: 'staging-alerts'
email_configs:
- to: 'dev-team@company.com'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#staging-alerts'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'app', 'environment']告警规则:
groups:
- name: app_alerts
rules:
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, app, environment)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency for {{ $labels.app }} in {{ $labels.environment }}"
description: "95th percentile request latency for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} seconds for more than 5 minutes."
- alert: CriticalRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, app, environment)) > 2
for: 2m
labels:
severity: critical
annotations:
summary: "Critical request latency for {{ $labels.app }} in {{ $labels.environment }}"
description: "95th percentile request latency for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} seconds for more than 2 minutes."
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (app, environment) / sum(rate(http_requests_total[5m])) by (app, environment) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.app }} in {{ $labels.environment }}"
description: "Error rate for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} for more than 5 minutes."
- alert: CriticalErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (app, environment) / sum(rate(http_requests_total[5m])) by (app, environment) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Critical error rate for {{ $labels.app }} in {{ $labels.environment }}"
description: "Error rate for {{ $labels.app }} in {{ $labels.environment }} is {{ $value }} for more than 2 minutes."6. 总结
Alertmanager 是 Prometheus 生态系统中一个强大的告警管理工具,通过提供告警路由、分组、抑制和静默等功能,确保告警能够及时、准确地传递给相关人员。
通过合理配置和使用 Alertmanager,可以实现对告警的有效管理,避免告警风暴,提高运维效率,确保系统的稳定运行。
6.1 核心优势
- 灵活的路由规则:根据告警的标签将告警路由到不同的接收器。
- 强大的告警分组:对相似的告警进行分组,避免告警风暴。
- 智能的告警抑制:当一个重要告警触发时,抑制相关的次要告警。
- 方便的静默功能:暂时禁止特定告警的通知,避免在维护期间收到不必要的告警。
- 丰富的接收器:支持电子邮件、Slack、PagerDuty 等多种接收器。
- 自定义模板:使用 Go 模板自定义告警通知的内容。
6.2 应用前景
随着微服务架构的普及和云原生技术的发展,对监控和告警的需求越来越高。Alertmanager 作为一种专业的告警管理工具,在以下场景中有着广泛的应用前景:
- 企业级监控:管理企业内部多个服务和系统的告警。
- 云服务监控:监控云服务的状态和性能,当云服务出现问题时及时通知。
- 微服务监控:监控微服务架构中的各个服务,确保服务的正常运行。
- DevOps 实践:作为 DevOps 工具链的一部分,实现自动化的告警管理。
- SRE 实践:支持 SRE 团队的工作,提高系统的可靠性和可用性。
通过不断探索和实践 Alertmanager 的功能和最佳实践,可以构建更加完善和可靠的告警系统,为业务的稳定运行提供有力保障。