Prometheus 中文教程

1. 项目概述

Prometheus 是一个开源的监控系统和时序数据库，专为收集、存储和分析指标数据而设计。它最初由 SoundCloud 开发，现在是 Cloud Native Computing Foundation (CNCF) 的毕业项目。

主要功能

多维度数据模型：通过键值对标识的时间序列数据
灵活的查询语言 PromQL：支持复杂的数据分析和聚合
高效的存储：本地时序数据库，支持长期存储
基于 HTTP 的拉取模型：从目标服务获取指标
自动发现目标：支持多种服务发现机制
集成的告警功能：基于 PromQL 表达式的告警规则
丰富的可视化集成：与 Grafana 等工具无缝集成

技术栈特点

用 Go 语言编写，性能优异
单机部署简单，集群部署可扩展
开源社区活跃，生态系统丰富
与 Kubernetes 深度集成

适用环境

云原生环境监控
微服务架构监控
容器环境监控
传统服务器监控

2. 安装与配置

2.1 二进制安装

从 Prometheus 官网下载对应平台的二进制文件
解压文件
运行 Prometheus

# 下载最新版本
wget https://github.com/prometheus/prometheus/releases/download/v2.43.0/prometheus-2.43.0.linux-amd64.tar.gz

# 解压
tar xvfz prometheus-*.tar.gz
cd prometheus-*

# 运行 Prometheus
./prometheus --config.file=prometheus.yml

2.2 Docker 安装

docker run -d \n  --name prometheus \n  -p 9090:9090 \n  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \n  prom/prometheus

2.3 Kubernetes 安装

使用 Helm 安装：

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus

2.4 基本配置

Prometheus 的配置文件为 YAML 格式，主要包含以下部分：

# prometheus.yml
global:
  scrape_interval: 15s  # 抓取间隔
  evaluation_interval: 15s  # 规则评估间隔

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

3. 基本使用

3.1 启动 Prometheus

# 使用默认配置启动
./prometheus

# 使用自定义配置启动
./prometheus --config.file=my-config.yml

3.2 访问 Web UI

Prometheus 启动后，可以通过 http://localhost:9090 访问 Web UI。

3.3 基本查询

在 Web UI 的查询编辑器中，可以使用 PromQL 进行查询：

# 查询 Prometheus 自身的 HTTP 请求数
promhttp_requests_total

# 按路径分组查询
promhttp_requests_total by (handler)

# 查询 5 分钟内的请求率
rate(promhttp_requests_total[5m])

3.4 指标类型

Prometheus 支持四种指标类型：

Counter：单调递增的计数器，如请求数、错误数
Gauge：可增可减的仪表盘，如温度、内存使用率
Histogram： histogram 类型，用于统计分布情况，如请求延迟
Summary： summary 类型，类似于 histogram，但直接计算分位数

4. 高级特性

4.1 服务发现

Prometheus 支持多种服务发现机制：

静态配置：手动指定目标
DNS 服务发现：通过 DNS 记录发现目标
文件服务发现：通过配置文件发现目标
Consul 服务发现：从 Consul 集群发现目标
Kubernetes 服务发现：从 Kubernetes 集群发现目标

4.2 告警规则

配置告警规则：

# alerting_rules.yml
groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status="5xx"}[5m]) / rate(http_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
description: "Error rate is {{ $value }} on {{ $labels.instance }}"

4.3 远程存储

Prometheus 支持将数据发送到远程存储系统：

# prometheus.yml
remote_write:
  - url: "http://remote-storage:9090/api/v1/write"

remote_read:
  - url: "http://remote-storage:9090/api/v1/read"

4.4 记录规则

使用记录规则预计算复杂查询：

# recording_rules.yml
groups:
- name: example
  rules:
  - record: job:http_requests:rate5m
    expr: rate(http_requests_total[5m]) by (job)

5. 最佳实践

5.1 指标设计

使用有意义的指标名称，遵循 {namespace}_{subsystem}_{metric} 格式
为指标添加有意义的标签，便于查询和聚合
避免使用高基数标签（如用户 ID、会话 ID 等）
选择合适的指标类型

5.2 配置优化

根据实际需求调整抓取间隔
使用服务发现而非静态配置
为不同类型的目标设置不同的抓取配置
使用记录规则预计算复杂查询

5.3 存储优化

合理设置存储保留时间
考虑使用远程存储系统进行长期存储
监控 Prometheus 自身的存储使用情况

5.4 告警最佳实践

告警级别应反映问题的严重程度
告警信息应包含足够的上下文信息
避免告警风暴，使用告警分组和抑制
定期审查和优化告警规则

6. 实际应用场景

6.1 监控服务器资源

使用 Node Exporter 监控服务器资源：

# prometheus.yml 中添加
scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

查询服务器 CPU 使用率：

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

6.2 监控应用服务

为应用添加 Prometheus 客户端库，暴露指标：

Node.js 应用示例：

const express = require('express');
const promClient = require('prom-client');

const app = express();

// 创建指标
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status']
});

// 中间件记录请求
app.use((req, res, next) => {
  const originalEnd = res.end;
  res.end = function() {
    httpRequestsTotal.inc({ method: req.method, path: req.path, status: res.statusCode });
    originalEnd.apply(this, arguments);
  };
  next();
});

// 暴露指标端点
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

app.get('/', (req, res) => {
  res.send('Hello World!');
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});

6.3 监控 Kubernetes 集群

使用 Prometheus Operator 监控 Kubernetes 集群：

# 安装 Prometheus Operator
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-operator prometheus-community/kube-prometheus-stack

6.4 与 Grafana 集成

安装 Grafana
添加 Prometheus 数据源
创建仪表盘

基本仪表盘配置：

{
  "datasource": "Prometheus",
  "panelId": 1,
  "title": "HTTP 请求率",
  "type": "graph",
  "targets": [
    {
      "expr": "rate(http_requests_total[5m])",
      "legendFormat": "{{handler}}",
      "refId": "A"
    }
  ]
}

7. 总结

Prometheus 是一个功能强大、灵活可扩展的监控系统，特别适合云原生环境和微服务架构。通过本文的介绍，您应该已经了解了 Prometheus 的核心概念、安装配置、基本使用和高级特性。

关键优势

多维度数据模型，支持复杂查询
高效的存储和查询性能
灵活的服务发现机制
强大的告警功能
丰富的生态系统和集成

应用场景

系统和服务监控
性能分析和故障排查
业务指标监控
容量规划和预测

Prometheus 与 Grafana 等工具结合使用，可以构建完整的监控和可视化解决方案，帮助您更好地理解和管理您的系统。