第10章:故障排查与优化
10.1 常见问题排查
10.1.1 配置语法检查
1. 使用nginx -t命令
sudo nginx -t预期输出:
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful错误示例:
nginx: [emerg] unexpected "}" in /etc/nginx/conf.d/example.conf:10
nginx: configuration file /etc/nginx/nginx.conf test failed2. 检查特定配置文件
sudo nginx -t -c /etc/nginx/nginx.conf3. 查看配置文件依赖关系
grep -r "include" /etc/nginx/10.1.2 连接问题诊断
1. 检查端口监听
# 检查Nginx是否在监听80和443端口
sudo netstat -tuln | grep nginx
# 或使用ss命令
sudo ss -tuln | grep nginx
# 或使用lsof命令
sudo lsof -i :80
sudo lsof -i :4432. 检查防火墙设置
# Ubuntu/Debian
sudo ufw status
# CentOS/RHEL
sudo firewall-cmd --list-all3. 测试端口可达性
# 本地测试
curl -I http://localhost
# 远程测试
curl -I http://example.com
# 测试特定端口
curl -I http://example.com:80804. 检查SELinux设置(CentOS/RHEL)
getenforce
# 如果SELinux启用,检查nginx上下文
ls -Z /usr/sbin/nginx
# 检查SELinux日志
tail -f /var/log/audit/audit.log | grep nginx10.1.3 性能问题定位
1. 检查Nginx状态
# 访问stub_status端点
curl http://localhost/nginx_status
# 或使用nginx-amplify监控
# 或使用prometheus+grafana监控2. 查看系统资源使用情况
# 查看CPU和内存使用情况
top
# 查看磁盘I/O
iostat -x
# 查看网络I/O
tcpdump -i eth0 port 80 or port 4433. 分析日志文件
# 查看错误日志
sudo tail -f /var/log/nginx/error.log
# 查看访问日志中的慢请求
sudo grep -E 'rt=[0-9]+\.[0-9]+' /var/log/nginx/access.log | sort -k1 -r | head -10
# 查看状态码分布
sudo awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -nr4. 使用strace跟踪系统调用
# 获取Nginx进程ID
nginx_pid=$(cat /var/run/nginx.pid)
# 跟踪系统调用
sudo strace -p $nginx_pid10.2 性能调优
10.2.1 操作系统优化
1. 增加文件描述符限制
# 临时增加
echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf
# 为Nginx服务增加
# 在/etc/systemd/system/nginx.service中添加
[Service]
LimitNOFILE=65536
# 重新加载配置
sudo systemctl daemon-reload
sudo systemctl restart nginx2. 优化TCP参数
# 编辑/etc/sysctl.conf
sudo nano /etc/sysctl.conf
# 添加以下内容
# 启用TCP快速打开
net.ipv4.tcp_fastopen = 3
# 增加TCP最大连接数
net.core.somaxconn = 65535
# 增加TCP连接跟踪表大小
net.nf_conntrack_max = 655350
net.netfilter.nf_conntrack_max = 655350
# 优化TCP连接回收
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 0
# 增加TCP接收和发送缓冲区大小
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
# 应用配置
sudo sysctl -p10.2.2 Nginx参数调优
1. 工作进程配置
worker_processes auto;
worker_cpu_affinity auto;2. 事件模块优化
events {
worker_connections 65536;
use epoll;
multi_accept on;
}3. HTTP模块优化
http {
# 隐藏版本信息
server_tokens off;
# 启用sendfile
sendfile on;
# 启用TCP_NOPUSH
tcp_nopush on;
# 启用TCP_NODELAY
tcp_nodelay on;
# 长连接超时时间
keepalive_timeout 65;
keepalive_requests 100;
# 客户端请求体大小限制
client_max_body_size 10m;
# 客户端头大小限制
client_header_buffer_size 1k;
large_client_header_buffers 4 4k;
# 客户端头超时时间
client_header_timeout 10;
client_body_timeout 10;
# 发送响应超时时间
send_timeout 10;
# 启用gzip压缩
gzip on;
gzip_comp_level 6;
gzip_min_length 256;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
gzip_proxied any;
gzip_vary on;
# 启用http2
listen 443 ssl http2;
# 优化日志
access_log off;
# 或使用缓冲日志
access_log /var/log/nginx/access.log main buffer=32k;
}4. 反向代理优化
location / {
proxy_pass http://backend;
# 增加代理缓冲区
proxy_buffers 8 32k;
proxy_buffer_size 64k;
# 增加代理超时时间
proxy_connect_timeout 60s;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
# 启用proxy_cache
proxy_cache my_cache;
proxy_cache_valid 200 10m;
# 启用proxy_cache_use_stale
proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
}10.2.3 压测工具使用(ab、wrk)
1. 使用ab(Apache Bench)
安装ab:
# Ubuntu/Debian
sudo apt-get install -y apache2-utils
# CentOS/RHEL
sudo yum install -y httpd-tools使用示例:
# 1000个请求,100个并发
ab -n 1000 -c 100 http://example.com/
# 保存结果到文件
ab -n 1000 -c 100 -g result.tsv http://example.com/
# 使用gnuplot绘制图形
# 创建plot.gp文件
cat > plot.gp << EOF
set terminal png
set output "result.png"
set title "HTTP Benchmark Result"
set xlabel "Request Number"
set ylabel "Response Time (ms)"
set grid
plot "result.tsv" using 9 with lines title "Response Time"
EOF
# 生成图形
gnuplot plot.gp2. 使用wrk
安装wrk:
git clone https://github.com/wg/wrk.git
cd wrk
make
sudo mv wrk /usr/local/bin/使用示例:
# 100个连接,10个线程,持续30秒
wrk -c 100 -t 10 -d 30s http://example.com/
# 使用Lua脚本
echo 'function request()
return wrk.format("GET", "/api/v1/users")
end' > request.lua
wrk -c 100 -t 10 -d 30s -s request.lua http://example.com/3. 压测结果分析
ab结果示例:
Server Software: nginx/1.18.0
Server Hostname: example.com
Server Port: 80
Document Path: /
Document Length: 123 bytes
Concurrency Level: 100
Time taken for tests: 1.234 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 123000 bytes
HTML transferred: 123000 bytes
Requests per second: 810.49 [#/sec] (mean)
Time per request: 123.384 [ms] (mean)
Time per request: 1.234 [ms] (mean, across all concurrent requests)
Transfer rate: 97.65 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.5 1 3
Processing: 5 120 15.3 118 156
Waiting: 4 119 15.2 117 155
Total: 5 121 15.3 119 158
Percentage of the requests served within a certain time (ms)
50% 119
66% 125
75% 130
80% 133
90% 140
95% 145
98% 150
99% 155
100% 158 (longest request)wrk结果示例:
Running 30s test @ http://example.com/
10 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 123.45ms 15.67ms 158.90ms 89.00%
Req/Sec 81.23 9.87 100.00 78.00%
24312 requests in 30.00s, 2.98MB read
Requests/sec: 810.40
Transfer/sec: 101.73KB10.3 实战项目:Nginx性能调优
在这个实战项目中,我们将对Nginx服务器进行性能调优,包括系统参数优化、Nginx配置优化和压测验证。
10.3.1 项目准备
1. 环境准备
- 一台Linux服务器,推荐使用Ubuntu 20.04或CentOS 7
- Nginx 1.18.0或更高版本
- ab和wrk压测工具
2. 安装依赖
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y nginx apache2-utils git make
# CentOS/RHEL
sudo yum update
sudo yum install -y nginx httpd-tools git make3. 安装wrk
git clone https://github.com/wg/wrk.git
cd wrk
make
sudo mv wrk /usr/local/bin/10.3.2 系统参数优化
1. 增加文件描述符限制
echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "LimitNOFILE=65536" | sudo tee -a /etc/systemd/system/nginx.service
sudo systemctl daemon-reload
sudo systemctl restart nginx2. 优化TCP参数
sudo nano /etc/sysctl.conf添加以下内容:
net.ipv4.tcp_fastopen = 3
net.core.somaxconn = 65535
net.nf_conntrack_max = 655350
net.netfilter.nf_conntrack_max = 655350
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 0
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216应用配置:
sudo sysctl -p10.3.3 Nginx配置优化
1. 备份原始配置
sudo cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.bak2. 优化Nginx配置
sudo nano /etc/nginx/nginx.conf替换为以下内容:
worker_processes auto;
worker_cpu_affinity auto;
error_log /var/log/nginx/error.log warn;
pid /run/nginx.pid;
include /usr/share/nginx/modules/*.conf;
events {
worker_connections 65536;
use epoll;
multi_accept on;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';
access_log off;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
keepalive_requests 100;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
# Gzip settings
gzip on;
gzip_comp_level 6;
gzip_min_length 256;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
gzip_proxied any;
gzip_vary on;
# Proxy settings
proxy_buffers 8 32k;
proxy_buffer_size 64k;
proxy_connect_timeout 60s;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}3. 重启Nginx
sudo nginx -t
sudo systemctl restart nginx10.3.4 压测验证
1. 准备测试页面
sudo mkdir -p /var/www/test
echo "<h1>Test Page</h1>" | sudo tee /var/www/test/index.html
# 创建测试配置
sudo nano /etc/nginx/conf.d/test.conf添加以下内容:
server {
listen 80;
server_name test.example.com;
root /var/www/test;
index index.html;
}2. 执行压测
# 使用ab进行压测
sudo nginx -t
sudo systemctl reload nginx
# 初始压测
echo "=== Initial Test ===" > benchmark.txt
ab -n 1000 -c 100 http://localhost/ >> benchmark.txt
# 优化后压测
echo "\n=== Optimized Test ===" >> benchmark.txt
ab -n 1000 -c 100 http://localhost/ >> benchmark.txt
# 使用wrk进行压测
echo "\n=== wrk Test ===" >> benchmark.txt
wrk -c 100 -t 10 -d 30s http://localhost/ >> benchmark.txt
# 查看压测结果
cat benchmark.txt3. 分析压测结果
比较优化前后的压测结果,关注以下指标:
- Requests per second(每秒请求数)
- Time per request(每个请求的平均时间)
- 90%、95%、99%响应时间
- Failed requests(失败请求数)
10.3.5 常见问题与解决方案
问题1:压测时CPU使用率过高
解决方案:
- 增加Nginx工作进程数量,与CPU核心数匹配
- 启用worker_cpu_affinity,绑定工作进程到特定CPU核心
- 优化应用代码,减少CPU密集型操作
- 考虑使用缓存,减少后端服务器的CPU负载
问题2:压测时内存使用率过高
解决方案:
- 调整proxy_buffers和proxy_buffer_size参数,减少内存使用
- 调整worker_connections参数,减少并发连接数
- 启用内存限制,使用cgroup限制Nginx的内存使用
- 考虑使用更高效的缓存策略,如Redis或Memcached
问题3:压测时网络带宽瓶颈
解决方案:
- 启用gzip压缩,减少传输数据量
- 使用CDN缓存静态资源,减少源服务器的带宽占用
- 优化图片和静态资源,减少文件大小
- 考虑使用HTTP/2,提高传输效率
章节总结
在本章中,我们学习了:
常见问题排查:
- 使用nginx -t检查配置语法
- 诊断连接问题,包括端口监听、防火墙设置和SELinux配置
- 定位性能问题,包括Nginx状态检查、系统资源监控和日志分析
性能调优:
- 优化操作系统参数,包括文件描述符限制和TCP参数
- 优化Nginx配置,包括工作进程、事件模块和HTTP模块
- 使用ab和wrk工具进行压测,分析压测结果
实战项目:
- 对Nginx服务器进行全面的性能调优
- 执行压测验证调优效果
- 分析压测结果,解决常见问题
实践练习
- 使用nginx -t命令检查Nginx配置语法
- 诊断一个无法访问的Nginx网站,包括端口监听、防火墙设置和SELinux配置
- 优化Nginx配置,包括工作进程、事件模块和HTTP模块
- 使用ab和wrk工具进行压测,比较优化前后的性能差异
- 分析压测结果,找出性能瓶颈并进行优化
延伸阅读
- Nginx Documentation
- Nginx Optimization
- Apache Bench Documentation
- wrk Documentation
- Linux Performance Optimization
下一章:第11章:扩展与模块开发