从502到排障:Nginx常见故障分析案例
作为一名运维工程师,你是否曾在深夜被502错误的报警电话惊醒?是否因为神秘的Nginx故障而焦头烂额?本文将通过真实案例,带你深入Nginx故障排查的精髓,让你从运维小白进阶为故障排查专家。
引言:那些年我们踩过的Nginx坑
在互联网公司的运维生涯中,Nginx故障可以说是最常见也最让人头疼的问题之一。从简单的配置错误到复杂的性能瓶颈,从偶发的502到持续的高延迟,每一个故障背后都有其独特的原因和解决方案。
作为拥有8年运维经验的工程师,我见证了无数次午夜故障处理,也总结出了一套行之有效的故障排查方法论。今天,我将通过10个真实案例,手把手教你如何快速定位和解决Nginx常见故障。
案例一:经典502错误 - 上游服务不可达
故障现象
某电商网站在促销活动期间突然出现大量502错误,用户无法正常下单,业务损失严重。
故障排查过程
第一步:查看Nginx错误日志
# 查看最新的错误日志 tail -f /var/log/nginx/error.log # 典型502错误日志 2024/09/15 1425 [error] 12345#0: *67890 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.1.100, server: shop.example.com, request: "POST /api/order HTTP/1.1", upstream: "http://192.168.1.200:8080/api/order", host: "shop.example.com"
第二步:检查上游服务状态
# 检查后端服务是否正常运行 netstat -tulpn | grep 8080 ps aux | grep java # 测试上游服务连通性 curl -I http://192.168.1.200:8080/health telnet 192.168.1.200 8080
第三步:分析Nginx配置
upstream backend_servers {
server192.168.1.200:8080 weight=1 max_fails=3 fail_timeout=30s;
server192.168.1.201:8080 weight=1 max_fails=3 fail_timeout=30s backup;
}
server {
listen80;
server_name shop.example.com;
location /api/ {
proxy_pass http://backend_servers;
proxy_connect_timeout5s;
proxy_read_timeout60s;
proxy_send_timeout60s;
}
}
根因分析
通过排查发现,主服务器192.168.1.200由于负载过高导致Java应用崩溃,而备份服务器配置有误未能及时接管流量。
解决方案
# 1. 重启故障服务器的应用
systemctl restart tomcat
# 2. 修复备份服务器配置
# 将backup参数移除,让两台服务器同时处理请求
upstream backend_servers {
server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}
# 3. 重载Nginx配置
nginx -t && nginx -s reload
预防措施
• 配置健康检查机制
• 设置合理的负载均衡策略
• 建立完善的监控告警体系
案例二:SSL证书过期导致的服务中断
故障现象
某金融网站客户反馈无法访问,浏览器显示"您的连接不是私密连接"错误。
故障排查过程
检查SSL证书状态
# 查看证书到期时间 openssl x509 -in /etc/nginx/ssl/domain.crt -noout -dates # 使用openssl检查在线证书 echo | openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates # 查看Nginx SSL配置 nginx -T | grep -A 10 -B 5 ssl_certificate
Nginx SSL配置示例
server {
listen443 ssl http2;
server_name finance.example.com;
ssl_certificate /etc/nginx/ssl/domain.crt;
ssl_certificate_key /etc/nginx/ssl/domain.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphersoff;
# HSTS设置
add_header Strict-Transport-Security "max-age=31536000" always;
}
解决方案
# 1. 生成新的SSL证书(以Let's Encrypt为例) certbot --nginx -d finance.example.com # 2. 手动更新证书配置 ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem; # 3. 测试并重载配置 nginx -t && nginx -s reload # 4. 验证SSL证书 curl -I https://finance.example.com
自动化解决方案
# 创建证书更新脚本 cat > /etc/cron.d/certbot << 'EOF' 0 12 * * * /usr/bin/certbot renew --quiet --post-hook "nginx -s reload" EOF # 添加证书监控脚本 cat > /usr/local/bin/ssl_check.sh << 'EOF' #!/bin/bash DOMAIN="finance.example.com" DAYS=30 EXPIRY_DATE=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2) EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s) CURRENT_EPOCH=$(date +%s) DAYS_LEFT=$(( ($EXPIRY_EPOCH - $CURRENT_EPOCH) / 86400 )) if [ $DAYS_LEFT -lt $DAYS ]; then echo"SSL certificate for $DOMAIN expires in $DAYS_LEFT days!" # 发送告警 fi EOF
案例三:高并发下的性能瓶颈
故障现象
某视频网站在晚高峰期间响应缓慢,部分用户反馈视频加载失败。
性能分析工具
# 查看Nginx连接状态 curl http://localhost/nginx_status # 使用htop查看系统负载 htop # 检查网络连接数 ss -tuln | wc -l netstat -an | grep :80 | wc -l
Nginx状态页配置
server {
listen80;
server_name localhost;
location /nginx_status {
stub_statuson;
access_logoff;
allow127.0.0.1;
deny all;
}
}
性能优化配置
# 主配置优化
worker_processes auto;
worker_connections65535;
worker_rlimit_nofile65535;
events {
useepoll;
multi_accepton;
worker_connections65535;
}
http {
# 开启gzip压缩
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_types text/plain text/css application/json application/javascript;
# 缓存优化
open_file_cache max=100000 inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;
# 连接优化
keepalive_timeout65;
keepalive_requests100;
# 缓冲区优化
client_body_buffer_size128k;
client_max_body_size50m;
client_header_buffer_size1k;
large_client_header_buffers44k;
}
系统层面优化
# 优化系统参数 cat >> /etc/sysctl.conf << 'EOF' # 网络优化 net.core.somaxconn = 65535 net.core.netdev_max_backlog = 5000 net.ipv4.tcp_max_syn_backlog = 65535 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 1200 net.ipv4.tcp_max_tw_buckets = 5000 # 文件描述符优化 fs.file-max = 1000000 EOF # 应用配置 sysctl -p
案例四:缓存配置错误导致的问题
故障现象
某新闻网站更新内容后,用户仍然看到旧内容,清除浏览器缓存后问题依然存在。
缓存配置分析
server {
listen80;
server_name news.example.com;
# 静态资源缓存
location~* .(jpg|jpeg|png|gif|ico|css|js)$ {
expires1y;
add_header Cache-Control "public, immutable";
add_header Pragma public;
}
# 动态内容
location / {
proxy_pass http://backend;
# 错误的缓存配置
proxy_cache_valid20030210m;
proxy_cache_valid4041m;
add_header X-Cache-Status $upstream_cache_status;
}
}
问题排查
# 检查缓存目录 ls -la /var/cache/nginx/ # 查看缓存配置 nginx -T | grep -A 20 proxy_cache # 测试缓存状态 curl -I http://news.example.com/article/123 | grep X-Cache-Status
正确的缓存配置
http {
# 缓存路径配置
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m use_temp_path=off;
server {
listen80;
server_name news.example.com;
# API接口不缓存
location /api/ {
proxy_pass http://backend;
proxy_cacheoff;
add_header Cache-Control "no-cache, no-store, must-revalidate";
}
# 新闻内容缓存
location /article/ {
proxy_pass http://backend;
proxy_cache my_cache;
proxy_cache_valid2005m;
proxy_cache_use_staleerror timeout updating;
add_header X-Cache-Status $upstream_cache_status;
}
# 静态资源长期缓存
location~* .(jpg|jpeg|png|gif|ico)$ {
expires1y;
add_header Cache-Control "public, immutable";
}
location~* .(css|js)$ {
expires1d;
add_header Cache-Control "public";
}
}
}
缓存管理工具
# 清除特定URL缓存 curl -X PURGE http://news.example.com/article/123 # 批量清除缓存 find /var/cache/nginx -type f -name "*.cache" -mtime +7 -delete # 缓存统计脚本 cat > /usr/local/bin/cache_stats.sh << 'EOF' #!/bin/bash CACHE_DIR="/var/cache/nginx" echo"Cache directory size: $(du -sh $CACHE_DIR)" echo"Cache files count: $(find $CACHE_DIR -type f | wc -l)" echo"Cache hit rate: $(grep -c HIT /var/log/nginx/access.log)" EOF
案例五:日志轮转异常导致磁盘空间耗尽
故障现象
服务器突然无法响应,检查发现磁盘空间100%占用,主要是Nginx日志文件过大。
问题诊断
# 检查磁盘空间 df -h # 找出大文件 du -h /var/log/nginx/ | sort -hr # 检查日志轮转配置 cat /etc/logrotate.d/nginx
修复和优化
# 紧急处理:截断当前日志 > /var/log/nginx/access.log > /var/log/nginx/error.log # 重启nginx以重新打开日志文件 nginx -s reopen
优化的日志轮转配置
# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 640 nginx nginx
sharedscripts
postrotate
if [ -f /var/run/nginx.pid ]; then
kill -USR1 `cat /var/run/nginx.pid`
fi
endscript
}
日志配置优化
http {
# 自定义日志格式
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status$body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';
# 条件日志记录
map$status$loggable {
~^[23] 0;
default1;
}
server {
# 只记录错误请求
access_log /var/log/nginx/access.log main if=$loggable;
# 静态资源不记录日志
location~* .(jpg|jpeg|png|gif|ico|css|js)$ {
access_logoff;
expires1y;
}
}
}
监控脚本
# 磁盘空间监控
cat > /usr/local/bin/disk_monitor.sh << 'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $USAGE -gt $THRESHOLD ]; then
echo"Disk usage is ${USAGE}%, exceeding threshold of ${THRESHOLD}%"
# 自动清理老日志
find /var/log/nginx -name "*.log.*" -mtime +7 -delete
# 发送告警
fi
EOF
案例六:负载均衡配置错误
故障现象
某服务采用多台后端服务器,但发现流量分配不均,部分服务器负载过高而其他服务器闲置。
负载均衡策略对比
# 轮询(默认)
upstream backend_round_robin {
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 加权轮询
upstream backend_weighted {
server192.168.1.10:8080 weight=3;
server192.168.1.11:8080 weight=2;
server192.168.1.12:8080 weight=1;
}
# IP哈希
upstream backend_ip_hash {
ip_hash;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
# 最少连接
upstream backend_least_conn {
least_conn;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}
健康检查配置
upstream backend_with_health {
server192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server192.168.1.12:8080 max_fails=3 fail_timeout=30s backup;
# keepalive连接池
keepalive32;
}
server {
location / {
proxy_pass http://backend_with_health;
# 健康检查相关
proxy_next_upstreamerror timeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries2;
proxy_next_upstream_timeout5s;
# 连接复用
proxy_http_version1.1;
proxy_set_header Connection "";
}
}
监控脚本
# 后端服务器健康检查脚本
cat > /usr/local/bin/backend_health_check.sh << 'EOF'
#!/bin/bash
SERVERS=("192.168.1.10:8080""192.168.1.11:8080""192.168.1.12:8080")
for server in"${SERVERS[@]}"; do
if curl -sf "http://$server/health" > /dev/null; then
echo"$server: OK"
else
echo"$server: FAILED"
# 发送告警
fi
done
EOF
案例七:安全配置漏洞
故障现象
网站被恶意扫描,发现存在多个安全漏洞,需要加强Nginx安全配置。
安全加固配置
server {
listen80;
server_name secure.example.com;
# 隐藏版本信息
server_tokensoff;
more_set_headers"Server: WebServer";
# 安全头设置
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "no-referrer-when-downgrade" always;
add_header Content-Security-Policy "default-src 'self' http: https: data: blob: 'unsafe-inline'" always;
# 限制请求方法
if ($request_method !~ ^(GET|HEAD|POST)$ ) {
return405;
}
# 防止目录遍历
location~ /. {
deny all;
access_logoff;
log_not_foundoff;
}
# 限制文件上传大小
client_max_body_size10M;
# 限制请求频率
limit_req_zone$binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone$binary_remote_addr zone=login:10m rate=1r/s;
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://backend;
}
location /login {
limit_req zone=login burst=5 nodelay;
proxy_pass http://backend;
}
}
防护脚本
# fail2ban配置示例 cat > /etc/fail2ban/filter.d/nginx-4xx.conf << 'EOF' [Definition] failregex = ^-.*"(GET|POST).*" (404|403|400) .*$ ignoreregex = EOF cat > /etc/fail2ban/jail.local << 'EOF' [nginx-4xx] enabled = true port = http,https filter = nginx-4xx logpath = /var/log/nginx/access.log maxretry = 10 bantime = 3600 findtime = 60 EOF
案例八:反向代理配置问题
故障现象
使用Nginx作为反向代理时,客户端真实IP丢失,后端服务无法获取正确的客户端信息。
问题分析和解决
server {
listen80;
server_name api.example.com;
location / {
proxy_pass http://backend;
# 正确传递客户端IP
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# 处理重定向
proxy_redirectoff;
# 超时设置
proxy_connect_timeout30s;
proxy_send_timeout30s;
proxy_read_timeout30s;
# 缓冲设置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
proxy_busy_buffers_size8k;
}
}
WebSocket支持
map$http_upgrade$connection_upgrade {
default upgrade;
'' close;
}
server {
listen80;
server_name ws.example.com;
location /websocket {
proxy_pass http://backend;
proxy_http_version1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_cache_bypass$http_upgrade;
# WebSocket特殊配置
proxy_read_timeout86400;
}
}
案例九:URL重写规则冲突
故障现象
网站URL重写规则复杂,出现重定向循环和404错误。
重写规则优化
server {
listen80;
server_name example.com www.example.com;
# 强制跳转到主域名
if ($host != 'example.com') {
return301 https://example.com$request_uri;
}
# SEO友好的URL重写
location / {
try_files$uri$uri/ @rewrites;
}
location@rewrites {
rewrite ^/product/([0-9]+)$ /product.php?id=$1last;
rewrite ^/category/([a-zA-Z0-9-]+)$ /category.php?name=$1last;
rewrite ^/user/([a-zA-Z0-9]+)$ /profile.php?username=$1last;
return404;
}
# 防止重定向循环
location~ .php$ {
try_files$uri =404;
fastcgi_pass127.0.0.1:9000;
fastcgi_index index.php;
include fastcgi_params;
}
}
调试重写规则
# 开启重写日志
error_log /var/log/nginx/rewrite.log notice;
rewrite_logon;
# 测试重写规则
location /test {
rewrite ^/test/(.*)$ /debug?param=$1break;
return200"Rewrite test: $args
";
}
案例十:性能监控与调优
故障现象
需要建立完善的Nginx性能监控体系,及时发现和解决性能问题。
监控脚本
# Nginx性能监控脚本
cat > /usr/local/bin/nginx_monitor.sh << 'EOF'
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"
# 获取状态信息
STATUS=$(curl -s $NGINX_STATUS_URL)
ACTIVE_CONN=$(echo"$STATUS" | grep "Active connections" | awk '{print $3}')
ACCEPTS=$(echo"$STATUS" | awk 'NR==2 {print $1}')
HANDLED=$(echo"$STATUS" | awk 'NR==2 {print $2}')
REQUESTS=$(echo"$STATUS" | awk 'NR==2 {print $3}')
READING=$(echo"$STATUS" | awk 'NR==3 {print $2}')
WRITING=$(echo"$STATUS" | awk 'NR==3 {print $4}')
WAITING=$(echo"$STATUS" | awk 'NR==3 {print $6}')
# 记录到日志
echo"$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING" >> $LOG_FILE
# 告警逻辑
if [ $ACTIVE_CONN -gt 1000 ]; then
echo"High connection count: $ACTIVE_CONN" | logger -t nginx_monitor
fi
EOF
综合调优配置
# 终极优化配置
worker_processes auto;
worker_cpu_affinity auto;
worker_rlimit_nofile100000;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
useepoll;
worker_connections10240;
multi_accepton;
accept_mutexoff;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
# 日志格式
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status$body_bytes_sent "$http_referer" '
'"$http_user_agent" $request_time$upstream_response_time';
# 性能优化
sendfileon;
tcp_nopushon;
tcp_nodelayon;
keepalive_timeout65;
keepalive_requests1000;
# 压缩优化
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_comp_level6;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml;
# 缓存优化
open_file_cache max=100000 inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;
# 安全优化
server_tokensoff;
client_header_timeout10;
client_body_timeout10;
reset_timedout_connectionon;
send_timeout10;
# 限流配置
limit_req_zone$binary_remote_addr zone=global:10m rate=100r/s;
limit_conn_zone$binary_remote_addr zone=addr:10m;
include /etc/nginx/conf.d/*.conf;
}
故障排查方法论总结
1. 标准化排查流程
1. 收集故障信息:确认故障现象、影响范围、发生时间
2. 查看日志文件:error.log、access.log、系统日志
3. 检查配置文件:语法检查、逻辑检查
4. 验证网络连通:端口状态、连通性测试
5. 分析性能指标:CPU、内存、网络、磁盘
6. 确定根本原因:深入分析,找出真正原因
7. 实施解决方案:临时修复、永久解决
8. 验证修复效果:功能测试、性能测试
9. 总结经验教训:文档记录、流程优化
2. 常用排查工具
• 日志分析:tail、grep、awk、sed
• 网络工具:curl、wget、telnet、netstat、ss
• 性能监控:htop、iotop、iftop、nginx-status
• 系统诊断:strace、lsof、tcpdump
3. 预防性措施
• 建立完善的监控告警体系
• 定期进行配置文件备份
• 实施自动化运维工具
• 制定标准化操作流程
• 定期进行故障演练
结语
Nginx故障排查是运维工程师必备的核心技能,需要扎实的理论基础和丰富的经验
全部0条评论
快来发表一下你的评论吧 !