Nginx常见故障案例总结

马哥Linux运维 2025-09-18 986

描述

从502到排障：Nginx常见故障分析案例

作为一名运维工程师，你是否曾在深夜被502错误的报警电话惊醒？是否因为神秘的Nginx故障而焦头烂额？本文将通过真实案例，带你深入Nginx故障排查的精髓，让你从运维小白进阶为故障排查专家。

引言：那些年我们踩过的Nginx坑

在互联网公司的运维生涯中，Nginx故障可以说是最常见也最让人头疼的问题之一。从简单的配置错误到复杂的性能瓶颈，从偶发的502到持续的高延迟，每一个故障背后都有其独特的原因和解决方案。

作为拥有8年运维经验的工程师，我见证了无数次午夜故障处理，也总结出了一套行之有效的故障排查方法论。今天，我将通过10个真实案例，手把手教你如何快速定位和解决Nginx常见故障。

案例一：经典502错误 - 上游服务不可达

故障现象

某电商网站在促销活动期间突然出现大量502错误，用户无法正常下单，业务损失严重。

故障排查过程

第一步：查看Nginx错误日志

# 查看最新的错误日志
tail -f /var/log/nginx/error.log

# 典型502错误日志
2024/09/15 1425 [error] 12345#0: *67890 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.1.100, server: shop.example.com, request: "POST /api/order HTTP/1.1", upstream: "http://192.168.1.200:8080/api/order", host: "shop.example.com"

第二步：检查上游服务状态

# 检查后端服务是否正常运行
netstat -tulpn | grep 8080
ps aux | grep java

# 测试上游服务连通性
curl -I http://192.168.1.200:8080/health
telnet 192.168.1.200 8080

第三步：分析Nginx配置

upstream backend_servers {
server192.168.1.200:8080 weight=1 max_fails=3 fail_timeout=30s;
server192.168.1.201:8080 weight=1 max_fails=3 fail_timeout=30s backup;
}

server {
listen80;
server_name shop.example.com;

location /api/ {
proxy_pass http://backend_servers;
proxy_connect_timeout5s;
proxy_read_timeout60s;
proxy_send_timeout60s;
    }
}

根因分析

通过排查发现，主服务器192.168.1.200由于负载过高导致Java应用崩溃，而备份服务器配置有误未能及时接管流量。

解决方案

# 1. 重启故障服务器的应用
systemctl restart tomcat

# 2. 修复备份服务器配置
# 将backup参数移除，让两台服务器同时处理请求
upstream backend_servers {
    server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
    server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}

# 3. 重载Nginx配置
nginx -t && nginx -s reload

预防措施

• 配置健康检查机制

• 设置合理的负载均衡策略

• 建立完善的监控告警体系

案例二：SSL证书过期导致的服务中断

故障现象

某金融网站客户反馈无法访问，浏览器显示"您的连接不是私密连接"错误。

故障排查过程

检查SSL证书状态

# 查看证书到期时间
openssl x509 -in /etc/nginx/ssl/domain.crt -noout -dates

# 使用openssl检查在线证书
echo | openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates

# 查看Nginx SSL配置
nginx -T | grep -A 10 -B 5 ssl_certificate

Nginx SSL配置示例

server {
listen443 ssl http2;
server_name finance.example.com;

ssl_certificate /etc/nginx/ssl/domain.crt;
ssl_certificate_key /etc/nginx/ssl/domain.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphersoff;

# HSTS设置
add_header Strict-Transport-Security "max-age=31536000" always;
}

解决方案

# 1. 生成新的SSL证书（以Let's Encrypt为例）
certbot --nginx -d finance.example.com

# 2. 手动更新证书配置
ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem;

# 3. 测试并重载配置
nginx -t && nginx -s reload

# 4. 验证SSL证书
curl -I https://finance.example.com

自动化解决方案

# 创建证书更新脚本
cat > /etc/cron.d/certbot << 'EOF'
0 12 * * * /usr/bin/certbot renew --quiet --post-hook "nginx -s reload"
EOF

# 添加证书监控脚本
cat > /usr/local/bin/ssl_check.sh << 'EOF'
#!/bin/bash
DOMAIN="finance.example.com"
DAYS=30

EXPIRY_DATE=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
CURRENT_EPOCH=$(date +%s)
DAYS_LEFT=$(( ($EXPIRY_EPOCH - $CURRENT_EPOCH) / 86400 ))

if [ $DAYS_LEFT -lt $DAYS ]; then
echo"SSL certificate for $DOMAIN expires in $DAYS_LEFT days!"
# 发送告警
fi
EOF

案例三：高并发下的性能瓶颈

故障现象

某视频网站在晚高峰期间响应缓慢，部分用户反馈视频加载失败。

性能分析工具

# 查看Nginx连接状态
curl http://localhost/nginx_status

# 使用htop查看系统负载
htop

# 检查网络连接数
ss -tuln | wc -l
netstat -an | grep :80 | wc -l

Nginx状态页配置

server {
listen80;
server_name localhost;

location /nginx_status {
stub_statuson;
access_logoff;
allow127.0.0.1;
deny all;
    }
}

性能优化配置

# 主配置优化
worker_processes auto;
worker_connections65535;
worker_rlimit_nofile65535;

events {
useepoll;
multi_accepton;
worker_connections65535;
}

http {
# 开启gzip压缩
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_types text/plain text/css application/json application/javascript;

# 缓存优化
open_file_cache max=100000 inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;

# 连接优化
keepalive_timeout65;
keepalive_requests100;

# 缓冲区优化
client_body_buffer_size128k;
client_max_body_size50m;
client_header_buffer_size1k;
large_client_header_buffers44k;
}

系统层面优化

# 优化系统参数
cat >> /etc/sysctl.conf << 'EOF'
# 网络优化
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_tw_buckets = 5000

# 文件描述符优化
fs.file-max = 1000000
EOF

# 应用配置
sysctl -p

案例四：缓存配置错误导致的问题

故障现象

某新闻网站更新内容后，用户仍然看到旧内容，清除浏览器缓存后问题依然存在。

缓存配置分析

server {
listen80;
server_name news.example.com;

# 静态资源缓存
location~* .(jpg|jpeg|png|gif|ico|css|js)$ {
expires1y;
add_header Cache-Control "public, immutable";
add_header Pragma public;
    }

# 动态内容
location / {
proxy_pass http://backend;

# 错误的缓存配置
proxy_cache_valid20030210m;
proxy_cache_valid4041m;
add_header X-Cache-Status $upstream_cache_status;
    }
}

问题排查

# 检查缓存目录
ls -la /var/cache/nginx/

# 查看缓存配置
nginx -T | grep -A 20 proxy_cache

# 测试缓存状态
curl -I http://news.example.com/article/123 | grep X-Cache-Status

正确的缓存配置

http {
# 缓存路径配置
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m use_temp_path=off;

server {
listen80;
server_name news.example.com;

# API接口不缓存
location /api/ {
proxy_pass http://backend;
proxy_cacheoff;
add_header Cache-Control "no-cache, no-store, must-revalidate";
        }

# 新闻内容缓存
location /article/ {
proxy_pass http://backend;
proxy_cache my_cache;
proxy_cache_valid2005m;
proxy_cache_use_staleerror timeout updating;
add_header X-Cache-Status $upstream_cache_status;
        }

# 静态资源长期缓存
location~* .(jpg|jpeg|png|gif|ico)$ {
expires1y;
add_header Cache-Control "public, immutable";
        }

location~* .(css|js)$ {
expires1d;
add_header Cache-Control "public";
        }
    }
}

缓存管理工具

# 清除特定URL缓存
curl -X PURGE http://news.example.com/article/123

# 批量清除缓存
find /var/cache/nginx -type f -name "*.cache" -mtime +7 -delete

# 缓存统计脚本
cat > /usr/local/bin/cache_stats.sh << 'EOF'
#!/bin/bash
CACHE_DIR="/var/cache/nginx"
echo"Cache directory size: $(du -sh $CACHE_DIR)"
echo"Cache files count: $(find $CACHE_DIR -type f | wc -l)"
echo"Cache hit rate: $(grep -c HIT /var/log/nginx/access.log)"
EOF

案例五：日志轮转异常导致磁盘空间耗尽

故障现象

服务器突然无法响应，检查发现磁盘空间100%占用，主要是Nginx日志文件过大。

问题诊断

# 检查磁盘空间
df -h

# 找出大文件
du -h /var/log/nginx/ | sort -hr

# 检查日志轮转配置
cat /etc/logrotate.d/nginx

修复和优化

# 紧急处理：截断当前日志
> /var/log/nginx/access.log
> /var/log/nginx/error.log

# 重启nginx以重新打开日志文件
nginx -s reopen

优化的日志轮转配置

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 640 nginx nginx
    sharedscripts
    postrotate
if [ -f /var/run/nginx.pid ]; then
kill -USR1 `cat /var/run/nginx.pid`
fi
    endscript
}

日志配置优化

http {
# 自定义日志格式
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status$body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';

# 条件日志记录
map$status$loggable {
        ~^[23]  0;
default1;
    }

server {
# 只记录错误请求
access_log /var/log/nginx/access.log main if=$loggable;

# 静态资源不记录日志
location~* .(jpg|jpeg|png|gif|ico|css|js)$ {
access_logoff;
expires1y;
        }
    }
}

监控脚本

# 磁盘空间监控
cat > /usr/local/bin/disk_monitor.sh << 'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')

if [ $USAGE -gt $THRESHOLD ]; then
echo"Disk usage is ${USAGE}%, exceeding threshold of ${THRESHOLD}%"
# 自动清理老日志
    find /var/log/nginx -name "*.log.*" -mtime +7 -delete
# 发送告警
fi
EOF

案例六：负载均衡配置错误

故障现象

某服务采用多台后端服务器，但发现流量分配不均，部分服务器负载过高而其他服务器闲置。

负载均衡策略对比

# 轮询（默认）
upstream backend_round_robin {
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}

# 加权轮询
upstream backend_weighted {
server192.168.1.10:8080 weight=3;
server192.168.1.11:8080 weight=2;
server192.168.1.12:8080 weight=1;
}

# IP哈希
upstream backend_ip_hash {
    ip_hash;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}

# 最少连接
upstream backend_least_conn {
    least_conn;
server192.168.1.10:8080;
server192.168.1.11:8080;
server192.168.1.12:8080;
}

健康检查配置

upstream backend_with_health {
server192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server192.168.1.12:8080 max_fails=3 fail_timeout=30s backup;

# keepalive连接池
keepalive32;
}

server {
location / {
proxy_pass http://backend_with_health;

# 健康检查相关
proxy_next_upstreamerror timeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries2;
proxy_next_upstream_timeout5s;

# 连接复用
proxy_http_version1.1;
proxy_set_header Connection "";
    }
}

监控脚本

# 后端服务器健康检查脚本
cat > /usr/local/bin/backend_health_check.sh << 'EOF'
#!/bin/bash
SERVERS=("192.168.1.10:8080""192.168.1.11:8080""192.168.1.12:8080")

for server in"${SERVERS[@]}"; do
if curl -sf "http://$server/health" > /dev/null; then
echo"$server: OK"
else
echo"$server: FAILED"
# 发送告警
fi
done
EOF

案例七：安全配置漏洞

故障现象

网站被恶意扫描，发现存在多个安全漏洞，需要加强Nginx安全配置。

安全加固配置

server {
listen80;
server_name secure.example.com;

# 隐藏版本信息
server_tokensoff;
more_set_headers"Server: WebServer";

# 安全头设置
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "no-referrer-when-downgrade" always;
add_header Content-Security-Policy "default-src 'self' http: https: data: blob: 'unsafe-inline'" always;

# 限制请求方法
if ($request_method !~ ^(GET|HEAD|POST)$ ) {
return405;
    }

# 防止目录遍历
location~ /. {
deny all;
access_logoff;
log_not_foundoff;
    }

# 限制文件上传大小
client_max_body_size10M;

# 限制请求频率
limit_req_zone$binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone$binary_remote_addr zone=login:10m rate=1r/s;

location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://backend;
    }

location /login {
limit_req zone=login burst=5 nodelay;
proxy_pass http://backend;
    }
}

防护脚本

# fail2ban配置示例
cat > /etc/fail2ban/filter.d/nginx-4xx.conf << 'EOF'
[Definition]
failregex = ^ -.*"(GET|POST).*" (404|403|400) .*$
ignoreregex =
EOF

cat > /etc/fail2ban/jail.local << 'EOF'
[nginx-4xx]
enabled = true
port = http,https
filter = nginx-4xx
logpath = /var/log/nginx/access.log
maxretry = 10
bantime = 3600
findtime = 60
EOF

案例八：反向代理配置问题

故障现象

使用Nginx作为反向代理时，客户端真实IP丢失，后端服务无法获取正确的客户端信息。

问题分析和解决

server {
listen80;
server_name api.example.com;

location / {
proxy_pass http://backend;

# 正确传递客户端IP
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# 处理重定向
proxy_redirectoff;

# 超时设置
proxy_connect_timeout30s;
proxy_send_timeout30s;
proxy_read_timeout30s;

# 缓冲设置
proxy_bufferingon;
proxy_buffer_size4k;
proxy_buffers84k;
proxy_busy_buffers_size8k;
    }
}

WebSocket支持

map$http_upgrade$connection_upgrade {
default upgrade;
    '' close;
}

server {
listen80;
server_name ws.example.com;

location /websocket {
proxy_pass http://backend;
proxy_http_version1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_cache_bypass$http_upgrade;

# WebSocket特殊配置
proxy_read_timeout86400;
    }
}

案例九：URL重写规则冲突

故障现象

网站URL重写规则复杂，出现重定向循环和404错误。

重写规则优化

server {
listen80;
server_name example.com www.example.com;

# 强制跳转到主域名
if ($host != 'example.com') {
return301 https://example.com$request_uri;
    }

# SEO友好的URL重写
location / {
try_files$uri$uri/ @rewrites;
    }

location@rewrites {
rewrite ^/product/([0-9]+)$ /product.php?id=$1last;
rewrite ^/category/([a-zA-Z0-9-]+)$ /category.php?name=$1last;
rewrite ^/user/([a-zA-Z0-9]+)$ /profile.php?username=$1last;
return404;
    }

# 防止重定向循环
location~ .php$ {
try_files$uri =404;
fastcgi_pass127.0.0.1:9000;
fastcgi_index index.php;
include fastcgi_params;
    }
}

调试重写规则

# 开启重写日志
error_log /var/log/nginx/rewrite.log notice;
rewrite_logon;

# 测试重写规则
location /test {
rewrite ^/test/(.*)$ /debug?param=$1break;
return200"Rewrite test: $args
";
}

案例十：性能监控与调优

故障现象

需要建立完善的Nginx性能监控体系，及时发现和解决性能问题。

监控脚本

# Nginx性能监控脚本
cat > /usr/local/bin/nginx_monitor.sh << 'EOF'
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"

# 获取状态信息
STATUS=$(curl -s $NGINX_STATUS_URL)
ACTIVE_CONN=$(echo"$STATUS" | grep "Active connections" | awk '{print $3}')
ACCEPTS=$(echo"$STATUS" | awk 'NR==2 {print $1}')
HANDLED=$(echo"$STATUS" | awk 'NR==2 {print $2}')
REQUESTS=$(echo"$STATUS" | awk 'NR==2 {print $3}')
READING=$(echo"$STATUS" | awk 'NR==3 {print $2}')
WRITING=$(echo"$STATUS" | awk 'NR==3 {print $4}')
WAITING=$(echo"$STATUS" | awk 'NR==3 {print $6}')

# 记录到日志
echo"$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING" >> $LOG_FILE

# 告警逻辑
if [ $ACTIVE_CONN -gt 1000 ]; then
echo"High connection count: $ACTIVE_CONN" | logger -t nginx_monitor
fi
EOF

综合调优配置

# 终极优化配置
worker_processes auto;
worker_cpu_affinity auto;
worker_rlimit_nofile100000;

error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;

events {
useepoll;
worker_connections10240;
multi_accepton;
accept_mutexoff;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

# 日志格式
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status$body_bytes_sent "$http_referer" '
'"$http_user_agent" $request_time$upstream_response_time';

# 性能优化
sendfileon;
tcp_nopushon;
tcp_nodelayon;
keepalive_timeout65;
keepalive_requests1000;

# 压缩优化
gzipon;
gzip_varyon;
gzip_min_length1000;
gzip_comp_level6;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml;

# 缓存优化
open_file_cache max=100000 inactive=20s;
open_file_cache_valid30s;
open_file_cache_min_uses2;
open_file_cache_errorson;

# 安全优化
server_tokensoff;
client_header_timeout10;
client_body_timeout10;
reset_timedout_connectionon;
send_timeout10;

# 限流配置
limit_req_zone$binary_remote_addr zone=global:10m rate=100r/s;
limit_conn_zone$binary_remote_addr zone=addr:10m;

include /etc/nginx/conf.d/*.conf;
}

故障排查方法论总结

1. 标准化排查流程

1. 收集故障信息：确认故障现象、影响范围、发生时间

2. 查看日志文件：error.log、access.log、系统日志

3. 检查配置文件：语法检查、逻辑检查

4. 验证网络连通：端口状态、连通性测试

5. 分析性能指标：CPU、内存、网络、磁盘

6. 确定根本原因：深入分析，找出真正原因

7. 实施解决方案：临时修复、永久解决

8. 验证修复效果：功能测试、性能测试

9. 总结经验教训：文档记录、流程优化

2. 常用排查工具

• 日志分析：tail、grep、awk、sed

• 网络工具：curl、wget、telnet、netstat、ss

• 性能监控：htop、iotop、iftop、nginx-status

• 系统诊断：strace、lsof、tcpdump

3. 预防性措施

• 建立完善的监控告警体系

• 定期进行配置文件备份

• 实施自动化运维工具

• 制定标准化操作流程

• 定期进行故障演练

结语

Nginx故障排查是运维工程师必备的核心技能，需要扎实的理论基础和丰富的经验

打开APP阅读更多精彩内容