Nginx反向代理场景下的三类错误排查方法

马哥Linux运维 2026-05-12 423

描述

线上服务 502、504、Connection Reset 到底怎么排查？

问题背景

在生产环境中，502、504 和 Connection Reset 是 Nginx 反向代理场景下最常见的三类错误。它们经常被笼统地归为"后端挂了"，但实际上三种错误指向完全不同的故障类型：

502 Bad Gateway：后端无响应

504 Gateway Timeout：后端响应太慢

Connection Reset：连接被中间层或后端主动断开

三种错误的排查路径不同，错误日志中的关键词也不同。如果在 504 的场景下用 502 的思路去排查（比如反复检查后端进程是否存活），就会浪费大量时间。

本文以 Nginx 作为反向代理层为默认场景，覆盖从错误特征区分到链路逐段排查的完整方法。

一、三种错误的本质区分

1.1 通信链路

从客户端请求到后端响应，数据经过以下路径：

客户端 → Nginx（反向代理） → upstream（后端服务）
         ↑                    ↑
    问题发生在           问题发生在
    客户端→Nginx          Nginx→upstream

1.2 错误特征对比

错误	日志关键词	Nginx error.log 示例	直接原因
502	connect() failed	connect() failed (111: Connection refused)	Nginx 无法连接 upstream
502	no live upstreams	no live upstreams while connecting to upstream	所有 upstream 都不可用
504	upstream timed out	upstream timed out (110: Connection timed out)	upstream 响应超时
504	upstream prematurely closed	upstream prematurely closed connection	处理未完成但连接被关闭
Connection Reset	recv() failed	recv() failed (104: Connection reset by peer)	upstream 主动复位连接
Connection Reset	Connection reset by peer	readv() failed (104: Connection reset by peer)	Nginx 或 upstream 主动断开

1.3 快速判断方法

在生产环境中，不要只看浏览器或客户端返回的错误码，因为浏览器可能缓存或展示不准确。正确的做法是：

# 1. 查看 Nginx access.log 中 $status 字段的实际状态码
$ awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
   5234 200
    102 502
     56 504
     12 499

# 2. 结合 upstream_status 看后端实际返回了什么
$ tail -100 /var/log/nginx/access.log | grep "status=502" | head -5
# 注意：upstream_status 在 502 时可能为空（没连上），在 504 时可能是 504

# 3. 直接查看 Nginx error.log 中的错误类型
$ grep -E "connect() failed|upstream timed out|recv() failed|Connection reset|no live upstreams" /var/log/nginx/error.log | tail -20

注意：access.log 中的 $status 是 Nginx 返回给客户端的响应码，$upstream_status 是 upstream 返回给 Nginx 的响应码。两者可能不同——比如 Nginx 检测到 upstream 超时会返回 504，但 upstream 实际返回的是 200（只是还没来得及发给 Nginx 就超时了）。

二、四段链路排查法

无论哪种错误，排查思路都是"分段确认，定位最短的板"。将请求链路拆为四段：

段1: 客户端 → Nginx 网络
段2: Nginx 自身
段3: Nginx → upstream 网络
段4: upstream（后端服务）

2.1 本机对照法

在 Nginx 部署的机器上执行 curl，排除客户端到 Nginx 的网络问题：

# 对本机 Nginx 发请求（不走外网）
$ curl -sS -o /dev/null -w 'http_code=%{http_code} time_total=%{time_total}s time_connect=%{time_connect}s time_starttransfer=%{time_starttransfer}s
' http://127.0.0.1/health

本机 curl 正常、外部访问异常 → 段1 的问题

本机 curl 也异常 → 段2~4 的问题

2.2 直连 upstream 法

跳过 Nginx，直接访问后端服务：

# 直连 upstream（从 Nginx 机器上执行）
$ curl -sS -o /dev/null -w "http_code=%{http_code} time_total=%{time_total}s
" http://10.0.1.10:8080/health

# 如果 upstream 不止一个，逐个测试
$ for ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
    echo -n "$ip: "
    curl -sS -o /dev/null -w "code=%{http_code} total=%{time_total}s
" --connect-timeout 3 --max-time 5 http://$ip:8080/health
done

本机 Nginx 异常、直连 upstream 也异常 → 段4 的问题

本机 Nginx 异常、直连 upstream 正常 → 段2~3 的问题（Nginx 配置或 Nginx→upstream 网络）

三、502 Bad Gateway 专项排查

3.1 日志关键定位

# 查看最近 502 对应的 error 日志
$ grep " 502 " /var/log/nginx/access.log | tail -5 | awk '{print $1,$4,$7}'

# 查看 error.log 中与 502 相关的错误
$ grep "connect() failed" /var/log/nginx/error.log | tail -10
2025/08/15 1422 [error] 1234#0: *56789 connect() failed (111: Connection refused) while connecting to upstream, client: 1.2.3.4, server: api.example.com, upstream: "http://10.0.1.10:8080/api/", host: "api.example.com"

3.2 常见原因与排查

原因 A：后端进程未启动或崩溃

# 确认后端进程是否存活
$ ps aux | grep -E "java|python|node|php-fpm" | grep -v grep

# 确认端口在监听
$ ss -lntp | grep 8080

# 查看后端进程是否被 OOM kill
$ dmesg -T | grep -i "oom|killed" | tail -5

原因 B：PHP-FPM 进程池耗尽

# PHP-FPM 状态（如果已开启 status page）
$ curl http://127.0.0.1/status

# 或查看 PHP-FPM 日志
$ tail -50 /var/log/php-fpm/www-error.log
WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers)
WARNING: [pool www] server reached pm.max_children setting (50), consider raising it

原因 C：防火墙或安全组拦截

# 确认 Nginx → upstream 端口可通
$ telnet 10.0.1.10 8080
$ nc -zv 10.0.1.10 8080

# 检查 iptables 规则
$ iptables -L -n | grep 8080

# 检查云厂商安全组（需在云控制台确认）

原因 D：FastCGI Buffer 不足

# error.log 关键字
grep "upstream sent too big header" /var/log/nginx/error.log

# 解决：增大 buffer
location ~ .php$ {
    fastcgi_buffer_size 32k;
    fastcgi_buffers 8 32k;
    fastcgi_busy_buffers_size 64k;
    ...
}

3.3 502 排查 checklist

# 1. 后端进程是否运行
ps aux | grep backend

# 2. 端口是否监听
ss -lntp

# 3. 防火墙是否拦截
iptables -L -n

# 4. Nginx proxy_pass 地址是否写对
grep "proxy_pass|fastcgi_pass" /etc/nginx/conf.d/default.conf

# 5. upstream 主机名能否解析（如果用的是域名）
nslookup backend.example.com

# 6. 后端是否有健康检查接口
curl -I http://127.0.0.1:8080/health

四、504 Gateway Timeout 专项排查

4.1 超时时间的配置解析

Nginx 反向代理涉及三个超时配置：

location /api/ {
    proxy_connect_timeout 5s;      # 与 upstream 建立 TCP 连接的超时
    proxy_send_timeout 10s;        # 发送请求体到 upstream 的超时
    proxy_read_timeout 30s;        # 等待 upstream 返回响应的超时（最常碰到）
    proxy_pass http://backend;
}

504 最常见的原因是 proxy_read_timeout 不够——Nginx 已经将请求发送给 upstream，但在 proxy_read_timeout 时间内 upstream 没有返回完整的响应头部。

Nginx 的默认值：proxy_connect_timeout 60s、proxy_send_timeout 60s、proxy_read_timeout 60s。但在生产环境中，这些默认值不一定合理。

4.2 排查步骤

Step 1：测量 upstream 的真实响应时间

# 使用 curl 的 -w 选项提取各阶段耗时
$ curl -sS -o /dev/null -w "
   time_namelookup=%{time_namelookup}s

   time_connect=%{time_connect}s

   time_starttransfer=%{time_starttransfer}s

   time_total=%{time_total}s
" 
   http://10.0.1.10:8080/api/slow-endpoint

time_namelookup=0.001s          # DNS 解析耗时
time_connect=0.002s             # TCP 建连耗时
time_starttransfer=25.300s      # 从开始到收到第一个字节的耗时
time_total=25.400s              # 总耗时

如果 time_starttransfer 很大（比如 > 30s），说明 upstream 处理请求的时间很长，Nginx 的 proxy_read_timeout 需要大于这个值。

Step 2：检查 error.log 中的超时日志

$ grep "upstream timed out" /var/log/nginx/error.log | tail -5
2025/08/15 1422 [error] 1234#0: *57123 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 1.2.3.4, server: api.example.com, upstream: "http://10.0.1.10:8080/api/slow", host: "api.example.com"

Step 3：按接口类型拆分超时配置

不要全局统一超时，不同接口的预期处理时间不同：

# 普通 API：快速响应，5 秒超时
location /api/quick/ {
    proxy_read_timeout 5s;
    proxy_pass http://backend;
}

# 报表导出：可能耗时较长，120 秒超时
location /api/export/ {
    proxy_read_timeout 120s;
    proxy_pass http://backend;
}

# 长轮询/SSE：超时时间设长
location /api/poll/ {
    proxy_read_timeout 3600s;
    proxy_buffering off;           # 长轮询关闭缓冲
    proxy_pass http://backend;
}

4.3 后端慢的常见根因

后端在proxy_read_timeout 内未返回响应，根本原因是后端处理太慢，常见原因：

原因	排查方法
慢 SQL	数据库 slow query log
外部依赖超时	后端调用外部 API 是否有超时保护
线程池排队	查看后端线程池监控
死锁	线程 dump 分析
Full GC	JVM GC 日志

504 的解决方向有两个：一是优化后端响应速度（治本），二是增大 proxy_read_timeout（治标）。如果后端无法短期内优化，先增大 timeout 确保业务可用，再排期优化后端性能。

五、Connection Reset 专项排查

5.1 错误的本质

Connection Reset（104: Connection reset by peer）不同于 502 和 504，它的本质是通信的一方在没有正常完成四次挥手的情况下，直接发送了 RST 包。

在 Nginx 场景下，这通常意味着：

upstream 主动断开连接（最常见）：后端压力过大，主动关闭连接

Nginx 主动断开连接：Nginx 侧超时或连接池回收

中间网络设备断开：防火墙、负载均衡器的 idle timeout 设置过小

5.2 排查步骤

Step 1：确认报错位置

Nginx error.log 会明确告诉你是哪一端 reset 了连接：

# upstream reset 连接
$ grep "Connection reset by peer" /var/log/nginx/error.log
2025/08/15 1422 [error] 1234#0: *57123 recv() failed (104: Connection reset by peer) while reading response header from upstream, ...

# Nginx 主动 reset 连接（较少见）
# 不会在 error.log 中出现，但 clients 端会报 "Connection reset by peer"

Step 2：检查 upstream 的 fd 和线程池

Connection Reset 最常见的根因是 upstream 的文件描述符（fd）耗尽或线程池打满，导致新连接被拒绝或已有连接被强制关闭。

# upstream 服务器上检查 fd
$ cat /proc/$(pidof java)/limits | grep "open files"
$ lsof -p $(pidof java) | wc -l

# 检查 TCP 连接队列
$ ss -ant | grep -E 'SYN-RECV|TIME-WAIT' | wc -l
$ netstat -s | grep -i "listen overflow"

Step 3：检查 Nginx 的 worker_connections

# Nginx 连接数是否打满
$ curl http://127.0.0.1/nginx_status  # 需要开启 stub_status
Active connections: 65300
server accepts handled requests
 123456 123456 345678
Reading: 0 Writing: 128 Waiting: 45

如果 Active connections 接近 worker_connections × worker_processes，说明 Nginx 自身连接数紧张。

Step 4：检查 TCP backlog 是否溢出

# 查看 listen 队列溢出情况
$ netstat -s | grep -i "listen"
    18932 times the listen queue of a socket overflowed

# 当前 backlog 大小
$ ss -lntp | grep 80
LISTEN 0 511  ...

# 第一个数字 511 = backlog 大小

如果 backlog 溢出频繁，调大配置：

listen 8080 backlog=65535;

配合内核参数：

$ sysctl -w net.core.somaxconn=65535
$ sysctl -w net.ipv4.tcp_max_syn_backlog=65535

六、Nginx 配置优化参考

6.1 合理的超时与缓冲配置

upstream backend {
    # 负载均衡算法
    least_conn;

    # upstream 节点
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;

    # 长连接池
    keepalive 128;
    keepalive_requests 10000;
    keepalive_timeout 60s;
}

server {
    listen 80 backlog=65535;
    server_name api.example.com;

    # 超时配置
    proxy_connect_timeout 5s;
    proxy_send_timeout 10s;
    proxy_read_timeout 30s;

    # 缓冲配置
    proxy_buffer_size 4k;
    proxy_buffers 8 4k;
    proxy_busy_buffers_size 8k;

    # 请求体大小
    client_max_body_size 10m;
    client_body_buffer_size 128k;

    location /api/ {
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_pass http://backend;
    }
}

6.2 日志格式必须包含 upstream 信息

log_format main_ext '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent" '
                    'upstream_addr=$upstream_addr '
                    'upstream_status=$upstream_status '
                    'upstream_response_time=$upstream_response_time '
                    'request_time=$request_time';

access_log /var/log/nginx/access.log main_ext;

这些 upstream 变量是排查 502/504/Connection Reset 的关键：

$upstream_addr：实际处理请求的 upstream 地址

$upstream_status：upstream 返回的状态码

$upstream_response_time：upstream 处理耗时（秒）

$request_time：Nginx 总耗时（含网络延迟）

如果日志中没有这些字段，遇到问题就只能靠猜。

6.3 故障转移和降级

upstream backend {
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8080 backup;     # 备用节点
}

server {
    location /api/ {
        proxy_pass http://backend;
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;   # 重试 2 次
        proxy_next_upstream_timeout 5s; # 重试总时长限制

        # 降级兜底页面
        error_page 502 504 = @fallback;
    }

    location @fallback {
        internal;
        default_type application/json;
        return 200 '{"status":"degraded","message":"服务暂时不可用，请稍后重试"}';
    }
}

proxy_next_upstream_tries 不要设置太大（建议 2~3），否则故障时大量重试会加剧上游负载，形成雪崩。

七、常见场景速查

现象	日志特征	最可能的原因	优先排查
偶尔 502	connect() failed (111: Connection refused)	后端重启中，端口未就绪	确认后端启动流程和健康检查
持续 502	no live upstreams	所有 upstream 都挂了	检查所有后端节点状态
定时 502	connect() failed 固定间隔出现	crontab 任务导致系统负载抖动	查看 crontab 和系统定时任务
部分请求 504	upstream timed out	特定接口处理慢	分析后端各接口 P99 延迟
全部 504	upstream timed out	后端整体过载/数据库打满	检查后端 CPU、连接池、慢查询
偶发 Connection Reset	recv() failed (104)	后端 fd 不足/线程池满	检查后端 fd 和线程池监控
批量 Connection Reset	recv() failed (104)	后端 OOM/崩溃重启	dmesg、后端日志
502 + 504 交替出现	两种日志都有	后端过载，部分请求超时部分拒绝	查看后端 GC/线程池/连接池

八、生产环境注意事项

配置修改前必须备份：

$ cp -a /etc/nginx /etc/nginx.$(date +%F_%H%M%S).bak

执行配置重载而非重启：先用 nginx -t 测试配置文件语法，再执行 nginx -s reload 热加载。不要用 systemctl restart nginx，这会中断正在处理的连接。

超时配置不要全局一刀切：不同接口的预期响应时间不同，根据接口 SLA 分别配置 proxy_read_timeout。

关闭 proxy_buffering 的场景：长轮询、SSE、Server-Sent Events 等场景需要关闭缓冲，否则 Nginx 会等缓冲区满了才转发数据。

注意 proxy_next_upstream 的重试风险：proxy_next_upstream 在 post 请求时可能导致重复提交。如果后端不是幂等的，考虑用 proxy_next_upstream error timeout 而非 proxy_next_upstream http_500。

505 错误之外的排查：如果看到 499 错误（Nginx 的客户端主动断开连接码），说明客户端等不及 Nginx 响应就断开了，通常是浏览器超时或前端设置了 timeout。

九、总结

502、504、Connection Reset 的排查不是靠"重启大法"，而是靠日志和数据驱动的分段排查：

日志先行：先看 Nginx error.log 中的错误关键词，是什么错误、发生在哪一段

分层验证：本机 curl → 直连 upstream → 逐层确认问题在哪一段

对症下药：

502 查后端存活和防火墙

504 查后端响应时间和超时配置

Connection Reset 查后端 fd 和线程池

最后，所有的排查前提是日志配置了充分的字段。如果 access.log 中没有 $upstream_addr、$upstream_status、$upstream_response_time，排查效率会大幅降低。建议所有 Nginx 反向代理的日志配置第一件事就是把这些字段加进去。

打开APP阅读更多精彩内容