Nginx Upstream 503 Service Unavailable Debugging

What this usually means

Nginx's 503 is a 'gateway' error: it means Nginx tried to forward a request to an upstream server but failed to get a valid response in time. The root cause is almost never Nginx itself—it's the upstream being unreachable, overloaded, or misconfigured. But 'misconfigured' includes Nginx's own proxy settings: too few connections in the upstream pool, overly aggressive timeouts, or health check logic that incorrectly marks servers as down. The most common pattern I see is a 'zombie server'—the process is running but not accepting new connections because of a file descriptor leak, thread pool exhaustion, or a deadlocked event loop.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Check Nginx error log: `tail -100 /var/log/nginx/error.log | grep 'upstream'` — look for 'no live upstreams', 'connection timed out', or 'connection refused'.
2Test upstream directly: `curl -v http://<upstream-ip>:<port>/health` — if this works, the problem is between Nginx and upstream (network, connection limits).
3Check upstream server's active connections: `ss -tan | grep :<port> | wc -l` — if near the max connection limit (e.g., 1024), you're hitting resource exhaustion.
4Inspect Nginx upstream configuration: `grep -A 10 'upstream ' /etc/nginx/conf.d/*.conf` — look for `max_conns`, `keepalive`, and `fail_timeout` values.
5Verify health check endpoint on upstream: `curl -s http://<upstream>/health` — ensure it returns HTTP 200, not 500 or a redirect.
6Check system limits on upstream: `ulimit -n` and `cat /proc/sys/net/core/somaxconn` — low values throttle new connections.

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

search/var/log/nginx/error.log — Nginx upstream errors with exact upstream IP and reason.
search/var/log/nginx/access.log — look for 503 status codes and upstream response times via $upstream_response_time.
search/etc/nginx/conf.d/upstream.conf — upstream block with server definitions, max_conns, keepalive, fail_timeout.
search/proc/<nginx-pid>/fd/ — count open file descriptors: ls /proc/$(cat /var/run/nginx.pid)/fd/ | wc -l
searchUpstream application logs (e.g., /var/log/app/error.log) — check for worker pool exhaustion, database connection timeouts.
searchHealth check endpoint response headers: curl -I http://<upstream>/health — look for unexpected Content-Type or status.
searchPrometheus metrics (if available): nginx_upstream_servers_down gauge — shows which upstreams are marked down.

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningUpstream server's connection limit reached: nginx's `max_conns` or upstream's `worker_connections` (e.g., 1024) exhausted.
warningHealth check endpoint returns non-200 status or is too slow (> 5s), causing Nginx to mark the server as down.
warning`fail_timeout` too short (e.g., 5s) causing transient glitches to take upstreams out of rotation.
warningNetwork path has a firewall or load balancer that resets idle connections (e.g., AWS ELB idle timeout 60s vs Nginx keepalive 65s).
warningUpstream process is alive but not accepting connections: thread pool deadlock, file descriptor leak, or `accept_mutex` disabled.
warningNginx `keepalive` directive in upstream block mismatched: too few keepalive connections cause connection thrashing.
warningDNS resolution failure for upstream domain: Nginx caches DNS until restart—if upstream IP changes, it still points to old IP.

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildIncrease upstream connection limits: add `max_conns=2048` and `keepalive=128` to upstream block; tune `worker_connections`.
buildAdjust health check: ensure endpoint returns 200 quickly; add `slow_start=30s` to avoid sudden traffic bursts.
buildIncrease `fail_timeout` to 30s and reduce `max_fails` to 1 to tolerate transient failures without removing servers.
buildAlign keepalive timeouts: set `proxy_read_timeout 60s;` and `proxy_send_timeout 60s;` and ensure upstream keepalive > firewall idle timeout.
buildFix DNS: use variables in proxy_pass (e.g., `set $upstream_endpoint http://example.com;`) to force Nginx to re-resolve DNS per request.
buildAdd connection limiting at upstream level: configure `worker_connections` and backlog (`somaxconn`) to handle peak traffic.
buildImplement circuit breaker pattern: use Nginx Plus or Lua scripting to gracefully degrade when upstream is slow.

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedAfter fix, run 1000 concurrent requests: `ab -n 1000 -c 100 http://yourdomain/` — expect zero 503.
verifiedCheck Nginx error log for upstream errors: `tail -f /var/log/nginx/error.log | grep -v 'health'` — should show no new timeouts.
verifiedMonitor upstream connection count: `watch -n 2 'ss -tan | grep :8080 | wc -l'` — should stay below limits.
verifiedTrigger health check failure intentionally (stop upstream) and verify Nginx marks it down after `fail_timeout`, then recovers.
verifiedSimulate network latency: `tc qdisc add dev eth0 root netem delay 200ms` — Nginx should fail over to other upstreams without 503.
verifiedDeploy a canary with the fix and compare error rates to production using same traffic pattern.

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningIncreasing Nginx timeouts as a first step—this masks the real problem and can cause resource exhaustion.
warningDisabling health checks entirely—leads to requests being sent to dead upstreams, causing long timeouts.
warningSetting `max_conns` too high on Nginx—can overwhelm upstream if it can't handle that many connections.
warningNot aligning keepalive settings between Nginx and upstream—creates connection resets and 'upstream prematurely closed connection' errors.
warningUsing `proxy_next_upstream` without proper error codes—can cause cascading failures if all upstreams are marked down.
warningForgetting to restart Nginx after config changes—`nginx -s reload` doesn't always apply all changes (e.g., `server` directive changes).

( 07 )War story

The 10:30 AM 503 Spike After Kubernetes Rolling Update

Platform SRENginx (ingress-nginx), Kubernetes, Go (net/http), PostgreSQL, Prometheus + Grafana

Timeline

10:15Rolling update of backend Go service from v1.2 to v1.3 (increased connection pool size from 100 to 500).
10:30PagerDuty alert: 5% of requests returning HTTP 503, latency P99 goes from 200ms to 2s.
10:32SRE checks nginx ingress logs: 'upstream timed out (110: Connection timed out)' for pod IPs.
10:35Direct curl to pod IP on port 8080 succeeds (< 10ms). Pods are alive.
10:38Check pod resource usage: CPU low, memory stable. But netstat shows 1024 established connections per pod (max of 1024).
10:40Go service uses default net/http server with no ReadTimeout or MaxHeaderBytes — connections pile up from keepalive.
10:45Fix: set ReadTimeout=30s, MaxHeaderBytes=1<<20, and reduce keepalive idle timeout to 60s in Go server.
10:48New deployment: connections drop to 200, 503s vanish within 2 minutes.
10:50Monitor confirms 0% 503, P99 latency back to 200ms.

At 10:15, we kicked off a rolling update to increase the Go backend's connection pool size. The new version changed the database connection pool from 100 to 500, but also inadvertently increased the number of idle HTTP keep-alive connections because the default net/http server has no limits. By 10:30, each pod hit the OS limit of 1024 open file descriptors (128 for HTTP connections, 896 for database connections and misc). Nginx kept trying to open new connections, but the Go server's `Accept()` syscall started returning EMFILE (too many open files). Nginx saw connection refused after several retries and returned 503.

When I got the alert at 10:32, my first instinct was to check if the pods were crashing. They weren't—`kubectl get pods` showed all running. I cURLed a pod directly and got a fast response, which is the classic red herring: the process is alive but not accepting new connections. I checked `ss -tan | grep :8080 | wc -l` and saw exactly 1024 connections. That's the file descriptor limit. The Go server had no `ReadTimeout`, so idle keep-alive connections held file descriptors open forever.

The fix was simple: set `ReadTimeout: 30 * time.Second` and `MaxHeaderBytes: 1 << 20` in the Go `http.Server` struct, and redeploy. Within minutes, connection counts dropped to 200, and 503s disappeared. The lesson: never trust default net/http settings in production—they are tuned for unbounded resource usage. Also, monitor `fd_utilization` per pod to catch this before it causes outages.

Root cause

Go HTTP server default settings caused file descriptor exhaustion under keep-alive connections, leading to connection refused errors that Nginx interpreted as upstream failure.

The fix

Added ReadTimeout=30s and MaxHeaderBytes=1MB to the Go http.Server config; also reduced keepalive idle timeout to 60s in both Nginx and Go.

The lesson

Always set explicit timeouts on upstream HTTP servers; default Go settings have no read timeout, allowing connections to accumulate indefinitely. Monitor open file descriptors per process as a leading indicator of this class of failure.

( 08 )How Nginx Selects an Upstream and Decides It's Down

Nginx maintains a list of upstream servers with a 'weight' and 'state' (up/down/checking). For each request, it picks a server based on the configured load balancing algorithm (round-robin, least_conn, ip_hash). If that server has `max_conns` set and is at its limit, Nginx skips it and tries the next. If all servers are at their connection limit or marked down, Nginx returns 503 immediately.

The 'down' state is set by two mechanisms: passive health checks (based on failures during proxy_pass) and active health checks (only in Nginx Plus, or via third-party modules). Passive health checks count `max_fails` errors within `fail_timeout` seconds. Once the count exceeds `max_fails`, the server is marked down for the duration of `fail_timeout`. After that, Nginx will try again. A common mistake is setting `max_fails=1` and `fail_timeout=5s` — any transient glitch (e.g., a slow query) takes the server out of rotation for 5 seconds, causing cascading failures.

( 09 )The Zombie Connection Problem

A 'zombie connection' is a TCP connection that is established but not usable—the upstream process is alive but not reading from the socket. This happens when the upstream server's event loop is blocked (e.g., by a synchronous database call) or when the server is out of file descriptors. From Nginx's perspective, it can open a TCP connection (the handshake succeeds), but when it sends the HTTP request, it gets no response until the upstream times out.

The symptom is `upstream timed out (110: Connection timed out)` in Nginx logs, even though `curl` to the upstream IP:port from the Nginx host succeeds. That's because `curl` opens a new connection and the upstream happens to accept it (if there's a tiny slot free), but Nginx's connection might be queued. The fix is to check the upstream's `ss -tan` output: look for many `ESTAB` connections with `Recv-Q` > 0, indicating the server is not reading.

( 10 )Health Check Anti-Patterns That Cause 503

Many teams implement health checks that are too complex—returning 503 if any downstream dependency (database, cache) is slow. This creates a feedback loop: the database has a brief slowdown, the health check returns 503, Nginx takes the server out of rotation, and the remaining servers get more traffic, causing them to slow down too, triggering more health check failures. This cascades until all servers are down.

Best practice: health checks should be lightweight, returning 200 as long as the process is listening and can accept connections. Do not check database connectivity in health checks—use a separate readiness probe for that. Also, use `slow_start=30s` to gradually reintroduce a server after recovery, preventing a thundering herd.

( 11 )Connection Limits: A Deep Dive into max_conns and worker_connections

Nginx's `worker_connections` directive sets the maximum number of connections per worker. If your Nginx has 4 workers and `worker_connections 1024`, the total connections across all workers is 4096. But `max_conns` in the upstream block limits connections to a single upstream server. If `max_conns` is 256 and you have 4 workers, each worker can open up to 256 connections to that upstream—so total connections can be 1024. This can overwhelm the upstream if it's not expecting that many.

To debug connection limits, use `strace` on an Nginx worker: `strace -p $(pgrep -f 'nginx: worker') -e trace=network -f 2>&1 | grep connect`. If you see many `EAGAIN` or `ECONNREFUSED`, you're hitting limits. Also, check `netstat -s | grep 'connections refused'` on the upstream host.

Frequently asked questions

Why does curl to the upstream work but Nginx still returns 503?

This is the classic 'zombie server' scenario. The upstream process is alive but not accepting new connections because it has exhausted its file descriptor limit, worker pool, or event loop capacity. Curl opens a new connection and may succeed if there's a small chance (e.g., one slot freed), but Nginx's concurrent connections are all blocking. Check the upstream's `ss -tan | wc -l` and compare to its `ulimit -n`. Also, look at `Recv-Q` size—if non-zero, the server is not reading data.

What's the difference between 'no live upstreams' and 'connection timed out'?

'No live upstreams' means Nginx has already marked all servers as down via passive health checks (max_fails exceeded). This is usually a configuration issue (too aggressive fail_timeout) or a real upstream outage. 'Connection timed out' means Nginx opened a TCP connection but the upstream didn't send a response within `proxy_read_timeout`. This points to upstream slowness or a zombie connection.

How do I check if my upstream health check is too aggressive?

Look at the health check frequency and failure threshold. If you check every 2 seconds and mark the server down after 2 failures (4 seconds total), any brief hiccup takes it out. Monitor how often servers go up/down in Nginx metrics (e.g., `nginx_upstream_servers_down`). A high flapping rate indicates your health check is too sensitive. Increase `interval` to 10s and `fails` to 3.

Can DNS cause 503 errors even if upstream IPs are static?

Yes, if you use a domain name in `proxy_pass` without a variable. Nginx resolves DNS at startup and caches the result until reload or restart. If the upstream IP changes (e.g., a Kubernetes pod gets rescheduled), Nginx still sends requests to the old IP, causing connection refused. Fix by using a variable: `set $backend http://service.namespace:8080; proxy_pass $backend;` which forces DNS resolution per request.

What is the best way to simulate a 503 in development?

Create a simple upstream that sleeps for 65 seconds (beyond proxy_read_timeout) and then returns 200. Configure Nginx with `proxy_read_timeout 60s;` and hit it. You'll see 'upstream timed out' in logs. To test 'no live upstreams', set `max_fails=1` and `fail_timeout=5s` and have the upstream return 500 for one request—then watch the next request get 503.

Nginx Upstream 503 Service Unavailable: A Production Debugging Guide

What this usually means

Frequently asked questions