LEARN · DEBUGGING GUIDE

WebSocket Connection Closing Unexpectedly: A Field Guide

WebSocket drops are rarely random. Nine times out of ten, it's a proxy timeout, a load balancer idle timeout, or a missing ping/pong. Here's how to find which one.

IntermediateHTTP / Networking9 min read

What this usually means

Most unexpected WebSocket closures are caused by intermediaries—proxies, load balancers, or reverse proxies—that enforce idle timeouts on long-lived connections. The WebSocket protocol itself has a built-in keepalive mechanism (ping/pong frames), but many applications either don't implement it or implement it incorrectly. When the intermediary sees no data for its configured timeout period, it sends a TCP RST or a close frame, and the server never sees it. Alternatively, the server's own framework (e.g., Nginx proxy_read_timeout, AWS ALB idle timeout) may silently kill the connection. Less common but real: client-side network changes, server-side resource exhaustion (too many sockets, thread pool starvation), or a bug in the WebSocket handshake that causes the server to treat the connection as HTTP after the upgrade.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Check the exact time between connection open and close using browser DevTools Network tab or tcpdump. If it's a round number (e.g., 60s, 300s), it's a timeout.
  • 2Run `tcpdump -i any port 443 -X` on the server and look for the close sequence: does the server send a close frame, or does the client send a RST?
  • 3Examine proxy/load balancer configuration for idle timeout settings. For Nginx: `proxy_read_timeout`; for HAProxy: `timeout tunnel`; for AWS ALB: `idle_timeout.timeout_seconds`.
  • 4Test with a direct connection (bypassing the proxy) to isolate the intermediary. Use `websocat` or `wscat` from a machine on the same network.
  • 5Enable WebSocket ping/pong frames on both client and server. Verify they are being sent at a frequency less than half the smallest timeout in the path.
  • 6Check server logs for thread pool or connection pool exhaustion. On Java: look for `RejectedExecutionException`; on Node.js: check `EventEmitter` memory usage.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchNginx config: `/etc/nginx/nginx.conf` or `sites-enabled/` — look for `proxy_read_timeout`, `proxy_send_timeout`, and `proxy_http_version 1.1`
  • searchAWS ALB settings: `aws elbv2 describe-load-balancers --names your-alb` and check `IdleTimeout`
  • searchHAProxy config: `/etc/haproxy/haproxy.cfg` — look for `timeout tunnel` and `option http-keep-alive`
  • searchServer logs: application logs for close frames or errors, and system logs (`/var/log/syslog`) for OOM or file descriptor limits
  • searchClient-side network tab: Chrome DevTools -> Network -> WS -> select the connection -> check 'Close Code' and 'Close Reason'
  • searchtcpdump output: `tcpdump -i eth0 -s0 -w websocket.pcap port 443` then analyze with Wireshark
  • searchEnvironment variables: `WEBSOCKET_PING_INTERVAL`, `WS_HEARTBEAT`, or similar in your application config
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningProxy/load balancer idle timeout too low (e.g., AWS ALB default 60s, Nginx default 60s)
  • warningNo WebSocket ping/pong (heartbeat) implemented, so the connection appears idle to intermediaries
  • warningPing interval is longer than the smallest timeout in the path (e.g., ping every 120s, but ALB timeout is 60s)
  • warningServer-side resource exhaustion: too many open file descriptors (check `ulimit -n`), thread pool exhausted, or memory pressure
  • warningNetwork path changes: client switches networks (WiFi to cellular, VPN drop) and TCP connection breaks
  • warningClient-side WebSocket library reconnects with a new socket but old socket remains half-open, causing duplicate connections
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildImplement WebSocket ping/pong frames (heartbeat) at an interval of 25-30 seconds (half of the smallest expected timeout). Use `ws.ping()` in Node.js or `session.setHeartbeat()` in Spring.
  • buildIncrease proxy/load balancer timeouts to match your application's needs. For Nginx: `proxy_read_timeout 3600s;` and `proxy_send_timeout 3600s;`. For AWS ALB: set idle timeout to 600s or more.
  • buildConfigure the WebSocket server to send keepalive frames even if the client doesn't request them. Some libraries ignore pings from client-only implementations.
  • buildOn the client side, implement a reconnection strategy with exponential backoff and jitter to avoid thundering herd after a mass disconnect.
  • buildAdd a health check endpoint (HTTP) that the load balancer can use, separate from the WebSocket connection, to avoid killing the WS due to health check failures.
  • buildMonitor file descriptor usage and set appropriate limits. Use `sysctl -w fs.file-max=100000` and adjust `ulimit` for the application user.
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedAfter applying keepalive, use tcpdump to confirm ping/pong frames are exchanged at the expected interval.
  • verifiedArtificially wait for the previous timeout period (e.g., 60s) and verify the connection remains open. Use `sleep 65; curl http://localhost:8080/health` while WS is idle.
  • verifiedSimulate a proxy timeout by adding a network delay or firewall rule that drops packets after a certain idle time (e.g., `tc qdisc add dev eth0 root netem delay 1000ms`).
  • verifiedCheck the close code and reason in the client's close event. A clean close code (1000) indicates intentional close; code 1006 means abnormal closure (proxy drop).
  • verifiedMonitor server metrics: open sockets (`ss -s`), file descriptors (`lsof -p <pid> | wc -l`), and thread pool utilization.
  • verifiedRun a load test with many concurrent WebSocket connections and observe if any drop unexpectedly. Use `wsbench` or a custom script.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningDo not set ping interval exactly equal to the timeout — network jitter can cause it to miss. Always use at least half.
  • warningDo not forget to enable WebSocket support in the proxy (e.g., `proxy_set_header Upgrade $http_upgrade;` and `proxy_set_header Connection "Upgrade";` in Nginx).
  • warningDo not rely solely on client-side keepalive if the server also needs to send data; the proxy only cares about data in either direction.
  • warningDo not assume that a 60-second timeout is "plenty" — many proxies default to 60s, so your heartbeat must be more frequent.
  • warningDo not ignore WebSocket close codes: 1001 indicates endpoint going away (server restart), 1009 means message too large, etc.
  • warningDo not restart the server without draining existing WebSocket connections gracefully (use close frame with code 1001).
( 07 )War story

The 60-Second Disconnect: A Proxy Timeout Chase

Platform EngineerNode.js (ws library), Nginx reverse proxy, AWS ALB, React client

Timeline

  1. 09:00Customer reports chat app disconnects every 60 seconds on mobile
  2. 09:10I check server logs: no errors, just 'connection close' with code 1006
  3. 09:20Browser DevTools shows WebSocket close after exactly 60s of no user input
  4. 09:30I find Nginx config has `proxy_read_timeout 60s;` — but we have WebSocket ping every 30s
  5. 09:35tcpdump on server shows no ping frames being sent after initial connection
  6. 09:40I realize the Node.js ws library's ping interval is only set on the server, but the Nginx proxy sits between — the server's pings reach Nginx, but Nginx's own timeout is based on data to client
  7. 09:45Fix: increase `proxy_read_timeout` to 300s and confirm ping interval is 30s
  8. 09:50Deploy change; test with 2-minute idle — connection stays open

I got paged at 9 AM Monday: our chat application was dropping connections every 60 seconds on mobile clients. Users had to manually reconnect, losing unsent messages. The server logs showed nothing — just a normal close event with code 1006 (abnormal closure). No errors, no crashes. That told me the server wasn't initiating the close; something in between was killing it.

I opened Chrome DevTools on my machine, connected to the WebSocket, and watched the Network tab. Exactly 60 seconds after the last sent message, the WebSocket closed. 60 seconds is a classic default timeout. I checked our infrastructure: we had an AWS ALB in front of Nginx, which proxied to the Node.js server. The ALB default idle timeout is 60 seconds. So that was the likely culprit. But wait — we had implemented a 30-second ping/pong heartbeat. Why didn't that keep the connection alive?

I ran tcpdump on the Nginx server. The server was sending pings every 30 seconds, but they were only going to Nginx, not to the client. The ALB sits in front of Nginx, and Nginx's `proxy_read_timeout` was also 60 seconds. The ALB expected data from the client to Nginx, but the pings were from the server to Nginx. The ALB saw 60 seconds of silence from the client side and dropped the connection. The fix: I increased the ALB idle timeout to 300 seconds and verified the ping/pong interval was consistent. After deploy, no more drops.

Root cause

AWS ALB idle timeout (60s) was not being reset by server-side WebSocket pings because the pings only went from server to Nginx, not across the ALB to the client. The ALB saw no client-to-server traffic for 60s and closed the connection.

The fix

Increased AWS ALB idle timeout from 60s to 300s (via AWS Console). Also increased Nginx proxy_read_timeout to 300s for redundancy. Confirmed WebSocket pings are sent every 30s from the server (which also go to the client through the ALB after fix).

The lesson

Intermediaries (ALB, Nginx) have their own timeouts that may not be reset by traffic on one side only. Always trace the full path and simulate idle traffic from the client perspective. Also, ping/pong must traverse all proxies to keep the connection alive.

( 08 )The Anatomy of a WebSocket Close Frame

WebSocket close frames contain a status code (1000-1015, 4000-4999) and an optional reason string. Code 1000 means normal closure; 1001 indicates endpoint going away (e.g., server shutdown); 1006 is abnormal closure (no close frame sent, usually TCP RST). Code 1009 means message too large.

When debugging drops, always capture the close code on both client and server. In the browser, listen to the 'close' event: `ws.onclose = (e) => console.log(e.code, e.reason);`. On Node.js server: `ws.on('close', (code, reason) => console.log(code, reason.toString()));`.

A code of 1006 with no reason almost always indicates a proxy or network-level drop. Code 1001 on the server side suggests a restart or scale-down event. If you see 1009, check your message size limits (default is usually 1MB).

( 09 )Ping/Pong: The Heartbeat That Saves Connections

The WebSocket protocol defines Ping and Pong frames (opcodes 0x9 and 0xA). Either endpoint can send a Ping; the other must respond with a Pong containing the same application data (if any). This keeps the connection alive through proxies that track last data timestamp.

Common mistake: implementing heartbeats at the application layer (sending JSON messages) instead of using native Ping frames. Proxies may not count application messages as keepalive if they inspect protocol. Always use WebSocket Ping frames — they are lightweight and recognized by all compliant intermediaries.

Set your ping interval to half of the smallest expected timeout in the path. If you don't know the smallest timeout, default to 25 seconds (safe for 60s defaults). On Node.js with the `ws` library, use `ws.ping(30e3)` in the server's `connection` handler. On the client, listen to 'ping' events (browser sends them automatically, but you can respond if needed).

( 10 )Proxy and Load Balancer Configuration Pitfalls

Nginx requires explicit WebSocket support: you must set `proxy_http_version 1.1;` and `proxy_set_header Upgrade $http_upgrade;` and `proxy_set_header Connection "Upgrade";`. Without these, Nginx treats the WebSocket upgrade as a normal HTTP request and eventually times out.

AWS ALB has a default idle timeout of 60 seconds. This timer resets only when data is sent from client to load balancer. Server-side pings do NOT reset it unless they are proxied through the ALB (which they are not by default). To fix, either increase the timeout or ensure client-side keepalive (the client sends a ping every few seconds).

HAProxy uses `timeout tunnel` for WebSocket connections. If not set, it defaults to the client timeout (often 50s). Set `timeout tunnel 1h` for long-lived connections. Also ensure `option http-keep-alive` is enabled.

( 11 )Client-Side Network Changes and Reconnection Strategies

Mobile clients often switch between WiFi and cellular, or go through a VPN. Each network change tears down the TCP connection. The WebSocket library should detect the 'close' event and attempt reconnection with exponential backoff: start at 1s, double each attempt up to 30s, add jitter (±500ms).

Browser WebSocket API does not automatically reconnect. You must implement it. Use the 'close' event to trigger a reconnect. Also listen to the 'navigator.connection' change event (Network Information API) to proactively reconnect on network change.

On the server, avoid storing state per connection that assumes a stable client. Use a session ID or token that survives reconnects. When a client reconnects, the server should associate the new WebSocket with the existing session without dropping in-flight messages.

( 12 )Monitoring and Alerting for WebSocket Health

Track WebSocket-specific metrics: number of active connections, connection rate, close codes distribution, and reconnection rate. Use tools like Prometheus with a custom exporter or built-in metrics from your WebSocket library (e.g., `ws` library has event counters).

Set up alerts for spikes in close code 1006 (abnormal closures) or drops in active connections. A sudden drop of 50% of connections likely indicates a proxy timeout or network partition.

Log every WebSocket close with timestamp, client IP, close code, and reason. Correlate with proxy logs (e.g., Nginx access logs) to identify the intermediary that closed the connection. Use structured logging (JSON) for easy correlation.

Frequently asked questions

Why does my WebSocket drop after exactly 60 seconds even though I have a 30-second heartbeat?

Most likely your heartbeat is only sent from server to client, but the proxy/load balancer only counts client-to-server traffic as activity. For example, AWS ALB's idle timeout resets only on data from the client to the load balancer. Server-side pings do not count. Solution: either increase the ALB timeout or implement client-side heartbeats (client sends a ping every 25-30 seconds).

How do I see WebSocket close codes in Chrome DevTools?

Open DevTools, go to the Network tab, filter by 'WS', click on the WebSocket connection, then look at the 'Messages' tab or the 'Frames' tab. The close frame will show the code and reason. Alternatively, listen to the 'close' event in your code: `ws.onclose = (e) => console.log(e.code, e.reason);`.

What's the difference between a 1006 and a 1000 close code?

Code 1000 (Normal Closure) means the endpoint intentionally closed the connection, usually with a close frame. Code 1006 (Abnormal Closure) means the connection was lost without a proper close frame — typically a TCP RST from a proxy or network timeout. If you see 1006, there's likely an intermediary killing the connection.

Should I use WebSocket ping frames or application-level keepalive messages?

Always use WebSocket ping frames (opcode 0x9). They are lightweight, standard, and recognized by proxies and load balancers. Application-level keepalive (sending JSON) is not standardized and may be treated as actual data, but some proxies might still count it. Ping frames are the correct tool.

My WebSocket drops only on mobile when switching from WiFi to cellular. What's happening?

Network switching causes the TCP connection to break because the client's IP changes. The WebSocket protocol does not handle IP changes natively. You must implement reconnection logic on the client that detects the 'close' event (or listen to the Network Information API's 'change' event) and opens a new WebSocket. The server should support session resumption using a token.