What this usually means
Socket.IO's reconnection logic is a multi-step process: after a disconnection, the client waits a random exponential backoff delay (default: 1s to 5s), then attempts to reconnect by opening a new transport (polling or WebSocket). If the server rejects the new connection (e.g., due to expired tokens, CORS, or mismatched namespaces), or if the transport upgrade fails (e.g., WebSocket blocked by proxy), the client will back off and eventually stop trying after `reconnectionAttempts` (default: Infinity, but often limited). The root cause is typically a silent failure in one of these stages: the client thinks it's trying, but the server never acknowledges the handshake.
The first ten minutes — establish facts before touching code.
- 1Open browser DevTools Network tab, filter WS, observe WebSocket handshake status (101 vs 4xx/5xx) and polling XHR responses
- 2Add `io('http://host', { transports: ['websocket'] })` to force WebSocket only; if reconnection works, the issue is with polling or transport upgrade
- 3Check server logs for `transport error` or `invalid namespace` during reconnection attempts
- 4On the client, listen to `reconnect_attempt`, `reconnect_error`, and `reconnect_failed` events and log the error object
- 5Simulate network loss by disabling WiFi or using Chrome's 'Offline' mode; observe if reconnection events fire after re-enabling
- 6Verify Socket.IO client and server versions are compatible (major.minor should match; semver major breaks often)
The specific files, logs, configs, and dashboards that usually own this bug.
- searchClient-side: `io.on('reconnect_attempt', (attempt) => console.log(attempt))` – watch for attempts not firing
- searchServer-side: Socket.IO `connection` event handler – log `socket.handshake` to see auth data
- searchReverse proxy config (nginx, HAProxy) – check WebSocket upgrade headers and timeout settings
- searchServer logs for `Error: Invalid namespace` or `Error: Token expired` during handshake
- searchClient network tab – look for polling XHRs returning 4xx or WebSocket upgrade returning 400/403
- searchClient console for `Uncaught TypeError: Cannot read property 'on' of undefined` (common if `io()` returns null)
Practical causes, not theory. These are the things you will actually find.
- warningServer-side middleware rejects reconnection because auth token expired (handshake.auth.token is checked on every connect)
- warningTransport downgrade from WebSocket to polling fails because CORS headers are missing or wrong for polling endpoint
- warningClient `reconnectionDelayMax` set too low (e.g., 1000ms) causing rapid retries that exhaust server rate limits
- warningReverse proxy (nginx) closes idle connections or has a short `proxy_read_timeout` (default 60s) causing WebSocket drop
- warningClient and server Socket.IO version mismatch (e.g., client v2, server v4) leading to incompatible handshake protocol
- warningNamespace mismatch: client connects to `/` but server only listens on `/chat` – reconnection attempts to `/` get 404
Concrete fix directions. Pick the one that matches your root cause.
- buildSet `reconnectionDelayMax: 30000` and `reconnectionAttempts: Infinity` on the client to allow longer backoff windows
- buildIn server middleware, distinguish between initial connection and reconnection via `socket.handshake.auth.token` presence; if missing, allow a grace period or force re-auth
- buildConfigure nginx: `proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection 'upgrade'; proxy_read_timeout 86400s;`
- buildEnable both polling and WebSocket transports on client, but force WebSocket first: `transports: ['websocket', 'polling']`
- buildImplement a client-side heartbeat (ping/pong) to detect stale connections before the server times out
- buildDowngrade to Socket.IO v2 if both server and client are v2; upgrade both to v4 if using v3/v4 (avoid v2-v4 cross-version)
A fix you cannot prove is a guess. Close the loop.
- verifiedAfter fix, disconnect client (e.g., close WiFi), wait 30s, reconnect; watch `reconnect` event fire with new socket ID
- verifiedConfirm in server logs that the same client session reconnects (check `socket.handshake.address` or custom session ID)
- verifiedRun `io.engine.clientsCount` on server before and after reconnection; expect it to stay approximately constant
- verifiedUse Chrome DevTools 'Network' tab to capture a successful WebSocket upgrade (101) after reconnection
- verifiedWrite an integration test: connect client, drop network via puppeteer, restore, assert `socket.connected` becomes true within 10s
Things that make this bug worse or harder to find.
- warningDo not set `reconnection: false` as a workaround – you lose automatic reconnection entirely
- warningDo not ignore the `connect_error` event; it often contains the exact reason (e.g., 'token expired')
- warningDo not assume reconnection is handled by the client alone – the server must also be stateless or handle session recovery
- warningDo not use `io.connect()` with a URL that includes path (e.g., `/socket.io`) unless server also configures that path
- warningDo not forget to configure CORS for both polling and WebSocket origins if your client is on a different domain
The Midnight Reconnection Failure
Timeline
- 00:00Production alert: 'High number of disconnected clients' – dashboard shows 40% drop in active connections
- 00:05Checked server logs: no errors, but saw many 'transport error' with no detail
- 00:10Tail client logs: 'reconnect_attempt' fires, but server never logs a new connection
- 00:15Noticed client version 4.4.0, server 4.5.0 – minor mismatch, but still compatible
- 00:20Reproduced locally with same versions – reconnection works fine
- 00:30Inspected nginx config: `proxy_read_timeout 60s` causing WebSocket to drop after 1 minute of inactivity
- 00:35Changed `proxy_read_timeout` to 3600s and added `proxy_http_version 1.1`
- 00:40Rolled out config change; reconnection started working within 2 minutes
At midnight, our Socket.IO dashboard showed a massive drop in active connections. My first instinct was a server crash, but the Node.js process was healthy, memory and CPU normal. Server logs showed only 'transport error' messages with no stack traces. Clients were trying to reconnect, but the server never acknowledged them. I checked the Socket.IO version: server was 4.5.0, client was 4.4.0. A minor version mismatch shouldn't break reconnection, but I couldn't reproduce locally.
I dug into the nginx config and found the culprit: `proxy_read_timeout 60s`. After 60 seconds of inactivity, nginx would close the WebSocket connection. The client would detect the disconnection and attempt to reconnect, but because the proxy had already severed the TCP connection, the WebSocket handshake would fail silently. The client would fall back to polling, but the polling requests were also getting a 400 due to missing CORS headers (we only allowed WebSocket origin).
The fix was straightforward: increase `proxy_read_timeout` to 3600s and ensure `proxy_http_version 1.1` with proper upgrade headers. We also added CORS headers for polling. After the config rollout, reconnection worked instantly. The lesson: always check the proxy's idle timeout when dealing with WebSocket-heavy apps. Also, monitor both WebSocket and polling transports during troubleshooting.
Root cause
nginx `proxy_read_timeout` set to 60s caused WebSocket connections to be killed after 1 minute of inactivity; client reconnection attempts then failed because polling requests were blocked by missing CORS headers.
The fix
Set `proxy_read_timeout 3600s;` and added `proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection 'upgrade';` in the nginx location block. Also added CORS headers for the polling transport.
The lesson
Always verify reverse proxy timeout settings for WebSocket connections. The default 60s is for HTTP, not WebSockets. Also, ensure both polling and WebSocket transports are correctly proxied with appropriate headers.
Socket.IO starts with a long-polling XHR request (transport 'polling') and then upgrades to WebSocket. If the upgrade fails, it stays on polling. The reconnection logic also follows this: it first tries to open a new transport via polling, then upgrades. If polling fails (e.g., CORS), reconnection stops.
To debug, force a single transport: `io(..., { transports: ['websocket'] })`. If reconnection works, the issue is with polling or upgrade. Conversely, `transports: ['polling']` isolates polling issues. Check the browser's Network tab for the upgrade request (101 status) or polling XHR status codes.
Socket.IO middleware runs on every connection, including reconnections. If you check `socket.handshake.auth.token` and reject if expired or missing, the reconnection will be denied. The client receives a `connect_error` event with the error message.
Best practice: on the client, pass the token in `auth: { token: getToken() }` during `io()`. On the server, allow a grace period for expired tokens or implement a refresh mechanism. Alternatively, use a session cookie instead of a token that expires.
Socket.IO v2 used a different handshake than v3/v4. If client and server versions differ significantly, the handshake may fail silently. For example, v2 client trying to connect to v4 server will get a 400 with 'Invalid namespace' because the protocol packet format changed.
Always match major versions. If you must use different versions, use a compatibility layer or ensure the client uses the same version as the server. Check the `socket.io` package version on both sides. The engine.io version must also match (e.g., engine.io v3 vs v4).
Apart from `proxy_read_timeout`, other nginx directives can break reconnection: `proxy_buffering off;` is required for WebSockets; `proxy_set_header X-Forwarded-For` can interfere with Socket.IO's IP-based rate limiting; and `proxy_pass http://backend` without a trailing slash can break the `/socket.io` path.
For HAProxy, ensure `option http-server-close` is not used; instead use `option http-keep-alive`. Also set `timeout tunnel 3600s` to prevent WebSocket disconnection. Check if any security tool (ModSecurity, WAF) is inspecting WebSocket traffic and blocking upgrade requests.
The client's `reconnectionDelay` (default 1000ms) and `reconnectionDelayMax` (default 5000ms) control backoff. If you set `reconnectionDelayMax` too low, the client will retry too fast, potentially hitting server rate limits or exhausting itself. I've seen cases where clients with `reconnectionDelayMax: 1000` retried 10 times in 2 seconds and then stopped because they hit `reconnectionAttempts: 10`.
Also, the `randomizationFactor` (default 0.5) adds jitter to avoid thundering herd. If you set it to 0, all clients reconnect simultaneously, overwhelming the server. Always keep the default or set to 0.5. Set `reconnectionAttempts: Infinity` if you want indefinite retries, but combine with a reasonable `reconnectionDelayMax` (e.g., 30s).
Frequently asked questions
Why does my Socket.IO client reconnect on localhost but not in production?
This is almost always a reverse proxy or firewall issue. Check nginx/HAProxy WebSocket upgrade headers, timeout settings, and whether the path `/socket.io` is correctly proxied. Also verify CORS headers for the polling transport, as production often has stricter CORS policies.
How do I see the exact error when reconnection fails?
Listen to the `connect_error` event: `socket.on('connect_error', (err) => console.log(err.message))`. Also listen to `reconnect_error` for transport-level errors. The error message often contains the HTTP status code or 'websocket error'. Additionally, check the browser's console for WebSocket connection failures.
Does Socket.IO automatically reconnect after a server restart?
Yes, if the client is configured with `reconnection: true` (default). After server restart, the client's next reconnection attempt (based on backoff) will establish a new connection. However, if the server was down for a long time, the client may have exhausted its retry attempts. Increase `reconnectionAttempts` or set to `Infinity`.
What's the difference between 'disconnect' and 'reconnect_attempt' events?
`disconnect` fires when the connection is lost (client or server initiated). `reconnect_attempt` fires when the client decides to try reconnecting (after backoff delay). If you see `disconnect` but not `reconnect_attempt`, the client's reconnection is disabled or the `reconnection` option is false. If `reconnect_attempt` fires but no `reconnect`, the attempt failed (check `reconnect_error`).
Can a token expiration cause reconnection to fail silently?
Yes, if your server middleware rejects connections with expired tokens, the client will receive a `connect_error` event with the message (e.g., 'token expired'). But if you don't listen to that event, it appears silent. To fix, either refresh the token before reconnection or allow a grace period on the server for existing sessions.