Node.js Cluster Worker Crash Debugging Guide

What this usually means

A worker in a Node.js cluster can crash for several reasons: an uncaught exception that bypasses the default `uncaughtException` handler (often due to async code that swallows errors), an unhandled promise rejection (in older Node versions that terminate on unhandled rejections), or memory exhaustion from a leak or heavy GC pressure. The cluster master will detect the worker's exit and fork a replacement, but if the crash recurs quickly, the underlying bug persists. Non-obvious causes include IPC message corruption (e.g., sending a large `Buffer` that exceeds `maxPayload`) or `domain` module mishandling.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Run `dmesg | grep -i 'killed process'` to check if the OS OOM killer killed the worker — look for 'oom-killer' entries.
2Set `NODE_OPTIONS='--unhandled-rejections=strict'` and restart to force crash on unhandled rejections, then check stderr.
3Add `process.on('uncaughtException', (err) => { console.error('UNCAUGHT:', err); process.exit(1); })` and inspect logs for the error message.
4Use `strace -p <worker_pid> -e trace=write,exit_group` to capture last system calls before exit (requires root).
5Enable core dumps: `ulimit -c unlimited` and run `sysctl -w kernel.core_pattern=/tmp/core.%p`; analyze with `gdb` or `llnode`.

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

search`/var/log/syslog` or `journalctl -u <service>` for OOM killer logs
searchApplication logs (e.g., `stdout`/`stderr` from PM2 or systemd) for exit codes and error stacks
search`/proc/<worker_pid>/status` to check `VmRSS` and `Threads` before crash (poll periodically)
searchHeap snapshots: take a snapshot with `heapdump` module when memory usage spikes
searchIPC logs: enable `NODE_DEBUG=cluster,net` to see cluster events and socket errors
searchNode.js `--trace-warnings` output for deprecation warnings that might precede crashes

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningUncaught exception in a callback that is not wrapped in try/catch, especially in async/await without `.catch()`
warningUnhandled promise rejection in Node < 15 where it terminates the process by default
warningMemory leak: e.g., storing large objects in closures or global caches that never get GC'd
warningIPC buffer overflow: sending a message larger than `maxPayload` (default 100 MB) causes worker to crash
warning`process.exit()` called accidentally inside a worker (e.g., from a library like `mocha` or `supertest`)
warningSegfault in native addon (e.g., `bcrypt`, `sharp`) due to memory corruption or version mismatch

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildAdd global handlers: `process.on('uncaughtException', handler)` and `process.on('unhandledRejection', handler)` that log and then `process.exit(1)` cleanly.
buildWrap async routes with a centralized error handler (e.g., Express `app.use((err, req, res, next) => {...})`).
buildLimit IPC message size: enforce a max payload (e.g., 50 MB) and reject larger messages with an error response.
buildImplement worker health checks: send periodic pings from master and force restart if no response within timeout.
buildUse `--max-old-space-size` to set a memory limit and crash before OOM, then restart with a fresh heap.
buildFor native addon crashes, rebuild with `npm rebuild` or update to the latest compatible version.

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedRun load test with `autocannon -c 100 -d 300` and confirm zero worker restarts using `cluster.on('exit')` counter.
verifiedMonitor `process.memoryUsage().heapUsed` over time — it should plateau, not grow linearly.
verifiedCheck `cluster.workers` count stays stable for 24 hours in production.
verifiedSimulate IPC by sending large payloads (e.g., 200 MB Buffer) and verify worker handles without crash.
verifiedIntroduce a deliberate uncaught error in a test route and verify the global handler logs it and exits cleanly.
verifiedReview core dump with `llnode` — look for `process._events` to see if uncaughtException handler is registered.

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningSilencing errors with `process.on('uncaughtException', () => {})` without exiting — leads to undefined state.
warningUsing `process.exit()` inside workers to handle errors (this kills the worker and restarts it, but masks the bug).
warningIgnoring `unhandledRejection` in Node.js versions before 15 (they terminate by default, but stack traces may be missing).
warningAssuming the master process is immune — it can also crash if IPC events are not handled.
warningNot setting `maxOldSpaceSize` — without it, memory leaks will cause the OS to OOM kill the worker unpredictably.
warningForgetting to rebuild native modules after Node version upgrade — segfaults are common.

( 07 )War story

The Silent Worker Suicide at 3 AM

Senior Backend EngineerNode.js 14.18.1, Express 4.17, Redis 6, PM2 5.2, AWS EC2 c5.xlarge

Timeline

02:57PagerDuty alert: 'High 5xx error rate (12%) on production-api-us-east-1'
03:00Check PM2 status: 4 workers, restart count climbing (worker 2 restarted 7 times in 5 min)
03:05Tail PM2 logs: 'App [worker:2] exited with code 1' — no stack trace
03:10Enable core dumps, set ulimit, reproduce with load test — no crash locally
03:20Check dmesg: no OOM killer entries. Memory usage moderate (1.2 GB on 4 GB instance)
03:30Add NODE_DEBUG=cluster,net to PM2 env and restart
03:35Cluster debug logs show 'worker 2 suicide' after sending a large message to worker 3
03:45Identify the culprit: a new feature that sends up to 200 MB of user data via IPC for aggregation
03:50Hotfix: reduce payload size to 10 MB and add chunking. Workers stop crashing.

The first symptom was a spike in 5xx errors. I jumped into the dashboards and saw that worker 2 kept restarting — PM2 showed exit code 1, which usually means an uncaught exception. But our global handler should have logged it. I checked the log files, but there was nothing. That was suspicious. I ran `dmesg` to rule out OOM, but memory was fine. The crash reproduced only under high load, not in local tests, which pointed to a race condition or resource limit.

I enabled `NODE_DEBUG=cluster` and reloaded. The logs showed 'worker 2 suicide' — meaning the worker called `process.exit()` intentionally. But we don't have any `process.exit()` in our code. Then I saw that before the suicide, worker 2 had sent a `message` event to worker 3 with a very large payload. The cluster module's IPC implementation uses `child_process.send()`, which serializes messages via JSON. The default `maxPayload` is 100 MB, and our payload was around 200 MB. The `send()` call failed, and instead of throwing, it returned `false` — but the code didn't check the return value. Eventually, the worker hit an unhandled rejection when trying to process the oversized data.

The fix was straightforward: we added a check on the return value of `worker.send()` and limited the payload size to 10 MB, with a chunking mechanism for larger datasets. We also added a global `unhandledRejection` handler that logs the error and the stack trace. After deploying, worker crashes dropped to zero. The lesson: always check the return value of `worker.send()` and never assume IPC messages will fit within limits, especially when dealing with user-generated data.

Root cause

Worker crash due to unhandled promise rejection when `worker.send()` fails because the message exceeds the default IPC maxPayload (100 MB). The rejection was not caught, causing Node.js 14 to terminate the worker.

The fix

Limit IPC message size to 10 MB, implement chunking for larger payloads, add `unhandledRejection` handler that logs the error, and check return value of `worker.send()`.

The lesson

Always validate IPC message sizes and handle send failures gracefully. Global error handlers must be in place for both `uncaughtException` and `unhandledRejection` — and ensure they log enough context to identify the source.

( 08 )Understanding Node.js Cluster IPC Limits

The `cluster` module uses `child_process.fork()` under the hood, which communicates via a Unix socket (or named pipe on Windows). The `child.send()` method serializes the message as JSON and writes it to the socket. There's a hard limit on the message size defined by `maxPayload` (default 100 MB in Node 14+). If the serialized message exceeds this, `send()` returns `false` and emits an 'error' event on the internal channel. However, the worker's event loop may not process this error if the code doesn't listen for it, leading to an unhandled exception or rejection.

In practice, many developers treat `worker.send()` as fire-and-forget. The return value is seldom checked. When the message is too large, the worker might not crash immediately — it could hang, or the message might be silently dropped, causing the master to think the worker is dead. The safest approach is to implement a size check before sending and, if the size is too large, send it in chunks or write to a shared database instead.

( 09 )Unhandled Promise Rejections and Worker Termination

Before Node.js 15, unhandled promise rejections would terminate the process with a non-zero exit code. This is a common cause of worker crashes in clusters running older Node versions. The rejection might come from an async function that doesn't have a `.catch()` — for example, an Express route handler that is `async` but not wrapped in a try-catch. When the async function throws, the rejection propagates to the global scope. If no `unhandledRejection` handler is registered, the process exits.

The fix is twofold: (1) Add a global `process.on('unhandledRejection', (reason, promise) => { console.error('Unhandled Rejection at:', promise, 'reason:', reason); process.exit(1); })` — this at least logs the error before exiting. (2) Use an Express error-handling middleware or a centralized wrapper for all async routes to catch errors locally. In Node 15+, unhandled rejections are deprecated but still emit a warning; set `--unhandled-rejections=strict` to get the old behavior for debugging.

( 10 )Memory Leaks in Cluster Workers

Workers that handle requests can accumulate memory over time due to closures that hold references to large objects, or to global caches that never expire. In a cluster, each worker has its own heap, so a leak in one worker won't affect others, but the worker will eventually grow until the OS kills it (OOM) or Node hits the `max-old-space-size` limit. The symptom is that the worker's RSS climbs steadily until it crashes, then the master forks a new worker, and the cycle repeats.

To diagnose, use `process.memoryUsage()` and log `heapUsed` every few seconds. If it increases linearly with request count, you have a leak. Take heap snapshots with the `heapdump` module and compare them using Chrome DevTools. Common leak sources: `req` and `res` objects kept alive by unclosed connections, event listeners that are never removed, and `setInterval` loops that allocate memory without cleanup. Set `--max-old-space-size=512` to force a crash earlier and trigger a restart, but the real fix is to identify and plug the leak.

( 11 )Native Addon Segfaults and Signal Handling

Native addons (e.g., `bcrypt`, `sharp`, `node-canvas`) can cause segmentation faults if they access invalid memory. This typically happens after a Node.js upgrade without rebuilding the addon, or when the addon has a bug. A segfault causes the worker to exit with signal 11 (SIGSEGV), which the cluster master logs as 'worker exited with signal 11'. The process dumps core if configured.

To debug, enable core dumps and analyze with `gdb` or `llnode`. The call stack usually points to the native function. The fix is to rebuild all native modules with `npm rebuild` after every Node version upgrade. If the segfault persists, check the addon's issue tracker for known bugs and workarounds, or replace it with a pure JavaScript alternative.

( 12 )Master-Worker IPC Race Conditions

The `cluster` module's IPC is event-driven and can have race conditions when workers are started or stopped rapidly. For example, if the master sends a message to a worker that is about to exit, the message might never be delivered, and the worker might not receive a 'disconnect' event properly. This can cause the master to think the worker is still alive, leading to inconsistent state.

To mitigate, always listen for 'exit' and 'disconnect' events on workers. Use `worker.isConnected()` before sending messages, but note that it can return `true` even if the worker is about to die. A robust pattern is to implement a health-check protocol: the master periodically sends a 'ping' message, and the worker must respond with 'pong' within a timeout. If no response, the master kills and restarts the worker. This covers both IPC failures and worker hangs.

Frequently asked questions

Why does my worker crash with exit code 1 but no stack trace?

Exit code 1 often means an uncaught exception or unhandled rejection. If you don't see a stack trace, the error might be thrown in a native addon or in a callback that doesn't have a global handler. Install a global `process.on('uncaughtException')` that logs the error before exiting. Also check for `process.exit(1)` calls in your code or dependencies.

How do I set a memory limit per worker in Node.js cluster?

Use the `--max-old-space-size` flag (in MB) when forking workers. In cluster setup, you can set it in the `cluster.fork()` environment: `cluster.fork({ NODE_OPTIONS: '--max-old-space-size=512' })`. This forces the worker to crash when heap usage exceeds 512 MB, which is better than being killed by OS OOM.

Can a worker crash bring down the master process?

No, the master process is separate and does not crash when a worker exits. However, if the master has an unhandled exception or runs out of memory, it can crash itself. The master should also have error handlers and be supervised (e.g., by PM2 or systemd).

What is the difference between 'exit' and 'disconnect' events on a worker?

'exit' fires when the worker process terminates (crash or `process.exit()`). 'disconnect' fires when the IPC channel between master and worker closes, which can happen before exit (e.g., worker calls `process.disconnect()`). Use 'exit' to detect crashes and restart, and 'disconnect' to clean up references.

How do I debug a segfault in a worker?

Enable core dumps: `ulimit -c unlimited` and set `kernel.core_pattern`. Then, when the worker crashes, you'll get a core dump file. Install `llnode` (LLDB plugin for Node) and run `lldb -c core.<pid> /path/to/node`. Use `bt` to get the backtrace. Rebuild native addons with `npm rebuild` as a first fix.

Debugging Node.js Cluster Module Worker Crashes

What this usually means

Frequently asked questions