LEARN · DEBUGGING GUIDE

Node.js Heap Snapshot Analysis: Finding the Leak That Survived GC

Heap snapshots reveal objects that garbage collection won't touch. This guide shows you how to compare snapshots, trace retaining paths, and identify the exact leak source in production Node.js apps.

AdvancedMemory6 min read

What this usually means

A memory leak in Node.js means there are objects that the garbage collector cannot free because they are still referenced from the root set (global variables, closures, caches, event emitters, or timers). When you take a heap snapshot, you see all live objects. The key diagnostic is comparing two snapshots taken at different times after forcing GC: if a class of objects grows monotonically and those objects are not being freed, you've found your leak. The retaining path tells you exactly which variable or closure holds the reference.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run `node --inspect` or `node --inspect-brk` with `--max-old-space-size=4096` to ensure enough room for snapshots
  • 2Open Chrome DevTools about:inspect and connect to the Node process
  • 3Take two heap snapshots: one at app start (or after GC) and another after a few hours of load
  • 4Enable 'Comparison' view and filter for objects with '(string)' or your app's class names; look for objects that grew significantly
  • 5Click on a suspect object and examine its retaining path — the top of the chain is the root cause
  • 6Check if the retaining node is a global variable, a module-level cache, or an event listener attached to a global emitter
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchHeap snapshot comparison view in DevTools (Memory tab)
  • searchRetaining tree for suspected object types (e.g., 'Buffers', 'Closure', 'Array')
  • searchNode.js process memory: `process.memoryUsage()` logged every minute
  • searchApplication code: global caches (Map, Set, arrays) that never clear entries
  • searchEvent emitters: listeners registered on `process` or global singletons without removal
  • searchThird-party libraries: ORM result caches (e.g., Sequelize, Mongoose), Redis client, or HTTP agent keep-alive connections
  • searchNative bindings: if you use `node-ffi` or `node-gyp`, check for unmanaged memory in C++ objects
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningModule-level variable holding a growing collection (e.g., `const cache = {}` that never evicts)
  • warningClosures capturing large objects in async callbacks that don't release references
  • warningEvent listeners attached to global emitters (like `process.on('data', ...)`) never removed
  • warningTimers (setInterval) that reference large objects and are never cleared
  • warningStreams not properly destroyed: unclosed Readable/Writable streams holding buffers
  • warningObject pooling implemented incorrectly: pool never returns objects for GC
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildReplace unbounded caches with LRU caches (e.g., `lru-cache` npm package)
  • buildExplicitly remove event listeners when done: `emitter.removeListener(name, fn)` or use `once`
  • buildClear timer handles: `clearInterval(handle)` when no longer needed
  • buildUse WeakMap or WeakSet for caches that should not prevent GC of keys
  • buildFor stream leaks: always call `stream.destroy()` after `end` or on error
  • buildAdd manual GC triggers in development: `global.gc()` with `--expose-gc` flag, then snapshot
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedAfter fix, deploy and monitor `process.memoryUsage().heapUsed` over 48 hours under production load
  • verifiedTake heap snapshots at the same intervals as before; confirm the suspect class no longer grows
  • verifiedRun a stress test that previously caused OOM; measure RSS stability
  • verifiedCheck GC stats: `node --trace-gc` output should show old space size plateauing
  • verifiedUse `heapdump` module to automate snapshot on a memory threshold trigger
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningTaking only one snapshot — a single snapshot tells you nothing about growth
  • warningForgetting to force GC before each snapshot: `global.gc()` or press the trash icon in DevTools
  • warningFocusing on total heap size without looking at object count per class
  • warningAssuming all 'string' objects are the leak — strings are interned and may be shared
  • warningNot checking native memory: V8 heap may be fine but C++ bindings leak (use `--trace-gc-object-stats`)
  • warningAdding more memory instead of fixing the leak — delays the inevitable crash
( 07 )War story

The Ever-Growing Set of Unsubscribed WebSocket Clients

Senior Backend EngineerNode.js 18, Express, ws (WebSocket), Redis, Docker on Kubernetes

Timeline

  1. 08:00Deploy v2.3 with new real-time collaboration feature
  2. 14:00First OOM kill on pod with 2GB memory limit after 6h of uptime
  3. 14:15Double memory limit to 4GB; pod crashes again after 8h
  4. 15:00Enable --expose-gc and heapdump; restart with --trace-gc
  5. 15:30Take first heap snapshot after GC
  6. 17:00Take second heap snapshot after 1.5h of real traffic
  7. 17:05Compare snapshots: found 50k 'WebSocket' objects in old space
  8. 17:10Trace retaining path: all held by a Set in the CollaborationManager singleton
  9. 17:15Identify bug: on 'close' event, client removed from Set but not from an internal Map
  10. 17:30Deploy fix: ensure both Set and Map are cleaned up

We shipped a new real-time collaboration feature using WebSockets. Within hours of deploying v2.3, pods started getting OOMKilled. At first I thought it was just normal memory growth under load—our traffic had increased. But after doubling the memory limit and seeing the same crash pattern, I knew it was a leak.

I connected Chrome DevTools to the Node process and took two heap snapshots one hour apart, forcing GC before each. The comparison view showed a steady increase in WebSocket objects—about 50 new objects per minute. The retaining path pointed to a Set inside a global singleton called CollaborationManager. That Set was supposed to track active clients.

The code that removed a client from the Set on disconnect was executing, but there was also a Map storing client metadata. The 'close' event handler only cleared the Set, not the Map. Over time, the Map accumulated dead entries with references to the WebSocket objects. One line fix: delete the Map entry too. Memory plateaued immediately.

Root cause

A Map in a global singleton retained references to disconnected WebSocket clients because the 'close' handler only removed them from a Set, not the Map.

The fix

Add `this.clientsById.delete(client.id)` in the same 'close' handler that removes the client from the Set.

The lesson

Always double-check that all collections are cleaned up in removal logic. Use heap snapshot comparison early when you see linear memory growth.

( 08 )Reading the Comparison View: What to Look For

When you open the Comparison view in Chrome DevTools, you see three columns: # New, # Deleted, and # Delta. Focus on the Delta column — positive numbers mean more objects of that type survived. Sort by Delta descending.

Ignore built-in types like (string), (number), (compiled code) — they are often interned or shared. Instead look for your own class names, closures, or third-party objects like 'Socket', 'Buffer', 'Array', 'Object' with large deltas.

If you see a large delta in 'Array' or 'Object', click on one and examine its retaining path. Often the retaining tree will show a closure or a global variable holding the reference.

( 09 )Retaining Path: Following the Breadcrumb Trail

The retaining path is a tree from the root (window/global) to the object. The topmost node under 'GC root' that is not a built-in is your culprit. Common culprits: 'system / Context' (global variable), 'Closure' (a function scope), or 'Object' (a module-level object).

Right-click on an object and select 'Show in Retainers View' to see all paths. This is useful if the object is held by multiple references — you need to break all of them.

If the retaining path shows a 'Timer' or 'EventEmitter', check that the corresponding timer/emitter is cleared/removed when no longer needed.

( 10 )Automating Snapshot Collection in Production

Use the `heapdump` npm module to write snapshots to disk on demand. You can trigger it via a signal or a memory threshold. Example: `process.on('SIGUSR2', () => heapdump.writeSnapshot('./snapshot.heapsnapshot'));`

Combine with `process.memoryUsage()` logged every 60 seconds. Set an alert if `heapUsed` grows by more than 10% over 30 minutes.

For Kubernetes, deploy a sidecar that periodically takes snapshots and uploads them to object storage. This allows post-mortem analysis without DevTools connection.

( 11 )Common Pitfall: The 'Closure' Leak That Hides in Async Hooks

A particularly nasty leak occurs when an async function captures a large object in its closure, and the promise never resolves (e.g., due to a never-called callback). The entire closure scope stays alive.

To catch this, look in the comparison view for '(closure)' entries. Expand one and look at the variable names — they often reveal the captured data.

A fix: avoid capturing large objects in async callbacks. Use WeakRef or move the data to a cache that can be cleared.

Frequently asked questions

How many heap snapshots do I need to take to identify a leak?

At least two: one at a steady state (after forcing GC) and another after a period of activity. More snapshots at regular intervals (e.g., every hour) help distinguish a leak from normal fluctuation. Compare each pair.

Why do I see a lot of 'string' objects in my heap snapshot?

Strings in V8 are interned and shared across contexts. A large number of string objects often indicates that you are dynamically generating many unique strings (e.g., template literals in a loop). This may be a leak if the strings are held by a collection, but often it's just normal usage. Focus on non-string objects first.

Can I use heap snapshots in production without killing performance?

Heap snapshots freeze the VM briefly (hundreds of ms to seconds depending on heap size). For most production apps, this is acceptable if done infrequently (e.g., once per hour). Use the `--max-old-space-size` flag to limit heap size and reduce snapshot time. Avoid taking snapshots at peak traffic if possible.

My leak is in a native addon — can heap snapshots help?

V8 heap snapshots only show JavaScript objects. Native memory allocated by C++ addons will not appear. Use `process.memoryUsage().external` to monitor native memory. For deeper analysis, use tools like valgrind or address sanitizer on the addon code.

What's the difference between heap snapshot and CPU profile?

A heap snapshot captures all live JavaScript objects and their references (memory state). A CPU profile measures function call times and frequency. For memory leaks, use heap snapshots. For performance bottlenecks, use CPU profiles.