LEARN · DEBUGGING GUIDE

MongoDB Change Stream Not Receiving Events: Diagnostic Walkthrough

Change streams that stop emitting events usually trace to one of three causes: a stale cursor, an undersized oplog, or a replica set member that has fallen behind. Here's exactly how to verify each.

AdvancedDatabase6 min read

What this usually means

Change streams rely on the oplog (capped collection) and the replication mechanism. If the oplog does not contain the start point (usually because it has been truncated by newer operations when the cursor was paused), the change stream will silently stop delivering events. Alternatively, the cursor may have been killed by a network timeout or a stale primary stepdown. In sharded clusters, a missing or unresponsive shard can also cause the entire change stream to stall without error.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run db.serverStatus().oplogTruncation to check if oplog truncation is aggressive
  • 2Check rs.status() for any secondary with 'stateStr' other than SECONDARY or PRIMARY
  • 3Use db.watch() in a test script with a small batch size and log every event to isolate replica set vs app issue
  • 4Verify the resume token's clusterTime is within the oplog window: db.oplog.rs.find().sort({$natural:-1}).limit(1).next().ts
  • 5Enable MongoDB driver debug logging (e.g., mongodb.debug = true in Node.js) to see if getMore commands fail silently
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchMongoDB server log (mongod.log) for lines containing 'ChangeStream' or 'cursor' or 'getMore'
  • searchApplication driver logs (set driver log level to DEBUG)
  • searchoplog.rs collection on the primary: db.oplog.rs.find({ts: {$gte: resumeTokenTimestamp}})
  • searchrs.status() output to verify replica set health and last heartbeats
  • searchMongoDB Atlas metrics dashboard (if applicable): 'Oplogs Available Hours' metric
  • searchMongoDB server status: db.serverStatus().oplogTruncation and db.serverStatus().wiredTiger.cache
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningOplog size is too small relative to write volume, causing truncation of the resume token's timestamp
  • warningChange stream cursor has been idle longer than the cursor timeout (default 10 minutes) and is killed
  • warningReplica set election caused a primary change; the new primary's oplog may not contain the old resume token
  • warningNetwork partition causes the cursor's getMore to fail silently (TCP half-open connection)
  • warningApplication uses a stale connection pool that does not re-resolve the primary
  • warningSharded cluster: one shard's change stream cursor is paused and its oplog rolled past the resume token
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildIncrease oplog size: use rs.printReplicationInfo() to see current size, then rs.adminCommand({replSetResizeOplog: 1, size: 20000}) to set a larger size (in MB)
  • buildImplement automatic cursor restart with resume token persistence in a separate collection
  • buildSet maxAwaitTimeMS on the change stream to force periodic reconnection (e.g., 5000ms) and handle empty batches
  • buildAdd application-level heartbeat: every N seconds write a touch document to the watched collection to keep the cursor alive
  • buildFor sharded clusters, ensure all shards are reachable and the balancer is not causing excessive chunk migrations
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedAfter increasing oplog, watch the change stream for a high-volume period and confirm events arrive continuously
  • verifiedSimulate a primary stepdown (rs.stepDown()) and verify the change stream resumes automatically within the timeout
  • verifiedInsert a test document and confirm the event appears in the application log within 100ms
  • verifiedCheck db.serverStatus().oplogTruncation.truncationsPerSecond and confirm it's under 0.1
  • verifiedRun a long-lived change stream test for 24 hours and verify no gaps in event timestamps
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningAssuming the change stream is healthy because the cursor is open (it can be open but stalled)
  • warningNot setting a resume token after a restart; starting fresh may cause missed events
  • warningUsing a single change stream cursor for all collections; prefer separate cursors per collection
  • warningIgnoring MongoDB driver version; older drivers have bugs with change stream recovery
  • warningNot monitoring the change stream lag; use db.currentOp() to see if getMore is pending
  • warningSetting oplog size too large on a memory-constrained system (WiredTiger caches oplog in memory)
( 07 )War story

Change Stream Silence After Primary Election

Platform SREMongoDB 4.4 replica set (3 nodes), Node.js driver 3.6.10, Kafka connector as consumer

Timeline

  1. 14:02PagerDuty alert: order processing pipeline stalled; no orders processed for 8 minutes
  2. 14:03Check Kafka consumer lag: orders topic has 0 lag, but no new messages produced
  3. 14:05Check change stream cursor in Node.js: cursor is not closed, but no events emitted
  4. 14:07Run rs.status(): primary changed to a different node 10 minutes ago
  5. 14:10Check oplog on new primary: db.oplog.rs.find({ts: {$gte: resumeTokenTimestamp}}).count() returns 0
  6. 14:12Confirm oplog size: only 1GB; write volume is ~500MB/min; retention ~2 minutes
  7. 14:15Resize oplog to 5GB; restart change stream with last known resume token (now stale, but new cursor starts from 'now')
  8. 14:17Events resume; orders start flowing again

At 14:02 our order pipeline went silent. The Kafka connector that reads change stream events from MongoDB had produced no new messages for 8 minutes. The MongoDB replica set had experienced a primary election about 10 minutes earlier due to a network hiccup. The change stream cursor was still open—it hadn't thrown an error—but it was returning no events.

I checked rs.status() and saw the primary had changed. The resume token stored in our connector was from the old primary. I queried the oplog on the new primary for any entries after that token's timestamp, and got zero results. The oplog was only 1GB, and our write volume was about 500MB per minute. The token's timestamp was already truncated—the oplog had rolled over it within 2 minutes.

We increased the oplog size to 5GB using replSetResizeOplog. Since the resume token was dead, we had to restart the change stream from 'now', accepting a small gap in events (which we backfilled from the primary's oplog archive). After the restart, events flowed again. The lesson: monitor oplog retention hours (rs.printReplicationInfo()) and set it to at least 2× the maximum expected change stream restart delay.

Root cause

Oplog size too small (1GB) relative to write rate (500MB/min), causing truncation of the resume token during a primary election. The change stream cursor had no valid resume point and silently stalled.

The fix

Resize oplog to 5GB (rs.adminCommand({replSetResizeOplog: 1, size: 5000})), restart the change stream from the current timestamp, and implement monitoring for oplog retention hours below a threshold.

The lesson

Always monitor oplog retention hours and set it to at least 2× the maximum expected time to restart a change stream. Implement automatic cursor health checks that verify the resume token still exists in the oplog.

( 08 )Oplog Truncation Mechanics

The MongoDB oplog is a capped collection; when it reaches its configured size, the oldest entries are removed to make room for new ones. db.serverStatus().oplogTruncation provides the truncation rate. If this rate is high, the window of available history shrinks.

Change streams depend on the oplog to replay events. If a resume token's timestamp is older than the oldest oplog entry, the change stream cannot be resumed and will silently stop. To check the oldest timestamp: db.oplog.rs.find().sort({$natural:1}).limit(1).next().ts. Compare this to your stored resume token.

( 09 )Cursor Lifecycle and Timeouts

A change stream cursor is a special tailable cursor. By default, MongoDB kills cursors after 10 minutes of inactivity. However, change streams send empty batches (with a heartbeat) to keep the cursor alive. If the driver or network drops these heartbeats, the cursor may be killed.

To verify: run db.currentOp({$or: [{desc: /getMore/}, {desc: /ChangeStream/}]}) and check the 'secs_running' field. If a getMore operation has been running for longer than maxAwaitTimeMS without returning, the cursor may be stalled.

( 10 )Replica Set Topology Changes

During a primary election, all cursors on the old primary are invalidated. The driver should automatically reconnect to the new primary and re-establish the change stream, but only if it can resolve the new primary and if the resume token still exists.

The driver's behavior varies by version. In older Node.js drivers (<3.6), reconnection may not happen automatically. Use the 'resumeToken' option and handle the 'resumeTokenChanged' event. Always test with rs.stepDown() to validate recovery.

( 11 )Sharded Cluster Specifics

In a sharded cluster, change streams are opened on each shard and merged. If one shard is unreachable or its oplog truncated the resume token, the entire change stream stalls. The driver may not report which shard is problematic.

To diagnose: open a change stream on each shard individually (via mongos but specifying shard name in the pipeline). Check rs.status() on each shard. Also, ensure the balancer is not causing excessive migrations that can pause oplog application.

Frequently asked questions

How do I check if my oplog is large enough?

Run rs.printReplicationInfo(). Look at the 'oplog size' and 'time between first and last op'. Ensure the time window is at least 2-3 times the expected maximum downtime for your change stream consumers. For high-write systems, start with 10GB or more.

My change stream cursor is open but not emitting events. Is it stalled?

Yes, an open cursor does not guarantee it's active. Check db.currentOp() for the getMore operation. If it shows 'waitingForToFinish' or similar, the cursor is waiting. Also, try inserting a document and see if the event arrives within a few seconds. If not, the cursor is likely stalled.

Can I resume a change stream after a primary election without losing events?

Yes, if your resume token is still in the oplog on the new primary. Save the resume token from the last event you processed, and when reconnecting, pass it as the 'resumeAfter' option. If the token is gone, you must start from 'now' and accept a gap.

What driver settings help prevent change stream stalls?

Set maxAwaitTimeMS to a value that forces regular empty batches (e.g., 5000ms). Enable heartbeat (keepalive) on TCP connections. Use the latest driver version and set serverSelectionTimeoutMS high enough for elections. Also, set retryReads to true.

Why does my change stream work on a standalone but not in a replica set?

Change streams require a replica set or sharded cluster (standalone does not have an oplog). Ensure your deployment is a replica set, even if single node. Also, check that the read concern is set to 'majority' (default) and write concern is appropriate.