All guides

LEARN \u00b7 DEBUGGING GUIDE

Stale lock bug: how locks that are never refreshed cause outages

Your application acquires a lock to process a critical section. The work finishes in 2 seconds, but the lock has a 60-second TTL. For the remaining 58 seconds, no other process can enter the critical section. If the lock holder crashes, the system is blocked for the full TTL.

AdvancedJavaScript/Node runtime debugging

What this usually means

A stale lock is one that outlives its usefulness. The lock holder acquired it, did the work, and either forgot to release it or crashed before releasing. The lock stays until its TTL expires, blocking all other processes. The gap between work completion and lock release (or TTL expiry) is wasted time. The fix is to release the lock as soon as work is done and to set a TTL that balances crash recovery against holding a stale lock too long.

( 01 )Fast diagnosis

The first ten minutes \u2014 establish facts before touching code.

  • 1Check the lock TTL vs the actual work duration. If TTL is 60s but work takes 2s, the lock is stale for 58s.
  • 2Check if the lock is released in a finally block. If release is only on success, an error leaves the lock held.
  • 3Check if the release code can fail. If the Redis DEL fails silently, the lock stays.
  • 4Check for a lock refresh mechanism. Does the holder extend the TTL while still working?
  • 5Check if the lock is released at all. Code review the lock acquisition and release paths.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchLock acquisition code — SET with NX and TTL value
  • searchLock release code — is it in a finally block? Does it handle errors?
  • searchLock TTL configuration — is it appropriate for the work duration?
  • searchLock refresh or watchdog mechanism — does the holder extend the TTL?
  • searchProcess monitoring — do lock holders crash frequently?
  • searchRedis — inspect lock keys and their TTL values
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningLock is released only on the success path, not in catch or finally
  • warningLock release code fails silently and the error is not logged
  • warningLock TTL is set much longer than the maximum work duration
  • warningProcess crashes between acquiring the lock and releasing it
  • warningLock release targets the wrong Redis instance or key
  • warningLock refresh logic has a bug — the TTL is not actually extended
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildAlways release the lock in a finally block to guarantee cleanup on success and failure
  • buildSet the lock TTL to be just longer than the maximum expected work duration plus a buffer
  • buildAdd a lock refresh or watchdog that extends the TTL while the holder is still alive and working
  • buildLog lock acquisition, refresh, and release events with timestamps and lock owner ID
  • buildMonitor lock hold time: if a lock is held longer than expected, alert immediately
  • buildUse a fencing token to prevent a stale lock holder from corrupting data after its lock was taken
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedRun the critical section. The lock should be acquired before work and released immediately after.
  • verifiedCheck Redis: the lock key should be absent after work completes. TTL should be gone.
  • verifiedSimulate a crash: kill the process during the critical section. The lock should be released after TTL.
  • verifiedRun concurrent processes. They should take turns, not be blocked longer than the work duration.
  • verifiedMonitor lock hold time percentiles. The p99 should be close to the work duration, not the TTL.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningNot releasing the lock on error paths
  • warningSetting the lock TTL too long because 'it is safer'
  • warningNot logging lock lifecycle events — you cannot debug what you cannot see
  • warningNot monitoring lock hold time — you will not know the lock is stale
  • warningUsing a lock when a simpler synchronisation mechanism would work