What this usually means
A distributed lock is a key in Redis with a TTL. The lock holder sets the key. When done, it deletes the key. If the lock holder crashes before deleting the key, the lock stays until the TTL expires. If the TTL is too long, the system is blocked for that duration. If the unlock logic is wrong (wrong key, wrong connection, wrong Redis instance), the lock is never released even when the process finishes normally. In a Redis cluster or sentinel setup, a lock acquired on a master might not be released if the master fails over before the release command reaches it.
The first ten minutes \u2014 establish facts before touching code.
- 1Check the lock key in Redis. `GET <lock-key>`. Does it exist? What is the value (lock owner ID)? What is the TTL?
- 2Check if the lock owner process is still running. If the process crashed, the lock is orphaned.
- 3Check the unlock code. Is it deleting the exact same key with the same Redis connection?
- 4Check if the lock has a reasonable TTL. If the operation takes 5 seconds and the TTL is 60 seconds, a crash blocks for 60 seconds.
- 5Check if the unlock uses a Lua script or `DEL IF value = owner`. Deleting without checking ownership risks deleting someone else's lock.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchRedis — `KEYS *lock*`, `GET <key>`, `TTL <key>`
- searchLock acquisition code — SET with NX + PX, or Redlock implementation
- searchLock release code — Lua script or DEL-after-ownership-check
- searchProcess monitoring — is the lock holder process alive?
- searchLock TTL configuration — is it long enough for normal operation but short enough to recover from crashes?
- searchRedis topology — standalone, cluster, or sentinel? Failover behaviour?
Practical causes, not theory. These are the things you will actually find.
- warningLock holder process crashes before releasing the lock
- warningLock release sends the command to the wrong Redis instance (in cluster/sentinel)
- warningLock release fails but the error is swallowed — the code continues as if released
- warningLock TTL is set too long relative to the operation duration
- warningLock release script has a bug — wrong key format, wrong owner ID comparison
- warningNetwork partition: the lock holder is still running but cannot reach Redis to release
- warningMultiple processes use different lock key formats — one acquires with prefix A, another tries to release with prefix B
Concrete fix directions. Pick the one that matches your root cause.
- buildSet a reasonable TTL on every lock — long enough for the operation but short enough for crash recovery.
- buildUse a Lua script for lock release that atomically checks ownership before deleting: `if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('DEL', KEYS[1]) else return 0 end`.
- buildAdd a lock watchdog that extends the TTL while the holder is still active and processing.
- buildImplement a fencing token: include a monotonically increasing token with the lock so downstream systems can detect stale lock holders.
- buildMonitor lock acquisition wait time and release failures — alert if locks are held beyond expected duration.
A fix you cannot prove is a guess. Close the loop.
- verifiedAcquire the lock, kill the process, and verify the lock is released after the TTL expires.
- verifiedAcquire the lock, release it normally, and verify the key is deleted from Redis.
- verifiedTwo processes attempt to acquire the same lock — only one should succeed.
- verifiedSimulate a Redis failover and verify locks are handled correctly.
- verifiedRun a load test with lock contention and verify no deadlocks occur.
Things that make this bug worse or harder to find.
- warningNot setting a TTL on the lock at all — a crash locks the system forever
- warningDeleting the lock key without checking ownership — could delete another process's lock
- warningUsing a lock with an infinite or very long TTL as a 'permanent' lock
- warningNot handling the case where the lock acquire fails — the code should wait and retry or fail gracefully
- warningAssuming Redis SET NX is safe in a cluster without using Redlock or a similar algorithm