Golang sync.Mutex Deadlock Detection and Fix

What this usually means

A deadlock with sync.Mutex typically means two or more goroutines hold locks and wait for each other to release them, creating a cycle. The most common pattern is lock ordering inversion: goroutine A locks mutex1 then mutex2, while goroutine B locks mutex2 then mutex1. If both acquire their first lock simultaneously, neither can proceed. Another frequent cause is recursive locking: a goroutine tries to lock a mutex it already holds (sync.Mutex is not reentrant). Also, forgetting to unlock a mutex on all code paths, especially in error handling, can permanently block other goroutines. The Go scheduler's non-determinism often masks these bugs until specific timing conditions trigger them in production.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Run pprof: curl http://localhost:6060/debug/pprof/goroutine?debug=2 | grep -A20 'sync.Mutex.Lock' to see blocked goroutines
2Count blocked goroutines: go tool pprof -top http://localhost:6060/debug/pprof/goroutine | grep 'mutex'
3Check if mutex operations are in a cycle: look for two goroutines each waiting on a lock held by the other
4Add a timeout around mutex operations using select with context (e.g., ctx.Done()) during development to catch hangs
5Use go vet ./... to detect potential lock copy issues (mutex should not be copied by value)

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchruntime/pprof output: /debug/pprof/goroutine?debug=2 (full goroutine stacks)
searchCode locations where mutex.Lock() and Unlock() are called, especially in nested or recursive functions
searchError handling paths: check that Unlock() is called in defer or after every error return
searchLock ordering: list all mutexes and the order they are acquired across goroutines
searchThird-party libraries that use sync.Mutex internally; check if they expose the mutex or callbacks that hold locks
searchTests with -race flag: go test -race ./... can detect data races but also sometimes deadlocks
searchApplication logs: look for repeated attempts to acquire a lock with no corresponding release

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningLock ordering inversion: goroutine A locks mutex1 then mutex2, goroutine B locks mutex2 then mutex1
warningRecursive locking: goroutine attempts to lock the same mutex twice (e.g., calling a function that locks while holding the lock)
warningMissing Unlock on error path: early return without releasing the lock
warningCopying a mutex by value (e.g., passing struct with mutex by value) causes unintended lock/unlock on different copies
warningGoroutine leaks: goroutine holding a lock exits without unlocking (e.g., panic recover not handled)
warningExternal factors: network I/O, database queries, or channel operations within the locked section that block indefinitely

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildEstablish a global lock ordering: all goroutines must acquire locks in the same order (e.g., alphabetical, or based on a hierarchy)
buildUse sync.RWMutex when reads dominate; separate read and write locks to reduce contention
buildWrap Lock/Unlock in a helper that logs acquisition and release for debugging (but remove in production)
buildAlways defer Unlock immediately after Lock to ensure release even on panic: defer mu.Unlock()
buildAvoid calling external callbacks or I/O while holding a lock; minimize critical section
buildConsider using channels instead of mutexes for communication; often simpler to reason about

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedRun the application under stress testing (e.g., multiple goroutines, high concurrency) and confirm no hang
verifiedUse go test -race -count=100 ./... to run tests repeatedly and check for deadlock
verifiedMonitor /debug/pprof/goroutine?debug=2 after fix; ensure no goroutines blocked on sync.Mutex.Lock for long periods
verifiedAdd a watchdog timer that prints stack traces if a lock is held longer than a threshold (e.g., 10 seconds)
verifiedVerify that all Lock() calls have a corresponding Unlock() via code review or static analysis (e.g., go vet)

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningDon't use sync.Mutex as a reentrant lock; it is not. Use sync.Cond or channels if reentrancy is needed
warningDon't copy a struct that contains a mutex by value; always use pointers to avoid copying the lock state
warningDon't assume that a deadlock will always reproduce; timing-dependent bugs need stress testing
warningDon't add sleep() to 'fix' the problem; it only masks the race condition
warningDon't ignore go vet warnings about copying mutexes; they are serious
warningDon't use global mutexes for unrelated resources; high contention can cause performance issues and obscure deadlocks

( 07 )War story

The Price Tick Feed Deadlock

Backend SREGo 1.18, gRPC, Redis, Kubernetes (GKE)

Timeline

09:00Deploy new version of price-feed service to production
09:15Alerts fire: latency spikes to 30s for tick retrieval endpoint
09:20Check /debug/pprof/goroutine?debug=2; see 500 goroutines blocked on sync.Mutex.Lock
09:25Identify two goroutines: one holding 'subscriberMu' waiting for 'tickMu', another holding 'tickMu' waiting for 'subscriberMu'
09:30Rollback to previous version; latency drops to normal
09:45Code review: new code introduced a callback that locks subscriberMu while holding tickMu
10:00Fix: reorder lock acquisition to always acquire subscriberMu first, then tickMu
10:15Deploy fixed version; monitor for 30 minutes, no recurrence

We deployed a new version of the price-feed service that added a feature to notify subscribers when a tick price changes. Within minutes, the tick retrieval endpoint latency spiked from 5ms to 30s. I immediately checked the goroutine profile and saw over 500 goroutines blocked on sync.Mutex.Lock.

Looking at the stacks, I found a classic lock ordering inversion. The Tick() function locked tickMu then called Subscriber.Notify(), which locked subscriberMu. Meanwhile, Subscriber.Update() locked subscriberMu then called Tick(), which locked tickMu. Two goroutines executing these paths simultaneously caused a deadlock.

We rolled back immediately. The fix was straightforward: enforce a global lock order. I changed Tick() to lock subscriberMu before tickMu, ensuring consistent ordering. After deploying the fix, latency returned to normal and the deadlock never reappeared. I also added a stress test that runs 100 concurrent requests to catch ordering issues early.

Root cause

Lock ordering inversion: Tick() locked tickMu then subscriberMu, while Subscriber.Update() locked subscriberMu then tickMu.

The fix

Changed lock order in Tick() to lock subscriberMu first, then tickMu, consistent with Subscriber.Update(). Also added a defer Unlock pattern and a stress test.

The lesson

Always establish and document a global lock ordering for all mutexes. Use defer Unlock immediately after Lock. Stress test with high concurrency to catch deadlocks before production.

( 08 )Detecting Deadlocks with Runtime Stack Traces

The most reliable way to detect a deadlock in Go is to examine goroutine stack traces. Enable the pprof HTTP server by importing net/http/pprof and adding a handler (often at /debug/pprof). Then curl /debug/pprof/goroutine?debug=2 to get full stack traces. Look for multiple goroutines in 'sync.Mutex.Lock' or 'sync.runtime_SemacquireMutex' state. Trace the locks each goroutine holds and waits for.

For example, a goroutine stack might show: goroutine 1 [semacquire, 1 minute]: sync.runtime_SemacquireMutex(...) sync.(*Mutex).Lock(...) main.func1 ... main.func2. Another goroutine shows the opposite order. This indicates a cycle. In production, you can also expose the pprof endpoint on a separate port or use a sidecar to collect profiles periodically.

( 09 )Lock Ordering and Hierarchies

The canonical solution to lock ordering inversion is to impose a partial order on all mutexes. For example, assign a numeric ID to each mutex and always acquire locks in increasing ID order. This prevents cycles. In Go, you can embed a 'ordering' field in a struct or use a global map. Document the order clearly.

Another approach is to use hierarchical locking: group related resources under a 'manager' mutex that is acquired first. For example, a database connection pool might have a pool mutex, and individual connection mutexes are only locked after the pool mutex. This reduces the chance of inversion.

( 10 )Using defer for Safe Unlock

Always use defer mu.Unlock() immediately after mu.Lock(). This ensures the lock is released even if the function panics or returns early. However, be cautious: defer adds overhead and may delay lock release in performance-critical paths. In such cases, use explicit Unlock but ensure all code paths release the lock, including error returns.

A common mistake is to lock and then call a function that panics, causing the lock to be held indefinitely. defer Unlock prevents this. Also, never copy a mutex after locking it; the copy's state is undefined. Use pointers to share mutexes.

( 11 )Advanced: Deadlock Detection with go-deadlock

The go-deadlock package (github.com/sasha-s/go-deadlock) provides drop-in replacements for sync.Mutex and sync.RWMutex that detect potential deadlocks at runtime. It uses lock ordering and cycle detection. During development, replace sync.Mutex with deadlock.Mutex. If a potential deadlock is detected (e.g., lock ordering violation), it prints a stack trace and panics.

This tool is invaluable for testing but should not be used in production due to performance overhead. Use build tags (e.g., // +build deadlock) to conditionally include it. Combine with go test -race for comprehensive concurrency validation.

Frequently asked questions

Is sync.Mutex reentrant in Go?

No, sync.Mutex is not reentrant. If a goroutine attempts to lock a mutex it already holds, it will deadlock. Use sync.RWMutex or channels if you need reentrant behavior, though it's usually a design smell.

How do I find which goroutines are in a deadlock?

Enable the pprof HTTP server and fetch /debug/pprof/goroutine?debug=2. Look for goroutines in 'sync.Mutex.Lock' or 'sync.runtime_SemacquireMutex' state. Two or more goroutines waiting on each other's locks indicate a deadlock.

Can I use channels to avoid deadlocks with mutexes?

Channels can reduce the need for mutexes by communicating data between goroutines safely. However, channels can also deadlock if not used carefully (e.g., unbuffered channel send without receiver). Often, using channels with select and timeouts can help avoid deadlocks.

What is the difference between a deadlock and a livelock?

In a deadlock, goroutines are blocked waiting for each other and make no progress. In a livelock, goroutines are actively trying to resolve the conflict but keep changing state without making progress (e.g., two goroutines repeatedly releasing and re-acquiring locks). Livelocks are rarer but can be detected by observing high CPU usage with no progress.

How can I prevent deadlocks in new Go code?

Establish a global lock ordering for all mutexes. Always use defer Unlock. Avoid calling external functions or callbacks while holding locks. Use go vet to check for mutex copy issues. Write stress tests with high concurrency. Consider using go-deadlock during development.

Golang sync.Mutex Deadlock: Detection and Resolution

What this usually means

Frequently asked questions