What this usually means
At least one goroutine is blocked trying to send to or receive from a channel that will never be ready, or is stuck waiting on a sync.WaitGroup counter that never reaches zero. The Go scheduler detects that no goroutine is runnable and panics. Common causes: sending on an unbuffered channel with no receiver ready and no other goroutine that will receive; calling sync.WaitGroup.Wait() before all Add() calls are made; forgetting to close a channel that a range loop is reading from; circular channel dependencies where goroutine A waits on B and B waits on A.
The first ten minutes — establish facts before touching code.
- 1Run the program under a timeout (timeout 5 ./myapp) to catch the hang quickly.
- 2When it hangs, send SIGQUIT (kill -QUIT <pid>) to dump all goroutine stacks.
- 3Examine the stack dump: look for 'goroutine N [chan send]' or '[chan receive]' — those are blocked.
- 4Identify the last line of user code each blocked goroutine is executing before the channel operation.
- 5Check if every send has a corresponding receive on the same channel, and vice versa.
- 6Verify that all sync.WaitGroup.Add() calls happen before Wait() and that Done() is called exactly once per Add.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchGoroutine stack dump from SIGQUIT or GOTRACEBACK=crash.
- searchAll channel send/receive statements in the codebase, especially in goroutines.
- searchsync.WaitGroup usage: every Add must precede Wait, and Done must be called in all exit paths.
- searchdefer statements that might not execute (e.g., os.Exit, log.Fatal before defer).
- searchselect statements with no default case — if all channels are blocked, the goroutine blocks forever.
- searchChannel close patterns: range over a channel that is never closed will block forever.
Practical causes, not theory. These are the things you will actually find.
- warningUnbuffered channel send without a corresponding receive in another goroutine.
- warningsync.WaitGroup.Add() called after Wait() has begun (race condition).
- warningMissing sync.WaitGroup.Done() in an error branch or early return.
- warningMultiple goroutines reading from a channel but only one sender closes it — others block on range.
- warningCircular wait: goroutine A sends to ch1, goroutine B sends to ch2, and A tries to receive from ch2 before ch1 send completes.
- warningUsing time.Sleep to 'wait' for goroutines — fragile and hides the real deadlock.
Concrete fix directions. Pick the one that matches your root cause.
- buildEnsure unbuffered channel sends and receives are paired in separate goroutines or use a buffered channel.
- buildConvert unbuffered channels to buffered channels with capacity=1 as a quick fix, but verify semantics.
- buildUse a select with default to make sends/receives non-blocking when appropriate.
- buildAdd a 'done' or 'quit' channel that is closed to signal goroutines to exit, instead of relying on channel close.
- buildFor sync.WaitGroup, always do Add(1) before launching the goroutine, and defer Done() inside the goroutine.
- buildUse a context with cancel to propagate cancellation across goroutines and break deadlocks.
A fix you cannot prove is a guess. Close the loop.
- verifiedRun the program under stress test (go test -count=100 -race) to catch intermittent deadlocks.
- verifiedAdd GOTRACEBACK=crash environment variable to get stack traces on panic.
- verifiedUse the race detector: it won't catch deadlocks but helps eliminate data race as a cause.
- verifiedInject short timeouts in select statements to detect if a goroutine is stuck longer than expected.
- verifiedWrite a test that exercises the specific code path that previously deadlocked and assert it completes within a timeout.
- verifiedRun with -v and log each goroutine's progress to trace the sequence of events.
Things that make this bug worse or harder to find.
- warningDo not add random time.Sleep calls to 'fix' the deadlock — it only masks the race and may fail under load.
- warningDo not ignore the stack dump — it tells you exactly which goroutines and lines are stuck.
- warningDo not close a channel from the receiver side — only senders should close a channel.
- warningDo not assume a channel operation will complete — always handle the case where it might block forever.
- warningDo not use sync.WaitGroup without ensuring Done is called exactly once per Add, even on panic (use defer).
The Silent Chat Server: All Goroutines Asleep
Timeline
- 09:15Deploy new chat server to staging. Immediately after, the server becomes unresponsive.
- 09:17Health check fails. SSH into instance, run 'curl localhost:8080/health' hangs.
- 09:20Run 'kill -QUIT <pid>' to get goroutine dump. Dump shows 500 goroutines blocked on 'chan send' and 'chan receive'.
- 09:22Analyze dump: all blocked goroutines are in the broadcast() function and in client readPump().
- 09:25In broadcast(), it sends to every client's send channel; in readPump(), it reads from the same channel. But broadcast() is called in a single goroutine that iterates over clients.
- 09:30Discover that when a client disconnects, its send channel is not removed from the broadcast list immediately, but its readPump goroutine exits. So broadcast tries to send to a channel with no receiver.
- 09:35Fix: protect the client list with a mutex, remove client before closing its send channel, and set a maximum send buffer of 256.
- 09:40Deploy fix. Server recovers and passes load test.
I had deployed a WebSocket chat server that worked fine in development but immediately deadlocked in staging under any load. The runtime panic screamed 'all goroutines are asleep'. I had seen this before but always from textbook examples, never in a real service.
The goroutine dump showed hundreds of goroutines stuck in broadcast() waiting to send on a channel, and an equal number stuck in readPump() waiting to receive. But the numbers matched — why would they deadlock? I traced the code: broadcast() iterated over a map of clients and sent a message to each client's send channel. That send channel was an unbuffered channel. The client's readPump() was supposed to receive from it. But when a client disconnected, we closed its send channel in the cleanup, but we never removed the client from the broadcast map. So broadcast kept trying to send to a closed channel — which panics? No, sending on a closed channel panics, but here the channel wasn't closed; it was just that the receiving goroutine had exited. So the send blocked forever.
The fix was simple: use a buffered channel with capacity 256 so that sends don't block if the receiver is temporarily slow, and ensure we remove the client from the map before closing the channel. Also protect the map with a mutex. After deploying, the server handled 1000 concurrent connections without a hitch.
Root cause
Unbuffered send channel with no active receiver because the client's read goroutine exited before the cleanup removed the client from the broadcast list.
The fix
Buffered sends (capacity 256) and proper cleanup: remove client from list, then close channel, all under a mutex.
The lesson
Never use unbuffered channels for broadcast patterns; always assume receivers may disappear. And always capture goroutine dumps on deadlock.
When Go panics with deadlock, it prints all goroutine stacks. If you catch the hang in time, send SIGQUIT to get the dump without panic. The format shows each goroutine's ID, state (e.g., 'chan send', 'chan receive', 'semacquire'), and the stack trace. Focus on the last line of your own code before the runtime call.
Look for patterns: multiple goroutines in 'chan send' on the same channel address indicate a broadcast deadlock. A goroutine in 'semacquire' indicates a mutex or WaitGroup deadlock. If you see a goroutine in 'IO wait', it might be a network deadlock, not a channel issue. For channel deadlocks, the stack will show runtime.chansend or runtime.chanrecv.
A common cause is calling Wait() before all Add() calls complete. For example, launching a goroutine with go func() { wg.Add(1); ... }() — the Add might happen after Wait() if the goroutine is scheduled late. Always Add before the goroutine launch.
Another pitfall: forgetting to call Done() in an error path. Use defer wg.Done() at the start of the goroutine to ensure it's called even if the goroutine panics. But note: defer won't run on os.Exit or log.Fatal — those kill the process immediately.
Two goroutines can deadlock if each is waiting on the other's channel. For example: goroutine A sends on ch1 and then receives on ch2; goroutine B sends on ch2 and then receives on ch1. If both start simultaneously, they block forever. This is the classic 'deadly embrace'. The stack dump will show A in 'chan send' on ch1 and B in 'chan send' on ch2, or alternatively one in send and one in receive.
To fix, reorder operations so that one goroutine sends first, or use a buffered channel with capacity at least 1 to break the cycle. A common pattern is to use a 'request' and 'response' channel pair, but ensure the request sender doesn't block waiting for a response before the request is received.
The race detector (-race) does not detect deadlocks, but it helps rule out data races that might cause inconsistent state leading to deadlock. Use 'go tool trace' to visualize goroutine execution and see where goroutines block. Capture a trace with 'trace.Start()' and 'trace.Stop()' around the problematic section.
For production, consider using 'net/http/pprof' to get goroutine profiles on demand: /debug/pprof/goroutine?debug=2 gives stack traces of all goroutines. This is safer than SIGQUIT because it doesn't kill the process.
Unbuffered channels guarantee synchronous communication — sender blocks until receiver is ready. This is useful for signaling but dangerous for high-throughput or when receiver availability is uncertain. Buffered channels decouple sender and receiver, allowing the sender to proceed until the buffer is full. However, a full buffered channel behaves like an unbuffered one.
A common mistake is using an unbuffered channel for a fan-out pattern (one sender, many receivers). That won't work because only one receiver gets the value. Use a channel per receiver or use sync.Cond. For fan-in (many senders, one receiver), unbuffered works but can cause contention. Always match the channel type to the concurrency pattern.
Frequently asked questions
Why does my program deadlock only sometimes but not always?
This typically indicates a race condition: the deadlock depends on goroutine scheduling order. For example, if a goroutine that adds to sync.WaitGroup is scheduled after Wait() is called, it deadlocks. But if the goroutine runs first, it works. Unbuffered channel deadlocks can also be timing-dependent. Use a stress test with -race and GOMAXPROCS=1 to increase reproducibility.
How do I get a goroutine dump without killing the process?
Use the net/http/pprof package. Import _ "net/http/pprof" and start an HTTP server. Then curl http://localhost:6060/debug/pprof/goroutine?debug=2 to get the stack dump. This is safe for production if the endpoint is not publicly exposed. Alternatively, send SIGABRT to get a dump with core file, but that terminates the process.
Can a closed channel cause a deadlock?
Receiving from a closed channel returns the zero value immediately, so it does not deadlock. Sending to a closed channel causes a panic (not a deadlock). However, ranging over a channel that is never closed will block forever, causing a deadlock if all goroutines are blocked on such ranges. Always close the channel from the sender side when done.
What is the difference between 'all goroutines are asleep' and 'fatal error: concurrent map writes'?
'All goroutines are asleep' is a deadlock detection by the scheduler: no goroutine is runnable. 'Concurrent map writes' is a runtime panic from unsynchronized map access. They are different. A deadlock can occur even without data races, but a data race can cause a deadlock indirectly by corrupting state (e.g., a channel variable being overwritten). Always fix data races first.
I see 'goroutine 1 [chan receive]' in the dump — what does that mean?
Goroutine 1 is the main goroutine. If it's blocked on a channel receive, the program cannot proceed to exit. This often happens when main() launches goroutines and then waits on a channel, but no goroutine sends to that channel. The fix is either to close the channel when done or to use sync.WaitGroup to wait for goroutines to finish instead of channel receive.