LEARN · DEBUGGING GUIDE

Diagnosing and Fixing Goroutine Leaks in Go

Goroutine leaks silently degrade Go services until OOM or latency spikes hit. This guide covers detection with pprof, runtime metrics, blocking analysis, and blocking patterns.

AdvancedGo7 min read

What this usually means

Goroutine leaks happen when goroutines exit their main loop but never return, often because a channel send/receive blocks indefinitely, a select statement has no default and all cases block, or a goroutine blocks on a mutex that's never released. The goroutine stays alive forever, holding references to its stack and any captured variables, preventing GC from reclaiming that memory. Over time, thousands of leaked goroutines consume all available memory. The root cause is usually an unclosed channel, a missing context cancellation, or a forgotten default case in a select.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run `curl http://localhost:6060/debug/pprof/goroutine?debug=2` and count goroutines at the bottom; if count > expected worker pool size × 2, suspect leak.
  • 2Check `runtime.NumGoroutine()` in a health endpoint or via pprof; log it every minute and chart it—if it never plateaus, you have a leak.
  • 3Examine pprof goroutine profile for goroutines stuck in the same function call (e.g., `chan send`, `chan receive`, `sync.Mutex.Lock`).
  • 4Use `go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine` and look for large clusters of identical stack traces.
  • 5Add `pprof.Lookup("goroutine").WriteTo(os.Stdout, 2)` in a debug endpoint and diff snapshots over time.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • search`/debug/pprof/goroutine?debug=2` — full stack traces of all goroutines
  • search`runtime.NumGoroutine()` in your metrics endpoint or logs
  • searchChannel send/receive sites in code, especially select statements without default
  • search`sync.WaitGroup` usage — ensure `Done()` is called on every path, including panics
  • searchContext cancellation propagation — check that `select` cases include `<-ctx.Done()`
  • searchThird-party libraries that spawn goroutines without exposing shutdown hooks
  • search`go tool trace` output if you can capture a short trace during growth
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningGoroutine blocks on a channel send to an unbuffered channel with no receiver
  • warningGoroutine blocks on a channel receive from a channel that never gets a value
  • warningMissing `default` case in select causing all goroutines to block
  • warning`sync.WaitGroup.Add` called before goroutine start, but `Done()` not called on panic or early return
  • warningContext not propagated: goroutine uses `time.After` instead of `ctx.Done()` and the timer never fires
  • warningHTTP handlers spawn goroutines without tracking their lifecycle; server shuts down but goroutines linger
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildAlways use `select` with `ctx.Done()` as one case for any blocking operation, and propagate context from caller.
  • buildPrefer buffered channels or use a `default` case to make sends non-blocking when the consumer is down.
  • buildWrap goroutine bodies in a function that defers `wg.Done()` and recovers panics, logging the error.
  • buildUse a leak-detect test: call `runtime.NumGoroutine()` before and after a test; ensure no growth.
  • buildFor long-lived goroutines, implement a shutdown channel or context that cancels upon SIGTERM.
  • buildAvoid `time.Sleep` for coordination; use `time.After` with `select` and ensure it's not the only path.
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedAfter fix, monitor `runtime.NumGoroutine()` over 24 hours; should stay flat under idle load.
  • verifiedRun `pprof` goroutine profile after a stress test; verify all goroutines are in expected states (idle worker waiting).
  • verifiedWrite a unit test that spawns 100 goroutines with the pattern, then checks goroutine count returns to baseline.
  • verifiedCheck that service memory usage (RSS) stabilizes after warm-up and doesn't grow unboundedly.
  • verifiedUse `go vet` with `-copylocks` and `-lostcancel` to catch related issues.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningCalling `go func()` inside a loop without capturing the loop variable (pre-1.22) — that's a data race, not a leak, but can cause confusion.
  • warningAssuming `time.Sleep` is harmless; it blocks the goroutine and can prevent GC if the goroutine is leaked.
  • warningNot using `-block` and `-mutex` profiles to differentiate between contention and leaks.
  • warningAdding more memory without investigating; it masks the leak and delays the fix.
  • warningFixing one leak while ignoring the same pattern elsewhere in the codebase.
( 07 )War story

The Silent OOM: A Goroutine Leak in a Go Order Service

Platform EngineerGo 1.18, gRPC, Kubernetes, Prometheus

Timeline

  1. 00:00Deploy v2.3 of order-service to staging; memory climbs 50 MB/hour.
  2. 08:00Production deploy; within 4 hours memory hits 90% of pod limit.
  3. 12:30Pager alert: OOMKilled on 2 out of 5 pods.
  4. 12:45Check /debug/pprof/goroutine?debug=2; see 12,000 goroutines vs expected 200.
  5. 13:00Heap profile shows large retained memory in stack frames; all stuck in same channel send.
  6. 13:15Identify the leaking goroutine: all stuck at `pkg/order/notifier.go:42` sending to `msgChan`.
  7. 13:30Code review reveals the consumer of that channel is in a select without `ctx.Done()` and blocks on a slow DB call.
  8. 14:00Hotfix: add `ctx.Done()` to the consumer select and a default to the sender.
  9. 14:15Deploy hotfix; goroutine count drops to 200 within minutes. Memory stabilizes.

I was on-call when our order-service pods started getting OOM-killed. The alerts showed memory climbing steadily after every deploy, but CPU was normal. I first checked the heap profile via pprof — memory was high but not obviously leaking. Then I looked at the goroutine profile: 12,000 goroutines, all waiting to send on a channel.

The stacktrace pointed to `notifier.go:42` — a goroutine that sends a notification message to a `msgChan`. The receiver was a select that processed messages and then made a slow database call. The sender had no timeout, so if the receiver was busy, the sender blocked forever. Over hours, every order created a new goroutine that never returned.

The fix was straightforward: add `ctx.Done()` to both sender and receiver selects, and a `default` case to the sender so it drops messages if the channel is full. We also changed the channel to buffered. After deployment, the goroutine count dropped to baseline and memory stayed flat. I added a test that checks goroutine count before and after a load simulation.

Root cause

Goroutine sending to an unbuffered channel blocked forever because the receiver was blocked on a slow database call and had no context cancellation.

The fix

Add context cancellation to both sender and receiver selects, and make the sender non-blocking with a default case. Increase channel buffer size.

The lesson

Every blocking channel operation must be interruptible via context, especially when the consumer might be slow. Monitor goroutine count as a standard metric.

( 08 )Using pprof to Find Leaked Goroutines

The first tool to reach for is Go's pprof. Start by enabling the HTTP profiler in your main function: `import _ "net/http/pprof"` and serve on a dedicated port (e.g., localhost:6060).

Fetch the goroutine profile with `curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutines.txt`. This gives you full stack traces. Count the number of goroutines at the bottom—if it's far above your expected worker count, you have a leak. Use `grep -c "goroutine" goroutines.txt` for a quick count.

Visualize with `go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine`. The graph shows which functions are accumulating goroutines. Look for a large cluster of identical stacks; that's your leak. You can also use the `top` command to see the functions with the most goroutines.

For a programmatic approach, call `pprof.Lookup("goroutine").WriteTo(os.Stdout, 2)` in a debug endpoint and compare snapshots. If the same stack appears in increasing numbers, you've pinpointed the leak.

( 09 )Blocking Analysis with go tool trace

When pprof shows many goroutines waiting on channels, but you need to know why they're not progressing, use execution tracing. Add `import "runtime/trace"` and start a trace in your main: `f, _ := os.Create("trace.out"); trace.Start(f); defer trace.Stop()`. Then run your service under load for a few seconds.

Analyze the trace with `go tool trace trace.out`. In the "Goroutine analysis" view, you can see each goroutine's lifecycle: when it was created, when it blocked, and how long it blocked. Look for goroutines that are "Runnable" but never get scheduled, or "Waiting" on a channel indefinitely.

The "Blocking profile" shows which operations caused the most blocking time. If you see a channel send or receive with a huge cumulative delay, check the code around that location. Combine this with the goroutine profile to confirm the leak.

( 10 )Common Patterns That Cause Leaks

The most frequent pattern is a goroutine that reads from a channel in an infinite loop but the channel is never closed. The goroutine blocks forever on receive. Always close channels when the producer stops, or use a context to signal shutdown.

Another classic is the `select` without a `default` case where all channels are blocked. For example: `select { case msg := <-ch: ... case <-time.After(time.Minute): ... }` — if the message never arrives and the minute timer is reset, the goroutine blocks forever. Always include `ctx.Done()` as a case.

Third: `sync.WaitGroup` misuse. If you `Add(1)` before launching a goroutine but forget to `Done()` in a panic or early return path, the goroutine leaks (the WaitGroup itself doesn't cause the leak, but the goroutine may stay alive waiting for something else). Use `defer wg.Done()` immediately after the goroutine starts.

( 11 )Testing for Leaks in CI

Write a test that checks goroutine count before and after a test. Use `runtime.NumGoroutine()` at the start, run the test logic, then check again. If the count increased beyond a threshold (say 5), fail the test. Example: `before := runtime.NumGoroutine(); // run test; after := runtime.NumGoroutine(); if after > before+5 { t.Fatal("goroutine leak") }`.

Be careful with background goroutines from the test framework or runtime. Use a tolerance (e.g., 10). Also, consider using the `leaktest` package from Uber (github.com/uber-go/goleak) which automatically checks for leaked goroutines at the end of a test. It's more robust because it ignores known runtime goroutines.

For integration tests, start a pprof HTTP server, run your scenario, then fetch the goroutine profile and look for unexpected stacks. Automate this with a script that greps for known leak patterns.

Frequently asked questions

How many goroutines is normal?

A healthy Go service typically has a few hundred to a few thousand goroutines at most. The runtime can handle tens of thousands, but if the number keeps growing without bound, you have a leak. Monitor the trend, not just the absolute number.

Can a goroutine leak cause a memory leak?

Yes. Each goroutine has a minimum stack (2 KB in Go 1.2+, but grows as needed). A leaked goroutine also holds references to any objects on its stack or captured by closures, preventing GC from freeing them. Over time, this memory accumulates and can cause OOM.

What's the difference between a goroutine leak and a deadlock?

A deadlock stops all goroutines and the program crashes. A leak leaves goroutines alive but blocked, so the program continues but memory grows. Leaks are harder to detect because the service appears to run normally until it OOMs.

Should I use buffered channels to prevent leaks?

Buffered channels can help by allowing sends to proceed without blocking if the buffer is not full, but they don't solve the root cause. If the consumer is permanently stuck, the buffer will eventually fill and sends will block. Always use context cancellation for proper cleanup.

How do I monitor goroutine count in production?

Expose `runtime.NumGoroutine()` as a Prometheus gauge or via your metrics library. Log it every minute and graph it. Set an alert if the count exceeds a threshold (e.g., 3x baseline for 5 minutes). Also expose the pprof endpoint internally for on-demand debugging.