LEARN · DEBUGGING GUIDE

Debugging Go Context Cancellation That Doesn't Propagate

When a parent context is canceled but goroutines keep running, the problem is usually a missing WithCancel, a leaked context, or a blocking operation that ignores the done channel. Here's exactly how to find and fix it.

IntermediateGo7 min read

What this usually means

The core issue is that the cancellation signal from the parent context never reaches the child goroutines. This happens when a child goroutine captures a derived context that wasn't properly created with context.WithCancel or context.WithTimeout, or when the goroutine's code path doesn't check ctx.Done() before or during a long-running operation. A common variant is accidentally storing a context in a struct or passing it to a long-lived worker that spawned its own goroutines without inheriting the parent's cancellation. Another culprit is using context.WithValue which doesn't create a cancelable child—if the original context isn't cancelable, nothing will propagate.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run pprof: go tool pprof -seconds=30 http://localhost:6060/debug/pprof/goroutine and look for goroutines stuck in your code
  • 2Add a global counter for goroutine starts/stops and log it on every request to detect leaks
  • 3Wrap your context with a custom type that logs when Done() is called, then check if it fires
  • 4Use race detector: go run -race . — it often catches context misuse in concurrent code
  • 5Check if any of your context.WithTimeout or WithCancel calls are ignored (e.g., parent context is Background())
  • 6Insert a select { case <-ctx.Done(): log.Println('canceled'); default: } at the start of suspected goroutines
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchAll places where you call context.WithCancel, WithTimeout, or WithDeadline — verify the returned cancel function is actually called
  • searchGoroutine creation sites: look for 'go func()' or 'go myFunc()' — trace the context argument
  • searchThird-party library calls that accept context — check their documentation for cancellation support
  • searchHTTP server handlers: verify you're using r.Context() and not a background context
  • searchDatabase/sql query calls: ensure they use the passed context, not a stored one
  • searchYour struct fields: any struct holding a context should be suspect — contexts should not be stored
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningCalling context.WithCancel but forgetting to call the returned cancel() function
  • warningDeriving a child context from context.Background() instead of the parent request context
  • warningStoring a context in a struct field and using it later after parent cancellation
  • warningUsing context.WithValue and assuming it creates a cancelable copy — it doesn't
  • warningBlocking on a channel read/network call without a select on ctx.Done()
  • warningGoroutine that loops without checking ctx.Done() between iterations
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildEnsure every context.WithTimeout/WithCancel call's cancel function is deferred immediately or called at the right scope
  • buildReplace select { case <-ctx.Done(): return } in all blocking operations (reads, writes, sleeps)
  • buildRefactor long-lived goroutines to take a context parameter and check Done() at loop boundaries
  • buildUse context.WithTimeout wrapping the entire handler flow, and pass that derived context everywhere
  • buildReplace stored contexts with function parameters — pass context explicitly to every function that needs it
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedRun the scenario that caused the leak and check pprof goroutine count before and after — count should return to baseline
  • verifiedWrite a unit test that cancels the context and asserts that all spawned goroutines terminate within a timeout
  • verifiedAdd a finalizer or defer log to each goroutine that prints when it exits
  • verifiedUse the race detector with a test that cancels context and waits for goroutines to finish
  • verifiedMonitor goroutine count in production via /debug/pprof/goroutine and set an alert on growth
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningDon't use context.Background() inside an HTTP handler — use r.Context()
  • warningDon't store a context in a struct that outlives a single request
  • warningDon't ignore the cancel function returned by WithCancel — it must be called exactly once
  • warningDon't assume third-party libraries respect cancellation — verify with a quick test
  • warningDon't use context.WithValue to make a cancelable child — it doesn't work that way
  • warningDon't put a context in a global variable or long-lived cache
( 07 )War story

Goroutine Leak from a Missed defer cancel() in a Search Aggregator

Backend EngineerGo 1.21, PostgreSQL, gRPC, Prometheus, Kubernetes

Timeline

  1. 10:00Deploy new search aggregator service to staging
  2. 10:15Alert: P99 latency spikes from 200ms to 30s, memory climbing
  3. 10:20Run pprof goroutine: 1500 goroutines (baseline ~50), most stuck in 'pgx.(*Conn).Query'
  4. 10:25Check context usage: each HTTP handler creates a WithTimeout, but the cancel is not deferred
  5. 10:30Found that in one path, the cancel function is only called on success, not on error
  6. 10:35Add defer cancel() after WithTimeout, redeploy
  7. 10:40Goroutine count drops to 60, latency normalizes
  8. 10:45Retrospective: the code had 'cancel, _ := context.WithTimeout(...)' but the cancel variable was shadowed later

I was on-call when the P99 latency alert fired. The search aggregator service, which fans out queries to three downstream services and a database, was suddenly taking 30 seconds for requests that should complete in 200ms. Memory was also climbing steadily. My first instinct was to look at goroutine profiles.

pprof showed 1500 goroutines, most stuck in pgx query calls. That told me the database queries were never being canceled. I traced the code: each HTTP handler created a context with WithTimeout, but the cancel function was stored in a local variable that got shadowed inside an error-handling block. The cancel was only called on success, never on error or timeout.

The fix was simple: defer cancel() right after the WithTimeout call. I also added a pattern where we always defer cancel in the same scope. After redeploy, goroutines dropped to baseline and latency recovered. The lesson: always defer cancel() immediately, and never let a context escape the function scope without a cancel path.

Root cause

The cancel function returned by context.WithTimeout was not deferred, and in a code path handling an error, the cancel variable was shadowed, so cancel() was never called.

The fix

Added `defer cancel()` immediately after `ctx, cancel := context.WithTimeout(...)` in all handler functions.

The lesson

Always defer cancel() right after creating a cancelable context; never store cancel in a variable that might be reassigned.

( 08 )How Context Cancellation Actually Works

A context in Go is an interface with a Done() channel that is closed when the context is canceled or times out. Propagation works because each derived context (created by WithCancel, WithTimeout, WithDeadline) stores a reference to its parent and listens on the parent's Done() channel. When the parent is canceled, all children that are listening get their own Done() closed.

The key mistake: if you create a child context with WithValue, it does NOT create a new cancelable branch. It just attaches a key-value pair. If you want a cancelable child, you must use WithCancel, WithTimeout, or WithDeadline. WithValue returns a context that shares the same parent's cancellation.

Another nuance: if you call WithTimeout on a context that is already expired, the returned context is immediately canceled. But if you never call the cancel function, you'll have a goroutine leak in the timer. Always defer cancel() to clean up resources.

( 09 )Common Patterns That Break Propagation

Storing a context in a struct: This is the number one pattern I see in code reviews. A struct holds a context field that is set once and used later. If the original request is canceled, the stored context might still be alive if it was derived from a parent that didn't propagate. The rule: never store a context in a struct; pass it as a function parameter.

Goroutines that don't check ctx.Done(): If you have a goroutine that loops forever, make sure each iteration checks the context. Otherwise, the goroutine will keep running even after cancellation. Use a select statement or a non-blocking check.

Third-party libraries that ignore context: Some libraries accept a context but don't actually use it for cancellation. Always verify by reading the source or writing a quick test. If the library is broken, wrap calls with a timeout on your side.

( 10 )Using pprof to Diagnose Leaks

The fastest way to confirm a context propagation issue is to take a goroutine profile. Run `curl http://localhost:6060/debug/pprof/goroutine?debug=1` and look for goroutines stuck in your code. The stack trace will show the exact line where they're blocked.

For a more detailed view, use `go tool pprof -seconds=30 http://localhost:6060/debug/pprof/goroutine` and then use the `traces` command to see all goroutines with their stacks. Filter by your package name.

If you see goroutines waiting on `context.Done()` or blocked on network calls, that's normal if they're waiting. But if they're stuck in a query or a channel send without a select, that's a leak. Compare the number of goroutines before and after a test request.

( 11 )Testing Cancellation Propagation

Write a unit test that spawns a goroutine with a context, cancels the context, and asserts that the goroutine exits within a reasonable timeout. Use `ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)` and then `defer cancel()`. After cancel, check if the goroutine's done channel is closed.

You can also use the race detector to catch concurrent misuse. Run `go test -race` and look for data races on context variables. A race often indicates that a context is being read and written concurrently without synchronization.

Another technique: inject a context that logs every time Done() is called. Wrap the context with a custom implementation that prints a stack trace on cancellation. This helps identify where cancellation happens (or doesn't).

Frequently asked questions

Why does my context.WithValue not propagate cancellation?

context.WithValue does not create a new cancelable branch; it only adds a key-value pair. The returned context shares the same parent's Done() channel. If you want a cancelable child, use context.WithCancel, WithTimeout, or WithDeadline.

How do I know if a third-party library respects context cancellation?

Read the documentation or source code. Look for a select on ctx.Done() inside the library's blocking calls. Alternatively, write a quick test: create a context with a short timeout, call the library function, and verify it returns before the timeout without hanging.

What's the best practice for passing context to goroutines?

Always pass context as the first parameter (by convention). Never store it in a struct. In the goroutine, check ctx.Done() at the beginning and in loops. Use a select to handle both the context and the work channels.

Can I reuse a context after it's canceled?

No. Once a context is canceled, it's done. You must create a new context with a fresh parent (e.g., context.Background()) for a new operation. Attempting to reuse a canceled context will cause immediate cancellation.