I've lost count of how many times I've been in a war room, staring at a flame graph, and heard someone say, "That red bar must be the problem." Flame graphs are one of the most powerful observability tools we have — but they're also one of the most commonly misinterpreted. The color isn't a severity indicator. The width isn't call count. And a tall stack doesn't mean you have a deep problem.
This post is about what the graph is actually telling you: where your CPU cycles are going, and more importantly, where they aren't going. I'll walk through a real flame graph from a production incident, show you the patterns that matter, and give you a repeatable process for turning profile data into a fix.
Anatomy of a Flame Graph (the Parts People Skip)
Every flame graph has a few key elements: the x-axis (width), the y-axis (depth/stack), the color, and the sample count. Most tutorials cover the basics — wider bars = more CPU time, top bars are what's executing. But here's what's often missing: the sample rate and the sampling mode.
A flame graph from perf (Linux perf_events) samples at 99 Hz by default. That means 99 snapshots per second of the program counter and stack. If your app runs for 60 seconds, you have ~5940 samples. A function that appears in 100 of those samples accounts for about 1.7% of CPU time. But if you're using a different profiler like async-profiler on Java, the sampling rate might be higher, and the graph will have more detail. Always check the sample count in the corner.
Quick sanity check: If the total number of samples is less than (runtime in seconds × sample rate), you're missing data. That could mean the profiler hit a kernel boundary, or your app spent time sleeping/blocked.
The Two Dimensions: Width vs. Height
Width represents the proportion of total CPU time consumed by that stack frame. A function that is 30% wide means 30% of all samples included that function somewhere on the stack. Height shows the call chain depth. A tall stack means many nested calls, but not necessarily a problem — a deep framework like Spring Boot will naturally produce tall stacks.
The real signal is in the width of the leaf functions (the topmost frames). If a leaf function is wide (say, >5% of total width), that's a candidate for optimization. If a middle function is wide, it's just a container — you need to look at its children.
The Incident: A Production Slowdown Traced to a Flame Graph
A few months ago, our API latency jumped from 20ms p99 to 600ms p99 after a routine deployment. We pulled a CPU flame graph from one of the affected nodes using async-profiler. The graph looked like a mountain — tall, narrow peaks all over the place. The usual suspects (GC, serialization) were thin.
But one stack caught my eye: a broad plateau at the top labeled `hashCode()` inside `java.util.HashMap.get()`. The stack was: `HashMap.get -> hashCode -> String.hashCode`. This function alone accounted for 34% of CPU samples. The team initially thought it was a collision issue, but the flame graph showed that `String.hashCode` itself was the hot leaf — not the collision resolution.
The HashMap Flame Graph Incident
- 14:05p99 latency spikes to 600ms after deploy.
- 14:12CPU flame graph captured via async-profiler (60s, 99 Hz).
- 14:18Identified HashMap.get() at 34% width, leaf is String.hashCode.
- 14:22Code review reveals key objects used as map keys but hashCode() not cached; each lookup recomputes hash from a long string.
- 14:35Fix: cache hashCode in the key object. Deploy.
- 14:50p99 latency drops back to 25ms. Flame graph now shows HashMap.get < 2%.
Lesson
A wide leaf function is almost always the place to start. Don't assume the problem is in the framework — follow the width to the actual CPU consumer.
How to Read a Flame Graph in 3 Steps
- 1Check the total sample count and sample rate. Ensure you have enough data (at least a few thousand samples).
- 2Find the widest leaf functions (topmost frames). Focus on those that account for >5% of total width. Ignore intermediate frames — they're just callers.
- 3Zoom into a wide leaf and look at its siblings. If multiple leaves under the same parent are wide, the parent might be doing too much. If only one leaf is wide, optimize that function.
Common Patterns and Misconceptions
- arrow_rightThe "Mountain" pattern: Many tall, thin stacks with similar widths. Often indicates a framework overhead (e.g., Spring proxy calls). Not necessarily a problem unless total CPU is high.
- arrow_rightThe "Table" pattern: One very wide stack dominating the graph. Classic hotspot — optimize that path.
- arrow_rightThe "Frozen" pattern: A single stack that is both wide and tall, with no branching. Could be a tight loop or recursion. Check if it's expected (e.g., a busy-wait).
- arrow_rightRed does not mean fire. Flame graphs often use a rainbow colormap where red hues indicate random sampling. Some tools color by function name hash. If you want severity, use a differential flame graph (red = hotter, blue = cooler compared to baseline).
Use the 'Search' feature (Ctrl+F) in tools like FlameGraph.pl or perf to find a function by name. The graph will highlight all occurrences and show the total width percentage. That's faster than scanning visually.
Practical Workflow with perf and FlameGraph.pl
To capture a CPU profile on a Linux system, you can use perf. Here's the command I use most often:
# Record 60 seconds of CPU samples at 99 Hz for a specific process
perf record -F 99 -p <PID> -g -- sleep 60
# Generate a flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > profile.svgThe resulting SVG is interactive: click on a frame to zoom, hover to see details, and search with Ctrl+F. I usually start by clicking on the widest leaf and then looking at the callers to understand the context.
For Java applications, async-profiler is a better choice because it understands JVM stack frames and can profile without the safepoint bias problem. The command is similar:
# Profile a Java process for 60 seconds and output a flame graph
./profiler.sh -d 60 -f profile.svg <PID>Differential Flame Graphs: Before vs. After
The most powerful variant is the differential (or 'delta') flame graph. You take two profiles — one from a baseline (e.g., before a deploy) and one from the suspect period — and subtract them. Red frames indicate increased CPU time, blue indicates decreased. This eliminates noise and highlights exactly what changed.
Brendan Gregg's FlameGraph tools include `difffolded.pl` for this. Here's the workflow:
# Generate folded stacks for both profiles
perf script | ./stackcollapse-perf.pl > baseline.folded
perf script | ./stackcollapse-perf.pl > suspect.folded
# Create differential flame graph
./difffolded.pl baseline.folded suspect.folded | ./flamegraph.pl > diff.svgIn our HashMap incident, a differential flame graph between the previous build and the bad deploy would have immediately shown a new red plateau at `HashMap.get`. That would have saved us 10 minutes of manual inspection.
When Not to Trust the Flame Graph
- arrow_rightIf the total CPU utilization of your process is low (e.g., <20%), a flame graph will show mostly thin stacks. Your problem is likely I/O or lock contention, not CPU. Use off-CPU flame graphs or tracing instead.
- arrow_rightIf your sampling rate is too low (e.g., <50 Hz on a fast app), you might miss short-lived spikes. Increase rate or duration.
- arrow_rightIf the profiler has a safepoint bias (like JMC or old jstack), the graph will overrepresent methods that are near safepoints. async-profiler avoids this.
of performance improvements I've made started with a CPU flame graph showing a single wide leaf function that wasn't obvious from code review alone.
Reading flame graphs is a skill that gets better with practice. The next time you're debugging a CPU issue, don't get distracted by the colors or the depth. Follow the width. It will almost always lead you to the code that's actually burning cycles.
Frequently asked questions
What's the difference between a flame graph and an icicle graph?
A flame graph shows stacks growing upward from the root (bottom). An icicle graph inverts the view so the root is at the top and leaves at the bottom. Icicle views help you see which functions are 'on top' (actually executing) vs. being called by others. Both show the same data, just oriented differently.
Why do some flame graphs have random gaps or missing samples?
Gaps happen when the profiler is not sampling at that moment (e.g., kernel idle, I/O wait, or sampling rate too low). For CPU profiling, the profiler only captures samples when the CPU is busy. If your app is waiting on a network call, those samples won't appear — so a thin flame graph might indicate your app is I/O-bound, not CPU-bound.
How deep should a stack be before I investigate?
There's no hard rule. What matters is the percentage of CPU time a leaf function consumes. A stack that is 50 frames deep but only 0.1% wide is noise. Focus on functions that account for >5% of total samples. For performance-critical code, even 2-3% can be worth optimizing if it's in a hot loop.
Can flame graphs show memory or I/O issues?
No — standard CPU flame graphs only show where CPU time is spent. For memory, use heap profiles or memory flame graphs. For I/O, use off-CPU flame graphs (e.g., tracing blocked stacks). However, CPU flame graphs can indirectly hint at I/O issues if you see thin stacks with lots of idle time.