What this usually means
Long GC pauses happen when the JVM spends too much time in stop-the-world phases. For young collections (minor GC), the cause is often a too-large young generation or an inefficient allocation rate. For old collections (major GC), it's usually a fragmented heap, oversized tenure set, or a misconfigured GC algorithm like CMS or G1GC. In G1GC, initial mark, remark, and cleanup phases can become long if the heap is too large or the remembered sets are bloated. The underlying mechanism is the same: the JVM must stop all application threads to safely move objects or compact regions, and if the work per pause exceeds the target pause time, you get latency spikes.
The first ten minutes — establish facts before touching code.
- 1Enable GC logging: `-Xlog:gc*,gc+age*,gc+ergo*=trace:file=gc.log:utctime,level,tags` for Java 9+; or `-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log` for Java 8.
- 2Get a baseline: run `jstat -gcutil <pid> 1s` and watch YGCT, FGCT, and GCT columns. If FGCT is >10% of total time, you have a full GC problem.
- 3Install a GC viewer like GCeasy or GCEasy and upload the gc.log – it gives instant cause classification.
- 4Check heap sizing: `jmap -heap <pid>` and see if young/old ratios are reasonable. Look for 'Eden Space' capacity vs utilization.
- 5Monitor OS-level: `top -H -p <pid>` shows CPU per thread; during GC, GC threads (VMThread, ConcurrentMarkSweep threads) should be the only ones busy. If not, there's interference.
- 6Check if the JVM is swapping: `cat /proc/<pid>/status | grep VmSwap`. Any swap kills GC pause times.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchGC log file (`gc.log` or `-Xlog:gc*` output).
- searchJVM native memory tracking via `jcmd <pid> VM.native_memory summary`.
- search/proc/<pid>/status for swap and memory stats.
- searchOS kernel log (`dmesg -T` | grep -i oom) for memory pressure.
- searchJFR (Java Flight Recorder) recording for GC phases: `jcmd <pid> JFR.start duration=60s filename=recording.jfr`.
- searchApplication metrics from APM (Datadog, New Relic) showing GC pause duration.
- searchThread dumps (`jstack <pid>`) during a long pause to see if application threads are stuck.
Practical causes, not theory. These are the things you will actually find.
- warningYoung generation too large: excessive Eden space causes long copy time in young GC pauses.
- warningOld generation fragmentation with CMS: promotion failures lead to full GCs.
- warningG1GC humongous allocations: objects >50% of region size bypass G1 and cause full GC.
- warningToo many threads doing allocation: high allocation rate forces frequent young GCs.
- warningJVM not given enough memory: forced GCs from -Xmx being too small relative to live data.
- warningMisconfigured GC algorithm: using CMS on large heaps (>4GB) or G1 with too-small region size.
Concrete fix directions. Pick the one that matches your root cause.
- buildSwitch to G1GC with explicit pause time goal: `-XX:+UseG1GC -XX:MaxGCPauseMillis=50`. Adjust `-XX:G1HeapRegionSize` to 1-4MB for large heaps.
- buildFor CMS, reduce `-XX:CMSInitiatingOccupancyFraction=70` to start concurrent cycle earlier, and add `-XX:+UseCMSInitiatingOccupancyOnly`.
- buildTune young generation size: `-Xmn` or `-XX:NewRatio`. Start with 25% of heap and watch young GC pause times.
- buildEnable string deduplication: `-XX:+UseStringDeduplication` in G1GC to reduce live heap.
- buildFor huge heaps (>32GB), consider ZGC or Shenandoah: `-XX:+UseZGC` or `-XX:+UseShenandoahGC` with sub-millisecond pause targets.
- buildReduce allocation rate: optimize code to allocate fewer objects, use thread-local caching, or increase TLAB size with `-XX:TLABSize`.
A fix you cannot prove is a guess. Close the loop.
- verifiedCheck GC logs after fix: young GC pauses should be under MaxGCPauseMillis target, and full GC count should drop to zero.
- verifiedRun load test and monitor `jstat -gcutil` – FGCT should be near 0% of total GC time.
- verifiedVerify with JFR: record GC phases and confirm no pause exceeds 10ms (for G1/ZGC).
- verifiedMonitor application latency percentile (p99) – it should no longer correlate with GC cycles.
- verifiedHeap dump analysis: ensure live data fits comfortably within heap and no memory leak grows over time.
Things that make this bug worse or harder to find.
- warningSetting -Xms equal to -Xmx on a memory-constrained host – prevents OS from reclaiming unused heap pages.
- warningIncreasing heap size without adjusting GC algorithm – 64GB heap with CMS will cause massive fragmentation.
- warningUsing G1GC on Java 8 without backport patches – early G1 had many bugs causing long pauses.
- warningBlindly copying GC flags from a different workload – always measure your own allocation patterns.
- warningIgnoring OS-level memory pressure: swap or kernel memory reclaim will destroy GC performance no matter what you tune.
- warningNot testing GC changes under realistic load – a fix that works in QA may fail in prod with different allocation rates.
The 8-Second GC Pause That Killed Our Checkout Latency
Timeline
- 14:01PagerDuty alert: checkout service p99 latency > 5s for 3 minutes.
- 14:03Checked APM: GC pause metric shows 8.2s max, 4.3s avg for last 5 minutes.
- 14:05SSH into box, run `jstat -gcutil <pid> 1s` – FGCT is 45% of total GC time.
- 14:07Look at GC log: many 'Full GC (Allocation Failure)' with pause times 6-9 seconds.
- 14:10Heap usage: used=44GB, committed=48GB. Nearly full.
- 14:12Run `jmap -histo:live <pid>` – a char[] array is 12GB! Suspect memory leak.
- 14:15Rollout a heap dump via `jcmd <pid> GC.heap_dump /tmp/dump.hprof` (took 2 min).
- 14:20Analyze dump with Eclipse MAT: top consumer is a HashMap holding session data that never expires.
- 14:25Hotfix: clear session cache in code, trigger GC with `jcmd <pid> GC.run` – pause drops to 200ms.
- 14:30Monitor: p99 latency back to 50ms. Full GCs stop. Incident resolved.
We were midway through Black Friday traffic. The checkout service started timing out. My first instinct was to check GC logs because we had seen long pauses before during load spikes. I saw 'Full GC (Allocation Failure)' with pause times exceeding 8 seconds. That's a dead giveaway that the heap is almost full and the JVM is struggling to find space.
Using jstat, I saw the heap was 44GB out of 48GB. That's too close to the limit. I suspected a memory leak because the live data shouldn't grow that fast. I forced a heap dump with jcmd and transferred it to my laptop for analysis. Eclipse MAT showed a single HashMap holding 12GB of char[] arrays – a session cache that was supposed to evict old entries but had a bug: the eviction thread was never started.
We deployed a fix to clear the cache and set a proper TTL. The GC pauses immediately dropped to under 200ms and full GCs vanished. Lesson: always verify memory leak before tuning GC flags – no amount of GC tuning will fix a leak. We also added alerts for heap usage crossing 80% and a JFR recording to capture GC phases in future incidents.
Root cause
Memory leak: unbounded session cache holding 12GB of char[] data, causing heap to fill and trigger frequent full GCs.
The fix
Fixed cache eviction logic: added a scheduled thread to expire old entries every 5 minutes. Also reduced session timeout from 24h to 1h.
The lesson
Before tuning GC flags, rule out memory leaks. Use heap dumps to find the largest objects. GC tuning is for efficient collection, not for plugging leaks.
G1GC divides the heap into regions (default ~1MB each). A young GC pauses all threads to copy live objects from Eden to Survivor regions. The pause time is proportional to the number of live objects in Eden. If you have a high allocation rate (e.g., 10GB/s), Eden fills quickly, and young GC pause times can exceed 100ms even with a small young generation.
The key tunable is `-XX:MaxGCPauseMillis`. G1 will adjust young generation size to meet this target. However, if the actual pause time is far above the target, it means G1 cannot shrink young enough – likely because the heap is too small or the allocation rate is too high. In that case, increase heap size or reduce allocation rate. Also check `-XX:G1NewSizePercent` and `-XX:G1MaxNewSizePercent` to constrain young gen.
A concurrent mode failure occurs when the old generation fills up while G1 is doing a concurrent marking cycle. This forces a stop-the-world full GC (serial or parallel). In GC logs, you'll see 'To-space exhausted' or 'Concurrent Mode Failure' followed by a 'Full GC'. This is a sign that the concurrent marking started too late or the heap is too small.
To fix, either increase heap size, or start marking earlier with `-XX:InitiatingHeapOccupancyPercent=30` (default 45). Also ensure that G1 has enough CPU to complete concurrent phases before the heap fills. Monitor the concurrent cycle duration with `-Xlog:gc*` and check if it consistently runs near the next cycle trigger.
Java Flight Recorder (JFR) provides detailed events for each GC phase. Run `jcmd <pid> JFR.start duration=60s filename=gc.jfr` and then open the recording with JDK Mission Control. Look for the 'Garbage Collection' event – it shows individual stop-the-world times broken into sub-phases like 'Pause Young (Normal)', 'Pause Young (Concurrent Start)', 'Pause Remark', 'Pause Cleanup'.
If 'Pause Remark' is long (>500ms), it's often due to a large number of SATB buffers (snapshot-at-the-beginning) that need to be processed. You can mitigate by increasing `-XX:G1SATBBufferEnqueueingThresholdPercent` or reducing the number of threads. If 'Pause Cleanup' is long, it's usually due to many regions to reclaim – consider increasing `-XX:G1MixedGCLiveThresholdPercent` to skip regions with too many live objects.
If your application requires sub-10ms pause times and heap sizes are >16GB, ZGC (Java 11+) or Shenandoah (Java 12+) are better choices. ZGC uses colored pointers and load barriers to achieve pause times that do not scale with heap size. Shenandoah uses forwarding pointers and a Brooks pointer. Both are concurrent and rarely stop the world.
However, they come with trade-offs: ZGC increases CPU usage by 15-20% for the load barriers, and Shenandoah can increase memory footprint by using a bit of extra heap for forwarding pointers. Test with your workload first. A quick way to check if your app is compatible: set `-XX:+UnlockExperimentalVMOptions -XX:+UseZGC` and run a load test. If throughput drops more than 10%, stick with G1 or CMS.
Frequently asked questions
How do I find the GC pause time for a specific JVM process?
The fastest way is to enable GC logging with `-Xlog:gc*:file=gc.log` (Java 9+) or `-XX:+PrintGCDetails -Xloggc:gc.log` (Java 8). Then search for 'pause' in the log: `grep pause gc.log`. The time is usually in milliseconds or seconds. You can also use `jstat -gcutil <pid> 1s` and look at the YGCT and FGCT columns for cumulative pause times.
What is a 'normal' GC pause time?
For young GC with G1GC, typical pause times are 10-50ms for heaps under 16GB. For full GC, anything above 200ms is a problem. With ZGC, pauses are under 1ms. The acceptable pause time depends on your application latency SLA. If your p99 latency is 100ms, GC pauses should be under 10ms to not dominate.
Can increasing heap size reduce GC pauses?
Not directly. Increasing heap size makes GC less frequent but can make individual pauses longer because there's more memory to scan/compact. For G1GC, a larger heap means more regions, and the pause time can increase if the number of live objects is high. The correct approach is to set a pause time goal with `-XX:MaxGCPauseMillis` and let G1 adjust the young gen size. If pauses are still long, consider a different GC algorithm.
What causes 'Allocation Failure' in G1GC?
An allocation failure happens when a thread cannot allocate an object in its current TLAB (Thread Local Allocation Buffer) and the JVM cannot find a free region in the heap. This triggers a young GC. If the young GC cannot free enough space, a full GC occurs. Common causes: too small heap, too many large objects (humongous), or a memory leak. Check GC logs for 'GC pause (G1 Evacuation Pause)' and 'To-space exhausted' messages.
How do I know if my GC tuning is working?
Monitor the following metrics over a production load cycle: (1) GC pause time p99 – should be below your target, (2) Full GC frequency – ideally zero, (3) Heap occupancy after GC – should be stable, not growing, (4) Application latency p99 – should not correlate with GC cycles. Use `jstat -gcutil` and JFR to verify. If after tuning the metrics degrade, your workload may need a different GC algorithm.