Java OutOfMemoryError Heap Space Debug Guide

What this usually means

The JVM tried to allocate memory for a new object, but the garbage collector could not free enough space after a full collection. This is almost always a symptom of one of three things: a memory leak where objects are unintentionally retained, an excessive allocation rate that outruns GC, or a heap configuration that is too small for the application's live data set. Non-obvious variants include native memory leaks that reduce the available heap (e.g., direct buffer allocations or JNI leaks), metaspace exhaustion that manifests as heap OOM, or fragmentation in the old generation preventing allocation of large objects.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Enable GC logging: -Xlog:gc*:file=gc.log:time,uptime,level,tags (Java 9+) or -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log (Java 8)
2Collect heap dump at OOM: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/dump.hprof
3Analyze heap dump with Eclipse MAT or JProfiler: look for the 'biggest objects' and 'leak suspect' reports
4Check GC logs for pattern: if GC pause times are increasing and heap usage never drops to pre-GC levels, it's a leak
5Monitor native memory with NMT (Native Memory Tracking): -XX:NativeMemoryTracking=summary, then jcmd <pid> VM.native_memory summary

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchGC log file (e.g., gc.log) — look for 'Full GC' and 'OutOfMemoryError' timestamps
searchHeap dump file (dump.hprof) — analyze with Eclipse MAT or YourKit
searchApplication logs — stack trace of OOM might indicate which allocation failed
searchThread dumps (pstack or jstack) — look for threads stuck in allocation or GC
searchMonitoring system (Prometheus/Grafana, New Relic, Datadog) — heap usage trend over hours/days
searchJVM startup scripts — check -Xmx, -Xms, -XX:MaxMetaspaceSize, -XX:MaxDirectMemorySize

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningObject references retained longer than needed (e.g., static collections, listeners, caches without eviction)
warningLarge objects or arrays allocated frequently (e.g., loading entire files into memory, large result sets)
warningInsufficient heap size for the application's live data set (e.g., -Xmx too low)
warningMetaspace leak (classloader leak) that consumes native memory, reducing heap availability
warningDirect buffer leak (ByteBuffer.allocateDirect) that consumes off-heap memory, causing GC to fail
warningGC tuning issues (e.g., too small young generation, concurrent mode failure) leading to promotion failures
warningMemory fragmentation in old generation preventing allocation of a contiguous large object

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildIncrease heap size (-Xmx) if live data set is larger than current heap, but only after verifying no leak
buildFix the leak: null out references, use WeakHashMap, use try-with-resources, close streams, clear caches
buildReduce allocation rate: batch database queries, stream large files, use object pooling sparingly
buildTune GC: adjust young generation size, switch to G1GC, set -XX:G1HeapRegionSize for large objects
buildFor metaspace leaks: use -verbose:class to track class loading, fix classloader leak (e.g., redeploy apps properly)
buildFor direct buffer leaks: limit direct memory with -XX:MaxDirectMemorySize, ensure ByteBuffer is freed
buildEnable -XX:+UseContainerSupport (Java 10+) if running in container to respect cgroup limits

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedRun application under load test after fix, monitor heap usage for at least 24 hours — baseline should be stable
verifiedTake heap dump after fix and compare retained sizes of suspect classes — they should be minimal
verifiedVerify GC logs show heap usage returning to pre-GC levels after each major collection
verifiedCheck that object count for the leak suspect (e.g., a class) does not grow over time via jmap -histo:live
verifiedRun with -XX:+PrintClassHistogram and compare histograms at different times
verifiedConfirm no OOM after fix by running stress test with near-production data volume

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningJust increasing heap without investigating the leak — delays the crash but doesn't fix it
warningIgnoring metaspace or native memory — they can cause heap OOM indirectly
warningTaking heap dump only after restart — capture during OOM or just before
warningUsing -XX:+HeapDumpOnOutOfMemoryError but not setting a path, losing the dump on crash
warningAssuming G1GC will fix everything — it needs tuning for large heaps and high throughput
warningForgetting to enable GC logging before the problem occurs — you need data from before the crash

( 07 )War story

Production Payment Service OOM After 48 Hours Uptime

Senior Backend EngineerJava 11, Spring Boot 2.3, PostgreSQL, Kubernetes (4GB container limit), G1GC

Timeline

00:00Deployed v2.1 of payment service with new fraud detection feature
12:00PagerDuty alert: payment service pod restarted (OOMKilled)
12:05Checked pod logs: 'java.lang.OutOfMemoryError: Java heap space' at timestamp T-48h
12:10Enabled GC logging and heap dump on next crash
18:00Another crash; heap dump captured
18:30Analyzed heap dump in Eclipse MAT: found 2GB retained by a single ArrayList in FraudEvaluator
19:00Looked at code: FraudEvaluator stored all transaction IDs in a static list for dedup, never cleared
19:30Hotfixed: changed static list to a bounded LRU cache (Caffeine) with max size 10000
20:00Redeployed; monitored heap usage — stable at 1.5GB baseline after 48h

We had a stable payment service running for months. After rolling out a new fraud detection feature, pods started dying every 48 hours with OOM. The first time I saw the alert, I just restarted and increased heap from 2GB to 3GB, thinking it was a traffic spike. That only bought us 12 more hours.

I enabled heap dump on OOM and waited. When the next dump came in, I loaded it into Eclipse MAT. The 'Leak Suspects' report pointed to a java.util.ArrayList held by com.example.FraudEvaluator. It had 8 million transaction IDs, consuming 1.8GB. The code had a static list that accumulated all processed transaction IDs as a naive dedup mechanism, never removing entries.

The fix was trivial: replace the static list with a bounded LRU cache using Caffeine, set max size to 10,000. After redeploy, heap usage stabilized. The lesson: never use unbounded collections in production, and always profile before tuning heap.

Root cause

Unbounded static ArrayList in fraud detection module accumulating all transaction IDs over time, never evicted.

The fix

Replaced static ArrayList with Caffeine cache bounded to 10,000 entries; also added TTL of 1 hour.

The lesson

Always profile memory before tuning heap; a leak masked by increasing heap only delays the crash.

( 08 )Beyond Heap: Native Memory and Metaspace

A surprising number of 'Java heap space' errors are actually caused by native memory exhaustion. The JVM's heap is allocated from native memory, and if other native areas (metaspace, thread stacks, direct buffers, code cache) grow, they reduce the available memory for the heap. This is especially common in containers with hard limits (cgroup memory limit).

Use Native Memory Tracking (NMT) to get a breakdown: -XX:NativeMemoryTracking=summary. Then run jcmd <pid> VM.native_memory summary. Look for 'reserved' vs 'committed' values. If metaspace or direct buffers are growing, that's your real problem. For direct buffers, track with -XX:MaxDirectMemorySize and use a debug flag to log allocations.

( 09 )GC Logs: Reading the Tea Leaves

GC logs are the first place to look. Enable them with -Xlog:gc*:file=gc.log:time,uptime,level,tags (Java 9+). Key patterns: a healthy application shows 'sawtooth' patterns where heap usage drops to baseline after each GC. A leak shows the baseline rising over time. Look for 'Full GC' or 'Pause Full' — these indicate that the JVM had to stop-the-world to collect, and if they recur frequently, it's a red flag.

Also check for 'Allocation Failure' in young GC: if young GCs are frequent but not freeing space, the allocation rate is too high. 'Concurrent Mode Failure' in CMS or 'Evacuation Failure' in G1 indicate that the old generation is too full to accept promoted objects. In G1, look for 'Humongous Allocation' warnings — large objects (>50% of region size) that can fragment the heap.

( 10 )Heap Dump Analysis with Eclipse MAT

The heap dump is your best friend. Use Eclipse MAT (Memory Analyzer Tool). Run the 'Leak Suspects Report' — it automatically identifies the objects most likely causing retention. Then use the 'Histogram' to sort by retained heap size. Look for collections (HashMap, ArrayList) with disproportionately large sizes.

For each suspect, use 'Path to GC Roots' to find who is holding the reference. Common culprits: static fields, ThreadLocal variables, long-lived servlet contexts, or unclosed resources. Also check for duplicate strings (String.intern() leaks) and classloaders (especially in Java EE apps).

( 11 )The Humongous Allocation Problem in G1

G1GC divides heap into regions (default ~1MB). Objects larger than half a region are 'humongous' and allocated directly in the old generation. If you allocate many humongous objects, they can fragment the heap and cause OOM even when total free space is sufficient. Symptoms: GC logs show 'Humongous Allocation' and 'G1 Humongous Allocation' warnings.

Fix: increase region size with -XX:G1HeapRegionSize=2m,4m,8m,16m,32m to reduce fragmentation. Or redesign the application to avoid huge objects (e.g., split large byte arrays into chunks). You can also use -XX:+UnlockExperimentalVMOptions -XX:G1MixedGCLiveThresholdPercent=85 to force more aggressive mixed GCs.

( 12 )Container-Specific Pitfalls

In Docker/Kubernetes, the JVM doesn't automatically know the container memory limit. If you set -Xmx to a value larger than the container limit, the OS will OOM-kill the container (OOMKilled). But if you set -Xmx too small, you get Java heap OOM. The correct approach: use -XX:+UseContainerSupport (Java 10+) or -XX:MaxRAMPercentage=70.0 to let the JVM compute heap as a percentage of container memory.

Another trap: the JVM's default heap is 1/4 of the host memory, not the container limit. Always explicitly set heap or use container-aware flags. Also, monitor RSS vs heap: if RSS is much larger than heap, suspect native leaks (thread stacks, direct buffers). Use 'ps -o rss,vsz' and compare to jstat output.

Frequently asked questions

Why does 'java.lang.OutOfMemoryError: Java heap space' sometimes appear even when I have plenty of free heap?

This can happen due to memory fragmentation. The JVM needs a contiguous block of memory to allocate an object. If the heap is fragmented (many small free blocks), a large object may not find a contiguous block even though total free space is sufficient. This is common with CMS or G1 with many humongous objects. Solutions: increase heap, tune GC to compact more aggressively, or reduce object size.

Should I always increase -Xmx when I see heap OOM?

No. Increasing heap without diagnosing the root cause just delays the crash and can make GC pauses worse. Always investigate for leaks first: collect heap dump, analyze retained objects, check GC logs. If the live data set truly exceeds the current heap, then increase -Xmx, but also consider if you can reduce memory usage (e.g., stream data, use smaller caches).

What is the difference between 'Java heap space' and 'Metaspace' OOM?

'Java heap space' means the heap is full — objects cannot be allocated. 'Metaspace' (or 'PermGen' in older Java) is a separate native memory area for class metadata. However, metaspace exhaustion can indirectly cause heap OOM if the JVM tries to expand metaspace and fails to allocate native memory, which then leads to heap allocation failures. Always check both in the error message and monitoring.

How do I find the exact line of code causing the leak?

The heap dump shows object types and references, but not source lines directly. Use the 'Path to GC Roots' feature in MAT to see which object holds the reference. Then correlate that object type to your code. For more precision, use a profiler like JProfiler or YourKit that can link allocations to stack traces (allocation profiling). Enable with -XX:+PreserveFramePointer or use async-profiler.

Why does my application OOM only after 48 hours of uptime?

That's a classic leak pattern. The leak rate is slow enough that it takes time to exhaust heap. It's often due to a collection (e.g., a cache, a listener registry, a session store) that grows over time without eviction. Use jmap -histo:live to compare histograms at intervals — you'll see a class count increasing. A heap dump at the time of OOM will show the accumulation.

Diagnosing and Fixing Java OutOfMemoryError: Java Heap Space

What this usually means

Frequently asked questions