LEARN · DEBUGGING GUIDE

Python Memory Leak Profiling: Diagnosing Unbounded Growth in Production

Python memory leaks are subtle—unbounded growth from forgotten references, caches, or circular references. This guide gives you the exact tools and patterns to find and fix them.

AdvancedMemory7 min read

What this usually means

A Python memory leak is almost always caused by objects that are still referenced but no longer needed, preventing garbage collection. Common culprits: unclosed file handles or network connections, global caches without eviction, cyclic references with custom __del__ methods, or objects captured in closures that outlive their intended scope. In production, the most insidious leaks come from libraries that cache data internally (e.g., SQLAlchemy, Celery, or even standard library modules like `logging`) or from thread-local storage that accumulates per-request data.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run `tracemalloc.start(25)` at startup, then use `snapshot = tracemalloc.take_snapshot()` at two time points and diff: `diff = snapshot2.compare_to(snapshot1, 'traceback')`
  • 2Check `len(gc.get_objects())` over time with a simple script: `import gc; import time; while True: print(len(gc.get_objects())); time.sleep(60)`
  • 3Use `objgraph.show_growth(limit=10)` to see which object types are increasing
  • 4Attach `pympler`'s `muppy` to a running process: `from pympler import muppy; all_objects = muppy.get_objects(); summary = muppy.summary(all_objects); muppy.print_summary(summary)`
  • 5Deploy a memory profiler like `memray` in production with flamegraphs: `memray run --trace-python-allocators my_script.py`
  • 6If using Docker, check `/sys/fs/cgroup/memory/memory.usage_in_bytes` inside the container
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchApplication code: look for global lists/dicts that grow unbounded (e.g., request history, user session stores)
  • searchThird-party library caches: inspect `sqlalchemy.engine.Engine._query_cache`, `celery.app.control.Cache`, `requests.sessions.Session.cookies`
  • searchLogging handlers: `logging.handlers.MemoryHandler` or custom handlers that accumulate log records
  • searchThread-local storage: `threading.local()` objects that hold per-thread state without cleanup
  • searchDatabase connection pools: `SQLAlchemy` pool size vs. actual connections, especially with `pool_pre_ping=True`
  • searchCircular references with `__del__` methods that prevent GC
  • searchSignal handlers: `signal.signal()` registering callbacks that reference large objects
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningUnbounded per-request caching: storing request/response objects in a global dict keyed by request ID without eviction
  • warningCelery task result backend accumulation: storing task results indefinitely with `result_expires` set too high or not set
  • warningSQLAlchemy session objects not closed: `db.session.remove()` not called in Flask/Django after request
  • warningMemory leak in C extension: e.g., `numpy` or `pandas` internal buffers not released
  • warningCoroutine or asyncio event loop starvation: tasks pile up in `asyncio.Task.all_tasks()`
  • warningStatic class variables holding large objects: e.g., class-level reference to a loaded ML model that gets replaced but old one not freed
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildReplace unbounded caches with `functools.lru_cache(maxsize=N)` or `cachetools.TTLCache`
  • buildAdd explicit cleanup in request/response lifecycle: `@app.teardown_request` in Flask, `request_finished` signal in Django
  • buildUse weak references with `weakref.WeakValueDictionary` for caches where appropriate
  • buildDisable or limit library-level caches: e.g., `engine.pool_size=10, max_overflow=0` in SQLAlchemy
  • buildCall `gc.collect()` periodically if you must, but better to fix the reference cycle
  • buildFor C extensions, ensure you call `Py_DECREF` correctly or use `gc.garbage` to detect uncollectable cycles
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedRun the app under load test (e.g., with `locust` or `wrk`) for 30 minutes and plot RSS vs. time; the slope should be flat
  • verifiedUse `tracemalloc` snapshot diff before and after a fix: the difference should show zero or bounded growth
  • verifiedCheck `gc.get_objects()` count before/after handling 10k requests; should return to baseline
  • verifiedMonitor `/proc/<pid>/smaps` for heap growth; stable PSS per request indicates fix
  • verifiedRun `valgrind --tool=memcheck` on Python if using C extensions (though slow)
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningDon't add `gc.set_debug(gc.DEBUG_LEAK)` in production—it prints every object and kills performance
  • warningDon't blindly call `gc.collect()` in a tight loop; it's a band-aid and hurts throughput
  • warningDon't assume `tracemalloc` is overhead-free; use it with sampling (e.g., 1% of requests) in production
  • warningDon't ignore third-party library versions; some leaks are fixed in later releases
  • warningDon't rely solely on `psutil` RSS; it includes shared memory and can be misleading
( 07 )War story

The Celery Result Backend Leak That Took Down Production Every 48 Hours

Senior Backend EngineerPython 3.9, Celery 5.1, Redis, Flask, PostgreSQL, Kubernetes (GKE)

Timeline

  1. 00:00PagerDuty alert: 'Memory usage > 85%' on two of three Celery worker pods in production
  2. 00:05Logged into Grafana; RSS of both pods climbing at ~200 MB/hour, no plateau
  3. 00:10SSH into one pod, ran `top`; Python process using 2.3 GB RSS on a 4 GB limit
  4. 00:15Executed `import gc; print(len(gc.get_objects()))` — 1.2 million objects, growing
  5. 00:20Used `objgraph.show_growth(limit=10)`; saw 50k new `celery.backends.redis.RedisBackend` objects
  6. 00:30Searched codebase: found `task = app.Task()` subclass with `backend` attribute set at class level, holding a Redis connection per task
  7. 00:35Applied hotfix: set `backend = None` in task class and used `app.conf.result_backend` instead; redeployed
  8. 01:00Memory stabilized at 800 MB; new objects stopped accumulating
  9. 01:30Root cause confirmed: each task instance held a reference to a Redis client, and tasks were cached in Celery's internal registry
  10. 02:00Opened PR to refactor task classes and added memory regression test

It was 2 AM on a Tuesday. I was on call, and PagerDuty blew up with memory alerts on our Celery workers. The pods had been running for 48 hours, and the memory graph showed a steady climb from 500 MB to 2.3 GB. I knew this was a classic leak—no plateau, just linear growth. The OOM killer would have killed them within the hour if I didn't act.

I SSHed into a pod and ran `len(gc.get_objects())`—1.2 million objects, and I could see the count ticking up every few seconds. I then used `objgraph.show_growth()` and saw thousands of new `celery.backends.redis.RedisBackend` objects. That was the smoking gun. I remembered that we had recently refactored some tasks to inherit from a custom base class that stored the backend as a class attribute.

I quickly scanned the code: `class MyTask(app.Task): backend = RedisBackend(...)`. Every time a task was instantiated, it created a new Redis backend, and since Celery caches task classes, those objects were never freed. I hotfixed by removing the class-level attribute and relying on Celery's global config. Memory dropped immediately. The real fix was to never hold long-lived connections in task class attributes.

Root cause

Celery task class attribute storing a RedisBackend instance, causing unbounded accumulation of backend objects in the task registry.

The fix

Removed the backend attribute from the task class and configured `result_backend` globally in Celery config. Also added a `weakref` callback to clean up any remaining references.

The lesson

Class-level attributes in Celery tasks are global and persistent—never put per-instance resources like database connections there. Use `__init__` or dependency injection instead.

( 08 )Using tracemalloc for Snapshot Diffing

`tracemalloc` is the most precise tool for Python memory profiling because it traces every memory allocation with its traceback. Start by calling `tracemalloc.start(25)` early in your code (e.g., before importing libraries). The argument is the number of frames to capture (25 is usually enough).

Take a baseline snapshot after application startup but before handling requests: `snap1 = tracemalloc.take_snapshot()`. After many requests (e.g., 10k), take a second snapshot and compute the diff: `diff = snap2.compare_to(snap1, 'traceback')`. Sort by size: `diff.sort(key=lambda x: x.size_diff, reverse=True)`. The top entries show exactly where the memory is growing—file, line, and allocation count. This works even for C extension allocations if they use Python's allocator.

( 09 )objgraph: Visualizing Object References

`objgraph` is great for finding what holds references to leaked objects. Once you've identified a type that's accumulating (e.g., `RedisBackend`), call `objgraph.show_backrefs([obj], max_depth=5, filename='backrefs.png')` to generate a graph of reference chains. This quickly reveals unexpected references like closures, global variables, or thread-local storage.

I've used this to find that a Flask `g` object was holding onto large database result sets because a middleware was setting `g.db_results = ...` without clearing it. The graph showed a chain from the request object to the results to a list that grew per request.

( 10 )Production Profiling with memray

`memray` is a modern profiler that can run in production with low overhead (especially with `--live` mode). Run `memray run --trace-python-allocators --live my_app.py` and then attach a separate process to view the flamegraph: `memray flamegraph output.bin`. The flamegraph shows which functions allocate the most memory over time.

In one case, I saw a massive allocation spike in `json.loads`—turns out a logging library was serializing the entire request body to a string for every log line. The fix was to truncate the log message. `memray` also supports `--follow-fork` for multiprocessing apps.

( 11 )Detecting Leaks in Docker Containers

Inside a container, `psutil` doesn't see cgroup limits. Instead, read `/sys/fs/cgroup/memory/memory.usage_in_bytes` directly. Use a one-liner: `cat /sys/fs/cgroup/memory/memory.usage_in_bytes` to get the current memory usage in bytes. For a rolling check, wrap in a loop: `while true; do cat /sys/fs/cgroup/memory/memory.usage_in_bytes; sleep 10; done`.

You can also use `docker stats` from the host, but that averages over 1-second windows. For precise per-process tracking, use `cat /proc/<pid>/status | grep VmRSS` inside the container. The OOM killer logs can be found with `dmesg | grep -i oom` on the node.

( 12 )Pattern: Per-Request Caching in Flask/Django

A common leak is caching per-request data in a global dictionary. For example, storing request objects in a list for debugging: `request_history.append(request)`. Over time, this list holds references to all requests, including their environment, form data, and database connections. The fix is to use a bounded deque or store only a summary.

Similarly, in Django, `threading.local()` used in middleware can accumulate data if not cleaned up. Always clear thread-local storage at the end of a request: `del local.user_data`. Use context managers or `@contextmanager` to ensure cleanup even on exceptions.

Frequently asked questions

How do I differentiate a memory leak from a memory bloat (temporary spike)?

A leak shows monotonic growth over time without recovery. Bloat spikes and then drops back to baseline. Plot RSS vs. time over hours: if it never decreases, it's a leak. Use `tracemalloc` to see if the allocation count keeps increasing—if it does, you have a leak.

Is it safe to use tracemalloc in production?

Yes, with caution. `tracemalloc.start(N)` adds overhead proportional to N (number of frames). Start with N=10 and only on a subset of servers. Monitor CPU usage; if it jumps >10%, reduce N or use sampling. Turn it off after diagnosis.

What's the difference between `gc.get_objects()` and `tracemalloc`?

`gc.get_objects()` returns all objects known to the garbage collector (Python objects), but it does not include memory allocated by C extensions directly (e.g., numpy arrays). `tracemalloc` traces all Python-level allocations, including those from C extensions if they use `PyMem_Malloc`. Use both: `gc.get_objects()` for Python object count trends, `tracemalloc` for byte-level growth.

My memory grows but `gc.get_objects()` stays flat. What's happening?

That indicates a leak in C extensions or memory allocated outside Python's GC (e.g., via `malloc`). Use `memray` or `valgrind` to catch those. Common culprits: `numpy` arrays, `pandas` DataFrames, or OpenCV images that are not freed. Ensure you call `del` and `gc.collect()` if needed, but the real fix is to release references.

How do I set up a memory leak regression test?

Write a test that runs the suspect code in a loop for many iterations (e.g., 1000), and compare memory before and after using `tracemalloc`. Assert that the difference is below a threshold (e.g., <1 MB). Use `pytest` with a fixture that starts tracemalloc, runs the test, takes a snapshot, and checks growth. Example: `assert snap2.compare_to(snap1)[0].size_diff < 1024*1024`.