Debugging Stale Cache Data in Distributed Systems

Every developer who has operated a web service at scale has a "why is this showing old data?" story. The classic quip that there are only two hard problems in computer science — cache invalidation and naming things — is funny because it's true. But the reality is worse: even when your invalidation logic is correct, stale data can sneak in through clock drift, partial evictions, or optimistic TTLs that don't match write patterns.

This post covers the non-obvious ways caches lie, how to debug them without going insane, and concrete strategies to minimize the window for stale data.

The three faces of stale data

Stale data falls into three rough categories. The first is TTL expiration mismatch — the cache TTL is set to 60 seconds, but your writes come every 45 seconds, so clients see 15-second-old data on average. The second is partial invalidation: you invalidate a key but miss a secondary key that references the same data. The third is propagation delay, common in multi-layer caches (Redis + CDN + browser).

Most debugging focuses on the first kind, but the second and third cause the most production pain.

lightbulb

When investigating stale data, always check the cache key structure first. A common mistake is invalidating key `user:123` but reading from `user:123:profile`. Consistent key naming conventions save hours.

Clock drift: the silent invalidator

If your cache nodes and application servers use different system clocks, TTLs become unreliable. I once debugged a case where a Redis cluster returned stale data for exactly 30 seconds every hour. Turns out the application servers had NTP configured, but the Redis nodes did not. The clock drift between a Redis node and the app server was about 2 seconds per hour, but because we used TTLs of 30 minutes, after 15 hours the drift accumulated to 30 seconds of effective TTL inflation.

The fix: run `ntpstat` on every cache node and add it to your monitoring dashboard. Also, avoid using `TTL` as the sole invalidation mechanism for critical data — use explicit key deletion or version-based keys.

Checking clock drift between application and Redis

# Check clock skew between app server and Redis
$ date +%s && redis-cli -h cache-01.example.com TIME
# Redis TIME returns seconds and microseconds. Compare with app server epoch.
# If difference > 500ms, investigate NTP configuration.

A production incident: the phantom cart

The Phantom Cart — stale data from write-behind cache

14:22User adds item to cart. Write-behind cache writes to Redis, queues database write.
14:23User refreshes page. Reads cart from Redis — sees new item. Good.
14:24Database write fails due to a constraint (duplicate key). Queue retries and eventually drops.
14:25User removes item from cart. Cache invalidates the cart key.
14:26User refreshes page. Cache miss triggers database read — item is gone. Good.
14:30User adds same item again. Write-behind writes to Redis (old item reappears), but database write succeeds this time.
14:31User sees two copies of the same item. Stale data from the first write is still in Redis because the invalidation at 14:25 cleared the old version, but the new write at 14:30 did not overwrite the stale entry — it coexisted.

Lesson

Write-behind caches must use unique identifiers or version fields to prevent stale entries from re-appearing. The fix: include a write timestamp or UUID in the cache value and always read the latest version by comparing timestamps.

The cache wasn't stale — it was serving two truths at once. The write-behind pattern created a phantom that survived invalidation.

Debugging tools and techniques

When you suspect stale data, start with the simplest diagnostic: add cache key and TTL to your application logs at read time. Use structured logging (JSON) so you can grep for specific keys. Then correlate with write logs.

For Redis, the `MONITOR` command is dangerous in production (it can saturate the connection), but you can run it briefly on a replica. For Memcached, use `stats items` to see eviction counts and `stats cachedump` to inspect keys, but be aware that `cachedump` is not atomic.

Scanning Redis keys for TTL and idle time

# Inspect Redis keys with TTL and idle time
$ redis-cli --scan --pattern 'user:*' | head -10 | while read key; do
    echo "$key: TTL=$(redis-cli ttl "$key") IDLE=$(redis-cli object idletime "$key")"
done

warning

Running `redis-cli --scan` with a pattern can block the Redis server if the key space is large. Use in off-peak hours or on a replica.

HTTP caching: the Vary trap

CDN and browser caches respect `Cache-Control` and `Vary` headers. A common source of stale data is the missing `Vary` header. If your API returns different content based on `Accept-Language` but you don't set `Vary: Accept-Language`, the CDN may serve the wrong language version to a user.

Debug with curl to inspect headers:

Verifying Vary header for multi-language caching

# Check Vary header from origin
$ curl -sI https://api.example.com/users/me -H "Accept-Language: fr" | grep -i vary
# Should return: vary: Accept-Language
# If missing, CDN will cache one version for all languages.

47%

of CDN-related stale data incidents are caused by missing or incorrect Vary headers (Source: 2023 Web Performance Survey)

Strategies to reduce stale window

arrow_rightUse versioned keys: append a version number or timestamp to the cache key so that invalidation is immediate (change the version).
arrow_rightUse write-through for critical data, but add circuit breakers to prevent cache failure from blocking writes.
arrow_rightSet TTLs based on write frequency, not arbitrary thresholds. If data changes every 5 minutes, a 10-minute TTL is too long.
arrow_rightImplement read-repair: on read, if the cache entry is older than a threshold, asynchronously refresh it from the database.
arrow_rightUse conditional requests with ETags or Last-Modified headers for HTTP caches.

Versioned cache key pattern for instant invalidation

// Versioned cache key example in Node.js with Redis
const version = await redis.get(`user:${userId}:version`) || 1;
const cacheKey = `user:${userId}:data:v${version}`;
let data = await redis.get(cacheKey);
if (!data) {
  data = await db.findUser(userId);
  await redis.set(cacheKey, JSON.stringify(data), 'EX', 600);
}
// On update:
await redis.incr(`user:${userId}:version`);

info

Versioned keys shift the invalidation cost from deletion to a small atomic increment. The old key remains in cache until TTL expires, but no new reads use it.

Wrapping up

Caching is not a set-and-forget optimization — it's a distributed system component that requires the same monitoring, testing, and debugging rigor as your database. The next time a user reports seeing old data, don't just clear the cache. Check the clock, check the Vary header, check the write-behind queue, and check your key naming conventions. Then add logging to prove the fix.

The goal is not zero staleness — that's impossible in an eventually consistent world. The goal is a staleness window you can measure, predict, and defend.

Frequently asked questions

How do I check if my Redis cache is returning stale data?

Enable Redis's `lfu-log-factor` or `lfu-decay-time` for eviction logging, and use the `OBJECT idletime` command to see last access. For TTL-based staleness, instrument your code to log the cache key and its remaining TTL at read time so you can correlate with write events.

Why would a CDN serve stale content even after I purged it?

CDN purges are eventually consistent — they might take seconds to propagate to all edge nodes. Additionally, if your origin returns `Cache-Control: max-age=0` but the CDN has a minimum TTL configured, it may ignore the directive. Check your CDN provider's TTL override settings.

What is the difference between write-through and write-behind caching for stale data?

Write-through writes to cache then to database synchronously — if the cache write fails, the database write might still proceed, causing stale cache on next read. Write-behind writes to cache immediately and asynchronously to the database — if the async write fails, the cache is stale indefinitely until a manual refresh or TTL expiry.

How can I simulate a stale cache scenario in development?

Use a tool like `toxiproxy` to inject latency or failure on cache writes. Set a very long TTL (e.g., 24 hours) on a cache entry, then update the underlying database directly (bypassing the cache). The next read will return the old data.

When caches lie: debugging stale data in distributed systems