Database11 min read

Debugging Cache Invalidation Failures: A Case Study with Redis and PostgreSQL

Cache invalidation sounds simple — write-through, TTLs, done. But in practice, silent failures hide in race conditions, connection pools, and stale read replicas. Here's a real debugging story with Redis and PostgreSQL that taught me how to find them.

cache invalidationRedisPostgreSQLdebuggingstale readsconnection pooling

I spent a Thursday afternoon convinced I'd lost my mind. A simple cache invalidation flow — update a product price in PostgreSQL, delete the Redis key, then read back — was returning old prices. I checked the logs: invalidation fired. I checked Redis: key was gone. But the next read? Stale data. This wasn't a missing invalidation; it was something deeper.

Cache invalidation is famously one of the two hard things in computer science. But the hard part isn't usually the logic — it's the silent failure modes that make you doubt your own code. Over the next few days, I traced the problem through three distinct layers: write ordering, connection pooling, and stale replicas. Each one is worth understanding because they all look the same from the outside: a cache that refuses to die.

The Setup That Should Have Worked

Standard cache-aside pattern — update DB, invalidate cache, lazy re-population on next read.
// Simplified product service with cache-aside
async function updatePrice(productId, newPrice) {
  await db.query('UPDATE products SET price = $1 WHERE id = $2', [newPrice, productId]);
  await redis.del(`product:${productId}`);
}

async function getProduct(productId) {
  let cached = await redis.get(`product:${productId}`);
  if (cached) return JSON.parse(cached);
  const result = await db.query('SELECT * FROM products WHERE id = $1', [productId]);
  await redis.set(`product:${productId}`, JSON.stringify(result.rows[0]));
  return result.rows[0];
}

The code above is textbook. It's what you'd write in an interview. But in production, it failed intermittently. The issue wasn't the pattern — it was the assumption that the database write and cache invalidation are atomic from the reader's perspective. They aren't.

Race #1: The Concurrent Read-After-Write

The Stale Window

  1. T+0msService A writes new price to PostgreSQL (commit occurs at T+2ms).
  2. T+1msService B reads cache: key exists with old price, returns stale data.
  3. T+2msService A deletes Redis key.
  4. T+3msService B finishes request with old price. User sees wrong data.

Lesson

The read hit the cache before the invalidation, but after the database write had already started. The reader never knew a write was in flight. Fix: use a distributed lock or compare-and-swap with version numbers to coordinate write and read.

That window exists in any system where write and invalidation aren't an atomic transaction. Even if you swap the order — invalidate then write — you get a similar race: a reader may miss the invalidation and re-populate cache with stale data before the write completes. The only way to eliminate this is to make the cache and database agree on a version.

lightbulb

Use a monotonic version column in your database table. When updating, increment the version. Store the version in the cache alongside the data. On read, compare versions. If the cache version is behind the database version, discard and re-fetch. This turns a race condition into a detect-and-retry.

The Connection Pool Phantom

After adding version checks, the stale reads dropped but didn't disappear. Some users still saw old prices minutes after an update. Redis keys were definitely being deleted — I verified with `MONITOR` — yet the cache hits returned old data. How?

The answer was in the client library. We were using `ioredis` with connection pooling. Redis itself is single-threaded per connection, but when using a pool, commands can be sent on different connections. The invalidation (DEL) and the subsequent read (GET) might go to different connections — and that's fine for Redis. But we had a middleware that cached the result of the GET in a local in-memory store for 500ms to reduce Redis round trips. That local cache was shared across connections, but the invalidation only cleared the Redis key, not the local cache. So the local cache returned stale data.

Local cache was not cleared when the underlying Redis key was invalidated. The 500ms window allowed stale reads.
// Problem: local cache not invalidated on write
const localCache = new Map();

async function getProduct(productId) {
  if (localCache.has(productId)) {
    return localCache.get(productId); // stale after write
  }
  let cached = await redis.get(`product:${productId}`);
  if (cached) {
    localCache.set(productId, JSON.parse(cached));
    setTimeout(() => localCache.delete(productId), 500);
    return JSON.parse(cached);
  }
  // ... fetch from DB
}

The fix was to either remove the local cache entirely or propagate invalidations to it. We chose the latter: each invalidation now publishes a message through Redis Pub/Sub, and all service instances listen and clear their local cache. That added complexity, but it eliminated the phantom stale reads.

The Replica Lag Ambush

The last failure mode was the most humbling. After fixing the local cache, stale reads still occurred — but only for a subset of users, and always within the first 10 seconds after a write. The pattern: a write succeeded, cache was invalidated, next read missed cache, hit the database, and got an old price. How?

Our read queries were hitting a read replica with replication lag. The write had committed on the primary, but the replica hadn't caught up. So the cache miss triggered a database read that returned pre-write data, which then got written back into Redis as the 'fresh' value — cementing the stale data for the next TTL period.

10s

Average replication lag during peak traffic — enough to cause persistent stale cache entries after every write.

Lag in seconds can vary wildly. We saw spikes up to 15s during write-heavy batch jobs.
-- Check replication lag on PostgreSQL
SELECT
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
  ROUND(EXTRACT(EPOCH FROM NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds
FROM pg_stat_replication;

The fix was to route cache-miss database reads to the primary for a short cooldown period after a write. We implemented a 'sticky write' mechanism: after updating a product, the service records the timestamp of the update in a small in-memory set. For the next 5 seconds, any read for that product goes directly to the primary. After 5 seconds, it's safe to use replicas again.

warning

Don't assume your read replicas are always consistent. If your cache invalidation relies on reading fresh data from the DB, you must consider replication lag. The simplest solution: always read from the primary for cache misses, or use a strongly consistent read path for data that was recently written.

Building a Detection System for Stale Reads

After fixing all three issues, I wanted to make sure we could detect regressions quickly. We added two things: a version field to every cached object and a metrics endpoint that exposes cache staleness. The version is a `last_updated` timestamp from the database row. On every read, we log the version difference between cache and database. If the difference exceeds 2 seconds (our tolerable window), we fire an alert.

We also wrote integration tests that specifically target race conditions. One test spawns two goroutines (we're in Go now) — one updates a record 100 times, another reads continuously. After all updates complete, we assert the final read returns the latest value. This test caught a new race when we switched to a different Redis client.

Simple but effective — runs in CI and catches invalidation races within seconds.
// Integration test for cache staleness
func TestCacheInvalidationRace(t *testing.T) {
  done := make(chan bool)
  go func() {
    for i := 0; i < 100; i++ {
      updateProductPrice(1, float64(i))
      time.Sleep(10 * time.Millisecond)
    }
    done <- true
  }()
  go func() {
    for {
      p, _ := getProduct(1)
      if p.Price > 99 {
        break // we got the latest
      }
      time.Sleep(5 * time.Millisecond)
    }
  }()
  <-done
  time.Sleep(2 * time.Second) // wait for propagation
  p, _ := getProduct(1)
  if p.Price != 99 {
    t.Errorf("expected price 99, got %v", p.Price)
  }
}

Key Takeaways

  • arrow_rightCache invalidation is not a single operation; it's a distributed transaction between cache, database, and application state.
  • arrow_rightConnection pools and local caches introduce hidden state that can bypass explicit invalidation.
  • arrow_rightReplication lag is a first-class citizen in any cache invalidation design — plan for it, don't ignore it.
  • arrow_rightVersion your cached data. It's cheap and gives you a direct way to detect staleness.
  • arrow_rightWrite integration tests that specifically target race conditions — they'll pay for themselves in the first incident they catch.
  • arrow_rightUse structured logging with correlation IDs to trace a single request's write, invalidation, and read paths across services.

The hardest part of cache invalidation isn't the algorithm — it's the assumptions you make about the environment. Once you start treating every network hop and every piece of middleware as a potential liar, you'll find the bugs faster. And when you do, you'll realize the problem wasn't that cache invalidation is hard — it's that we trust our infrastructure too much.

Frequently asked questions

Why does cache invalidation fail even with write-through caching?

Write-through caching writes to cache and database synchronously, but if the cache update fails (e.g., network blip) or the database write succeeds first, subsequent reads can still hit stale cache. Also, race conditions between concurrent writes and invalidations can leave old data in cache.

How can I detect stale cache reads in production?

Add a version field or last-updated timestamp to your cached objects. Log the version during reads and compare it against the database version. Set up alerts when version mismatch exceeds a threshold. Also, use cache hit/miss ratio monitoring with anomaly detection.

What role does connection pooling play in cache invalidation bugs?

In some caching libraries, connection pools cache state per connection. If invalidation occurs on one connection but the read uses another that still holds the old state, you get stale reads. This is especially sneaky because it's intermittent and load-dependent.

Should I use cache TTLs as a safety net for invalidation?

Yes, TTLs are a safety net, not a primary invalidation strategy. They limit damage from leaked stale data but can't prevent inconsistencies within the TTL window. Combine short TTLs with explicit invalidation for critical data.