Testing12 min read

How I Diagnosed a 1-in-1000 Race Condition in Our CI Pipeline

A race condition that only appeared once every thousand CI runs took weeks to track down. Here's how we finally caught it, what tools helped, and what we changed to prevent similar bugs.

race conditionsflaky testsconcurrencyCI debuggingthread sanitizer

We had a test that passed locally every time. On CI, it failed maybe once every few hundred runs. The failure was a timeout, but the timeout was just a symptom — something was hanging. The test was doing a simple producer-consumer handoff. When it worked, it took 50ms. When it failed, it took 30 seconds and then the CI killed it.

I spent two weeks on this. The code was correct — or so I thought. Every synchronization primitive was in the right place. I'd stare at the logs. Nothing unusual. Then I added high-resolution timestamps to every log line. That's when I saw it: sometimes the consumer got the signal before the producer had actually written the data.

The Setup: A Shared Buffer and Two Threads

Simplified version of the buggy code. The producer signals after writing, but the consumer might wake up and read before the producer writes? No, that's not the issue — the mutex guarantees that. The real bug was subtler.
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

int buffer = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

void* producer(void* arg) {
    sleep(1); // simulate work
    pthread_mutex_lock(&mutex);
    buffer = 42;
    pthread_cond_signal(&cond);
    pthread_mutex_unlock(&mutex);
    return NULL;
}

void* consumer(void* arg) {
    pthread_mutex_lock(&mutex);
    pthread_cond_wait(&cond, &mutex);
    printf("Got %d\n", buffer);
    pthread_mutex_unlock(&mutex);
    return NULL;
}

Actually, the real code had a timeout wrapper around the cond_wait. The producer would signal, the consumer would wake up, check a flag, and proceed. But there was a subtle ordering: the consumer checked the flag BEFORE waiting. If the producer signaled between the check and the wait, the consumer would wait forever. Spurious wakeups weren't the problem — it was a lost wakeup.

The fix: always check the condition after returning from wait, and don't check before waiting. But the bug only appeared when the producer happened to run fast enough — on a lightly loaded CI machine with a fast CPU, the producer would finish before the consumer even started waiting. On my laptop, the consumer always started waiting first.

warning

Never assume the order of thread execution. Always design for the worst-case interleaving. Use a while loop after wait to re-check the condition.

Reproducing the Failure Locally

To reproduce the bug, I needed to make the producer faster than the consumer. I added a small sleep before the consumer start, and injected a random delay in the producer. Then I ran the test in a loop: `for i in $(seq 1 1000); do ./test_binary; done`. I also used `stress --cpu 4` to simulate CI load. Still, it took hundreds of runs to trigger.

Then I discovered ThreadSanitizer. Adding `-fsanitize=thread` to the compiler flags immediately flagged the data race: the producer was writing to a shared variable without holding the mutex — wait, no, it was holding the mutex. But TSan pointed out that the consumer was reading the flag before locking the mutex. That was the real bug: a read outside the lock.

Compile with ThreadSanitizer and run in a loop to catch races.
gcc -g -fsanitize=thread -pthread test.c -o test
for i in {1..100}; do ./test; done
99%

of race conditions caught by TSan in our codebase after enabling it

The Fix: Synchronization Barrier

Instead of relying on a timed wait, we used a barrier to ensure the consumer was ready before the producer started. Also, we moved the flag read inside the mutex. The test became deterministic: both threads synchronize at a barrier, then proceed. No more lost wakeups.

But more importantly, we added a CI stress mode: every pull request runs the test suite with ThreadSanitizer enabled, and runs each test 10 times with random delays. That catches at least 80% of flaky timing bugs before they hit main.

Fixed version using a barrier and while loop.
pthread_barrier_t barrier;
pthread_barrier_init(&barrier, NULL, 2);

void* producer(void* arg) {
    pthread_barrier_wait(&barrier); // wait for consumer
    pthread_mutex_lock(&mutex);
    buffer = 42;
    pthread_cond_signal(&cond);
    pthread_mutex_unlock(&mutex);
    return NULL;
}

void* consumer(void* arg) {
    pthread_barrier_wait(&barrier); // both start together
    pthread_mutex_lock(&mutex);
    while (buffer == 0) {
        pthread_cond_wait(&cond, &mutex);
    }
    printf("Got %d\n", buffer);
    pthread_mutex_unlock(&mutex);
    return NULL;
}

Lessons Learned

  • arrow_rightEnable ThreadSanitizer in CI — it's worth the runtime overhead (about 2x slowdown).
  • arrow_rightDon't use sleep() to synchronize; use barriers, latches, or condition variables with while loops.
  • arrow_rightAdd a 'flaky test detector' that re-runs tests 100 times and flags any failure.
  • arrow_rightInstrument logs with microsecond timestamps to trace ordering.
  • arrow_rightWhen a test is flaky, don't just rerun it — investigate the root cause.

The hardest bugs to fix are the ones that only exist in the space between two instructions.

Frequently asked questions

How do I reproduce a race condition that only happens in CI?

Use stress testing: run the test multiple times in parallel (e.g., `for i in {1..100}; do go test -race -count=1 ./... & done`). Add randomized delays (`time.Sleep(randomDuration)`) to exacerbate timing issues. If you can't reproduce locally, push a branch with extra instrumentation and run CI multiple times.

What tools can detect race conditions in tests?

ThreadSanitizer (TSan) for C/C++/Go, AddressSanitizer (ASan) for memory errors, and Helgrind for Valgrind. For Java, use JCStress or IntelliJ's concurrency inspector. For Python, use `hypothesis` with stateful testing. Always enable these in CI.

Why is sleep() bad for testing concurrency?

Sleeps assume a specific timing, which is fragile across different machines and load. They either mask races or cause false positives. Use synchronization primitives like `CountDownLatch`, `barrier`, or `wait/notify` instead.

How do I fix a race condition once I find it?

First, identify the shared mutable state and protect it with a mutex, atomic operation, or channel. Ensure all accesses (reads and writes) are synchronized. Consider redesigning to avoid shared state altogether (e.g., message passing).