We had a test that passed locally every time. On CI, it failed maybe once every few hundred runs. The failure was a timeout, but the timeout was just a symptom — something was hanging. The test was doing a simple producer-consumer handoff. When it worked, it took 50ms. When it failed, it took 30 seconds and then the CI killed it.
I spent two weeks on this. The code was correct — or so I thought. Every synchronization primitive was in the right place. I'd stare at the logs. Nothing unusual. Then I added high-resolution timestamps to every log line. That's when I saw it: sometimes the consumer got the signal before the producer had actually written the data.
The Setup: A Shared Buffer and Two Threads
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
int buffer = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
void* producer(void* arg) {
sleep(1); // simulate work
pthread_mutex_lock(&mutex);
buffer = 42;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&mutex);
return NULL;
}
void* consumer(void* arg) {
pthread_mutex_lock(&mutex);
pthread_cond_wait(&cond, &mutex);
printf("Got %d\n", buffer);
pthread_mutex_unlock(&mutex);
return NULL;
}Actually, the real code had a timeout wrapper around the cond_wait. The producer would signal, the consumer would wake up, check a flag, and proceed. But there was a subtle ordering: the consumer checked the flag BEFORE waiting. If the producer signaled between the check and the wait, the consumer would wait forever. Spurious wakeups weren't the problem — it was a lost wakeup.
The fix: always check the condition after returning from wait, and don't check before waiting. But the bug only appeared when the producer happened to run fast enough — on a lightly loaded CI machine with a fast CPU, the producer would finish before the consumer even started waiting. On my laptop, the consumer always started waiting first.
Never assume the order of thread execution. Always design for the worst-case interleaving. Use a while loop after wait to re-check the condition.
Reproducing the Failure Locally
To reproduce the bug, I needed to make the producer faster than the consumer. I added a small sleep before the consumer start, and injected a random delay in the producer. Then I ran the test in a loop: `for i in $(seq 1 1000); do ./test_binary; done`. I also used `stress --cpu 4` to simulate CI load. Still, it took hundreds of runs to trigger.
Then I discovered ThreadSanitizer. Adding `-fsanitize=thread` to the compiler flags immediately flagged the data race: the producer was writing to a shared variable without holding the mutex — wait, no, it was holding the mutex. But TSan pointed out that the consumer was reading the flag before locking the mutex. That was the real bug: a read outside the lock.
gcc -g -fsanitize=thread -pthread test.c -o test
for i in {1..100}; do ./test; doneof race conditions caught by TSan in our codebase after enabling it
The Fix: Synchronization Barrier
Instead of relying on a timed wait, we used a barrier to ensure the consumer was ready before the producer started. Also, we moved the flag read inside the mutex. The test became deterministic: both threads synchronize at a barrier, then proceed. No more lost wakeups.
But more importantly, we added a CI stress mode: every pull request runs the test suite with ThreadSanitizer enabled, and runs each test 10 times with random delays. That catches at least 80% of flaky timing bugs before they hit main.
pthread_barrier_t barrier;
pthread_barrier_init(&barrier, NULL, 2);
void* producer(void* arg) {
pthread_barrier_wait(&barrier); // wait for consumer
pthread_mutex_lock(&mutex);
buffer = 42;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&mutex);
return NULL;
}
void* consumer(void* arg) {
pthread_barrier_wait(&barrier); // both start together
pthread_mutex_lock(&mutex);
while (buffer == 0) {
pthread_cond_wait(&cond, &mutex);
}
printf("Got %d\n", buffer);
pthread_mutex_unlock(&mutex);
return NULL;
}Lessons Learned
- arrow_rightEnable ThreadSanitizer in CI — it's worth the runtime overhead (about 2x slowdown).
- arrow_rightDon't use sleep() to synchronize; use barriers, latches, or condition variables with while loops.
- arrow_rightAdd a 'flaky test detector' that re-runs tests 100 times and flags any failure.
- arrow_rightInstrument logs with microsecond timestamps to trace ordering.
- arrow_rightWhen a test is flaky, don't just rerun it — investigate the root cause.
The hardest bugs to fix are the ones that only exist in the space between two instructions.
Frequently asked questions
How do I reproduce a race condition that only happens in CI?
Use stress testing: run the test multiple times in parallel (e.g., `for i in {1..100}; do go test -race -count=1 ./... & done`). Add randomized delays (`time.Sleep(randomDuration)`) to exacerbate timing issues. If you can't reproduce locally, push a branch with extra instrumentation and run CI multiple times.
What tools can detect race conditions in tests?
ThreadSanitizer (TSan) for C/C++/Go, AddressSanitizer (ASan) for memory errors, and Helgrind for Valgrind. For Java, use JCStress or IntelliJ's concurrency inspector. For Python, use `hypothesis` with stateful testing. Always enable these in CI.
Why is sleep() bad for testing concurrency?
Sleeps assume a specific timing, which is fragile across different machines and load. They either mask races or cause false positives. Use synchronization primitives like `CountDownLatch`, `barrier`, or `wait/notify` instead.
How do I fix a race condition once I find it?
First, identify the shared mutable state and protect it with a mutex, atomic operation, or channel. Ensure all accesses (reads and writes) are synchronized. Consider redesigning to avoid shared state altogether (e.g., message passing).