I've spent years debugging flaky tests — those intermittent failures that pass 99 times out of 100 on your machine but fail on CI, only to pass again when you rerun. They erode trust in your test suite and waste developer hours. The core problem is reproducibility: if you can't make it fail on demand, you can't fix it.
This post covers the techniques I use to reproduce flaky tests systematically. Not the theory — the actual commands, tools, and code changes that turn a heisenbug into a repeatable failure.
First, Classify the Flake
Before you start debugging, figure out what kind of flake you're dealing with. In my experience, most fall into one of three categories:
1. Race conditions — two goroutines, threads, or async tasks access shared state without synchronization. These depend on scheduling and are sensitive to CPU load.
2. Timeout or ordering — a test expects an event to happen within a fixed duration, but under CI load it takes slightly longer, or a callback fires in unexpected order.
3. Environment mismatch — the test passes locally because your machine has a different locale, file system, timezone, or dependency version than CI.
For each category, the reproduction strategy differs. Let's go through them.
Loop Testing: The Universal First Step
Regardless of the category, the first thing I do is run the failing test in a tight loop. If the failure rate is 1%, running it 500 times gives you a ~99% chance of seeing it. Most test frameworks have a built-in way to do this.
# pytest
pytest tests/test_flaky.py::test_something --count=1000 --tb=long
# Jest
jest --testPathPattern='test_something' --repeatEach=1000 --verbose
# Go (with -count flag)
go test -run TestSomething -count=1000 -v 2>&1 | tee flaky.log
# Rust (with cargo-nextest)
cargo nextest run -E 'test(test_something)' --repeat 1000If your framework doesn't support repetition, use a shell loop: `for i in $(seq 1000); do pytest test_flaky.py || exit 1; done`. But be mindful of test isolation — some frameworks reuse state between runs.
Reproducing Race Conditions: The Async Nightmare
Race conditions are the hardest to reproduce because they depend on thread interleaving. A single-threaded loop won't help if the race only happens under concurrency. Here's a real example from a Rust project I worked on.
The Disappearing Database Record
- 00:00Test creates a user in DB, spawns a background worker that reads the user, then deletes it after processing.
- 00:01Test asserts the user exists after worker finishes. Passes 99% of the time.
- 00:02CI fails with 'user not found' — the worker deletes before the test checks.
- 00:03Local debugging: adding a small sleep before the assert makes it pass, but that's not a fix.
Lesson
The race was between the worker's delete and the test's read. The fix was to use a channel to signal completion instead of polling the DB.
To reproduce this race locally, I used two techniques: stress testing and thread sanitizers.
Stress Testing with `stress` and `parallel`
# Run the test under CPU and IO stress
sudo apt install stress
stress --cpu 8 --io 4 --hdd 2 & # background
cargo test test_flaky -- --test-threads=2
kill %1This alone bumped the failure rate from ~1% to about 15%. Then I could bisect between the test and the worker to find the exact interleaving.
Thread Sanitizer (TSan)
For Rust, I enabled TSan on nightly:
RUSTFLAGS="-Z sanitizer=thread" cargo test -Z build-std --target x86_64-unknown-linux-gnu
# TSan reports: "ThreadSanitizer: data race on write at 0x..."TSan caught a data race on a shared `AtomicBool` that was being set in one thread and read in another without proper ordering. After adding `Ordering::SeqCst`, the race disappeared.
Reproducing Timeout Flakes: Deterministic Delays
Timeouts are easier: they happen when an operation takes longer than the test expects. To reproduce, I simulate the slow environment.
# Slow down disk I/O with fault injection
sudo apt install -y linux-tools-common
echo '0 100% write' | sudo tee /sys/kernel/debug/fail_make_request/probability
# Or use `tc` to add network latency
sudo tc qdisc add dev lo root netem delay 200ms
# Run the test
pytest tests/test_timeout_flaky.py
# Clean up
sudo tc qdisc del dev lo rootAnother trick: wrap the operation in a timeout that logs the exact duration. This tells you how long it actually takes under CI conditions.
import time
import pytest
@pytest.mark.timeout(10)
def test_slow_operation():
start = time.monotonic()
result = do_something()
elapsed = time.monotonic() - start
print(f"do_something took {elapsed:.2f}s") # Capture this in CI logs
assert result is not NoneFrequently asked questions
Why can't I reproduce a flaky test by running it once on my machine?
Flaky tests depend on timing, load, or random seeds that differ between CI and local runs. CI may have slower I/O, different CPU counts, or environment variables that affect behavior. Reproducing requires replicating those conditions exactly.
What is the fastest way to reproduce a flaky test?
Wrap the test in a loop that runs it hundreds of times. For pytest, use `pytest --count=1000`. For Jest, use `jest --repeatEach=1000`. If the failure is timing-related, run the loop under `stress --cpu 8` to increase contention.
How do I use thread sanitizer to find flaky tests?
Compile your code with `-fsanitize=thread` (Clang/GCC) and run the test suite. TSan will report data races even if they don't cause visible failures. This is effective for flaky tests caused by unsynchronized shared state.
What if the flaky test only fails in CI but never locally?
Capture all CI environment variables and replicate them locally. Use the same OS, CPU count, and memory limits. Tools like `act` can run GitHub Actions workflows locally. Also check for differences in file system timing or network dependencies.