Reproducing Flaky Tests: CI Logs to Reliable Failure

I've spent years debugging flaky tests — those intermittent failures that pass 99 times out of 100 on your machine but fail on CI, only to pass again when you rerun. They erode trust in your test suite and waste developer hours. The core problem is reproducibility: if you can't make it fail on demand, you can't fix it.

This post covers the techniques I use to reproduce flaky tests systematically. Not the theory — the actual commands, tools, and code changes that turn a heisenbug into a repeatable failure.

First, Classify the Flake

Before you start debugging, figure out what kind of flake you're dealing with. In my experience, most fall into one of three categories:

1. Race conditions — two goroutines, threads, or async tasks access shared state without synchronization. These depend on scheduling and are sensitive to CPU load.

2. Timeout or ordering — a test expects an event to happen within a fixed duration, but under CI load it takes slightly longer, or a callback fires in unexpected order.

3. Environment mismatch — the test passes locally because your machine has a different locale, file system, timezone, or dependency version than CI.

For each category, the reproduction strategy differs. Let's go through them.

Loop Testing: The Universal First Step

Regardless of the category, the first thing I do is run the failing test in a tight loop. If the failure rate is 1%, running it 500 times gives you a ~99% chance of seeing it. Most test frameworks have a built-in way to do this.

Loop testing commands for popular test frameworks.

# pytest
pytest tests/test_flaky.py::test_something --count=1000 --tb=long

# Jest
jest --testPathPattern='test_something' --repeatEach=1000 --verbose

# Go (with -count flag)
go test -run TestSomething -count=1000 -v 2>&1 | tee flaky.log

# Rust (with cargo-nextest)
cargo nextest run -E 'test(test_something)' --repeat 1000

lightbulb

If your framework doesn't support repetition, use a shell loop: `for i in $(seq 1000); do pytest test_flaky.py || exit 1; done`. But be mindful of test isolation — some frameworks reuse state between runs.

Reproducing Race Conditions: The Async Nightmare

Race conditions are the hardest to reproduce because they depend on thread interleaving. A single-threaded loop won't help if the race only happens under concurrency. Here's a real example from a Rust project I worked on.

The Disappearing Database Record

00:00Test creates a user in DB, spawns a background worker that reads the user, then deletes it after processing.
00:01Test asserts the user exists after worker finishes. Passes 99% of the time.
00:02CI fails with 'user not found' — the worker deletes before the test checks.
00:03Local debugging: adding a small sleep before the assert makes it pass, but that's not a fix.

Lesson

The race was between the worker's delete and the test's read. The fix was to use a channel to signal completion instead of polling the DB.

To reproduce this race locally, I used two techniques: stress testing and thread sanitizers.

Stress Testing with `stress` and `parallel`

Using `stress` to increase contention and trigger the race more frequently.

# Run the test under CPU and IO stress
sudo apt install stress
stress --cpu 8 --io 4 --hdd 2 &  # background
cargo test test_flaky -- --test-threads=2
kill %1

This alone bumped the failure rate from ~1% to about 15%. Then I could bisect between the test and the worker to find the exact interleaving.

Thread Sanitizer (TSan)

For Rust, I enabled TSan on nightly:

Running Rust tests with ThreadSanitizer to detect unsynchronized access.

RUSTFLAGS="-Z sanitizer=thread" cargo test -Z build-std --target x86_64-unknown-linux-gnu
# TSan reports: "ThreadSanitizer: data race on write at 0x..."

TSan caught a data race on a shared `AtomicBool` that was being set in one thread and read in another without proper ordering. After adding `Ordering::SeqCst`, the race disappeared.

Reproducing Timeout Flakes: Deterministic Delays

Timeouts are easier: they happen when an operation takes longer than the test expects. To reproduce, I simulate the slow environment.

Injecting disk and network delays to reproduce timeout-based flakiness.

# Slow down disk I/O with fault injection
sudo apt install -y linux-tools-common
echo '0 100% write' | sudo tee /sys/kernel/debug/fail_make_request/probability
# Or use `tc` to add network latency
sudo tc qdisc add dev lo root netem delay 200ms
# Run the test
pytest tests/test_timeout_flaky.py
# Clean up
sudo tc qdisc del dev lo root

Another trick: wrap the operation in a timeout that logs the exact duration. This tells you how long it actually takes under CI conditions.

Logging actual execution time in a test to identify timeout margins.

import time
import pytest

@pytest.mark.timeout(10)
def test_slow_operation():
    start = time.monotonic()
    result = do_something()
    elapsed = time.monotonic() - start
    print(f"do_something took {elapsed:.2f}s")  # Capture this in CI logs
    assert result is not None

Frequently asked questions

Why can't I reproduce a flaky test by running it once on my machine?

Flaky tests depend on timing, load, or random seeds that differ between CI and local runs. CI may have slower I/O, different CPU counts, or environment variables that affect behavior. Reproducing requires replicating those conditions exactly.

What is the fastest way to reproduce a flaky test?

Wrap the test in a loop that runs it hundreds of times. For pytest, use `pytest --count=1000`. For Jest, use `jest --repeatEach=1000`. If the failure is timing-related, run the loop under `stress --cpu 8` to increase contention.

How do I use thread sanitizer to find flaky tests?

Compile your code with `-fsanitize=thread` (Clang/GCC) and run the test suite. TSan will report data races even if they don't cause visible failures. This is effective for flaky tests caused by unsynchronized shared state.

What if the flaky test only fails in CI but never locally?

Capture all CI environment variables and replicate them locally. Use the same OS, CPU count, and memory limits. Tools like `act` can run GitHub Actions workflows locally. Also check for differences in file system timing or network dependencies.

How to Reproduce a Flaky Test: From CI Logs to Reliable Failure