End-to-end test flakiness is the silent killer of CI velocity. Every time a test fails and then passes on re-run, you lose at least 10 minutes of developer context-switching. Over a team of ten, that's hours per week burned on reruns and false alarms.
I maintain the test infrastructure for a SaaS platform that processes real-time payments. Our E2E suite runs 600+ tests on every pull request. Six months ago, 15% of those tests failed intermittently. After implementing a three-part strategy — deterministic seeding, isolated state, and structured retry budgets — we dropped flaky failures to 3%. Here's exactly what we did.
The Anatomy of a Flaky Test
Flaky tests have many faces: a timeout that fires 10ms too early, a database row left by a previous test causing a unique constraint violation, or an API response arriving in a different order. But they share a root cause: nondeterminism. The test suite assumes the world is consistent, but the world isn't.
Our worst offender was a test that created an invoice, then checked that it appeared in a list sorted by creation date. The test used `Date.now()` to generate timestamps. When two tests ran in parallel, their timestamps sometimes collided, and the sort order became unpredictable. The fix: replace `Date.now()` with a seeded counter.
// Before: nondeterministic timestamp
const createdAt = Date.now();
// After: deterministic seed-based timestamp
let counter = 0;
function deterministicTimestamp(seed) {
return seed + counter++;
}Step 1: Deterministic Seeding for Data Generation
We wrote a small library that overrides `Math.random`, `Date.now`, and `crypto.randomUUID` with a seeded PRNG (we used `seedrandom`). At the start of each test suite, we generate a seed from the CI run ID and test file path. That way, every test run on the same commit produces identical data.
The library exports a `useSeed(seed)` function that patches the globals. We call it in the test runner's `beforeAll` hook. For Playwright, we inject it into the browser context so the frontend also uses the same seed.
import seedrandom from 'seedrandom';
export function useSeed(seed) {
const rng = seedrandom(seed);
Math.random = () => rng();
let time = 1000000000000;
Date.now = () => time += 1;
}Don't forget to patch `crypto.randomUUID` — many modern apps use it for IDs. Without patching, you'll still get unique but nondeterministic values.
Step 2: Isolate State Per Test
Deterministic seeding fixes data generation, but shared state still causes flakiness. A test that creates a user and another that searches for users will collide if they share the same database. We moved to a model where each test gets its own isolated environment: a fresh database, a fresh cache, and mock API responses.
For the database, we use Docker Compose with a per-test database snapshot. We take a snapshot after the initial migrations, then restore it before each test. It adds about 2 seconds per test, but it eliminated all state-related flakiness.
For the cache (Redis), we flush the entire cache between tests. For external API calls, we record and replay using a deterministic seed for the recorded responses.
# Restore database snapshot before each test
docker exec -i db psql -U test -d test < /snapshots/clean.sql
# Flush Redis
docker exec cache redis-cli FLUSHALLStep 3: Structured Retry Budget
Even with deterministic seeding and isolated state, some flakiness remains — network hiccups, resource contention on CI nodes, or timing issues with third-party services. We needed a way to handle those without hiding bugs.
We implemented a retry budget per test: maximum 2 retries, with exponential backoff starting at 5 seconds. But we also track retry counts. If a test requires retries more than 5% of the time, it's flagged for investigation. We export retry metrics to a dashboard so we can see trends.
The Case of the Intermittent Login Test
- 09:15CI reports login test failed on PR #3421.
- 09:17Developer re-runs the test; it passes. They assume flaky test.
- 09:45Same test fails again on another PR. This time, developer investigates.
- 10:00Dev discovers the test relies on a session token that expires after 30 minutes. The test was running near the expiration boundary.
- 10:30We fixed the test to refresh the token before each assertion. No more flaky failures.
Lesson
This failure wasn't truly flaky — it was a hidden dependency on token expiry. Deterministic seeding didn't help, but isolating the session state per test and using a fixed time window would have caught it earlier. We now mock the auth service to return a non-expiring token in test mode.
Reduction in flaky test failures after implementing the three strategies
Results and Key Metrics
After rolling out these changes to our 600-test suite, we tracked flakiness for two months. The percentage of test runs with at least one flaky failure dropped from 15% to 3%. CI build time decreased by 12% because we no longer had to re-run failed jobs manually as often.
Developer satisfaction improved — the number of Slack messages asking 'Is this test flaky?' dropped significantly. We also introduced a flakiness score per test, displayed in our CI dashboard, so teams can quickly see which tests need attention.
- 1Replace all random data generation with deterministic seeds (took us about 2 weeks).
- 2Implement per-test state isolation using Docker snapshots and cache flushing (1 week).
- 3Add structured retry budget with monitoring (3 days).
- 4Set up flakiness dashboard and alerting (2 days).
You don't have to do everything at once. Start with deterministic seeding — it's the easiest win. Then add state isolation for your most flaky tests. Retry budgets are a safety net, not a cure.
Flaky tests are a symptom of technical debt in your test infrastructure. They erode trust in the CI pipeline and slow down every developer. The fix isn't magic — it's systematic elimination of nondeterminism. Start with seeding, isolate state, and use retries sparingly. Your team will thank you.
Frequently asked questions
What is the most common cause of flaky end-to-end tests?
In my experience, non-deterministic data and shared state top the list. Tests that rely on randomly generated IDs, timestamps, or leftover data from previous tests frequently fail on the second run. Using deterministic seeds and cleaning state between tests eliminates the majority of those failures.
How does deterministic seeding actually work in practice?
Instead of letting your code generate random values, you fix a seed at the start of the test suite. For example, in Playwright you can override `Math.random` with a seeded pseudo-random generator. Every test then sees the same 'random' data. If a test fails, you can re-run with the same seed to reproduce the exact state.
Should I just retry flaky tests automatically?
Blind retries mask the problem and waste CI time. A structured retry budget — e.g., retry up to 2 times with a 5-second backoff, but only if the failure is in a known flaky category — is better. Track retry counts per test; if a test requires retries more than 5% of the time, it needs a fix, not more retries.
What tools do you recommend for fighting flakiness?
I use Playwright for E2E tests, combined with a custom seed manager library and a CI wrapper that exports flakiness metrics. For database isolation, Docker Compose with per-test database snapshots works well. There's no silver bullet, but a structured approach with the right tooling cuts flakiness significantly.