All guides

LEARN \u00b7 DEBUGGING GUIDE

Flaky tests in CI: how to debug and fix intermittent test failures

The test suite passes. You push a typo fix. CI fails with a test you never touched. You rerun it — it passes. Your test suite is flaky and the team is starting to ignore CI failures.

AdvancedCI/CD debugging

What this usually means

Flaky tests are tests that sometimes pass and sometimes fail without any code change. They are caused by non-deterministic behaviour: race conditions between async operations, reliance on wall-clock time, shared mutable state between tests, test execution order dependencies, or external service availability. CI environments make flakiness worse because they are slower, have different timing characteristics, and run tests in different orders than local machines.

( 01 )Fast diagnosis

The first ten minutes \u2014 establish facts before touching code.

  • 1Identify the flaky test. Run it in isolation 20 times. Does it fail? If yes, it is individually flaky. If no, it is order-dependent.
  • 2Check if the test involves time (`setTimeout`, `setInterval`, `Date.now()`). Time-based tests are the most common source of flakiness.
  • 3Check if the test makes network calls. External dependencies (APIs, databases) introduce latency variance and transient failures.
  • 4Check if the test shares state with other tests. Global variables, database rows, or file system state that is not cleaned up between tests causes order-dependence.
  • 5Check the CI timing. Flaky tests often fail more in CI because CI machines are slower — race conditions that are invisible locally become visible.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchThe flaky test file — read the test logic and look for async gaps, time dependencies, and shared state
  • searchTest framework configuration — random test ordering, parallel execution settings, timeout values
  • searchCI job logs — compare passing and failing runs of the same test, look for timing differences
  • searchTest setup and teardown (`beforeEach`/`afterEach`) — is state properly reset between tests?
  • searchMock configurations — are mocks reset between tests? Are they simulating time correctly?
  • searchCI runner specs — CPU, memory, and disk compared to local development machine
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningRace condition: the test asserts before an async operation completes
  • warningTime dependency: the test uses `new Date()` or `Date.now()` and expects a specific value
  • warningOrder dependency: test B passes only if test A runs first and leaves the system in a specific state
  • warningShared mutable state: a global variable or singleton is modified by one test and affects another
  • warningExternal service: an API call, database query, or file system operation fails intermittently
  • warningResource exhaustion: CI runs tests in parallel and hits file descriptor or memory limits
  • warningClock drift: CI machine's clock is slightly different, causing time-based assertions to fail
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildUse fake timers (`jest.useFakeTimers()`, `vi.useFakeTimers()`) to control time in tests deterministically
  • buildMock external services instead of calling real APIs in tests — use MSW, nock, or similar
  • buildRun tests in random order locally to surface order dependencies: `jest --randomize` or `vitest --sequence.random`
  • buildClean up all shared state in `afterEach` — database rows, files, global variables, module caches
  • buildWait for async operations properly: use `waitFor`, `findBy`, or explicit await on promises
  • buildAdd retry logic only for the test runner's built-in retry (e.g. Jest `jest.retryTimes(2)`), not custom logic inside tests
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedRun the test 100 times in a loop locally. It should pass all 100 times.
  • verifiedRun the full test suite in random order 5 times. No tests should fail.
  • verifiedRun tests in CI 3 consecutive times without code changes. All runs should pass.
  • verifiedCheck that tests do not depend on system time by running them with a different system clock.
  • verifiedMonitor flaky test rate over the next week — it should trend to zero.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningAdding `await sleep(1000)` instead of waiting for the actual condition
  • warningDisabling or skipping the flaky test instead of fixing it
  • warningRunning tests in a fixed order and relying on that order
  • warningUsing real network calls in unit tests
  • warningNot investigating flaky tests because 'it passed on retry' — every flaky test hides a real bug