How to Detect Flaky Tests Before They Waste a Full CI Day

Flaky tests are expensive because they are rarely treated as a data problem. Teams usually notice them when a pipeline goes red, then the reflex is to rerun the job and hope it turns green. That approach hides the symptom, but it also erases the clues you need to understand why the test failed in the first place. By the time the same instability has burned through a morning of retries, the damage is already done, developer trust has dropped, and the CI queue is full of noise.

A better way to detect flaky tests in CI is to make instability visible before it becomes a day-long distraction. That means collecting repeated signals, preserving metadata, clustering failures by pattern, and enforcing a rerun policy that separates true regressions from inconsistent tests. The goal is not perfect elimination on day one. The goal is to find likely flaky tests early, quarantine them intentionally when needed, and keep the mainline pipeline useful.

What makes a test flaky in practice

A flaky test is one that can pass or fail without the underlying product behavior changing in a meaningful way. In practice, flakiness usually comes from one of a few sources:

Timing issues, such as waiting for UI state before the page is ready
Data dependencies, where test data is shared or reused unexpectedly
Environment instability, including browser crashes, network hiccups, or throttled CI runners
Locator fragility, especially in UI tests where selectors are too specific or tied to layout details
Test ordering, where one test changes state that another test assumes is clean
Non-deterministic dependencies, such as timestamps, random values, async jobs, or third-party APIs

This matters because a good detection workflow should map failure signals back to these categories. If a test fails in different places with different stack traces, that is often a test smell. If it always fails in the same place after a specific release, that is more likely a product regression.

A flaky test is not just a nuisance, it is a signal that your suite is missing determinism, isolation, or observability somewhere.

The detection goal, not just the rerun habit

Most teams already have a rerun habit. What they lack is a system for interpreting the reruns.

A useful detection workflow answers four questions:

Did the test fail in a repeatable way, or only once?
Did the failure happen in the test body, the setup, or the infrastructure?
Has this test failed before under similar conditions?
Should the failure block the merge, quarantine the test, or escalate it as a product issue?

If you can answer those quickly, you can detect flaky tests in CI before they waste a full day. That requires both test design and CI observability, meaning the pipeline needs enough context to distinguish a broken test from a broken build.

Build detection around repetition, not intuition

One of the simplest ways to expose flakiness is repetition. If a test passes once and fails once under the same code revision, there is at least some instability to investigate. Repetition can happen at a few levels:

Re-run the same test case multiple times in a single job
Run the same suite across several CI agents or environments
Execute the same commit repeatedly on a schedule or on demand
Compare the test outcome across multiple branches using the same baseline build

You do not need to overdo it. Running every test five times on every push is usually too expensive. Instead, apply repetition selectively:

For high-value smoke tests, repeat them in PR validation on a small loop
For historically unstable tests, use a short repeat policy until the failure rate is understood
For new UI tests, run a brief burn-in period before trusting them in the gate

A practical pattern is a two-stage pipeline:

Fast gate, run the test once in PR validation
Confidence pass, rerun only failed tests one or two more times to classify instability

That gives you signal without turning the pipeline into a lab experiment.

A simple repeat strategy in CI

Here is an example using a job matrix style in GitHub Actions to repeat a flaky-prone test shard a few times.

name: test
on: [pull_request]

jobs: ui-smoke: runs-on: ubuntu-latest strategy: fail-fast: false matrix: run: [1, 2, 3] steps: - uses: actions/checkout@v4 - run: npm ci - run: npm run test:smoke

This is not a permanent design for the whole suite. It is a detection tool. If one of three repeats fails while the others pass, you have a candidate flaky test or an environment problem. If all three fail the same way, treat it as a stronger regression signal.

Preserve metadata or lose the diagnosis

A failure without metadata is just a red badge. A failure with metadata is something you can reason about.

At minimum, capture the following for every test run:

Commit SHA and branch
Test name and file path
Suite name and shard
Environment, such as browser, OS, container image, or device type
Build number and job identifier
Retry count and whether the attempt was first run or rerun
Start and end timestamps
Failure message, stack trace, and assertion text
Screenshot, video, or trace artifact for UI tests
Network or console logs if available

With that information, you can see whether the same test fails more often on one browser, one runner type, one branch, or one time window. Without it, flakiness looks random even when it is not.

If your CI provider does not store enough context, send test results to a central place. Some teams use a lightweight reporting layer, others use a vendor platform. For example, Endtest can centralize execution results in an agentic AI Test automation workflow, which makes it easier to spot repeated locator-related failures and compare runs across changes. The value here is not the platform itself, it is the aggregation of structured evidence.

Failure clustering is where flakiness becomes obvious

Once metadata is preserved, group failures into clusters. A cluster is a set of failures that share a pattern, even if the test names differ slightly.

Useful clustering dimensions include:

Same test file or spec
Same error message prefix
Same stack trace top frame
Same failed locator or assertion text
Same browser or runner image
Same branch or feature flag state
Same time-to-failure window

For UI suites, locator failures are especially valuable to cluster. If ten tests fail because a selector no longer resolves, that is not ten separate problems. It may be one fragile selector pattern spreading through the suite.

A failure cluster often tells you more than individual reruns do. One test failing every tenth execution may be a flaky test. Twenty tests failing on the same step after a DOM change is probably a shared locator or app behavior issue.

Example of a useful clustering heuristic

A simple rule set can get you far:

If the same test fails with the same stack signature at least twice in 20 runs, mark it unstable
If the same file produces failures in multiple unrelated test names, inspect shared setup or fixtures
If the same failure happens only on one browser or one container image, inspect environment-specific dependencies
If reruns pass on the same commit but fail across multiple commits, suspect environment or data drift

You do not need a full data science pipeline. A spreadsheet or a small script can often identify the first cluster that matters.

Separate test failure from infrastructure failure

A common mistake is treating every red build as a test problem. In reality, many failures are infra-related and should be handled differently.

Examples include:

Browser or driver crashes
Timeouts caused by overloaded CI agents
DNS or network flakes
Unavailable test data services
Artifact upload failures
Container startup issues

This distinction matters because rerunning a broken environment can produce misleading evidence. If the runner was under memory pressure, a test may fail in a way that looks like a UI timeout. If the API dependency was down, the UI test may just be a victim.

A good failure triage workflow tags the origin of failure quickly:

Test assertion failed
Locator did not resolve
Timeout waiting for app state
Test setup failed
External dependency unavailable
CI runtime error

Once you tag the origin, you can route it appropriately. Test failures go to the suite owner. Infrastructure failures go to the platform or DevOps owner. Shared failures go into a triage queue.

Define a rerun policy before you need one

Rerunning failed tests is reasonable, but only if the rerun policy is explicit. Otherwise, teams create a hidden tax where every failure gets retried until it disappears.

A practical rerun policy should define:

How many automatic reruns are allowed
Which test tiers qualify for reruns, smoke, critical path, or full regression
Whether reruns happen in the same job or a separate quarantine lane
What counts as a flaky pass, one pass after one fail, or multiple inconsistent outcomes
When a failure blocks the merge even if a rerun passes

A good default is to be strict on critical tests and more flexible on exploratory or non-blocking suites. For example:

Critical smoke test fails once, one rerun allowed, if rerun passes, mark as unstable but do not ignore it
Non-critical regression test fails once, rerun once, if result changes, create a flaky ticket
Test fails twice in the same commit, treat as a likely product or test issue and escalate

The policy should also record rerun provenance. A rerun that passes is not the same as an original pass. If your dashboard does not show that distinction, you are already losing signal.

Quarantine strategy should reduce noise, not hide quality issues

Quarantine is often misunderstood. It is not a trash bin for unwanted failures. It is a temporary control mechanism that keeps a flaky test from blocking all delivery while the team investigates.

A responsible quarantine strategy has a few rules:

Quarantine is time-boxed, not permanent by default
Every quarantined test has an owner and a review date
Quarantined tests still run somewhere, so failures remain visible
Quarantined tests do not silently disappear from reporting
The team tracks whether quarantine rates are rising or falling

There are different quarantine models:

Hard quarantine, remove the test from merge blocking jobs entirely
Soft quarantine, keep the test in a non-blocking lane and report the result
Conditional quarantine, skip only under known bad environments or specific browser versions

Soft quarantine is usually the safest starting point because it preserves signal. Hard quarantine can be useful for a test that is known to be noisy and blocking delivery, but it should not become a permanent avoidance strategy.

If a quarantined test is never revisited, your team has not reduced flakiness, it has just hidden it.

Use CI observability to find patterns earlier

CI observability means your pipeline emits enough structured data that failures can be queried, compared, and explained. For flaky tests, this is the difference between guessing and diagnosing.

Useful observability signals include:

Failure rate per test over time
Pass after fail rate on reruns
Failure concentration by branch or PR author
Failure concentration by browser, OS, or image tag
Median time to failure for a given test
Top recurring error signatures
Test duration drift, which can expose performance-sensitive instability

If your test dashboard only shows pass or fail, you are missing most of the story. A test that passes 99 times and fails once a week is not healthy just because the latest run is green.

For teams using UI automation, some platforms, including Endtest’s self-healing tests documentation, highlight another aspect of observability, how locator changes are handled and recorded. Self-healing can reduce noise from brittle selectors, but the important part for detection is that healed events remain visible so you can tell whether the suite is stabilizing or just being patched over.

Practical workflow for detecting flaky tests in CI

Here is a workflow that works well for most teams without demanding a full platform rewrite.

1. Tag tests by criticality and volatility

Not every test deserves the same policy. Start with:

Critical path tests, login, checkout, deploy, billing, core user flows
Medium-value regression tests
High-volatility tests, new UI flows, tests around changing components, and externally dependent checks

Critical path tests deserve stricter gating and better observability.

2. Capture structured failure output

Make sure your test runner emits machine-readable results, such as JUnit XML, JSON, or native CI artifacts. Add metadata fields where possible. For Playwright, for example, use traces and JSON reports. For Selenium or Cypress, ensure the logs are tied to test identifiers.

3. Run a short repeat pass on failures

If a test fails, rerun it once or twice in the same environment. If the result changes, classify it as unstable. If it fails consistently, classify it as likely deterministic.

4. Cluster by signature

Group failures by top stack frame, locator, assertion, or error message. If a single cluster appears across many runs, prioritize it over isolated one-off failures.

5. Route by failure origin

Use separate buckets for test logic, product defects, and infrastructure failures. Do not let all three live in the same backlog without labels.

6. Apply quarantine with expiry

If a flaky test blocks too much work, quarantine it temporarily. Assign an owner, a reason, and an expiration date.

7. Track recovery, not just failure

When a test becomes stable again, remove quarantine promptly. Measure how long instability lasts and whether the same pattern recurs.

A Playwright example for failure triage hooks

If your tests run in Playwright, you can attach metadata during failures and make triage easier later.

import { test, expect } from '@playwright/test';

test('checkout button is visible', async ({ page }, testInfo) => {
  await page.goto('https://example.com');
  try {
    await expect(page.getByRole('button', { name: 'Checkout' })).toBeVisible();
  } catch (error) {
    await testInfo.attach('url', {
      body: page.url(),
      contentType: 'text/plain'
    });
    throw error;
  }
});

That is not a full observability solution, but it helps preserve enough context to compare one failure with another. The more context you capture at the point of failure, the less time you spend reconstructing history later.

Triage questions that cut through noise

When a test fails, the first response should not be, “Rerun it.” It should be a short set of questions:

Did this fail before on the same commit or on nearby commits?
Does the failure repeat in the same place?
Is the error message identical, or does it vary?
Does the failure correlate with a specific environment, browser, or time of day?
Did any shared fixture, helper, or page object change recently?
Is there a known dependency issue in the build?

These questions guide the branch between flaky test, deterministic bug, and environment issue. They also help engineering managers decide whether a suite needs investment or just cleanup.

Common edge cases that confuse detection

New tests often look flaky before they are stable

Recently added tests fail for a few reasons that are not necessarily flakiness, such as unstable selectors, unfinished features, or underdeveloped data setup. Consider a burn-in period before judging new tests harshly.

Performance-sensitive tests can masquerade as flaky

A test that barely passes on a fast local machine may fail in CI when the app is under load. In that case, the problem may be a weak timeout threshold or an implicit performance dependency.

Shared test data can create false flake signatures

If multiple parallel jobs write to the same account, row, or resource, the test may appear random when it is actually colliding with other runs. Isolation fixes are often more effective than reruns.

Healing mechanisms reduce noise, but they can also mask selector drift

Self-healing tools can reduce breakage when locators change. That is useful, especially for UI suites with frequent DOM changes. Still, healing should be logged and reviewed, because a healed run is a clue that your selector strategy needs attention. If a platform like Endtest heals a locator, the run should still leave enough evidence for failure clustering and trend analysis.

What good looks like after a few weeks

A team that detects flaky tests early should start to see a few measurable changes, even if the suite is not perfect:

Fewer rerun storms on PRs
More failures classified on the first pass
Clearer separation between infra incidents and test issues
Smaller quarantine backlog, or at least a backlog with owners and deadlines
More stable critical path jobs
Better trust from developers who no longer assume every red build is meaningless

This is not about eliminating every inconsistent failure. It is about making the inconsistency visible quickly enough that it does not consume an entire CI day.

A lightweight implementation checklist

If you want a practical starting point, use this checklist:

Emit structured test results with stable test IDs
Record commit, environment, and retry metadata
Add one rerun for failed tests in a controlled lane
Group failures by error signature and stack trace
Track pass after fail rate for each test
Separate infrastructure failures from test failures
Introduce quarantine with ownership and expiry
Review top flaky clusters weekly
Remove or rewrite the tests that keep resurfacing

This checklist is intentionally small. The first win is not automation sophistication, it is disciplined visibility.

Choosing the right level of intervention

Not every unstable test deserves the same response.

If the locator is brittle, fix the selector or the page object
If the test data collides, isolate data creation and cleanup
If the environment fails, improve runner stability or dependency availability
If the product itself is changing, align test expectations with the real behavior
If the test is still too volatile after fixes, quarantine it temporarily and refactor it later

The most mature teams treat flakiness as a reliability issue with several possible owners, not as a moral failing of the test author.

Final takeaway

To detect flaky tests in CI before they waste a full day, you need more than ad hoc reruns. You need repetition with purpose, preserved metadata, failure clustering, and a clear rerun policy. Once those pieces are in place, CI observability starts to work for you instead of against you. The suite becomes easier to trust, triage gets faster, and noisy failures stop dominating the pipeline.

That is also why centralizing test results matters. Whether you build that layer yourself or use a platform such as Endtest, the point is the same, make the patterns visible enough that you can act before the day is gone.

For teams that want the shortest path to better signal, start with one question, not ten, which tests fail inconsistently, and what metadata do you already have to prove it?