June 1, 2026
How to Detect Flaky Tests Before They Waste a Full CI Day
A practical workflow for detecting flaky tests in CI using repetition, metadata, failure clustering, rerun policy, and CI observability, with examples and triage steps.
Flaky tests are expensive because they are rarely treated as a data problem. Teams usually notice them when a pipeline goes red, then the reflex is to rerun the job and hope it turns green. That approach hides the symptom, but it also erases the clues you need to understand why the test failed in the first place. By the time the same instability has burned through a morning of retries, the damage is already done, developer trust has dropped, and the CI queue is full of noise.
A better way to detect flaky tests in CI is to make instability visible before it becomes a day-long distraction. That means collecting repeated signals, preserving metadata, clustering failures by pattern, and enforcing a rerun policy that separates true regressions from inconsistent tests. The goal is not perfect elimination on day one. The goal is to find likely flaky tests early, quarantine them intentionally when needed, and keep the mainline pipeline useful.
What makes a test flaky in practice
A flaky test is one that can pass or fail without the underlying product behavior changing in a meaningful way. In practice, flakiness usually comes from one of a few sources:
- Timing issues, such as waiting for UI state before the page is ready
- Data dependencies, where test data is shared or reused unexpectedly
- Environment instability, including browser crashes, network hiccups, or throttled CI runners
- Locator fragility, especially in UI tests where selectors are too specific or tied to layout details
- Test ordering, where one test changes state that another test assumes is clean
- Non-deterministic dependencies, such as timestamps, random values, async jobs, or third-party APIs
This matters because a good detection workflow should map failure signals back to these categories. If a test fails in different places with different stack traces, that is often a test smell. If it always fails in the same place after a specific release, that is more likely a product regression.
A flaky test is not just a nuisance, it is a signal that your suite is missing determinism, isolation, or observability somewhere.
The detection goal, not just the rerun habit
Most teams already have a rerun habit. What they lack is a system for interpreting the reruns.
A useful detection workflow answers four questions:
- Did the test fail in a repeatable way, or only once?
- Did the failure happen in the test body, the setup, or the infrastructure?
- Has this test failed before under similar conditions?
- Should the failure block the merge, quarantine the test, or escalate it as a product issue?
If you can answer those quickly, you can detect flaky tests in CI before they waste a full day. That requires both test design and CI observability, meaning the pipeline needs enough context to distinguish a broken test from a broken build.
Build detection around repetition, not intuition
One of the simplest ways to expose flakiness is repetition. If a test passes once and fails once under the same code revision, there is at least some instability to investigate. Repetition can happen at a few levels:
- Re-run the same test case multiple times in a single job
- Run the same suite across several CI agents or environments
- Execute the same commit repeatedly on a schedule or on demand
- Compare the test outcome across multiple branches using the same baseline build
You do not need to overdo it. Running every test five times on every push is usually too expensive. Instead, apply repetition selectively:
- For high-value smoke tests, repeat them in PR validation on a small loop
- For historically unstable tests, use a short repeat policy until the failure rate is understood
- For new UI tests, run a brief burn-in period before trusting them in the gate
A practical pattern is a two-stage pipeline:
- Fast gate, run the test once in PR validation
- Confidence pass, rerun only failed tests one or two more times to classify instability
That gives you signal without turning the pipeline into a lab experiment.
A simple repeat strategy in CI
Here is an example using a job matrix style in GitHub Actions to repeat a flaky-prone test shard a few times.
name: test
on: [pull_request]
jobs: ui-smoke: runs-on: ubuntu-latest strategy: fail-fast: false matrix: run: [1, 2, 3] steps: - uses: actions/checkout@v4 - run: npm ci - run: npm run test:smoke
This is not a permanent design for the whole suite. It is a detection tool. If one of three repeats fails while the others pass, you have a candidate flaky test or an environment problem. If all three fail the same way, treat it as a stronger regression signal.
Preserve metadata or lose the diagnosis
A failure without metadata is just a red badge. A failure with metadata is something you can reason about.
At minimum, capture the following for every test run:
- Commit SHA and branch
- Test name and file path
- Suite name and shard
- Environment, such as browser, OS, container image, or device type
- Build number and job identifier
- Retry count and whether the attempt was first run or rerun
- Start and end timestamps
- Failure message, stack trace, and assertion text
- Screenshot, video, or trace artifact for UI tests
- Network or console logs if available
With that information, you can see whether the same test fails more often on one browser, one runner type, one branch, or one time window. Without it, flakiness looks random even when it is not.
If your CI provider does not store enough context, send test results to a central place. Some teams use a lightweight reporting layer, others use a vendor platform. For example, Endtest can centralize execution results in an agentic AI Test automation workflow, which makes it easier to spot repeated locator-related failures and compare runs across changes. The value here is not the platform itself, it is the aggregation of structured evidence.
Failure clustering is where flakiness becomes obvious
Once metadata is preserved, group failures into clusters. A cluster is a set of failures that share a pattern, even if the test names differ slightly.
Useful clustering dimensions include:
- Same test file or spec
- Same error message prefix
- Same stack trace top frame
- Same failed locator or assertion text
- Same browser or runner image
- Same branch or feature flag state
- Same time-to-failure window
For UI suites, locator failures are especially valuable to cluster. If ten tests fail because a selector no longer resolves, that is not ten separate problems. It may be one fragile selector pattern spreading through the suite.
A failure cluster often tells you more than individual reruns do. One test failing every tenth execution may be a flaky test. Twenty tests failing on the same step after a DOM change is probably a shared locator or app behavior issue.
Example of a useful clustering heuristic
A simple rule set can get you far:
- If the same test fails with the same stack signature at least twice in 20 runs, mark it unstable
- If the same file produces failures in multiple unrelated test names, inspect shared setup or fixtures
- If the same failure happens only on one browser or one container image, inspect environment-specific dependencies
- If reruns pass on the same commit but fail across multiple commits, suspect environment or data drift
You do not need a full data science pipeline. A spreadsheet or a small script can often identify the first cluster that matters.
Separate test failure from infrastructure failure
A common mistake is treating every red build as a test problem. In reality, many failures are infra-related and should be handled differently.
Examples include:
- Browser or driver crashes
- Timeouts caused by overloaded CI agents
- DNS or network flakes
- Unavailable test data services
- Artifact upload failures
- Container startup issues
This distinction matters because rerunning a broken environment can produce misleading evidence. If the runner was under memory pressure, a test may fail in a way that looks like a UI timeout. If the API dependency was down, the UI test may just be a victim.
A good failure triage workflow tags the origin of failure quickly:
- Test assertion failed
- Locator did not resolve
- Timeout waiting for app state
- Test setup failed
- External dependency unavailable
- CI runtime error
Once you tag the origin, you can route it appropriately. Test failures go to the suite owner. Infrastructure failures go to the platform or DevOps owner. Shared failures go into a triage queue.
Define a rerun policy before you need one
Rerunning failed tests is reasonable, but only if the rerun policy is explicit. Otherwise, teams create a hidden tax where every failure gets retried until it disappears.
A practical rerun policy should define:
- How many automatic reruns are allowed
- Which test tiers qualify for reruns, smoke, critical path, or full regression
- Whether reruns happen in the same job or a separate quarantine lane
- What counts as a flaky pass, one pass after one fail, or multiple inconsistent outcomes
- When a failure blocks the merge even if a rerun passes
A good default is to be strict on critical tests and more flexible on exploratory or non-blocking suites. For example:
- Critical smoke test fails once, one rerun allowed, if rerun passes, mark as unstable but do not ignore it
- Non-critical regression test fails once, rerun once, if result changes, create a flaky ticket
- Test fails twice in the same commit, treat as a likely product or test issue and escalate
The policy should also record rerun provenance. A rerun that passes is not the same as an original pass. If your dashboard does not show that distinction, you are already losing signal.
Quarantine strategy should reduce noise, not hide quality issues
Quarantine is often misunderstood. It is not a trash bin for unwanted failures. It is a temporary control mechanism that keeps a flaky test from blocking all delivery while the team investigates.
A responsible quarantine strategy has a few rules:
- Quarantine is time-boxed, not permanent by default
- Every quarantined test has an owner and a review date
- Quarantined tests still run somewhere, so failures remain visible
- Quarantined tests do not silently disappear from reporting
- The team tracks whether quarantine rates are rising or falling
There are different quarantine models:
- Hard quarantine, remove the test from merge blocking jobs entirely
- Soft quarantine, keep the test in a non-blocking lane and report the result
- Conditional quarantine, skip only under known bad environments or specific browser versions
Soft quarantine is usually the safest starting point because it preserves signal. Hard quarantine can be useful for a test that is known to be noisy and blocking delivery, but it should not become a permanent avoidance strategy.
If a quarantined test is never revisited, your team has not reduced flakiness, it has just hidden it.
Use CI observability to find patterns earlier
CI observability means your pipeline emits enough structured data that failures can be queried, compared, and explained. For flaky tests, this is the difference between guessing and diagnosing.
Useful observability signals include:
- Failure rate per test over time
- Pass after fail rate on reruns
- Failure concentration by branch or PR author
- Failure concentration by browser, OS, or image tag
- Median time to failure for a given test
- Top recurring error signatures
- Test duration drift, which can expose performance-sensitive instability
If your test dashboard only shows pass or fail, you are missing most of the story. A test that passes 99 times and fails once a week is not healthy just because the latest run is green.
For teams using UI automation, some platforms, including Endtest’s self-healing tests documentation, highlight another aspect of observability, how locator changes are handled and recorded. Self-healing can reduce noise from brittle selectors, but the important part for detection is that healed events remain visible so you can tell whether the suite is stabilizing or just being patched over.
Practical workflow for detecting flaky tests in CI
Here is a workflow that works well for most teams without demanding a full platform rewrite.
1. Tag tests by criticality and volatility
Not every test deserves the same policy. Start with:
- Critical path tests, login, checkout, deploy, billing, core user flows
- Medium-value regression tests
- High-volatility tests, new UI flows, tests around changing components, and externally dependent checks
Critical path tests deserve stricter gating and better observability.
2. Capture structured failure output
Make sure your test runner emits machine-readable results, such as JUnit XML, JSON, or native CI artifacts. Add metadata fields where possible. For Playwright, for example, use traces and JSON reports. For Selenium or Cypress, ensure the logs are tied to test identifiers.
3. Run a short repeat pass on failures
If a test fails, rerun it once or twice in the same environment. If the result changes, classify it as unstable. If it fails consistently, classify it as likely deterministic.
4. Cluster by signature
Group failures by top stack frame, locator, assertion, or error message. If a single cluster appears across many runs, prioritize it over isolated one-off failures.
5. Route by failure origin
Use separate buckets for test logic, product defects, and infrastructure failures. Do not let all three live in the same backlog without labels.
6. Apply quarantine with expiry
If a flaky test blocks too much work, quarantine it temporarily. Assign an owner, a reason, and an expiration date.
7. Track recovery, not just failure
When a test becomes stable again, remove quarantine promptly. Measure how long instability lasts and whether the same pattern recurs.
A Playwright example for failure triage hooks
If your tests run in Playwright, you can attach metadata during failures and make triage easier later.
import { test, expect } from '@playwright/test';
test('checkout button is visible', async ({ page }, testInfo) => {
await page.goto('https://example.com');
try {
await expect(page.getByRole('button', { name: 'Checkout' })).toBeVisible();
} catch (error) {
await testInfo.attach('url', {
body: page.url(),
contentType: 'text/plain'
});
throw error;
}
});
That is not a full observability solution, but it helps preserve enough context to compare one failure with another. The more context you capture at the point of failure, the less time you spend reconstructing history later.
Triage questions that cut through noise
When a test fails, the first response should not be, “Rerun it.” It should be a short set of questions:
- Did this fail before on the same commit or on nearby commits?
- Does the failure repeat in the same place?
- Is the error message identical, or does it vary?
- Does the failure correlate with a specific environment, browser, or time of day?
- Did any shared fixture, helper, or page object change recently?
- Is there a known dependency issue in the build?
These questions guide the branch between flaky test, deterministic bug, and environment issue. They also help engineering managers decide whether a suite needs investment or just cleanup.
Common edge cases that confuse detection
New tests often look flaky before they are stable
Recently added tests fail for a few reasons that are not necessarily flakiness, such as unstable selectors, unfinished features, or underdeveloped data setup. Consider a burn-in period before judging new tests harshly.
Performance-sensitive tests can masquerade as flaky
A test that barely passes on a fast local machine may fail in CI when the app is under load. In that case, the problem may be a weak timeout threshold or an implicit performance dependency.
Shared test data can create false flake signatures
If multiple parallel jobs write to the same account, row, or resource, the test may appear random when it is actually colliding with other runs. Isolation fixes are often more effective than reruns.
Healing mechanisms reduce noise, but they can also mask selector drift
Self-healing tools can reduce breakage when locators change. That is useful, especially for UI suites with frequent DOM changes. Still, healing should be logged and reviewed, because a healed run is a clue that your selector strategy needs attention. If a platform like Endtest heals a locator, the run should still leave enough evidence for failure clustering and trend analysis.
What good looks like after a few weeks
A team that detects flaky tests early should start to see a few measurable changes, even if the suite is not perfect:
- Fewer rerun storms on PRs
- More failures classified on the first pass
- Clearer separation between infra incidents and test issues
- Smaller quarantine backlog, or at least a backlog with owners and deadlines
- More stable critical path jobs
- Better trust from developers who no longer assume every red build is meaningless
This is not about eliminating every inconsistent failure. It is about making the inconsistency visible quickly enough that it does not consume an entire CI day.
A lightweight implementation checklist
If you want a practical starting point, use this checklist:
- Emit structured test results with stable test IDs
- Record commit, environment, and retry metadata
- Add one rerun for failed tests in a controlled lane
- Group failures by error signature and stack trace
- Track pass after fail rate for each test
- Separate infrastructure failures from test failures
- Introduce quarantine with ownership and expiry
- Review top flaky clusters weekly
- Remove or rewrite the tests that keep resurfacing
This checklist is intentionally small. The first win is not automation sophistication, it is disciplined visibility.
Choosing the right level of intervention
Not every unstable test deserves the same response.
- If the locator is brittle, fix the selector or the page object
- If the test data collides, isolate data creation and cleanup
- If the environment fails, improve runner stability or dependency availability
- If the product itself is changing, align test expectations with the real behavior
- If the test is still too volatile after fixes, quarantine it temporarily and refactor it later
The most mature teams treat flakiness as a reliability issue with several possible owners, not as a moral failing of the test author.
Final takeaway
To detect flaky tests in CI before they waste a full day, you need more than ad hoc reruns. You need repetition with purpose, preserved metadata, failure clustering, and a clear rerun policy. Once those pieces are in place, CI observability starts to work for you instead of against you. The suite becomes easier to trust, triage gets faster, and noisy failures stop dominating the pipeline.
That is also why centralizing test results matters. Whether you build that layer yourself or use a platform such as Endtest, the point is the same, make the patterns visible enough that you can act before the day is gone.
For teams that want the shortest path to better signal, start with one question, not ten, which tests fail inconsistently, and what metadata do you already have to prove it?