May 25, 2026
Flaky Tests in CI: The Signals QA Teams Should Watch Before a Release Slips
An analyst-style breakdown of flaky test signals in CI, including rerun patterns, instability hotspots, and release risk indicators QA teams can use before a slip.
Flaky tests are easy to dismiss when the pipeline is still green enough to ship. A rerun passes, a developer shrugs, and the failing job gets filed under “environment noise.” That reaction is understandable, but it is also how teams let test debt accumulate until it becomes release risk. The useful question is not whether a test failed once. It is whether the failure pattern, timing, and recovery behavior are telling you something about the stability of the codebase, the CI environment, or the test suite itself.
For QA leaders, SDETs, and DevOps teams, the real task is to interpret flaky test signals in CI before they distort delivery decisions. A flaky failure is not just a nuisance, it is telemetry. It can reveal brittle selectors, race conditions, data contamination, unbounded dependencies, or infrastructure drift. More importantly, it can show up as a release-quality problem long before a business stakeholder sees a missed date.
A single red build is an event. A repeated pattern of red builds with fast reruns, inconsistent failures, and test-specific hotspots is a signal.
This article breaks down the operational signals that separate random noise from systemic test debt, and shows how to use CI telemetry to judge whether your release is stable, barely stable, or already slipping.
Why flaky tests matter more than they first appear
A flaky test is one that sometimes passes and sometimes fails without a meaningful code change. In practice, that definition is messy. Some failures are caused by real product defects, some by test design, some by slow infrastructure, and some by external dependencies that are beyond your control. The problem is not classification in the abstract, it is triage under delivery pressure.
Flakiness creates three kinds of damage:
- It erodes trust in signal quality. Teams stop believing failures, which leads to real defects being ignored.
- It hides release risk. If the same test passes on rerun, teams may accept a weak signal and keep shipping.
- It consumes operational time. Engineers spend cycles rerunning, investigating, and debating, instead of fixing root causes.
The most dangerous part is the feedback loop. As flake rates rise, developers rerun more often. As reruns rise, the pipeline produces more “eventual green” results. That makes the suite look healthier than it is, while the underlying instability worsens.
For background on the broader practice, see software testing, test automation, and continuous integration.
The signal stack, what CI data actually tells you
A useful flake investigation starts by separating raw failure events from behavioral signals. One failing run does not tell you much. A cluster of failures, reruns, and timing shifts does.
The signals worth watching are usually already present in your CI system, test runner, or observability stack:
- Failure frequency by test name
- Rerun count before pass
- Failure concentration by branch, environment, or time window
- Median and p95 execution time changes
- Isolation between test failures and specific dependencies
- Historical pass/fail alternation patterns
- Test order sensitivity
- Differences between local, CI, and scheduled runs
If your tooling records only “pass” or “fail,” you are operating with a very low-resolution view of release quality. The teams that catch systemic instability early tend to retain richer metadata, such as retries, environment labels, commit SHA, container image, browser version, test owner, and duration.
The most important distinction, noisy failure versus unstable system
A random one-off failure often looks like this:
- One test fails once in a specific job
- The rerun passes immediately
- No other tests are affected
- No related code or dependency changed
- The failure does not recur in the next several runs
A systemic instability looks different:
- The same tests recur in failed state across multiple builds
- Failures cluster in one suite, service, or environment
- Reruns pass, but not consistently
- Duration starts to drift upward before failures appear
- Failures correlate with parallelism, load, or shared data
That distinction matters because random noise should be monitored, but systemic instability should be triaged as release risk.
The core flaky test signals in CI
1. Rerun patterns, especially “second run always green” behavior
Reruns are useful, but the pattern of rerun success tells you a lot. If a test fails, then passes on the first retry, and this happens repeatedly, you are likely looking at a nondeterministic dependency, a race condition, or a fragile assertion.
Watch for these forms of rerun behavior:
- Immediate rerun success: often points to timing issues, async waits, stale data, or environment lag
- Multiple reruns before success: suggests a deeper instability, not just an occasional timeout
- Rerun success only on isolated workers: points to shared-state problems or resource contention
- Success after suite restart, not test rerun: can indicate order dependency or fixture contamination
A healthy pipeline should not depend on repeated retries to create truth. If your release process is treating retry success as proof of stability, the release gate has weakened.
2. Alternating pass/fail histories
The classic flake signature is a test that toggles between pass and fail over time without a matching code change. That alternating pattern is often more informative than the failure itself.
For example:
- Build 1201, pass
- Build 1203, fail
- Build 1204, pass
- Build 1207, fail
- Build 1210, pass
This pattern usually means the test is sensitive to conditions not represented in the test name, such as:
- Clock timing
- API eventual consistency
- Browser rendering timing
- Data setup race conditions
- Shared test user collisions
A single alternation is not conclusive. Repeated alternation over several releases is. It tells you the test is not stable enough to serve as a release gate.
3. Failures that cluster by environment
The same test failing only in one CI runner pool, one browser version, one container image, or one region is a strong operational clue. That usually means the problem is not the application code alone, but the execution context.
Common environment-linked sources include:
- CPU starvation on overloaded runners
- Memory pressure in containers
- Browser or driver version mismatch
- Network latency to upstream dependencies
- Time zone or locale differences
- Missing fonts, certificates, or OS libraries
If the suite passes on a developer laptop but fails on Linux containers, do not assume the laptop is “correct.” More often, the CI environment is exposing a hidden dependency that the local run masks.
4. Duration drift before failure
Many teams watch for outright test failures and ignore timing changes. That is a mistake. Slowdown is often the earliest warning of instability.
Examples of duration drift:
- A UI test that normally completes in 20 seconds starts hovering around 35 to 40 seconds
- A suite that used to finish in a consistent window becomes more variable run to run
- Only tests touching a shared service become slower over time
This can indicate:
- Resource contention
- Growing test data volume
- A dependency under load
- A regression in the application, such as slower page loads or API responses
- A test that is waiting on the wrong condition and masking a race
Duration drift matters because a test can be functionally unstable before it becomes visibly red.
5. Failure concentration in a small subset of tests
A noisy suite often has a small set of repeat offenders. That is where the debt is.
A useful metric is the share of total failures produced by the top 10 failing tests. If a tiny subset produces most incidents, you likely have a focused remediation opportunity. If failures are spread evenly across the suite, the problem may be broader, such as environment instability or weak test architecture.
The goal is not to punish the tests. It is to determine whether the test suite is behaving like a precision instrument or a weather vane.
6. Order dependence and suite coupling
Tests should be independent, but many are not. Order dependence appears when one test changes state that another test implicitly consumes.
Symptoms include:
- Test A passes alone, fails after Test B
- Running the suite in a different order changes results
- Parallel execution increases failure rate
- Cleanup steps appear to fix the issue temporarily
Order dependence is particularly common in end-to-end tests and shared fixtures. It is one of the strongest signs of regression instability, because it means the suite is not modeling isolated behavior reliably.
7. Differences between local, PR, merge, and scheduled runs
A test that passes locally but fails in CI could be a legitimate environment mismatch. But the pattern matters.
Consider the following hierarchy:
- Local pass, PR fail: often depends on environment, seed data, or browser state
- PR pass, main branch fail: may indicate merge interactions or branch-specific dependencies
- CI pass, nightly fail: points to data drift, time-based dependencies, or external services
- Only scheduled runs fail: often means the suite is not fully reproducible under production-like conditions
These differences are useful because they tell you where to invest first. If nightly runs expose issues that PR validation misses, your gating strategy is too weak.
How to tell flake from legitimate defect
A common operational mistake is to classify every intermittent failure as “flaky” and move on. That can hide real defects. The better approach is to evaluate the failure against a set of discriminating questions.
Ask whether the failure is reproducible under the same conditions
If the test fails with the same inputs, same image, same commit, and same environment, it is probably not a flake. It may be a real regression, or a deterministic test issue.
Ask whether the failure depends on timing or load
Failures that disappear when the system is slower, faster, or rerun later often point to race conditions, stale waits, or asynchronous behavior.
Ask whether the failure is isolated to the test or shared by others
When many tests fail because one dependency is down, the problem is environmental or service-level, not test-specific. When a single test fails repeatedly, the test itself may be the issue.
Ask whether the assertion is too strict
A test that checks exact text, exact ordering, exact pixel dimensions, or exact timing can fail for reasons that do not matter to product behavior. Overly brittle assertions are common flake sources.
Ask whether data setup is deterministic
If the test depends on random data, background jobs, or lingering state from previous runs, the suite will eventually drift into nondeterminism.
The goal is not to eliminate all variability, it is to make the variability observable and intentional.
Release risk indicators that matter to QA leadership
QA managers and engineering directors need a broader view than individual test failures. The question is whether flakiness is starting to distort release confidence.
Watch for these release-risk indicators
- Retry volume increasing over multiple releases
- The same failures appearing in release branches and mainline
- Manual sign-off increasingly relying on “known flaky” exclusions
- Longer pipeline times caused by extra reruns and quarantines
- More tests being marked unstable without owner assignment
- Post-merge defects rising in areas touched by flaky suites
A quarantine list can be useful, but only if it is treated as temporary and measured. If quarantined tests keep growing, you have created a shadow test suite that no longer protects the release.
A practical release-readiness question
Instead of asking, “Did the pipeline pass?”, ask:
Did the pipeline pass with an acceptable retry rate, a stable failure profile, and no unexplained duration drift?
That phrasing changes the conversation. It turns flakiness into an explicit release-quality dimension, not a side problem.
What good telemetry looks like
To make flaky test signals useful, capture enough context to classify them later. This does not require a data platform rebuild, but it does require discipline.
Minimum useful fields:
- Test name and suite name
- Commit SHA
- Branch or pull request ID
- Environment identifier
- Browser, driver, or runtime version
- Retry count
- Execution duration
- Failure message and stack trace
- Timestamp and timezone
- Worker or node ID
- Artifact links, such as logs, screenshots, or traces
If you can, also retain:
- Test owner
- Service dependency version
- Seed or dataset identifier
- Parallelism level
- Resource limits for the job
With those fields, you can create trend views like:
- Flake rate by suite
- Retry success rate by test owner
- Failure rate by environment image
- Duration drift by branch
- Failure concentration by dependency
That is the difference between reactive firefighting and operational intelligence.
A simple CI pattern for surfacing flaky signals
One practical step is to track retries as first-class telemetry in your pipeline. For example, if your CI system supports a flaky rerun policy, preserve both the original failure and the eventual outcome.
Here is a compact GitHub Actions example that separates the primary run from a retry step, which helps make retry behavior visible rather than hidden:
name: tests
on: [pull_request]
jobs: e2e: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - name: Run tests id: tests run: npm run test:e2e - name: Retry failed tests once if: failure() run: npm run test:e2e – –retry=1
The point here is not that retrying is always good. It is that retries should be measurable. If your pipeline silently retries until green, you lose the signal entirely.
For browser-based suites, Playwright can be configured to record retries and isolate flaky tests more clearly:
import { defineConfig } from '@playwright/test';
export default defineConfig({ retries: 1, reporter: [[‘html’], [‘json’, { outputFile: ‘test-results.json’ }]], });
That JSON output becomes useful if you aggregate it over time and compare retry incidence by test.
Common root causes behind CI flakes
The signal is only useful if it points to action. Most flaky tests come from a fairly small set of causes.
Brittle selectors and UI timing
Selectors tied to CSS structure or dynamic layout often fail when the page changes slightly. Better selectors are based on user-facing semantics or stable test IDs.
Uncontrolled async behavior
Tests that click and immediately assert without waiting for the right condition often fail intermittently. Explicit waits should target application state, not arbitrary sleep intervals.
Shared mutable test data
If multiple tests use the same accounts, records, or queues, they can interfere with each other. Parallel runs expose this quickly.
External dependency volatility
Third-party services, email systems, payment gateways, and feature flags can all create nondeterministic results if they are not stubbed or isolated.
Resource exhaustion in CI
Thin runner pools, undersized containers, and overloaded shared services can change execution timing enough to cause flaky behavior.
Incomplete teardown
If tests leave behind sessions, records, files, or browser state, later tests pay the price.
How to prioritize fixes without boiling the ocean
Not every flaky test deserves immediate refactoring. Prioritize by business and operational impact.
A useful order is:
- Tests gating release branches
- Tests with high rerun volume
- Tests that fail in multiple environments
- Tests tied to critical user journeys
- Tests with rising duration variability
Then classify each test into one of four buckets:
- Real defect: fix the product
- Test defect: fix the automation
- Environment issue: fix CI or infrastructure
- Known external dependency: isolate or mock
This classification should be visible in your defect tracking, not trapped in Slack threads.
A decision framework for QA teams
When a failure lands, use a simple decision tree:
- Did the same commit reproduce the failure?
- Did rerun succeed immediately?
- Is the issue tied to one environment?
- Are related tests failing too?
- Did duration drift precede the failure?
- Does the failure involve shared data or parallel execution?
If you answer “yes” to several of the instability questions, treat it as a flake signal. If you answer “yes” to reproducibility, treat it as a product defect until proven otherwise.
That stance keeps teams from using “flaky” as a convenient label for anything inconvenient.
What mature teams do differently
Teams that manage CI flakes well usually have a few habits in common:
- They track flaky test history, not just current build state
- They treat retry rate as an SLO-like operational metric
- They maintain ownership for unstable tests
- They reduce suite coupling and shared state
- They keep quarantines time-boxed and visible
- They use environment parity checks to compare local, PR, and scheduled behavior
Most importantly, they view test instability as a quality signal, not just a tooling problem. A flaky test can be the first observable symptom of broader release fragility.
Closing thought
The fastest way to lose confidence in a CI pipeline is to normalize flaky behavior. The better approach is to read the signals carefully, retry counts, duration drift, environment clustering, and alternating pass/fail patterns, then decide whether you are looking at random noise or a system that is quietly becoming harder to ship.
If your team is watching flaky test signals in CI with enough detail, you can usually spot release risk before the release slips. If you are not, the pipeline will still tell you the truth, just later, and at a higher cost.