Guides / Flaky tests cost real money

Reruns & flakiness

Flaky tests are costing you real money

By Keith Mazanec, Founder, CostOps ยท Updated February 17, 2026

A test fails. A developer clicks “Re-run failed jobs.” It passes. Nobody investigates. You just paid for two full CI runs to learn nothing new. Multiply that by every flaky test, every PR, every day, and you're burning CI minutes on non-determinism instead of testing code. Across a team, 5–10% of workflow runs requiring retries adds up to thousands of wasted minutes per month. The root causes are well-understood, the reruns are fixable, and you don't need to rewrite your test suite.

Symptoms

How to tell if flaky tests are draining your CI budget

Flaky tests hide in plain sight. They rarely cause outright build failures. Instead, they cause reruns. Open your Actions tab and look for these patterns:

  • Fail-then-pass patterns. A workflow run fails, gets re-run on the same commit, and passes. No code changed between runs. This is the canonical flaky test signal. If you see run_attempt: 2 or higher regularly, you have a flake problem. If 25%+ of multi-attempt runs show this pattern, flakiness is a significant cost driver.

  • Runs requiring 3+ attempts. Occasional single retries happen. But when 5%+ of runs need 3 or more attempts, you have a systemic problem. Each additional attempt costs the same minutes as the first, with zero new information. GitHub bills for every attempt. A 15-minute workflow re-run twice costs 45 minutes total.

  • Same tests fail across unrelated PRs. When the same test name appears in failure logs on PRs that touch completely different code, that test is flaky. It's not detecting regressions, just generating noise and reruns. Usually 2–3 job names account for the majority of rerun triggers.

  • Reruns cost more minutes than first attempts. Compare average minutes for attempt 1 vs. attempt 2+. If reruns consistently use more minutes, caches are being evicted between attempts, queues are longer during peak hours, or full re-execution is happening when only partial reruns are needed.

  • Developers silently re-running without reporting. Research from Trunk's analysis of 20.2M CI jobs found engineers typically debug flaky failures for ~30 minutes, then quietly re-run without filing an issue. Google found that 84% of their post-submit CI failures involved a flaky test. Each investigation takes ~20 minutes of context-switching. The cost compounds silently.

Metrics

The retry tax: what flaky tests actually cost

Every rerun pays the full cost of the failed jobs again. Here's a typical scenario: a 10-developer team where 30% of PRs hit a flaky failure and get re-run once, on Linux runners:

Before - no flake management

PRs/day 20
Flaky failure rate 30%
Minutes/run 15
Reruns/day 6
Wasted minutes/month 1,980
Monthly retry cost $12/mo

At $0.006/min (Linux 2-core)

After - flake rate reduced to 5%

PRs/day 20
Flaky failure rate 5%
Minutes/run 15
Reruns/day 1
Wasted minutes/month 330
Monthly retry cost $2/mo

Save $10/mo · $120/year · per workflow

That's raw compute cost on Linux. On macOS runners at $0.062/min, the same scenario costs $123/mo in wasted reruns, dropping to $20/mo after flake reduction. But the bigger cost is developer time. If each flaky failure wastes 20 minutes of a developer's time at $75/hr, those 6 daily reruns cost $150/day, which adds up to $3,300/mo in lost productivity. Google found that flaky tests consume over 2% of coding time across teams. For a 10-engineer team, that's roughly a person-week per quarter spent on non-determinism. The CI bill is the tax. Developer time is the real expense.


Fix 1

Use automatic retries to detect flaky tests

Before you can fix flaky tests, you need to identify them. The most reliable detection method is automatic retries: if a test fails and then passes on retry with no code changes, it's flaky by definition. Start by identifying which workflows and job names trigger the most reruns. In most repositories, 2–3 workflows account for the majority of rerun minutes. Focus there first.

GitHub Actions has no built-in retry keyword for steps. Use the nick-fields/retry action to wrap your test command with automatic retries. This costs a few extra minutes on flaky runs but gives you a clear signal: any test that passes on retry is a confirmed flake.

.github/workflows/ci.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20

      - run: npm ci

      - uses: nick-fields/retry@v3
        with:
          max_attempts: 3
          timeout_minutes: 15
          command: npm test

The max_attempts: 3 setting means a flaky test gets two extra chances before failing the workflow. If it passes on attempt 2 or 3, the step succeeds and the workflow continues. However, you now know that test is non-deterministic. Combine this with test framework features to make retries smarter:

Jest - retry only failed tests on subsequent attempts
- uses: nick-fields/retry@v3
  with:
    max_attempts: 3
    timeout_minutes: 15
    command: npx jest
    new_command_on_retry: npx jest --onlyFailures

The new_command_on_retry parameter runs a different command on subsequent attempts. With Jest's --onlyFailures flag, retries only re-run the tests that actually failed, not the entire suite. This cuts retry time from the full suite duration down to just the flaky tests.

Important distinction: step-level retries like nick-fields/retry target a known-transient operation within the same job, costing fractions of a minute. A full workflow rerun re-executes entire jobs on fresh runners, costing 3–14+ minutes. Use step-level retries for detection and for genuinely transient operations (network calls, package downloads, Docker pulls). Use retries as a detection mechanism, not a permanent fix. Log every retry so you can track which tests are flaking and prioritize fixes.

Fix 2

Quarantine flaky tests to stop paying for them

Quarantining means separating known-flaky tests from your critical CI path. Flaky tests still run, but their failures don't block merges or trigger expensive full-workflow reruns. The goal is to stop the retry tax immediately while you fix the underlying non-determinism. Disabling a flaky test removes it from CI entirely. Quarantining is better: you keep visibility into the flake rate without paying the retry tax.

The simplest approach is tag-based: mark flaky tests with a tag, then split your CI into two jobs. The main job runs everything except tagged tests and blocks merges. The quarantine job runs only the tagged tests and is allowed to fail. Tools like Trunk Flaky Tests and BuildPulse can automate quarantine detection and override exit codes for known-flaky tests.

RSpec - tag flaky tests
# spec/features/checkout_spec.rb
describe "Checkout", :flaky do
  it "submits the order" do
    # known flaky - race condition
    # tracking: JIRA-4521
  end
end
pytest - mark as flaky
# tests/test_payment.py
import pytest

@pytest.mark.flaky(
    reruns=3,
    reruns_delay=2
)
def test_payment_webhook():
    # known flaky - timing
    ...

Then split your CI workflow into two jobs:

.github/workflows/ci.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: bundle install
      - run: bundle exec rspec --tag ~flaky
        name: Run stable tests

  quarantine:
    runs-on: ubuntu-latest
    continue-on-error: true
    steps:
      - uses: actions/checkout@v4
      - run: bundle install
      - run: bundle exec rspec --tag flaky
        name: Run quarantined tests

The continue-on-error: true on the quarantine job means its failures won't fail the overall workflow. The test job, which runs only stable tests, still blocks merges normally. Developers stop re-running workflows because the tests that were causing failures are isolated.

One caveat: quarantine is a pressure valve, not a fix. Without a process to actually repair quarantined tests, the quarantine list grows and your test coverage silently degrades. Set a rule: any test quarantined for more than 2 weeks gets either fixed or deleted. Track quarantined tests by count and age. If the quarantine list grows without tests being fixed, you're accumulating debt instead of managing it.

Fix 3

Re-run only failed jobs, not the entire workflow

When a developer manually re-runs a failed workflow, the default “Re-run all jobs” button re-executes every job, including the ones that already passed. If your workflow has 8 jobs and only 1 failed, you're paying for all 8 again instead of 1. That single click wastes 79% of the rerun's billed minutes repeating passing work. GitHub Actions supports partial re-runs on all plans since March 2022, letting you re-run only the failed jobs and their downstream dependents.

Re-run all jobs
# lint job       → re-runs (3 min)
# test job       → re-runs (8 min)
# build job      → re-runs (4 min)
# Total billed   → 15 min
Re-run failed jobs only
# lint job       → skipped (passed)
# test job       → re-runs (8 min)
# build job      → re-runs (4 min)
# Total billed   → 12 min

The same option is available in the GitHub CLI with the --failed flag. If you automate reruns via bots or scripts, always use this flag:

Terminal
# Re-run only the failed jobs (not the entire workflow)
gh run rerun --failed RUN_ID

# Bad: re-runs every job, including those that passed
gh run rerun RUN_ID

Reduce the blast radius further by removing unnecessary needs dependencies between jobs. When your test job depends on a build job via needs: build, rerunning the test also reruns the build. If jobs don't actually need each other's outputs, remove the dependency. Lint doesn't need build artifacts. Unit tests often don't either if they can install dependencies independently. Each independent job becomes a separate rerunnability unit.

One caveat: partial re-runs reuse the artifacts and outputs from the original run's successful jobs. If your jobs have implicit dependencies on shared state (e.g., a database that gets modified), partial re-runs may produce different results. Keep jobs isolated and idempotent for partial re-runs to work correctly. The biggest win here is behavioral, not technical. Train your team to use “Re-run failed jobs” as the default and add it to your CI runbook.

Fix 4

Fix cache misses that make reruns slower than first attempts

Reruns should be at least as fast as first attempts. When they are slower, the most common cause is cache misses. GitHub Actions caches are scoped to the branch and fall back to the default branch. A rerun of the same workflow run should hit the same cache, but there are three common failure modes: caches evicted under the 10 GB per-repository limit, keys that change between attempts because they reference volatile inputs like github.run_id, and LRU eviction removing entries not accessed within 7 days.

.github/workflows/ci.yml
steps:
  - uses: actions/checkout@v4

  # Good: cache key is deterministic across reruns
  - uses: actions/cache@v4
    with:
      path: node_modules
      key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
      restore-keys: |
        ${{ runner.os }}-node-

  # Bad: cache key includes run_id or attempt, never reused
  # key: ${{ runner.os }}-node-${{ github.run_id }}

  - run: npm ci
  - run: npm test

The rules are simple. Cache keys should be based on lockfile hashes, not on run IDs, attempt numbers, or timestamps. Always include restore-keys as a fallback so partial cache hits still save time. And avoid caching too aggressively: if your total cache size approaches the 10 GB limit, GitHub evicts the least-recently-used entries, which can cause misses on reruns for less-active branches. You can audit your cache usage with gh cache list --sort size_in_bytes --order desc.

Fix 5

Track flake rate and make it visible

Spotify reduced their iOS test flakiness by 33% just by making flake data visible to the team, without any code changes or new tools. Reducing your overall CI failure rate starts with visibility like this. When engineers can see which tests are flaky and how often, social pressure and prioritization naturally follow.

You can build a basic flake tracker using GitHub's run_attempt context. Any workflow run where run_attempt > 1 and the conclusion is success represents a confirmed flake. Export this data to track your flake rate over time:

Query flaky runs via gh CLI
# List recent CI runs with their attempt number
gh run list \
  --workflow ci.yml \
  --limit 100 \
  --json databaseId,conclusion,attempt,startedAt \
  --jq '.[] | select(.attempt > 1 and .conclusion == "success")'

# Count: total runs vs. runs that needed a retry
TOTAL=$(gh run list --workflow ci.yml --limit 100 --json attempt --jq 'length')
FLAKY=$(gh run list --workflow ci.yml --limit 100 --json attempt,conclusion \
  --jq '[.[] | select(.attempt > 1 and .conclusion == "success")] | length')

echo "Flake rate: $FLAKY / $TOTAL"

Or skip the scripting. CostOps tracks run_attempt on every workflow run, groups rerun minutes by workflow name, and computes fail-then-pass rates automatically. You get flake rate trends and a ranked list of your flakiest workflows without writing custom queries.

Define a flake rate SLO and track it weekly. Here are reasonable starting targets:

Metric Target Why
Fail-then-pass rate < 10% Of multi-attempt runs
Runs needing 3+ attempts < 2% Of total workflow runs
Quarantined test count < 20 Prevent quarantine sprawl
Rerun minutes share < 5% Of total billable minutes

GitHub's internal engineering team tracked their flake rate as an SLO and reduced it from 9% (1 in 11 commits) to 0.5% (1 in 200), an 18x improvement. The key insight: 0.4% of their flaky tests caused 100+ failures each. Fixing the worst offenders first yields the most savings. Review these numbers weekly. When flake rate trends up, pause new test additions until the backlog is stabilized.


Reference

Root cause distribution of flaky tests

To actually fix flaky tests rather than just quarantining them, you need to identify why they're non-deterministic. Research across Google, Trunk, and Microsoft Research shows consistent root cause categories:

Root cause Share Symptom Fix
Async/timing waits ~45% Passes locally, fails in CI under load Replace sleep() with polling or event waits
Concurrency/shared state ~20% Fails only when run with other tests Isolate test data; reset state between tests
Test order dependency ~12% Passes in sequence, fails with random order Run with --order random; independent setup
External service calls ~10% Fails during API outages or rate limits Mock/stub external deps or step-level retry
Infra (OOM, network) ~13% Fails with timeout or resource errors Step-level retry with nick-fields/retry

The distinction matters for choosing the right fix. Code-level flakiness (async waits, shared state, test ordering) requires test fixes. Infrastructure flakiness (network timeouts, registry pulls, OOM) is better handled with targeted retries at the step level. Mixing the two strategies wastes effort. Over 70% of flaky tests exhibit flaky behavior when first introduced. The cheapest fix is catching them before they merge by running new tests multiple times in a pre-merge check.

Reference

Rerun cost comparison by strategy

The cost difference between rerun strategies is significant. For a workflow with 8 jobs, 12 total minutes, and 1 failed job (3 minutes), here is the cost per rerun on Linux runners at $0.006/min:

Strategy Minutes Cost Waste
Re-run all jobs 14 $0.084 79%
Re-run failed jobs 3 $0.018 0%
Step-level retry <1 $0.006 0%

“Re-run all jobs” costs 14 minutes instead of 12 because caches expired between attempts, requiring full dependency reinstall. The 79% waste figure comes from the 11 minutes spent re-executing the 7 passing jobs. On macOS at $0.062/min, the same rerun costs $0.87 vs $0.19 for partial rerun. At 5 reruns per day, the difference adds up to $7.26/mo per workflow on Linux, and $75/mo per workflow on macOS.

Related guides

Guides / Flaky tests cost real money

See which workflows are burning money on reruns

CostOps tracks rerun rates, flaky failure patterns, and retry costs per workflow. Know exactly where your CI budget is leaking before you change a line of YAML.

Free for 1 repo. No credit card. No code access.

Built by engineers who've managed CI spend at scale.