Guides / Reduce CI failures

Reruns & flakiness

Reduce CI failures and rerun waste

By Keith Mazanec, Founder, CostOps ยท Updated February 17, 2026

A developer pushes to a PR. The workflow runs for 18 minutes, then fails on a transient network error during npm install. They click “Re-run all jobs.” Same result. Third attempt passes. You just paid for 54 minutes to get 18 minutes of useful signal. Across a team, a 15% failure rate quietly burns thousands of minutes per month on runs that deliver nothing, and the reruns triggered by those failures multiply the cost further. The fix starts with knowing which workflows fail, why they fail, and where rerun minutes concentrate.

Symptoms

How to tell if CI failures and reruns are costing you money

Open your repository's Actions tab and filter by status. These patterns indicate high failure and rerun waste:

  • High failure rate. More than 10% of workflow runs end in failure or timed_out. Every failed run consumed minutes without producing a usable result, leaving you with no green check, no merge signal, and no artifact. Those minutes are gone.

  • Failures cluster in a few workflows. One or two workflow names account for the majority of failed minutes. This is common and follows the CI equivalent of the Pareto principle. If your e2e-tests workflow fails 30% of the time while lint fails 1%, your effort should be concentrated on the former.

  • Same jobs fail repeatedly with the same error. Infra failures (network timeouts, registry pulls, runner provisioning) look different from test failures. If you see the same ETIMEDOUT or exit code 137 across runs, that's an infrastructure problem, not a code problem. These need a different fix than failing assertions.

  • Developers manually re-run workflows frequently. High run_attempt > 1 counts mean developers are clicking “Re-run all jobs” to work around failures. Each re-run bills for the full pipeline again. A 20-minute workflow re-run three times costs 60 minutes to produce one passing result.

  • Rerun minutes dominated by 1–3 workflows. Group reruns by workflow name. If the top 3 workflows account for 60%+ of all rerun minutes, you have hotspots, not a systemic flakiness problem. This concentration means targeted fixes will have outsized impact compared to broad retry policies.

Metrics

Quantify the failure and rerun waste

The cost of failures is: failed runs × minutes/run × rate. A team running 80 builds/day with a 15% failure rate on Linux runners:

Before optimization

Runs/day 80
Failure rate 15%
Failed runs/day 12
Minutes/failed run 18
Wasted minutes/month 4,752
Monthly waste $28.51/mo

12 fails × 18 min × 22 days = 4,752 min × $0.006/min

After optimization (failure rate → 5%)

Runs/day 80
Failure rate 5%
Failed runs/day 4
Minutes/failed run 18
Wasted minutes/month 1,584
Monthly waste $9.50/mo

Save $19/mo · $228/year · per workflow

That's one workflow on Linux. On macOS runners at $0.062/min, the same scenario burns $294.62/mo on failed runs before optimization, dropping to $98.21/mo after, saving $196.42/mo. And this doesn't account for the reruns that developers trigger after failures, which multiply the cost further. Every rerun is billed at the same per-minute rate as the original run. There is no discount for retries.


Fix 1

Identify your top-failing workflows and jobs

Before fixing anything, you need to know where failures concentrate. The GitHub CLI can pull workflow run conclusions so you can rank workflows by failure count and wasted minutes. Focus on the top 2–3 workflows that account for the most failed minutes, since fixing everything at once is not the goal.

List failed runs by workflow (last 30 days)
# Count failures by workflow name over the last 30 days
gh run list \
  --status failure \
  --limit 500 \
  --json workflowName,conclusion,createdAt \
  --jq 'group_by(.workflowName)
    | map({name: .[0].workflowName, count: length})
    | sort_by(-.count)
    | .[:5]
    | .[] | "\(.count)\t\(.name)"'

Once you know which workflows fail most, drill into individual jobs within those workflows. A workflow might have 8 jobs, but failures concentrate in one or two. Use the Actions tab to filter by the failing workflow and look at which job names turn red.

List failed jobs for a specific workflow run
# Show failed jobs from a specific run
gh run view <run-id> --json jobs \
  --jq '.jobs[]
    | select(.conclusion == "failure")
    | "\(.name)\t\(.conclusion)\t\(.steps
        | map(select(.conclusion == "failure"))
        | .[0].name)"'

Or skip the scripting. CostOps ingests every workflow run via webhook and breaks down failures by workflow name, job name, conclusion type, and step. You get a ranked list of your top-failing jobs without paginating through the API.

This gives you three layers of specificity: which workflow, which job, and which step. That's enough to categorize the failure and pick the right fix from the sections below.

Fix 2

Identify rerun hotspots by workflow

Failures cost you minutes. Reruns double the bill. The GitHub Actions REST API exposes a run_attempt field on every workflow run. Any run where run_attempt > 1 is a rerun. Query your runs, group by workflow name, and rank by total rerun minutes. The top 1–3 workflows are your hotspots.

Query reruns via GitHub API (bash + gh CLI)
# List recent workflow runs where run_attempt > 1
gh api repos/{owner}/{repo}/actions/runs \
  --paginate \
  -q '.workflow_runs[]
       | select(.run_attempt > 1)
       | {name, run_attempt, conclusion,
          created_at, updated_at}'

# For each rerun, the API also returns:
#   run_attempt  - attempt number (2 = first rerun)
#   conclusion   - success, failure, timed_out, cancelled
#   run_started_at / updated_at  - for duration calculation

Next, group failing runs by job name within the hotspot workflow. In most cases, a single job accounts for the majority of failures. GitHub records four conclusion types for failed runs: failure (test/build errors), timed_out (deadlocks, resource exhaustion), cancelled (superseded by newer push), and startup_failure (runner provisioning errors). Each points to a different root cause and a different fix.

List failed jobs in a specific run
# Get failed jobs for a specific run
gh api repos/{owner}/{repo}/actions/runs/{run_id}/jobs \
  -q '.jobs[]
       | select(.conclusion == "failure")
       | {name, conclusion, started_at, completed_at}'

# Track the pattern: if the same job name appears
# in 80%+ of reruns, that's your target.

CostOps does this automatically. For every workflow run and re-run, CostOps ingests job-level results via webhook and breaks down failures by conclusion type and job name. Instead of paginating through the API run by run, you get a single view: which jobs fail most, what conclusion type dominates, and how many rerun minutes each one costs. It works on all GitHub plans, including Free.

Fix 3

Separate infrastructure failures from test failures

Infrastructure failures and test failures have different root causes and different fixes. Treating them the same wastes effort. Infra failures are transient, including network timeouts, registry rate limits, runner provisioning errors, and OOM kills. They pass on retry without any code change. Test failures indicate actual code problems or flaky test logic.

For infrastructure failures, the right fix is targeted retry logic at the step level rather than blanket "re-run all jobs." Blanket re-runs waste minutes re-running jobs that already passed. Step-level retries isolate the transient failure and retry only the affected step.

Full workflow rerun

Jobs replayed 6 of 6
Minutes billed 18
Rerun cost $0.108

5 passing jobs re-run unnecessarily

Step-level retry

Jobs replayed 0
Minutes billed 3
Rerun cost $0.018

83% cheaper · no manual intervention

.github/workflows/ci.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Infra step: retry on transient failures
      - name: Install dependencies
        uses: nick-fields/retry@v3
        with:
          max_attempts: 3
          retry_wait_seconds: 15
          timeout_minutes: 5
          command: npm ci

      # Test step: no retry - failures here mean real problems
      - name: Run tests
        run: npm test

The nick-fields/retry action retries a single step up to N times. Wrapping npm ci in a retry handles the common case where a registry request times out without needing to re-run your entire test suite. The timeout_minutes prevents a hung install from burning runner minutes indefinitely.

One caveat: do not add retry logic to your test step. If tests fail, you want to know immediately. Retrying test failures masks flaky tests and delays the feedback loop. Only retry steps where transient infrastructure failures are the expected failure mode: dependency installs, Docker pulls, artifact downloads, and external API calls.

Fix 4

Add explicit timeouts to cap failure cost

GitHub Actions defaults to a 360-minute (6-hour) timeout per job. If a job hangs due to a deadlocked test, stuck Docker build, or unresponsive external service, it will burn runner minutes for up to 6 hours before GitHub kills it. See our dedicated guide on reducing CI timeouts for a comprehensive timeout strategy. On a macOS runner, one hung job at the default timeout costs $22.32.

Set timeout-minutes on every job to a value slightly above its normal duration. A job that usually takes 12 minutes should have a timeout of 2025 minutes. This caps the damage from hangs while allowing headroom for legitimate variance.

Default: 6-hour timeout
jobs:
  test:
    runs-on: ubuntu-latest
    # No timeout-minutes set
    # Default: 360 minutes (6 hours)
    # A hung job costs up to $2.16
    # (Linux) or $22.32 (macOS)
    steps:
      - run: npm test
Explicit timeout: 20 min cap
jobs:
  test:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    # Caps damage to $0.12 (Linux)
    # or $1.24 (macOS)
    steps:
      - run: npm test

Fix 5

Use fail-fast and “Re-run failed jobs” to minimize rerun cost

When a matrix job or multi-job workflow discovers a failure early, you want to stop the remaining jobs as soon as possible. For matrix strategies, GitHub Actions has a fail-fast setting (enabled by default) that cancels remaining matrix jobs when one fails. Make sure you haven't disabled it, and structure your workflow so that cheap checks run first.

.github/workflows/ci.yml
jobs:
  # Gate: cheap checks run first
  lint:
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - uses: actions/checkout@v4
      - run: npm run lint

  # Expensive jobs depend on cheap ones passing
  test:
    needs: [lint]
    runs-on: ubuntu-latest
    timeout-minutes: 20
    strategy:
      fail-fast: true  # default, but explicit is clearer
      matrix:
        node: [18, 20, 22]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm ci
      - run: npm test

This structure provides two layers of fail-fast behavior. First, the needs keyword ensures that test doesn't start until lint passes. If lint fails in 30 seconds, you skip the 20-minute test suite entirely. Second, fail-fast: true on the matrix cancels remaining Node versions as soon as one fails.

When a rerun is unavoidable, always use “Re-run failed jobs” instead of “Re-run all jobs.” GitHub’s “Re-run failed jobs” replays only the failed jobs plus their downstream dependents. For a workflow with 6 jobs where 1 fails, “Re-run failed jobs” costs roughly 1/6th of “Re-run all jobs.” This feature is available on all GitHub plans. To automate this:

Automated rerun of failed jobs only
# In a retry job that runs after a failure:
retry:
  needs: [test]
  if: failure() && fromJSON(github.run_attempt) < 3
  runs-on: ubuntu-latest
  steps:
    - name: Rerun failed jobs
      run: gh run rerun ${{ github.run_id }} --failed
      env:
        GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The fromJSON(github.run_attempt) < 3 condition caps retries at 2 attempts beyond the original run. Without this guard, a genuinely broken test would rerun indefinitely. The --failed flag ensures only failed jobs replay rather than the entire workflow.

One caveat: automated reruns can hide real failures. If a test passes on attempt 3, the PR shows green, yet the underlying flakiness persists. Pair automated reruns with flake tracking: log every run_attempt > 1 pass so you can identify and fix the root cause.


Reference

Common CI failure categories and fixes

Once you've identified your top-failing jobs, categorize the failures by root cause. Each category has a different fix:

Failure type Symptoms Fix
Network/Registry ETIMEDOUT, 429, DNS errors Step-level retry with backoff
OOM / Resource exit code 137, killed Larger runner or reduce concurrency
Flaky tests Pass on re-run, random assertion failures Quarantine, fix determinism
Timeout/Hang timed_out conclusion Set explicit timeout-minutes
Startup failure startup_failure conclusion Wait and auto-retry the run
Real code failures Consistent failure, same test, same assertion Fix the code (this is CI working correctly)

Datadog's 2024 DevOps report found that 63% of pipeline failures are caused by resource exhaustion, not test failures. The common thread: most CI failure waste comes from infra-level problems that can be fixed independently of application code. Once you reduce transient failures, focus on stabilizing CI runtime variance to make remaining runs more predictable.

Reference

Prioritization framework for failure and rerun hotspots

Not all failures and reruns are equal. Use this framework to decide where to invest your time:

Factor Higher priority Lower priority
Runner type macOS ($0.062/min) Linux ($0.006/min)
Failure/rerun frequency Daily failures or reruns Weekly failures or reruns
Job count Full workflow reruns (6+ jobs) Single failed job reruns
Root cause Flaky tests (fixable) GitHub infra (transient)

Focus on the workflow that scores highest across all four factors. A daily-flaking macOS workflow with full reruns is worth 100x more to fix than a weekly Linux infra timeout that resolves on the second attempt.

Reference

Rerun cost by runner type

To estimate your own failure and retry tax, multiply your wasted minutes by these per-minute rates. The macOS multiplier makes rerun waste on Apple runners especially painful:

Runner Rate 100 wasted min/mo
Linux 2-core $0.006/min $0.60
Windows 2-core $0.010/min $1.00
macOS (M1/Intel) $0.062/min $6.20
Linux 8-core $0.022/min $2.20

GitHub rounds each job to the nearest minute and bills reruns at the same rate as first runs. There is no rerun discount. Per-minute rounding means a 10-second retry job is billed as 1 full minute.

Related guides

Guides / Reduce CI failures

See which workflows waste minutes on failures and reruns

CostOps tracks failure rates, rerun minutes by workflow, and failure conclusions automatically. Find your hotspots before you change the YAML.

Free for 1 repo. No credit card. No code access.

Built by engineers who've managed CI spend at scale.