How much do CI failures cost in GitHub Actions?

Every failed run consumes billed minutes at the same rate as a passing run. A team running 80 builds per day with a 15% failure rate on Linux runners wastes roughly 4,752 minutes per month ($28.51) on failures alone. On macOS runners at $0.062/min, the same scenario costs $294.62/month. Reducing the failure rate from 15% to 5% saves $19-196/month per workflow depending on runner type.

How do I find which GitHub Actions workflows fail the most?

Use the GitHub CLI to list failed runs and group by workflow name: gh run list --status failure --limit 500 --json workflowName,conclusion,createdAt, then pipe through jq to group_by(.workflowName) and sort by count. Focus on the top 2-3 workflows that account for the most failed minutes. For reruns specifically, query the REST API for runs where run_attempt > 1 and group by workflow name.

What is a rerun hotspot in CI?

A rerun hotspot is a workflow where manual reruns concentrate. Most teams find that 1-3 workflows account for 60% or more of all rerun minutes. Fixing these hotspots has outsized impact compared to addressing reruns across all workflows. Hotspots are typically caused by flaky tests, transient infrastructure failures, or resource exhaustion in specific jobs.

Should I use Re-run all jobs or Re-run failed jobs in GitHub Actions?

Always prefer Re-run failed jobs when possible. For a workflow with 6 jobs where 1 fails, Re-run all jobs replays all 6 jobs from scratch (costing 6x more), while Re-run failed jobs only replays the failed job plus its downstream dependents. This feature is available on all GitHub plans. When automating reruns, use the gh run rerun command with the --failed flag.

How do I add step-level retries to GitHub Actions?

Use the nick-fields/retry action to wrap individual steps with retry logic. Set max_attempts (typically 2-3), retry_wait_seconds for backoff, and timeout_minutes to cap duration. Only retry steps prone to transient failures like dependency installs, Docker pulls, and artifact downloads. Never add retries to test steps, as this masks flaky tests.

How much do reruns cost by runner type?

GitHub bills reruns at the same per-minute rate as first runs with no discount. Linux 2-core costs $0.006/min, Windows 2-core costs $0.010/min, macOS costs $0.062/min, and Linux 8-core costs $0.022/min. GitHub also rounds each job up to the nearest minute, so even short retry jobs carry a full-minute floor. This makes macOS rerun waste especially expensive.

Reruns & flakiness

Reduce CI failures and rerun waste

Q: How do I separate infrastructure failures from test failures in CI?

Look at the failure symptoms. Infrastructure failures show network timeouts (ETIMEDOUT), registry rate limits (429), OOM kills (exit code 137), or runner provisioning errors. They pass on retry without code changes. Test failures show consistent assertion errors tied to specific tests. Infra failures should get step-level retries using nick-fields/retry. Test failures need code fixes or quarantining.

By Keith Mazanec, Founder, CostOps · Updated February 17, 2026

A developer pushes to a PR. The workflow runs for 18 minutes, then fails on a transient network error during npm install. They click “Re-run all jobs.” Same result. Third attempt passes. You just paid for 54 minutes to get 18 minutes of useful signal. Across a team, a 15% failure rate quietly burns thousands of minutes per month on runs that deliver nothing, and the reruns triggered by those failures multiply the cost further. The fix starts with knowing which workflows fail, why they fail, and where rerun minutes concentrate.

Symptoms

How to tell if CI failures and reruns are costing you money

Open your repository's Actions tab and filter by status. These patterns indicate high failure and rerun waste:

High failure rate. More than 10% of workflow runs end in failure or timed_out. Every failed run consumed minutes without producing a usable result, leaving you with no green check, no merge signal, and no artifact. Those minutes are gone.
Failures cluster in a few workflows. One or two workflow names account for the majority of failed minutes. This is common and follows the CI equivalent of the Pareto principle. If your e2e-tests workflow fails 30% of the time while lint fails 1%, your effort should be concentrated on the former.
Same jobs fail repeatedly with the same error. Infra failures (network timeouts, registry pulls, runner provisioning) look different from test failures. If you see the same ETIMEDOUT or exit code 137 across runs, that's an infrastructure problem, not a code problem. These need a different fix than failing assertions.
Developers manually re-run workflows frequently. High run_attempt > 1 counts mean developers are clicking “Re-run all jobs” to work around failures. Each re-run bills for the full pipeline again. A 20-minute workflow re-run three times costs 60 minutes to produce one passing result.
Rerun minutes dominated by 1–3 workflows. Group reruns by workflow name. If the top 3 workflows account for 60%+ of all rerun minutes, you have hotspots, not a systemic flakiness problem. This concentration means targeted fixes will have outsized impact compared to broad retry policies.

Metrics

Quantify the failure and rerun waste

The cost of failures is: failed runs × minutes/run × rate. A team running 80 builds/day with a 15% failure rate on Linux runners:

Before optimization

Runs/day 80

Failure rate 15%

Failed runs/day 12

Minutes/failed run 18

Wasted minutes/month 4,752

Monthly waste $28.51/mo

12 fails × 18 min × 22 days = 4,752 min × $0.006/min

After optimization (failure rate → 5%)

Runs/day 80

Failure rate 5%

Failed runs/day 4

Minutes/failed run 18

Wasted minutes/month 1,584

Monthly waste $9.50/mo

Save $19/mo · $228/year · per workflow

That's one workflow on Linux. On macOS runners at $0.062/min, the same scenario burns $294.62/mo on failed runs before optimization, dropping to $98.21/mo after, saving $196.42/mo. And this doesn't account for the reruns that developers trigger after failures, which multiply the cost further. Every rerun is billed at the same per-minute rate as the original run. There is no discount for retries.

Fix 1

Identify your top-failing workflows and jobs

Before fixing anything, you need to know where failures concentrate. The GitHub CLI can pull workflow run conclusions so you can rank workflows by failure count and wasted minutes. Focus on the top 2–3 workflows that account for the most failed minutes, since fixing everything at once is not the goal.

List failed runs by workflow (last 30 days)

# Count failures by workflow name over the last 30 days
gh run list \
  --status failure \
  --limit 500 \
  --json workflowName,conclusion,createdAt \
  --jq 'group_by(.workflowName)
    | map({name: .[0].workflowName, count: length})
    | sort_by(-.count)
    | .[:5]
    | .[] | "\(.count)\t\(.name)"'

Once you know which workflows fail most, drill into individual jobs within those workflows. A workflow might have 8 jobs, but failures concentrate in one or two. Use the Actions tab to filter by the failing workflow and look at which job names turn red.

List failed jobs for a specific workflow run

# Show failed jobs from a specific run
gh run view <run-id> --json jobs \
  --jq '.jobs[]
    | select(.conclusion == "failure")
    | "\(.name)\t\(.conclusion)\t\(.steps
        | map(select(.conclusion == "failure"))
        | .[0].name)"'

Or skip the scripting. CostOps ingests every workflow run via webhook and breaks down failures by workflow name, job name, conclusion type, and step. You get a ranked list of your top-failing jobs without paginating through the API.

This gives you three layers of specificity: which workflow, which job, and which step. That's enough to categorize the failure and pick the right fix from the sections below.

Fix 2

Identify rerun hotspots by workflow

Failures cost you minutes. Reruns double the bill. The GitHub Actions REST API exposes a run_attempt field on every workflow run. Any run where run_attempt > 1 is a rerun. Query your runs, group by workflow name, and rank by total rerun minutes. The top 1–3 workflows are your hotspots.

Query reruns via GitHub API (bash + gh CLI)

# List recent workflow runs where run_attempt > 1
gh api repos/{owner}/{repo}/actions/runs \
  --paginate \
  -q '.workflow_runs[]
       | select(.run_attempt > 1)
       | {name, run_attempt, conclusion,
          created_at, updated_at}'

# For each rerun, the API also returns:
#   run_attempt  - attempt number (2 = first rerun)
#   conclusion   - success, failure, timed_out, cancelled
#   run_started_at / updated_at  - for duration calculation

Next, group failing runs by job name within the hotspot workflow. In most cases, a single job accounts for the majority of failures. GitHub records four conclusion types for failed runs: failure (test/build errors), timed_out (deadlocks, resource exhaustion), cancelled (superseded by newer push), and startup_failure (runner provisioning errors). Each points to a different root cause and a different fix.

List failed jobs in a specific run

# Get failed jobs for a specific run
gh api repos/{owner}/{repo}/actions/runs/{run_id}/jobs \
  -q '.jobs[]
       | select(.conclusion == "failure")
       | {name, conclusion, started_at, completed_at}'

# Track the pattern: if the same job name appears
# in 80%+ of reruns, that's your target.

CostOps does this automatically. For every workflow run and re-run, CostOps ingests job-level results via webhook and breaks down failures by conclusion type and job name. Instead of paginating through the API run by run, you get a single view: which jobs fail most, what conclusion type dominates, and how many rerun minutes each one costs. It works on all GitHub plans, including Free.

Fix 3

Separate infrastructure failures from test failures

Infrastructure failures and test failures have different root causes and different fixes. Treating them the same wastes effort. Infra failures are transient, including network timeouts, registry rate limits, runner provisioning errors, and OOM kills. They pass on retry without any code change. Test failures indicate actual code problems or flaky test logic.

For infrastructure failures, the right fix is targeted retry logic at the step level rather than blanket "re-run all jobs." Blanket re-runs waste minutes re-running jobs that already passed. Step-level retries isolate the transient failure and retry only the affected step.

Full workflow rerun

Jobs replayed 6 of 6

Minutes billed 18

Rerun cost $0.108

5 passing jobs re-run unnecessarily

Step-level retry

Jobs replayed 0

Minutes billed 3

Rerun cost $0.018

83% cheaper · no manual intervention

.github/workflows/ci.yml

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Infra step: retry on transient failures
      - name: Install dependencies
        uses: nick-fields/retry@v3
        with:
          max_attempts: 3
          retry_wait_seconds: 15
          timeout_minutes: 5
          command: npm ci

      # Test step: no retry - failures here mean real problems
      - name: Run tests
        run: npm test

The nick-fields/retry action retries a single step up to N times. Wrapping npm ci in a retry handles the common case where a registry request times out without needing to re-run your entire test suite. The timeout_minutes prevents a hung install from burning runner minutes indefinitely.

One caveat: do not add retry logic to your test step. If tests fail, you want to know immediately. Retrying test failures masks flaky tests and delays the feedback loop. Only retry steps where transient infrastructure failures are the expected failure mode: dependency installs, Docker pulls, artifact downloads, and external API calls.

Fix 4

Add explicit timeouts to cap failure cost

GitHub Actions defaults to a 360-minute (6-hour) timeout per job. If a job hangs due to a deadlocked test, stuck Docker build, or unresponsive external service, it will burn runner minutes for up to 6 hours before GitHub kills it. See our dedicated guide on reducing CI timeouts for a comprehensive timeout strategy. On a macOS runner, one hung job at the default timeout costs $22.32.

Set timeout-minutes on every job to a value slightly above its normal duration. A job that usually takes 12 minutes should have a timeout of 20–25 minutes. This caps the damage from hangs while allowing headroom for legitimate variance.

Default: 6-hour timeout

jobs:
  test:
    runs-on: ubuntu-latest
    # No timeout-minutes set
    # Default: 360 minutes (6 hours)
    # A hung job costs up to $2.16
    # (Linux) or $22.32 (macOS)
    steps:
      - run: npm test

Explicit timeout: 20 min cap

jobs:
  test:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    # Caps damage to $0.12 (Linux)
    # or $1.24 (macOS)
    steps:
      - run: npm test

Fix 5

Use fail-fast and “Re-run failed jobs” to minimize rerun cost

When a matrix job or multi-job workflow discovers a failure early, you want to stop the remaining jobs as soon as possible. For matrix strategies, GitHub Actions has a fail-fast setting (enabled by default) that cancels remaining matrix jobs when one fails. Make sure you haven't disabled it, and structure your workflow so that cheap checks run first.

.github/workflows/ci.yml

jobs:
  # Gate: cheap checks run first
  lint:
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - uses: actions/checkout@v4
      - run: npm run lint

  # Expensive jobs depend on cheap ones passing
  test:
    needs: [lint]
    runs-on: ubuntu-latest
    timeout-minutes: 20
    strategy:
      fail-fast: true  # default, but explicit is clearer
      matrix:
        node: [18, 20, 22]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm ci
      - run: npm test

This structure provides two layers of fail-fast behavior. First, the needs keyword ensures that test doesn't start until lint passes. If lint fails in 30 seconds, you skip the 20-minute test suite entirely. Second, fail-fast: true on the matrix cancels remaining Node versions as soon as one fails.

When a rerun is unavoidable, always use “Re-run failed jobs” instead of “Re-run all jobs.” GitHub’s “Re-run failed jobs” replays only the failed jobs plus their downstream dependents. For a workflow with 6 jobs where 1 fails, “Re-run failed jobs” costs roughly 1/6th of “Re-run all jobs.” This feature is available on all GitHub plans. To automate this:

Automated rerun of failed jobs only

# In a retry job that runs after a failure:
retry:
  needs: [test]
  if: failure() && fromJSON(github.run_attempt) < 3
  runs-on: ubuntu-latest
  steps:
    - name: Rerun failed jobs
      run: gh run rerun ${{ github.run_id }} --failed
      env:
        GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The fromJSON(github.run_attempt) < 3 condition caps retries at 2 attempts beyond the original run. Without this guard, a genuinely broken test would rerun indefinitely. The --failed flag ensures only failed jobs replay rather than the entire workflow.

One caveat: automated reruns can hide real failures. If a test passes on attempt 3, the PR shows green, yet the underlying flakiness persists. Pair automated reruns with flake tracking: log every run_attempt > 1 pass so you can identify and fix the root cause.

Reference

Common CI failure categories and fixes

Once you've identified your top-failing jobs, categorize the failures by root cause. Each category has a different fix:

Failure type	Symptoms	Fix
Network/Registry	ETIMEDOUT, 429, DNS errors	Step-level retry with backoff
OOM / Resource	exit code 137, killed	Larger runner or reduce concurrency
Flaky tests	Pass on re-run, random assertion failures	Quarantine, fix determinism
Timeout/Hang	timed_out conclusion	Set explicit timeout-minutes
Startup failure	startup_failure conclusion	Wait and auto-retry the run
Real code failures	Consistent failure, same test, same assertion	Fix the code (this is CI working correctly)

Datadog's 2024 DevOps report found that 63% of pipeline failures are caused by resource exhaustion, not test failures. The common thread: most CI failure waste comes from infra-level problems that can be fixed independently of application code. Once you reduce transient failures, focus on stabilizing CI runtime variance to make remaining runs more predictable.

Reference

Prioritization framework for failure and rerun hotspots

Not all failures and reruns are equal. Use this framework to decide where to invest your time:

Factor	Higher priority	Lower priority
Runner type	macOS ($0.062/min)	Linux ($0.006/min)
Failure/rerun frequency	Daily failures or reruns	Weekly failures or reruns
Job count	Full workflow reruns (6+ jobs)	Single failed job reruns
Root cause	Flaky tests (fixable)	GitHub infra (transient)

Focus on the workflow that scores highest across all four factors. A daily-flaking macOS workflow with full reruns is worth 100x more to fix than a weekly Linux infra timeout that resolves on the second attempt.

Reference

Rerun cost by runner type

To estimate your own failure and retry tax, multiply your wasted minutes by these per-minute rates. The macOS multiplier makes rerun waste on Apple runners especially painful:

Runner	Rate	100 wasted min/mo
Linux 2-core	$0.006/min	$0.60
Windows 2-core	$0.010/min	$1.00
macOS (M1/Intel)	$0.062/min	$6.20
Linux 8-core	$0.022/min	$2.20

GitHub rounds each job to the nearest minute and bills reruns at the same rate as first runs. There is no rerun discount. Per-minute rounding means a 10-second retry job is billed as 1 full minute.

Related guides

Flaky Tests Cost Real Money

Quarantine non-deterministic tests and stop paying for false failures.

Reduce CI Timeouts

Set explicit timeout-minutes on every job to stop paying for hung processes.

Stabilize CI Runtime

Fix cache misses and queue spikes that cause unpredictable pipeline durations.

Canceled Runs Wasting Minutes

Use concurrency groups to auto-cancel superseded runs and stop wasting minutes.

How to tell if CI failures and reruns are costing you money

Quantify the failure and rerun waste

Identify your top-failing workflows and jobs

Identify rerun hotspots by workflow

Separate infrastructure failures from test failures

Add explicit timeouts to cap failure cost

Use fail-fast and “Re-run failed jobs” to minimize rerun cost

Common CI failure categories and fixes

Prioritization framework for failure and rerun hotspots

Rerun cost by runner type

See which workflows waste minutes on failures and reruns