Reruns & flakiness
Reduce CI failures and rerun waste
By Keith Mazanec, Founder, CostOps ยท Updated February 17, 2026
A developer pushes to a PR. The workflow runs for 18 minutes, then fails on a transient network error during npm install. They click “Re-run all jobs.” Same result. Third attempt passes. You just paid for 54 minutes to get 18 minutes of useful signal. Across a team, a 15% failure rate quietly burns thousands of minutes per month on runs that deliver nothing, and the reruns triggered by those failures multiply the cost further. The fix starts with knowing which workflows fail, why they fail, and where rerun minutes concentrate.
Symptoms
How to tell if CI failures and reruns are costing you money
Open your repository's Actions tab and filter by status. These patterns indicate high failure and rerun waste:
-
High failure rate. More than 10% of workflow runs end in failure or timed_out. Every failed run consumed minutes without producing a usable result, leaving you with no green check, no merge signal, and no artifact. Those minutes are gone.
-
Failures cluster in a few workflows. One or two workflow names account for the majority of failed minutes. This is common and follows the CI equivalent of the Pareto principle. If your e2e-tests workflow fails 30% of the time while lint fails 1%, your effort should be concentrated on the former.
-
Same jobs fail repeatedly with the same error. Infra failures (network timeouts, registry pulls, runner provisioning) look different from test failures. If you see the same ETIMEDOUT or exit code 137 across runs, that's an infrastructure problem, not a code problem. These need a different fix than failing assertions.
-
Developers manually re-run workflows frequently. High run_attempt > 1 counts mean developers are clicking “Re-run all jobs” to work around failures. Each re-run bills for the full pipeline again. A 20-minute workflow re-run three times costs 60 minutes to produce one passing result.
-
Rerun minutes dominated by 1–3 workflows. Group reruns by workflow name. If the top 3 workflows account for 60%+ of all rerun minutes, you have hotspots, not a systemic flakiness problem. This concentration means targeted fixes will have outsized impact compared to broad retry policies.
Metrics
Quantify the failure and rerun waste
The cost of failures is: failed runs × minutes/run × rate. A team running 80 builds/day with a 15% failure rate on Linux runners:
Before optimization
12 fails × 18 min × 22 days = 4,752 min × $0.006/min
After optimization (failure rate → 5%)
Save $19/mo · $228/year · per workflow
That's one workflow on Linux. On macOS runners at $0.062/min, the same scenario burns $294.62/mo on failed runs before optimization, dropping to $98.21/mo after, saving $196.42/mo. And this doesn't account for the reruns that developers trigger after failures, which multiply the cost further. Every rerun is billed at the same per-minute rate as the original run. There is no discount for retries.
Fix 1
Identify your top-failing workflows and jobs
Before fixing anything, you need to know where failures concentrate. The GitHub CLI can pull workflow run conclusions so you can rank workflows by failure count and wasted minutes. Focus on the top 2–3 workflows that account for the most failed minutes, since fixing everything at once is not the goal.
# Count failures by workflow name over the last 30 days gh run list \ --status failure \ --limit 500 \ --json workflowName,conclusion,createdAt \ --jq 'group_by(.workflowName) | map({name: .[0].workflowName, count: length}) | sort_by(-.count) | .[:5] | .[] | "\(.count)\t\(.name)"'
Once you know which workflows fail most, drill into individual jobs within those workflows. A workflow might have 8 jobs, but failures concentrate in one or two. Use the Actions tab to filter by the failing workflow and look at which job names turn red.
# Show failed jobs from a specific run gh run view <run-id> --json jobs \ --jq '.jobs[] | select(.conclusion == "failure") | "\(.name)\t\(.conclusion)\t\(.steps | map(select(.conclusion == "failure")) | .[0].name)"'
Or skip the scripting. CostOps ingests every workflow run via webhook and breaks down failures by workflow name, job name, conclusion type, and step. You get a ranked list of your top-failing jobs without paginating through the API.
This gives you three layers of specificity: which workflow, which job, and which step. That's enough to categorize the failure and pick the right fix from the sections below.
Fix 2
Identify rerun hotspots by workflow
Failures cost you minutes. Reruns double the bill. The GitHub Actions REST API exposes a run_attempt field on every workflow run. Any run where run_attempt > 1 is a rerun. Query your runs, group by workflow name, and rank by total rerun minutes. The top 1–3 workflows are your hotspots.
# List recent workflow runs where run_attempt > 1 gh api repos/{owner}/{repo}/actions/runs \ --paginate \ -q '.workflow_runs[] | select(.run_attempt > 1) | {name, run_attempt, conclusion, created_at, updated_at}' # For each rerun, the API also returns: # run_attempt - attempt number (2 = first rerun) # conclusion - success, failure, timed_out, cancelled # run_started_at / updated_at - for duration calculation
Next, group failing runs by job name within the hotspot workflow. In most cases, a single job accounts for the majority of failures. GitHub records four conclusion types for failed runs: failure (test/build errors), timed_out (deadlocks, resource exhaustion), cancelled (superseded by newer push), and startup_failure (runner provisioning errors). Each points to a different root cause and a different fix.
# Get failed jobs for a specific run gh api repos/{owner}/{repo}/actions/runs/{run_id}/jobs \ -q '.jobs[] | select(.conclusion == "failure") | {name, conclusion, started_at, completed_at}' # Track the pattern: if the same job name appears # in 80%+ of reruns, that's your target.
CostOps does this automatically. For every workflow run and re-run, CostOps ingests job-level results via webhook and breaks down failures by conclusion type and job name. Instead of paginating through the API run by run, you get a single view: which jobs fail most, what conclusion type dominates, and how many rerun minutes each one costs. It works on all GitHub plans, including Free.
Fix 3
Separate infrastructure failures from test failures
Infrastructure failures and test failures have different root causes and different fixes. Treating them the same wastes effort. Infra failures are transient, including network timeouts, registry rate limits, runner provisioning errors, and OOM kills. They pass on retry without any code change. Test failures indicate actual code problems or flaky test logic.
For infrastructure failures, the right fix is targeted retry logic at the step level rather than blanket "re-run all jobs." Blanket re-runs waste minutes re-running jobs that already passed. Step-level retries isolate the transient failure and retry only the affected step.
Full workflow rerun
5 passing jobs re-run unnecessarily
Step-level retry
83% cheaper · no manual intervention
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 # Infra step: retry on transient failures - name: Install dependencies uses: nick-fields/retry@v3 with: max_attempts: 3 retry_wait_seconds: 15 timeout_minutes: 5 command: npm ci # Test step: no retry - failures here mean real problems - name: Run tests run: npm test
The nick-fields/retry action retries a single step up to N times. Wrapping npm ci in a retry handles the common case where a registry request times out without needing to re-run your entire test suite. The timeout_minutes prevents a hung install from burning runner minutes indefinitely.
One caveat: do not add retry logic to your test step. If tests fail, you want to know immediately. Retrying test failures masks flaky tests and delays the feedback loop. Only retry steps where transient infrastructure failures are the expected failure mode: dependency installs, Docker pulls, artifact downloads, and external API calls.
Fix 4
Add explicit timeouts to cap failure cost
GitHub Actions defaults to a 360-minute (6-hour) timeout per job. If a job hangs due to a deadlocked test, stuck Docker build, or unresponsive external service, it will burn runner minutes for up to 6 hours before GitHub kills it. See our dedicated guide on reducing CI timeouts for a comprehensive timeout strategy. On a macOS runner, one hung job at the default timeout costs $22.32.
Set timeout-minutes on every job to a value slightly above its normal duration. A job that usually takes 12 minutes should have a timeout of 20–25 minutes. This caps the damage from hangs while allowing headroom for legitimate variance.
jobs: test: runs-on: ubuntu-latest # No timeout-minutes set # Default: 360 minutes (6 hours) # A hung job costs up to $2.16 # (Linux) or $22.32 (macOS) steps: - run: npm test
jobs: test: runs-on: ubuntu-latest timeout-minutes: 20 # Caps damage to $0.12 (Linux) # or $1.24 (macOS) steps: - run: npm test
Fix 5
Use fail-fast and “Re-run failed jobs” to minimize rerun cost
When a matrix job or multi-job workflow discovers a failure early, you want to stop the remaining jobs as soon as possible. For matrix strategies, GitHub Actions has a fail-fast setting (enabled by default) that cancels remaining matrix jobs when one fails. Make sure you haven't disabled it, and structure your workflow so that cheap checks run first.
jobs: # Gate: cheap checks run first lint: runs-on: ubuntu-latest timeout-minutes: 5 steps: - uses: actions/checkout@v4 - run: npm run lint # Expensive jobs depend on cheap ones passing test: needs: [lint] runs-on: ubuntu-latest timeout-minutes: 20 strategy: fail-fast: true # default, but explicit is clearer matrix: node: [18, 20, 22] steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: ${{ matrix.node }} - run: npm ci - run: npm test
This structure provides two layers of fail-fast behavior. First, the needs keyword ensures that test doesn't start until lint passes. If lint fails in 30 seconds, you skip the 20-minute test suite entirely. Second, fail-fast: true on the matrix cancels remaining Node versions as soon as one fails.
When a rerun is unavoidable, always use “Re-run failed jobs” instead of “Re-run all jobs.” GitHub’s “Re-run failed jobs” replays only the failed jobs plus their downstream dependents. For a workflow with 6 jobs where 1 fails, “Re-run failed jobs” costs roughly 1/6th of “Re-run all jobs.” This feature is available on all GitHub plans. To automate this:
# In a retry job that runs after a failure: retry: needs: [test] if: failure() && fromJSON(github.run_attempt) < 3 runs-on: ubuntu-latest steps: - name: Rerun failed jobs run: gh run rerun ${{ github.run_id }} --failed env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
The fromJSON(github.run_attempt) < 3 condition caps retries at 2 attempts beyond the original run. Without this guard, a genuinely broken test would rerun indefinitely. The --failed flag ensures only failed jobs replay rather than the entire workflow.
One caveat: automated reruns can hide real failures. If a test passes on attempt 3, the PR shows green, yet the underlying flakiness persists. Pair automated reruns with flake tracking: log every run_attempt > 1 pass so you can identify and fix the root cause.
Reference
Common CI failure categories and fixes
Once you've identified your top-failing jobs, categorize the failures by root cause. Each category has a different fix:
| Failure type | Symptoms | Fix |
|---|---|---|
| Network/Registry | ETIMEDOUT, 429, DNS errors | Step-level retry with backoff |
| OOM / Resource | exit code 137, killed | Larger runner or reduce concurrency |
| Flaky tests | Pass on re-run, random assertion failures | Quarantine, fix determinism |
| Timeout/Hang | timed_out conclusion | Set explicit timeout-minutes |
| Startup failure | startup_failure conclusion | Wait and auto-retry the run |
| Real code failures | Consistent failure, same test, same assertion | Fix the code (this is CI working correctly) |
Datadog's 2024 DevOps report found that 63% of pipeline failures are caused by resource exhaustion, not test failures. The common thread: most CI failure waste comes from infra-level problems that can be fixed independently of application code. Once you reduce transient failures, focus on stabilizing CI runtime variance to make remaining runs more predictable.
Reference
Prioritization framework for failure and rerun hotspots
Not all failures and reruns are equal. Use this framework to decide where to invest your time:
| Factor | Higher priority | Lower priority |
|---|---|---|
| Runner type | macOS ($0.062/min) | Linux ($0.006/min) |
| Failure/rerun frequency | Daily failures or reruns | Weekly failures or reruns |
| Job count | Full workflow reruns (6+ jobs) | Single failed job reruns |
| Root cause | Flaky tests (fixable) | GitHub infra (transient) |
Focus on the workflow that scores highest across all four factors. A daily-flaking macOS workflow with full reruns is worth 100x more to fix than a weekly Linux infra timeout that resolves on the second attempt.
Reference
Rerun cost by runner type
To estimate your own failure and retry tax, multiply your wasted minutes by these per-minute rates. The macOS multiplier makes rerun waste on Apple runners especially painful:
| Runner | Rate | 100 wasted min/mo |
|---|---|---|
| Linux 2-core | $0.006/min | $0.60 |
| Windows 2-core | $0.010/min | $1.00 |
| macOS (M1/Intel) | $0.062/min | $6.20 |
| Linux 8-core | $0.022/min | $2.20 |
GitHub rounds each job to the nearest minute and bills reruns at the same rate as first runs. There is no rerun discount. Per-minute rounding means a 10-second retry job is billed as 1 full minute.
Related guides
Flaky Tests Cost Real Money
Quarantine non-deterministic tests and stop paying for false failures.
Reduce CI Timeouts
Set explicit timeout-minutes on every job to stop paying for hung processes.
Stabilize CI Runtime
Fix cache misses and queue spikes that cause unpredictable pipeline durations.
Canceled Runs Wasting Minutes
Use concurrency groups to auto-cancel superseded runs and stop wasting minutes.