Reruns & flakiness
Stabilize CI runtime variance
By Keith Mazanec, Founder, CostOps ยท Updated January 31, 2026
A developer opens a PR. CI finishes in 8 minutes. They push a follow-up commit. This time it takes 34 minutes with the same code, the same workflow, and the same branch. The pipeline didn't get harder; something in the environment changed. You're billed for every minute of that 34-minute run, and the developer just lost half an hour waiting for feedback. When your p90 duration is 3–4× your p50, the problem isn't your code. It's your CI infrastructure, caching, or queue behavior.
Symptoms
How to tell if CI runtime variance is costing you
Runtime variance is harder to spot than outright failures because each individual run looks fine. The problem only shows up in aggregate. Look for these patterns:
-
High p90/p50 duration ratio. If your p90 pipeline duration is 2.5× or more your p50, a significant share of runs take far longer than the median. A ratio of 1.3–1.5 is normal. Above 2.5 means something non-deterministic is inflating tail runs.
-
Unpredictable queue-time spikes. Jobs sit queued for 5–15 minutes on some runs but start instantly on others. This happens when runner demand exceeds capacity, when self-hosted runners have scale-up lag, or when GitHub-hosted runner pools are under pressure in your region.
-
Cache hit/miss lottery. The same workflow takes 6 minutes with a warm cache and 25 minutes without. If your cache keys are unstable or your branch isolation prevents cache sharing, every new branch starts cold, and you pay for the full dependency install.
-
Developer re-triggers on slow runs. When CI is unpredictably slow, developers cancel and re-push to "try again." This doubles the run count for those PRs, inflating both minutes and cost while giving the impression of a busier pipeline.
Metrics
What unstable runtime actually costs
The cost of variance comes from two places: the inflated tail runs themselves, and the behavioral tax that follows: developers re-triggering runs, losing context while waiting, and pushing smaller changes to get faster feedback. Here's a typical scenario for a team running 40 CI runs/day on Linux:
Unstable (p90/p50 = 3.5×)
40 runs/day × 22 days × 18 min × $0.006/min
Stabilized (p90/p50 = 1.4×)
Save $37/mo · $444/year · per workflow
That's one workflow on Linux at $0.006/min. On macOS runners at $0.062/min, the same variance inflates to $983/mo versus $601/mo stabilized, saving $382/mo from reducing variance alone. This doesn't account for the extra re-triggered runs developers create when CI is unpredictably slow.
Fix 1
Measure duration by workflow and identify outliers
Before fixing anything, identify which workflows have the highest p90/p50 ratio. The variance almost always concentrates in one or two workflows. Use the GitHub API to pull run durations and compare percentiles per workflow. Workflow context variables like github.run_id and timestamps from the API make this straightforward. A quick gh CLI query gets you the data:
# Pull the last 100 runs for a workflow and compute duration stats gh run list --workflow ci.yml --limit 100 --json databaseId,createdAt,updatedAt,conclusion \ | jq '[.[] | select(.conclusion == "success") | {id: .databaseId, duration_min: ((.updatedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 60 }] | sort_by(.duration_min) | {p50: .[length/2].duration_min, p90: .[length*9/10].duration_min, ratio: (.[length*9/10].duration_min / .[length/2].duration_min)}'
Or skip the scripting. CostOps computes p50, p90, and p90/p50 ratios per workflow automatically, so you can spot unstable pipelines without parsing API timestamps yourself.
A ratio above 2.5 warrants investigation. Once you know which workflow is unstable, dig into the individual runs with the highest durations. Check whether the slow runs correlate with cache misses, specific runner types, or time-of-day patterns (which suggest queue congestion).
Fix 2
Stabilize cache hit rates across branches
Cache misses are the most common cause of runtime variance. GitHub Actions restricts cache access by branch: a feature branch can read caches from the base branch (main), but not from other feature branches. If your main cache is stale or your cache key changes on every commit, feature branches start cold every time.
The fix is a two-part cache strategy: use stable keys based on lockfiles, and add restore-keys fallbacks so branches always find a usable cache even when the exact key misses.
- uses: actions/cache@v4 with: path: node_modules key: deps-${{ github.sha }} # Misses every commit # No restore-keys fallback
- uses: actions/cache@v4 with: path: node_modules key: deps-${{ hashFiles('package-lock.json') }} restore-keys: | deps- # Hits on any lockfile match # Falls back to any deps- prefix
For workflows where the main branch cache may be stale (e.g., long-running feature branches), add a scheduled or post-merge job that warms the cache on main. This ensures every feature branch has a recent cache to restore from:
name: Warm dependency cache on: push: branches: [main] paths: - 'package-lock.json' - 'yarn.lock' jobs: warm-cache: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/cache@v4 with: path: node_modules key: deps-${{ hashFiles('package-lock.json') }} - run: npm ci
One caveat: GitHub evicts caches not accessed in 7 days, and the total cache storage per repository is capped at 10 GB. If you have many workflows competing for cache space, less-used caches may be evicted mid-week, reintroducing variance. Split caches by purpose (dependencies vs. build artifacts) and keep keys stable to avoid needless eviction.
Fix 3
Reduce queue-time spikes
Queue time, meaning the gap between a job being requested and a runner picking it up, is one of the least visible causes of variance. On GitHub-hosted runners, queue time depends on runner pool availability in your region and how many concurrent jobs you're requesting. On self-hosted runners, it depends on your autoscaler's scale-up latency.
Queue time isn't directly billable (billing starts when the runner picks up the job), but it inflates wall-clock duration, frustrates developers, and correlates with re-triggers, which are billable. Adding CI timeouts ensures slow runs don't spiral into hours of waste. Two changes reduce queue spikes:
# 1. Auto-cancel superseded runs to free up runner capacity concurrency: group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} cancel-in-progress: ${{ github.ref != 'refs/heads/main' }} jobs: test: runs-on: ubuntu-latest # 2. Reduce matrix dimensions on PRs to lower concurrent job demand strategy: matrix: node: ${{ github.event_name == 'pull_request' && fromJSON('[18]') || fromJSON('[16, 18, 20]') }} steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: ${{ matrix.node }} - run: npm test
Concurrency groups free up runner slots by cancelling stale jobs. Reducing matrix dimensions on PRs lowers the number of concurrent jobs competing for runners. Together, these reduce the peak runner demand that causes queue spikes. For self-hosted runners, ensure your autoscaler provisions instances before demand peaks. Most Kubernetes-based runners (like actions-runner-controller) support pre-warming a minimum replica count during business hours.
Fix 4
Pin runner images and tool versions
GitHub-hosted runner images are updated weekly. When a new image rolls out, pre-installed tools may change versions, cached layers may become invalid, and performance characteristics can shift. A benchmark study by Andrey Akinshin found that GitHub Actions runner performance can vary by 10–30% between runs on the same nominal runner type, partly due to the underlying hardware mix in GitHub's pool (different CPU generations, varying clock speeds).
You can't control which physical machine you get, but you can reduce the environmental variance by pinning your runner image and tool versions:
jobs: test: # Pin to a specific Ubuntu version instead of -latest runs-on: ubuntu-24.04 steps: - uses: actions/checkout@v4 # Pin tool versions explicitly - uses: actions/setup-node@v4 with: node-version: '20.11.0' # exact version, not '20' or '20.x' # Use cache with the setup action's built-in caching - uses: actions/setup-node@v4 with: node-version: '20.11.0' cache: 'npm' # built-in caching handles key + restore - run: npm ci - run: npm test
Pinning ubuntu-24.04 instead of ubuntu-latest prevents unexpected image changes. Pinning exact tool versions (e.g., 20.11.0 instead of 20) avoids the scenario where a minor version bump changes build output or cache compatibility. Using built-in caching from actions/setup-node (or equivalent for Python, Ruby, Go) handles cache key generation based on your lockfile, so you don't need to manage cache keys manually.
One caveat: pinning means you need to update versions deliberately. Set a monthly reminder or use Dependabot to keep runner images and action versions current. The tradeoff is worth it because you trade surprise variance for planned, testable updates.
Reference
CI variance diagnostic checklist
Use this checklist to systematically identify and address the sources of duration variance in your pipelines. Each row maps a variance source to its observable signal and the corresponding fix.
| Variance source | Signal | Fix |
|---|---|---|
| Cache misses | Slow runs correlate with cache-hit: false | Stable keys + restore-keys fallbacks |
| Queue time spikes | High gap between created_at and started_at | Concurrency groups + smaller matrices on PRs |
| Runner image drift | Variance spikes after weekly image update | Pin ubuntu-24.04 + exact tool versions |
| Hardware lottery | Same job ±30% between runs, no other changes | Use larger runners for CPU-bound jobs (less variance per vCPU) |
| Flaky tests | Fail → pass on retry adds rerun minutes | Quarantine + fix non-deterministic tests |
| Network variability | Dependency download times vary 2–10× | Cache dependencies; avoid network fetches in hot paths |
Start with the highest-impact source. In most repositories, cache behavior and queue time account for 60–80% of duration variance. Runner image drift and hardware lottery are secondary but become significant for CPU-intensive builds or macOS runners (where the hardware pool is smaller and more heterogeneous).
Related guides
Flaky Tests Cost Real Money
Detect flake patterns, quarantine non-deterministic tests, and cut retry waste 50-80%.
Set Timeout-Minutes to Stop Paying for Hung Jobs
Cap wasted CI spend with explicit timeout-minutes on every job and step.
Fix Dependency Cache Misses
Fix actions/cache key patterns and restore strategies to cut CI install time 40-70%.
Find and Fix CI Rerun Hotspots
Identify workflows that drive most rerun spend and add targeted retries to cut waste 40-60%.