Reruns & flakiness

Stabilize CI runtime variance

By Keith Mazanec, Founder, CostOps · Updated January 31, 2026

A developer opens a PR. CI finishes in 8 minutes. They push a follow-up commit. This time it takes 34 minutes with the same code, the same workflow, and the same branch. The pipeline didn't get harder; something in the environment changed. You're billed for every minute of that 34-minute run, and the developer just lost half an hour waiting for feedback. When your p90 duration is 3–4× your p50, the problem isn't your code. It's your CI infrastructure, caching, or queue behavior.

Symptoms

How to tell if CI runtime variance is costing you

Runtime variance is harder to spot than outright failures because each individual run looks fine. The problem only shows up in aggregate. Look for these patterns:

High p90/p50 duration ratio. If your p90 pipeline duration is 2.5× or more your p50, a significant share of runs take far longer than the median. A ratio of 1.3–1.5 is normal. Above 2.5 means something non-deterministic is inflating tail runs.
Unpredictable queue-time spikes. Jobs sit queued for 5–15 minutes on some runs but start instantly on others. This happens when runner demand exceeds capacity, when self-hosted runners have scale-up lag, or when GitHub-hosted runner pools are under pressure in your region.
Cache hit/miss lottery. The same workflow takes 6 minutes with a warm cache and 25 minutes without. If your cache keys are unstable or your branch isolation prevents cache sharing, every new branch starts cold, and you pay for the full dependency install.
Developer re-triggers on slow runs. When CI is unpredictably slow, developers cancel and re-push to "try again." This doubles the run count for those PRs, inflating both minutes and cost while giving the impression of a busier pipeline.

Metrics

What unstable runtime actually costs

The cost of variance comes from two places: the inflated tail runs themselves, and the behavioral tax that follows: developers re-triggering runs, losing context while waiting, and pushing smaller changes to get faster feedback. Here's a typical scenario for a team running 40 CI runs/day on Linux:

Unstable (p90/p50 = 3.5×)

p50 duration 10 min

p90 duration 35 min

Avg minutes/run 18

Monthly minutes 15,840

Monthly cost $95/mo

40 runs/day × 22 days × 18 min × $0.006/min

Stabilized (p90/p50 = 1.4×)

p50 duration 10 min

p90 duration 14 min

Avg minutes/run 11

Monthly minutes 9,680

Monthly cost $58/mo

Save $37/mo · $444/year · per workflow

That's one workflow on Linux at $0.006/min. On macOS runners at $0.062/min, the same variance inflates to $983/mo versus $601/mo stabilized, saving $382/mo from reducing variance alone. This doesn't account for the extra re-triggered runs developers create when CI is unpredictably slow.

Fix 1

Measure duration by workflow and identify outliers

Before fixing anything, identify which workflows have the highest p90/p50 ratio. The variance almost always concentrates in one or two workflows. Use the GitHub API to pull run durations and compare percentiles per workflow. Workflow context variables like github.run_id and timestamps from the API make this straightforward. A quick gh CLI query gets you the data:

shell

# Pull the last 100 runs for a workflow and compute duration stats
gh run list --workflow ci.yml --limit 100 --json databaseId,createdAt,updatedAt,conclusion \
  | jq '[.[] | select(.conclusion == "success") |
      {id: .databaseId, duration_min:
        ((.updatedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 60
      }] | sort_by(.duration_min) |
      {p50: .[length/2].duration_min,
       p90: .[length*9/10].duration_min,
       ratio: (.[length*9/10].duration_min / .[length/2].duration_min)}'

Or skip the scripting. CostOps computes p50, p90, and p90/p50 ratios per workflow automatically, so you can spot unstable pipelines without parsing API timestamps yourself.

A ratio above 2.5 warrants investigation. Once you know which workflow is unstable, dig into the individual runs with the highest durations. Check whether the slow runs correlate with cache misses, specific runner types, or time-of-day patterns (which suggest queue congestion).

Fix 2

Stabilize cache hit rates across branches

Cache misses are the most common cause of runtime variance. GitHub Actions restricts cache access by branch: a feature branch can read caches from the base branch (main), but not from other feature branches. If your main cache is stale or your cache key changes on every commit, feature branches start cold every time.

The fix is a two-part cache strategy: use stable keys based on lockfiles, and add restore-keys fallbacks so branches always find a usable cache even when the exact key misses.

Unstable - key includes SHA

- uses: actions/cache@v4
  with:
    path: node_modules
    key: deps-${{ github.sha }}
    # Misses every commit
    # No restore-keys fallback

Stable - key on lockfile

- uses: actions/cache@v4
  with:
    path: node_modules
    key: deps-${{ hashFiles('package-lock.json') }}
    restore-keys: |
      deps-
    # Hits on any lockfile match
    # Falls back to any deps- prefix

For workflows where the main branch cache may be stale (e.g., long-running feature branches), add a scheduled or post-merge job that warms the cache on main. This ensures every feature branch has a recent cache to restore from:

.github/workflows/cache-warm.yml

name: Warm dependency cache

on:
  push:
    branches: [main]
    paths:
      - 'package-lock.json'
      - 'yarn.lock'

jobs:
  warm-cache:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/cache@v4
        with:
          path: node_modules
          key: deps-${{ hashFiles('package-lock.json') }}
      - run: npm ci

One caveat: GitHub evicts caches not accessed in 7 days, and the total cache storage per repository is capped at 10 GB. If you have many workflows competing for cache space, less-used caches may be evicted mid-week, reintroducing variance. Split caches by purpose (dependencies vs. build artifacts) and keep keys stable to avoid needless eviction.

Fix 3

Reduce queue-time spikes

Queue time, meaning the gap between a job being requested and a runner picking it up, is one of the least visible causes of variance. On GitHub-hosted runners, queue time depends on runner pool availability in your region and how many concurrent jobs you're requesting. On self-hosted runners, it depends on your autoscaler's scale-up latency.

Queue time isn't directly billable (billing starts when the runner picks up the job), but it inflates wall-clock duration, frustrates developers, and correlates with re-triggers, which are billable. Adding CI timeouts ensures slow runs don't spiral into hours of waste. Two changes reduce queue spikes:

.github/workflows/ci.yml

# 1. Auto-cancel superseded runs to free up runner capacity
concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}

jobs:
  test:
    runs-on: ubuntu-latest
    # 2. Reduce matrix dimensions on PRs to lower concurrent job demand
    strategy:
      matrix:
        node: ${{ github.event_name == 'pull_request' && fromJSON('[18]') || fromJSON('[16, 18, 20]') }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm test

Concurrency groups free up runner slots by cancelling stale jobs. Reducing matrix dimensions on PRs lowers the number of concurrent jobs competing for runners. Together, these reduce the peak runner demand that causes queue spikes. For self-hosted runners, ensure your autoscaler provisions instances before demand peaks. Most Kubernetes-based runners (like actions-runner-controller) support pre-warming a minimum replica count during business hours.

Fix 4

Pin runner images and tool versions

GitHub-hosted runner images are updated weekly. When a new image rolls out, pre-installed tools may change versions, cached layers may become invalid, and performance characteristics can shift. A benchmark study by Andrey Akinshin found that GitHub Actions runner performance can vary by 10–30% between runs on the same nominal runner type, partly due to the underlying hardware mix in GitHub's pool (different CPU generations, varying clock speeds).

You can't control which physical machine you get, but you can reduce the environmental variance by pinning your runner image and tool versions:

.github/workflows/ci.yml

jobs:
  test:
    # Pin to a specific Ubuntu version instead of -latest
    runs-on: ubuntu-24.04
    steps:
      - uses: actions/checkout@v4

      # Pin tool versions explicitly
      - uses: actions/setup-node@v4
        with:
          node-version: '20.11.0'  # exact version, not '20' or '20.x'

      # Use cache with the setup action's built-in caching
      - uses: actions/setup-node@v4
        with:
          node-version: '20.11.0'
          cache: 'npm'  # built-in caching handles key + restore

      - run: npm ci
      - run: npm test

Pinning ubuntu-24.04 instead of ubuntu-latest prevents unexpected image changes. Pinning exact tool versions (e.g., 20.11.0 instead of 20) avoids the scenario where a minor version bump changes build output or cache compatibility. Using built-in caching from actions/setup-node (or equivalent for Python, Ruby, Go) handles cache key generation based on your lockfile, so you don't need to manage cache keys manually.

One caveat: pinning means you need to update versions deliberately. Set a monthly reminder or use Dependabot to keep runner images and action versions current. The tradeoff is worth it because you trade surprise variance for planned, testable updates.

Reference

CI variance diagnostic checklist

Use this checklist to systematically identify and address the sources of duration variance in your pipelines. Each row maps a variance source to its observable signal and the corresponding fix.

Variance source	Signal	Fix
Cache misses	Slow runs correlate with cache-hit: false	Stable keys + restore-keys fallbacks
Queue time spikes	High gap between created_at and started_at	Concurrency groups + smaller matrices on PRs
Runner image drift	Variance spikes after weekly image update	Pin ubuntu-24.04 + exact tool versions
Hardware lottery	Same job ±30% between runs, no other changes	Use larger runners for CPU-bound jobs (less variance per vCPU)
Flaky tests	Fail → pass on retry adds rerun minutes	Quarantine + fix non-deterministic tests
Network variability	Dependency download times vary 2–10×	Cache dependencies; avoid network fetches in hot paths

Start with the highest-impact source. In most repositories, cache behavior and queue time account for 60–80% of duration variance. Runner image drift and hardware lottery are secondary but become significant for CPU-intensive builds or macOS runners (where the hardware pool is smaller and more heterogeneous).

Related guides

Flaky Tests Cost Real Money

Detect flake patterns, quarantine non-deterministic tests, and cut retry waste 50-80%.

Set Timeout-Minutes to Stop Paying for Hung Jobs

Cap wasted CI spend with explicit timeout-minutes on every job and step.

Fix Dependency Cache Misses

Fix actions/cache key patterns and restore strategies to cut CI install time 40-70%.

Find and Fix CI Rerun Hotspots

Identify workflows that drive most rerun spend and add targeted retries to cut waste 40-60%.