Change failure rate is up 30% — here's how to measure yours in an afternoon

A practitioner’s guide to calculating your team’s CFR without a vendor platform — the DORA formula, the SQL, and the AI-assisted vs human-authored split nobody is publishing yet.

Cortex’s 2026 benchmark says change failure rate has risen about 30% industry-wide since AI coding adoption accelerated. The number has been quoted in every engineering newsletter I read. It keeps showing up in LinkedIn posts. I cited it myself in my last post.

Here’s the uncomfortable follow-up question: what’s yours?

Most platform teams I’ve worked with couldn’t give me a number. They could estimate a direction — “it feels worse lately” — but the actual percentage wasn’t anywhere. And without a number, the 30% headline is just other people’s data. You can’t improve what you haven’t measured.

This post walks through how to compute your team’s CFR in an afternoon using data you already have, and how to split it in a way nobody is doing yet: AI-assisted PRs vs. human-authored. You don’t need a vendor platform for any of this.

What CFR actually measures

DORA’s definition, lifted from the source:

The percentage of changes to production or releases to users that result in degraded service and subsequently require remediation — a hotfix, rollback, fix forward, or patch.

That’s the whole thing. Three details matter and they’re the ones most vendor posts get slightly wrong.

Only production counts. A test that fails in CI isn’t a change failure. A canary that catches a bad deploy before it reaches real users isn’t one either. If your release engineering is working, a lot of would-be failures never count — which is the point.

Remediation has to happen. A deployment that’s merely suboptimal isn’t a failure. The question is whether it needed a rollback, hotfix, fix-forward, or patch after the fact. “We wrote a Jira ticket” isn’t remediation; “we pushed another deploy to fix the first one” is.

The denominator is changes, not deployments. If you push three deploys and two of them are fix-only remediations of the first, you made one change, not three. Fix-only deploys come out of both the numerator and the denominator — they are neither new changes nor new failures in the sense CFR measures.

So:

Number of changes   = Production deployments − Fix-only deployments
CFR                 = Failed changes ÷ Number of changes

DORA’s 2025 report found that about 16.7% of teams maintain CFR at 4% or below — that’s the elite band. Most teams sit well above it.

The 90-minute version

You need three things, all of which you already have somewhere.

1. A list of production deployments. From your CI (GitHub Actions, GitLab CI, Jenkins, CircleCI, Argo), filtered to production environment only, successful runs only. Most of these systems have an API or a database you can query. If you can get deployment_id, service, deployed_at, and commit_sha, you’re set.

2. A list of production incidents. From PagerDuty, Opsgenie, Incident.io, your internal spreadsheet — wherever your on-call logs live. Filter to anything that required an engineering response. You want incident_id, service, started_at, and ideally the SHA or deployment that was identified as the root cause.

3. A rule for joining them. The simplest rule that works: an incident “belongs to” a deployment if the incident started within some window after the deployment, on the same service. A 24-hour window is standard; some teams use 48 hours for services with slow-burn failure modes. This isn’t causal attribution — it’s a proxy, and it’s close enough.

Here’s the shape of the query once both datasets are in the same place:

WITH changes AS (
  SELECT
    deployment_id,
    service,
    deployed_at,
    commit_sha
  FROM deployments
  WHERE environment = 'production'
    AND result = 'success'
    AND deployed_at >= '2026-01-01'
    AND NOT is_fix_only  -- exclude rollbacks/hotfixes
),
failed_changes AS (
  SELECT DISTINCT c.deployment_id
  FROM changes c
  JOIN incidents i
    ON i.service = c.service
    AND i.started_at BETWEEN c.deployed_at AND c.deployed_at + INTERVAL '24 hours'
)
SELECT
  COUNT(DISTINCT c.deployment_id)                       AS total_changes,
  COUNT(DISTINCT fc.deployment_id)                      AS failed_changes,
  COUNT(DISTINCT fc.deployment_id) * 1.0
    / COUNT(DISTINCT c.deployment_id)                   AS change_failure_rate
FROM changes c
LEFT JOIN failed_changes fc USING (deployment_id);

If your deployment tool doesn’t track is_fix_only, the practical workaround is a convention — require engineers to prefix fix-only PRs with fix: or tag them with a fix-only label, and filter on that. The data gets better once you start asking for it.

Run the query over the last 90 days. That’s your CFR. Longer windows are noisier; shorter ones are too volatile to trust.

The cut nobody is making yet

Here’s where it gets interesting. The Cortex 30% number is an aggregate. It tells you the industry has gotten worse. It doesn’t tell you which of your PRs are driving your team’s number.

You can find out.

Tag your PRs. There are several reasonable ways:

PR label. Add an ai-assisted label manually at review time. Lowest overhead, most honest, relies on the author.
PR template checkbox. “Did you use AI coding tools in this PR?” as a checkbox that a small bot reads and labels accordingly. Works well for teams with a review culture that already uses templates.
Commit trailer. AI-Assisted: yes or a Co-authored-by: ... line pointing at a bot account. Survives rebases and is machine-readable.
Tool-reported attribution. Some tooling (Git AI’s open standard on Git Notes is a good example) can record which ranges of a diff were model-authored at the source, before the PR is even opened. Heavier setup, higher fidelity.

Any of these is fine. The worst option is to defer tagging until you “find the right platform.” Pick a convention, write it down, roll it out on Monday.

Once PRs are tagged, split the CFR query two ways:

-- AI-assisted PRs
SELECT ... FROM changes c
WHERE c.commit_sha IN (SELECT sha FROM ai_assisted_prs)
-- Human-authored PRs
SELECT ... FROM changes c
WHERE c.commit_sha NOT IN (SELECT sha FROM ai_assisted_prs)

Now you have two CFRs. Compare them.

If your AI-assisted CFR is meaningfully higher than your human-authored CFR — and, based on every public benchmark from the last six months, it probably is — you have your own version of the 30% number. Not an industry aggregate. Your team’s aggregate, on your codebase, for your definition of failure. That number is the one that actually motivates change.

It’s also a fair number in a way the industry stat isn’t. If your AI-assisted CFR is lower than your human CFR, that tells you something real too — your team has figured out how to use these tools well, and the finding is worth internal publicity.

What to do with the number

I wrote most of this in the previous post, so I’ll keep it brief.

The patterns that reduce CFR at teams I’ve seen up close are the boring ones. Smaller PRs. Trunk-based development with feature flags instead of long-lived branches. Canary deploys with automatic rollback. Strong ownership over shared infrastructure artifacts. And — the one most teams skip — visibility into the cross-repo blast radius of a change before it merges, so that the review can ask the right question rather than a generic one.

What doesn’t work is adding process layers that slow every change without discriminating by risk. The goal isn’t to slow the agents down; it’s to route high-blast-radius changes through more scrutiny than low-blast-radius ones.

FAQ

How is change failure rate calculated?

CFR equals the number of failed changes divided by the number of changes, over a given time window. A failed change is a production deployment that required remediation — a rollback, hotfix, fix-forward, or patch. Fix-only deployments are excluded from both sides of the ratio because they aren’t new changes.

What is a good change failure rate?

DORA’s 2025 data suggests that about 16.7% of teams achieve a CFR of 4% or lower, which is the elite band. A CFR in the 0–15% range generally indicates a mature delivery process. Above 30% typically points at gaps in testing, release safety, or ownership clarity.

Should I include staging or pre-production failures in CFR?

No. CFR is a production-only metric by DORA’s definition. A canary that catches a bad deploy before it reaches real users is a win, not a failure — counting it penalises the very controls you want teams to invest in.

How do I track AI-assisted code for CFR purposes?

The simplest approach is a PR label or commit trailer that engineers apply at authoring or review time. More sophisticated options include PR templates with a checkbox, bot-applied labels based on known AI-tool user accounts, and tools like the Git AI open standard that record AI-authored diff ranges in Git Notes. Perfect attribution is not required — a consistent convention used by the team is enough to split the metric meaningfully.

How long should my CFR measurement window be?

Ninety days is the usual default. Shorter windows (two to four weeks) are too noisy for most teams — a single rough week swings the number. Longer windows (six months or more) smooth out recent changes in your delivery practices and are slow to react to regressions.

Closing

A week of work, most of which is data plumbing you probably already have, gets you an honest CFR number and a split between AI-assisted and human-authored changes. That’s a better starting point than any aggregate benchmark from any vendor report.

I’m building Riftmap to solve the other half of this — giving teams visibility into the cross-repo blast radius of a change before the CFR number moves. Auto-discovery across Terraform, Docker, CI templates, Helm, Go, npm, Python, Ansible, Kubernetes, and Kustomize. One read-only token. No YAML to maintain.

If this is familiar territory, reach me at [email protected], or try a free scan at app.riftmap.dev.

Sources referenced

DORA, Software delivery performance metrics — dora.dev
Google Cloud / DORA, 2025 State of AI-assisted Software Development Report — dora.dev
Cortex, Engineering in the Age of AI: 2026 Benchmark Report — cortex.io
Swarmia, DORA change failure rate — what, why, and how — swarmia.com
Git AI open standard for AI authorship attribution via Git Notes — github.com/git-ai-project

AI Doesn’t Understand Blast Radius — Why change failure rates are up 30% and what’s structurally driving it.
AI coding agents need cross-repo context — What teams running AI coding agents at scale are publishing about the missing dependency substrate.
Meta needed 50+ AI agents to map their tribal knowledge — How a 50-agent system quietly rests on a single graph index that does the heavy lifting.