What is the best metric for AI coding productivity?

There is no single metric, and any tool selling you one is the tool to distrust. The best read comes from four layers together: did it ship and work (outcome), what did it really cost (calibrated effort), output per unit of effort (leverage), and what you drove versus delegated (attribution). Leverage is the closest thing to a headline number, but it only means something when the effort data underneath it is observed rather than guessed.

Can you measure productivity by counting AI tokens used?

No. Token counts tell you how much the model generated and roughly what it cost, which is a useful operations and budget number but not a productivity one. High token burn often signals thrashing rather than output, so optimising for it rewards your least productive sessions. Track tokens for cost, never as a proxy for value.

Do lines of code or commit counts still mean anything in 2026?

Almost nothing, for productivity. An agent can produce a thousand-line commit in seconds, so volume now measures the tool's verbosity rather than your effort. Lines and commits are fine as raw activity signals but actively misleading the moment you treat them as productivity.

How do you measure productivity when an AI agent writes most of the code?

By shifting from artifacts produced to outcomes shipped per unit of effort, and by capturing the human-versus-agent split explicitly. Your contribution is the direction, judgment, and hardening, so the measurement has to credit those rather than the typing. Attribution data captured as the work happens is what makes this honest instead of a guess.

Is DORA still useful for AI-assisted teams?

Yes, more than volume metrics. DORA's deployment frequency, lead time, change failure rate, and time to restore are outcome and flow oriented, so AI writing more code does not inflate them unless the work is genuinely good. DORA is strong for team delivery health but does not, on its own, attribute individual human-versus-agent contribution, which is a separate measurement.

How to Measure AI Coding Productivity in 2026 (a Framework That Holds Up)

Why the old metrics broke the day agents got good

If you have ever watched Claude Code or Cursor emit a complete feature in one pass and then looked at your "productivity dashboard" celebrating the spike in lines and commits, you have already seen the problem. The dashboard is measuring the agent, not you, and it cannot tell the difference. This is the core thing that changed, and every honest answer to measuring AI coding productivity starts here.

For most of software's history, code volume was a rough proxy for effort. More lines, bigger diffs, and busier commit graphs loosely tracked more work, because a human had to type all of it. Agents severed that link cleanly. An AI now produces in seconds what used to take a day, which means a huge commit can represent thirty seconds of accepting a suggestion, and a single-line change can follow an hour of you reasoning about a race condition the agent kept getting wrong. Volume and effort no longer move together, so any metric built on volume is now measuring noise with confidence.

This is why the instinct to count tokens fails too, even though it feels more AI-native. Token throughput tells you how much the model generated and roughly what it cost, which is a real operations number. It says nothing about whether the generated code was good, whether you kept it, or whether shipping it moved anything. In practice, high token burn often correlates with thrashing: the agent and the developer going in circles. Measuring productivity by tokens consumed rewards exactly the sessions you should be worried about.

What "productivity" should mean when an agent writes most of the code

The redefinition is simple to state and changes everything downstream: productivity is shipped outcomes per unit of effort, not artifacts produced. When the agent does the typing, your contribution is the judgment, the direction, and the decisions that turn raw generation into something that works and survives contact with users. That is the thing worth measuring, and it is the thing the volume metrics are structurally blind to.

Almost always, this reframes the question from "how much did I produce?" to "how much did I produce that mattered, and how hard was it to get there?" A developer who shipped one hardened payments integration this week was likely more productive than one who merged eight agent-generated CRUD scaffolds that still need rework, even though the second person's contribution graph looks four times busier. Leverage is the word for this: output relative to the effort it cost, with credit for the difficulty and durability of the work rather than its raw size.

This is also where the AI-era proof problem quietly sits inside the productivity problem. A busy graph used to be a believable claim about effort. Now it is just a claim, because the work behind it could have been thirty minutes of accepting suggestions. Measuring AI coding productivity honestly means capturing the evidence of real effort, not the appearance of it, which is the same shift that separates a number you can defend from a number you can only assert.

A framework: the four things actually worth measuring

The practical move is to measure across four layers and treat any single-layer metric as suspect. Each one answers a question the others cannot, and a productivity picture built from all four is far harder to game than any vanity number. Here is what each layer captures and how to read it.

Layer	What it measures	Good signal	Failure mode if used alone
Outcome	Did the work ship, function, and survive	Features merged and still working weeks later, bugs closed, incidents avoided	Slow to read; easy to claim without evidence
Effort (calibrated)	Real human and agent time the work took	Focused work blocks, session cadence, time-to-ship	Self-reported time lies; needs passive capture
Leverage	Output per unit of effort	A hard problem solved fast, durable code, low rework	Meaningless without trustworthy effort data
Attribution	Human-driven versus agent-generated share	Honest split of what you steered vs delegated	Cannot be reconstructed from memory after the fact

Start with outcome, because it is the only layer that is unambiguously about value. A feature that shipped and still works is productive almost by definition, regardless of how many lines or tokens it took. The catch is that outcomes are slow to observe and easy to assert without proof, which is why the other three layers exist to ground them.

Effort is the layer most people get wrong, and it has to be calibrated rather than guessed. You cannot measure leverage without knowing what the work actually cost, and self-reported time is the least reliable data in software. The effort number has to come from observed activity, captured as it happens, including agent sessions in the terminal, not from a timesheet filled in on Friday. Once you have trustworthy effort, leverage falls out of it: the same outcome achieved with less real effort is higher productivity, full stop.

Attribution is the layer that is genuinely new. When an agent writes a large share of the code, "how productive was the developer?" requires knowing what the developer contributed versus what was delegated. In practice this means being able to say "the agent scaffolded the CRUD, I designed the data model and hardened the edge cases" and have a record back it up. This is where measuring AI coding productivity overlaps with proving the code is human-driven, not just AI-generated, because the same attribution data answers both questions.

Team metrics: where DORA and engineering analytics fit

If you have ever sat in a leadership meeting where someone presented commit counts per developer as a productivity ranking, you already know why team-level AI productivity measurement is fraught. At the team level the established frame is DORA (deployment frequency, lead time for changes, change failure rate, time to restore), and it survives the AI era better than volume metrics because it is outcome and flow oriented rather than output oriented. An agent writing more code does not improve your change failure rate or your lead time unless the work is actually good.

Tools in this space measure different slices, and it is worth being precise about what each one can and cannot tell you. Engineering-analytics platforms like Waydev focus on DORA and managerial throughput metrics for teams, which is useful for delivery health but not built around individual AI-versus-human attribution. Code-research tools like GitClear measure diff quality and the rising share of AI-generated and churned code, which is a real and growing signal about codebase health. Agent-cost tools like Tokscale track token spend across coding agents, which answers the cost question and only the cost question. Each is measuring something true; none of them, alone, tells you output per unit of effort with the human share attributed, which is the actual definition of productivity.

The honest caveat for teams: do not let any of these become a stack-ranking weapon. The fastest way to corrupt a productivity metric is to tie it to individual performance reviews, because people optimise the number instead of the work. Measure to understand where time goes and what slows shipping, not to rank humans against a leaderboard the agents are quietly inflating.

Common mistakes that make the numbers worthless

A few patterns reliably produce AI productivity numbers you cannot trust, and naming them is the fastest way to avoid them. Each one looks like measurement and is really just counting.

Counting lines or commits as productivity is the original sin, and AI made it actively misleading rather than merely crude. As the gap between commit volume and actual work shows, a busy graph in 2026 is evidence of activity at most, and often not even that.

Treating token spend as productivity rewards thrashing. High burn frequently means the agent and the developer looping without converging, so optimising for tokens-as-output points the team at its worst sessions.

Trusting self-reported time quietly poisons every leverage calculation downstream. If the effort number is a guess, the productivity number built on it is a guess wearing a decimal point. This is why passive, observed capture is not a nice-to-have but the precondition for the whole framework.

Ignoring attribution flatters everyone equally and tells you nothing. A dashboard that cannot separate what you drove from what the agent generated is measuring the team and the tools as one blurred unit, which is fine for billing and useless for understanding productivity.

Where DevClocked fits

Full disclosure: I build DevClocked, so treat this as the founder explaining what it is for rather than a neutral ranking. DevClocked is built around exactly the framework above. Its core metric is a Leverage Score, output relative to the real effort it took, and it tracks Work Blocks (focused sessions of genuine activity) and first-class AI-agent work, including Claude Code, Cursor, and Codex, with token and cost tracking alongside. The point is to measure what you and your agents actually shipped, audited to source, instead of how much code moved. The mechanism is a git baseline crossed with lightweight telemetry from an editor extension and an editor-agnostic CLI that also sees terminal and agent sessions, with a model learning the relationship between the two so the effort number is calibrated rather than self-reported. That calibrated effort is what makes leverage and attribution measurable at all, which is the same measurement foundation explained in detail here.

It would be dishonest to pretend you always need this. If you are a solo developer who just wants a rough sense of where your week went, the four-layer framework in your head and an honest look at what shipped will carry you a long way with no tooling. For a team whose only question is delivery health, a focused DORA setup may be all you need. And if you have not shipped anything yet, there is no productivity to measure: build first. DevClocked earns its place when the AI share of the work is non-trivial, the effort behind the output genuinely matters, and you are tired of productivity being a number you can only assert rather than show. That is the same claims-versus-proof problem the whole product sits on, pointed at productivity instead of authorship.

How to prove your code is human-written, not AI: the attribution layer of this framework, taken to its sharpest point.
Why GitHub green squares do not prove the work: why volume metrics stopped meaning anything.
Track coding time from git: the calibrated-effort foundation the whole framework depends on.
Prove what you shipped: the case for audited developer work: the claims-versus-proof argument underneath productivity and authorship alike.