What Would Good Agent Productivity Metrics Look Like?
People have been clamoring for AI impact metrics over the past year.
Yet, according to a recent deepdive by The Pragmatic Engineer, “none” of the metrics work. The article describes one principal engineer’s frustration (emphasis mine):
“I talked with DX and one of the other vendors, they are just DORA+Velocity metrics combined with anything they can get from APIs of Cursor, Claude etc.”
“How can we make effective use of our AI agent subscriptions? So far, in my experience, there is no answer to this — not even the hint of one.”
I recently spoke with an EVP who oversees several portfolio companies. Her take on AI was to use traditional productivity metrics:
“Measure it the same way as measuring a team not doing AI – how much are they getting done, how good is the work?”
The industry’s best attempt so far has been to segment existing output metrics by AI usage and see if they go up.
The problem with output metrics is they’re like a scoreboard: they tell you if you’re successful, but not how to improve.
Accountability for results is important, but the principal engineer who has to deliver those results gets nothing out of staring at a velocity report.
To actually get better, you need metrics that provide targeted, actionable guidance – like a coach, not a scoreboard.
Such metrics don’t exist yet for agents, but we can start to see what they should look like starting from first principles of agentic engineering.
Principle #1: Humans are far more expensive than agents
If agents can perform a task instead of a person, that is almost always a win.
There may be some edge cases where AI is extremely slow or expensive, but in general it either can or cannot perform a task with an acceptable level of quality, and costs far less than a person to do it.
As a result, the main component of agent productivity metrics should be human effort. There will be costs for tokens and other things, but human effort is what really matters.
Principle #2: Agents can usually figure things out given enough context
One senior engineer I work with described his experience using Claude Code:
“It still does really dumb stuff, like it couldn’t get half the tests to pass, so it just deleted them instead of fixing the real problem.”
Anyone who’s spent time with AI has experienced its sometimes shocking lack of common sense.
Yet, it will happily stop deleting your tests, you just have to tell it not to!
Its initial mistakes may be uncanny, but so too is its ability to find the right answer given enough context.
There are some limits, but human/AI interactions with the latest models largely entail the human providing feedback and additional instructions, not by taking over from Agents that are “stuck.” (This is a significant change from older models before December 2025, which were more prone to getting stuck.)
This means metrics can focus on the primary workflow where humans provide feedback to agents until task completion rather than separately handling cases where humans take over.
Principle #3: Context switching kills human productivity
People have limited short-term memory. They have to spend significant time orienting themselves on a new task, and small distractions can knock them out of this productive state.
Every time an agent needs help from a person who’s doing something else, the cost can easily exceed the time it takes to provide that help by an order of magnitude.
Good productivity metrics must take this into consideration.
Principle #4: As agents become more autonomous, people multi-task
The longer agents can run without human feedback, the more likely people are to switch to other tasks or run multiple agents in parallel.
If you’re actively using one agent at the terminal, your productivity would be a factor of the total session length (AI execution + human response time) because you are focused on a single task. In this scenario, AI execution time matters a lot and is a primary component of productivity.
As agents become autonomous, however, agent execution time matters less because you can run more in parallel. Instead, the constraint becomes human attention.
Principle #5: As people multi-task, context switching overhead dominates task time
If you’re just running two agents, context switching overhead might not be that bad, especially if they are working on related tasks.
As agents run longer on their own and work on more tasks in parallel, the human effort to provide a response requires an increasing amount of context switching overhead to familiarize oneself with what the agent is doing.
Core agent productivity metric: Input Frequency
With these principles, we can derive a core agent productivity metric:
Input Frequency: The total human inputs required per task
With many agents running in parallel and the human cost dominated by context switching, you can get the most out of agents by reducing the number of times they need human feedback on the path to completing a task.
This metric covers all the different reasons agents may need input, such as making a mistake, lacking instructions, or not having access to necessary information.
Other metrics like total agent runtime, token use, code complexity, etc. may tell you small things about agent productivity, but input frequency focuses on the primary bottleneck: human attention.
Reducing agent input frequency
You can start to reduce input frequency by analyzing human inputs and classifying what actions would make each input unnecessary. (The prompts in each session are not currently available in vendor APIs, but can be extracted from local log files.)
Your exact categories may vary, but here are some common ways to make agents more autonomous:
Better Planning – Better planning and requirements would have enabled the agent to get further on its own. The action here is to improve processes to create better up-front plans.
Best-Practice Violations – Certain types of mistakes (e.g., deleting tests) can be categorically fixed by improving your agent instructions file (e.g., CLAUDE.md) in each repository.
Tool Access – Agents may lack access to important tools and systems like your observability platform, CRM, design tools, database, debuggers, etc.
Test Automation – Automated tests serve double duty for agents. Not only do they help verify correctness, they also serve as documentation of your software’s functional requirements that would otherwise live in the developer’s head.
Security/Permissions – If agents require input for approving resource access, you can unblock them with better sandboxing and isolation.
A note about quality
Selecting input frequency as the primary agent productivity metric assumes that developers uphold quality standards rather than letting them slip.
While input frequency is a metric you want to move, it’s equally important to use your existing quality metrics like defect rates as guardrails that do not move as you adopt agents. Otherwise, it’s possible to reduce input frequency by just accepting whatever the agent gives you with less scrutiny.
The way forward
In this article, we looked at why input frequency is a good metric for improving agent productivity along with guardrail metrics to uphold quality.
During the adoption phase of agentic development, the critical path is cutting out unnecessary human involvement. Input frequency makes that the priority.
This relies on a few assumptions, however, that may no longer hold as agents improve
If agents start one-shotting tasks and your input frequency becomes two (one start, one to approve the result), we will have to look for other metrics to continue pushing the envelope.
Once agents become extremely autonomous, the assumption that humans are more expensive could also turn on its head with bills that exceed the entire team’s salary. In this world, it may become more important to offload agent tasks onto traditional software rather than further increase agent autonomy.
But, whatever the future holds, one thing seems certain: agents will be a big part of it.
Teams that succeed will need actionable metrics to make good prioritization decisions in the world of ever-growing complexity.


