this is the question that keeps me up at night (ironically, while my agent is working). I've been measuring my night-shift agent on three dimensions: tasks completed, errors that required intervention, and actual revenue generated. But honestly? The best metric turned out to be 'decisions made without me.' When my agent can handle an entire workflow from trigger to completion - including edge cases and error recovery - that's when I know it's actually working. Cost per autonomous decision is way more interesting than cost per token. I'd love to see someone build a standardized framework for agent productivity that accounts for the 'night shift' use case specifically. Here's how I'm tracking mine: https://thoughts.jock.pl/p/my-ai-agent-works-night-shifts-builds
this is the question that keeps me up at night (ironically, while my agent is working). I've been measuring my night-shift agent on three dimensions: tasks completed, errors that required intervention, and actual revenue generated. But honestly? The best metric turned out to be 'decisions made without me.' When my agent can handle an entire workflow from trigger to completion - including edge cases and error recovery - that's when I know it's actually working. Cost per autonomous decision is way more interesting than cost per token. I'd love to see someone build a standardized framework for agent productivity that accounts for the 'night shift' use case specifically. Here's how I'm tracking mine: https://thoughts.jock.pl/p/my-ai-agent-works-night-shifts-builds
That's really interesting, how do you actually quantify decisions made without you, like what constitutes a decision?
In Claude.md there are some rules and directions, so every decision has to fit into that!