minimal engineering

Agent Efficiency Beats Autonomy

Kevin Borders — Thu, 25 Jun 2026 16:46:25 GMT

The biggest misconception about agent productivity is that it’s about maximizing agent autonomy.

The idea is that AI adoption should focus on having agents work longer with less help, measuring progress by the quantity and quality of agent output relative to human involvement. More skills, more instruction files, more context.

This is like saying the key to building more software prior to AI was just hiring as many good engineers as possible, then spending all your budget on training without investing in DevOps.

The issue is that both humans and LLMs are expensive and non-deterministic. Just throwing people or LLMs at a problem leads to massive inefficiency and ultimately poor quality.

The key to agent productivity is the same as the key to human productivity: improve efficiency by simplifying and automating workflows with deterministic processes.

Throwing LLMs at a problem is actually more dangerous because of one fundamental difference: people think for themselves. If you hire good engineers, they tend to use their judgment to do the right thing.

Model vendors have a conflict of interest. They want you to use LLMs as much as possible. Vendor agent harnesses like Claude Code are great for producing deterministic code to automate workflows, but are slower, more expensive, and less reliable than conventional code at executing workflow steps that can be automated.

Many companies are patching over this lack of reliability with even more non-deterministic LLM workflow steps in the name of being “AI native.” Much like organizations that missed the boat on DevOps, those who focus on autonomy without paying attention to agent efficiency are going to be in for a rude awakening.

This article demonstrates by example how we built Thoreau at minware, our AI pipeline for generating product documentation (you can see the end result here). Thoreau follows the principle of efficiency first, using agents to develop a workflow that minimizes the role of LLMs in the workflow itself, resulting in higher quality and lower costs.

Principle #1: Information movement and transformation should be automated

The first guiding principle of our documentation workflow is that information should have a single home. If it is needed in another place or format, the movement and transformation should be done by deterministic code that automates the process.

Our product at minware has a variety of metrics and reports about developer productivity. Names and descriptions of those metrics are represented as structured data in the application, along with reports that have text describing how to use and interpret those metrics.

The product documentation on minware.com has its own content management system in a separate repository.

The autonomy-first approach

One trap companies fall into even without AI is to have a person use the product and then write documentation in an independent system, moving and transforming information manually. (“Hey engineer: please type pages of documentation into the CMS.”)

We didn’t want to do this with Thoreau. It would have been easy to create skills that told Claude Code to fire up a browser, go through the reports, and write out documentation markdown files describing all the metrics. (“Hey Claude: please produce pages of documentation and put it in the CMS.”)

This autonomy-first approach would have been faster to build as well. However, every word of documentation would be liable to change with each future iteration and be subject to potential hallucinations (just like the manual workflow is subject to human error), requiring further review.

Of course, we could have doubled down on agent autonomy and created even more skills and steps to verify the output, burning tokens along the way.

This would be in effect automating an inefficient human process with an inefficient AI process.

The efficiency-first approach

Instead, we told Claude Code to build a series of automated scripts to extract and transform information from its canonical source in our application into the form needed for generating documentation.

Because these scripts were simple, for internal use with known inputs, and we could fully validate the final output (static documentation pages), we didn’t have to read the code like we would for customer-facing workflows. We had Claude write unit tests of course, but development was very fast.

The first task in this process fetches all of the report and metric data from our API and saves it to a JSON file, which is not checked into version control.

Then, we gave Claude Code instructions for writing a script to transform raw API data, organizing report contents based on their menu structure and creating a list of metrics associated with each report.

The output contains the relevant data from our application arranged roughly how it will appear in the final documentation, and dropping irrelevant fields.

Principle #2: Automate in small steps

We built the movement and transformation scripts using Claude Code by only looking at the script output and not having to read the code.

This wouldn’t have been feasible by just handing Claude the API and desired output format.

Instead, we broke down the process into several distinct scripts and validated the output at each step.

The trap people often fall into with agents (and, to be fair, with engineers as well) is trying to do too much at once. This raises the complexity above the level where you can efficiently get to the right result just by providing feedback on the output.

First, we focused only on extracting the raw API data and writing it directly to a file. Then, we did basic preprocessing to filter the raw data down to only what was needed. We then linked and rearranged records so that the raw fields for each metric and report were adjacent to one another. Finally, we applied transformation logic to produce the final fields in the correct format for output.

At each step, Claude made some mistakes that we told it to correct. If we had tried to do it all at once, it would have been a lot more difficult for us and for Claude to isolate and fix each of the errors.

Principle #3: Isolate human context from agent-generated code

Working with real data, you inevitably encounter things that are incorrect or inconsistent. For us, certain metric descriptions were written differently from the others, and certain charts were set up in a way that would have led to repetitive or nonsensical documentation.

To address this issue, we had Claude flag everything that didn’t fit the expected format in a list of exceptions generated by the automated script. This caught things like two charts with the same title using different metrics on different reports, or orphan metrics that didn’t show up on any report.

Resolving each of these exceptions required human instructions to say how to handle the issue.

The wrong way to do this is to tell Claude Code how to deal with issues one at a time by embedding data-dependent logic inside of the code that it generated.

If you mix AI-generated code and human context together in this way, it makes maintenance a nightmare because you have to sift through thousands of lines to understand the embedded human context.

Instead, we created a separate notes.yml file to hold all of the human-provided context and told Claude to have the scripts read these notes when processing the data. This file also served as a concise to-do list for cleaning up the source data at a later time. The notes file is read and reviewed by a person, but the scripts that used it are not.

Principle #4: You usually don’t need an LLM

Once we extracted and organized all of the text fields from their source in our application, there were a series of transformations that involved manipulating text.

For example, some metrics were written like “Average XYZ” while others were “XYZ (Avg.)”. We also had to combine different fields like “Metric X by Dimension Y” where the metric and dimension names were in separate fields.

Most text transformations can be achieved with regular expressions. It’s worth it to ask AI to implement the transformation with conventional code first, because there may be text processing methods you aren’t aware of that are deterministic and cheaper than using an LLM.

As a general guideline, if the transformations you’re doing are purely syntactic, then you probably don’t need an LLM. To be concrete: it is more effective to have an LLM generate a pipeline that deterministically transforms text than it is to provide an LLM with unstructured inputs, describe your desired outputs, and ask for the final output to be written to disk. Even if you have a skill that guides the LLM, the deterministic process will be faster, easier to maintain, and 100% correct for behavior that is covered by automated tests.

LLMs only come into play when you need to embed a judgment that depends on the meaning of words or the broader context.

Doing all the transformations you can with deterministic code first lets you focus the LLM where it was really needed.

Principle #5: Isolate and constrain LLM steps

This brings us to where we really needed to use an LLM. Some of our reports already had human-written summaries of their purpose and what the report contained. Others did not.

We also have shorter length constraints in our documentation pages than in the application, so some of the existing text needed to be shortened intelligently.

There was also certain information that didn’t exist in our application, like an appropriate magnitude to show for demo charts of things like commit or story point counts.

When it came time to use an LLM, we first had the automated script extract a minimal list of the items that needed attention.

We also instructed Claude to have the script insert other information that would be relevant context for LLM processing into this to-do list. This included the titles of all the charts, names of the reports, and text content on the reports other than the report description.

Then, we gave Claude specific instructions to create an llm.yml file that just had the IDs of the items and the LLM-generated data for each field. This resulted in a concise list that was easy to review, which we checked into version control.

For the things that were wrong and needed extra context, we added that to the human notes.yml file and asked Claude to re-run the LLM generation step using that context.

While we used Claude Code for this LLM step (by creating a skill to have it run the pre-LLM script, update llm.yml, then run the post-LLM script), you’d want to do it with raw LLM calls in a pipeline with more data to control the exact LLM input, which would more the results more deterministic and reduce costs.

After running the LLM generation step, the final piece was to have an automated script combine the LLM output with everything else and push it to the final documentation markdown output format, which you can see at this link.

The whole pipeline runs the scripts in a few seconds and the LLM processing in several seconds more. The diff is easy to review because we can see which changes come from an LLM and which are just copying upstream changes that have already been reviewed.

Putting agent efficiency into practice

It’s one thing to understand the principles of efficient AI automation. Managing an organization to apply these principles at scale is a far greater challenge. Here are a few practical things you can do to set your team on the right path.

Document and enforce AI workflow automation guidelines

The first step to implementing agent efficiency is to establish a shared written document with technical leaders that outlines how engineers should automate workflows using AI.

This should include things like how to structure files (i.e., separating human notes from agent code and LLM processing outputs), what types of processing should happen with deterministic code, which artifacts need human review, etc.

It’s also helpful to specify when workflows should have more or less stringent guidelines based on how frequently they run, how critical it is to have correct output, and maintenance expectations. This ensures the guidelines don’t get in the way of engineers applying common sense.

The guidelines also should encourage the initial creation of text-based skills for repeated AI tasks, even if they should be made more efficient later. Otherwise, people may use agents directly without creating skills, which makes deterministic automation harder to see and manage.

Finally, for guidelines to be effective, it’s essential that technical leaders understand them, buy into them, and have ample time to review the work of less experienced engineers to ensure that it conforms.

Keep an inventory of your AI workflows

With the push toward AI adoption and the power of agents, people are creating more automated workflows than ever, including for one-off personal tasks.

Given the explosion of AI workflows, it’s impractical to optimize all of them. This is why an inventory is so valuable: it helps you identify which workflows are the most important and prioritize their efficiency.

The best way right now to gain visibility into AI workflows is to heavily encourage the consolidation of repetitive AI tasks into skills, MCPs, and tools, as mentioned in the previous section.

With common named capabilities in place, you can look at usage traces from your AI tools to see how much each skill/MCP/tool is being used, how long it takes, its success rate, and its token cost.

The major coding agents (Claude Code, Codex, Gemini CLI, OpenCode etc…) emit OpenTelemetry data which is the most complete source for this type of information. Other tools, like Cursor, have enterprise usage APIs that include some of this data as well.

It’s also helpful if you link this data to commits, pull requests, and tickets by ingesting and joining it with data from your version control and project management system. (We’ve built minware to do this for you if you’re not keen on using your tokens to manage a data pipeline.)

All together, this gives you full observability over each AI workflow, showing which types of tickets, projects, repositories, and tasks use it. Understanding the impact of each workflow on higher-level outcomes lets you effectively manage agent efficiency across the organization.

Agent efficiency is the future

Agents are still a very new technology. Most organizations are closer to the start of their transformation than to the end.

The mass euphoria around AI has shone a spotlight on autonomy, with free-flowing budgets and a race to use AI for everything.

Autonomy is not bad per se. Having LLMs do tasks that humans used to do may still be a big win.

However, many of the success stories I see look like automating inefficient human processes.

LLMs may be cheaper than people, but even if their cost drops to zero, they are still inescapably non-deterministic.

The agent efficiency wave is coming. Unless you get ahead of it, you risk landing back where you started before adopting AI, just with a bigger monthly bill.

What Would Good Agent Productivity Metrics Look Like?

Kevin Borders — Tue, 10 Feb 2026 14:39:05 GMT

People have been clamoring for AI impact metrics over the past year.

Yet, according to a recent deepdive by The Pragmatic Engineer, “none” of the metrics work. The article describes one principal engineer’s frustration (emphasis mine):

“I talked with DX and one of the other vendors, they are just DORA+Velocity metrics combined with anything they can get from APIs of Cursor, Claude etc.”
“How can we make effective use of our AI agent subscriptions? So far, in my experience, there is no answer to this — not even the hint of one.”

I recently spoke with an EVP who oversees several portfolio companies. Her take on AI was to use traditional productivity metrics:

“Measure it the same way as measuring a team not doing AI – how much are they getting done, how good is the work?”

The industry’s best attempt so far has been to segment existing output metrics by AI usage and see if they go up.

The problem with output metrics is they’re like a scoreboard: they tell you if you’re successful, but not how to improve.

Accountability for results is important, but the principal engineer who has to deliver those results gets nothing out of staring at a velocity report.

To actually get better, you need metrics that provide targeted, actionable guidance – like a coach, not a scoreboard.

Such metrics don’t exist yet for agents, but we can start to see what they should look like starting from first principles of agentic engineering.

Principle #1: Humans are far more expensive than agents

If agents can perform a task instead of a person, that is almost always a win.

There may be some edge cases where AI is extremely slow or expensive, but in general it either can or cannot perform a task with an acceptable level of quality, and costs far less than a person to do it.

As a result, the main component of agent productivity metrics should be human effort. There will be costs for tokens and other things, but human effort is what really matters.

Principle #2: Agents can usually figure things out given enough context

One senior engineer I work with described his experience using Claude Code:

“It still does really dumb stuff, like it couldn’t get half the tests to pass, so it just deleted them instead of fixing the real problem.”

Anyone who’s spent time with AI has experienced its sometimes shocking lack of common sense.

Yet, it will happily stop deleting your tests, you just have to tell it not to!

Its initial mistakes may be uncanny, but so too is its ability to find the right answer given enough context.

There are some limits, but human/AI interactions with the latest models largely entail the human providing feedback and additional instructions, not by taking over from Agents that are “stuck.” (This is a significant change from older models before December 2025, which were more prone to getting stuck.)

This means metrics can focus on the primary workflow where humans provide feedback to agents until task completion rather than separately handling cases where humans take over.

Principle #3: Context switching kills human productivity

People have limited short-term memory. They have to spend significant time orienting themselves on a new task, and small distractions can knock them out of this productive state.

Every time an agent needs help from a person who’s doing something else, the cost can easily exceed the time it takes to provide that help by an order of magnitude.

Good productivity metrics must take this into consideration.

Principle #4: As agents become more autonomous, people multi-task

The longer agents can run without human feedback, the more likely people are to switch to other tasks or run multiple agents in parallel.

If you’re actively using one agent at the terminal, your productivity would be a factor of the total session length (AI execution + human response time) because you are focused on a single task. In this scenario, AI execution time matters a lot and is a primary component of productivity.

As agents become autonomous, however, agent execution time matters less because you can run more in parallel. Instead, the constraint becomes human attention.

Principle #5: As people multi-task, context switching overhead dominates task time

If you’re just running two agents, context switching overhead might not be that bad, especially if they are working on related tasks.

As agents run longer on their own and work on more tasks in parallel, the human effort to provide a response requires an increasing amount of context switching overhead to familiarize oneself with what the agent is doing.

Core agent productivity metric: Input Frequency

With these principles, we can derive a core agent productivity metric:

Input Frequency: The total human inputs required per task

With many agents running in parallel and the human cost dominated by context switching, you can get the most out of agents by reducing the number of times they need human feedback on the path to completing a task.

This metric covers all the different reasons agents may need input, such as making a mistake, lacking instructions, or not having access to necessary information.

Other metrics like total agent runtime, token use, code complexity, etc. may tell you small things about agent productivity, but input frequency focuses on the primary bottleneck: human attention.

Reducing agent input frequency

You can start to reduce input frequency by analyzing human inputs and classifying what actions would make each input unnecessary. (The prompts in each session are not currently available in vendor APIs, but can be extracted from local log files.)

Your exact categories may vary, but here are some common ways to make agents more autonomous:

Better Planning – Better planning and requirements would have enabled the agent to get further on its own. The action here is to improve processes to create better up-front plans.
Best-Practice Violations – Certain types of mistakes (e.g., deleting tests) can be categorically fixed by improving your agent instructions file (e.g., CLAUDE.md) in each repository.
Tool Access – Agents may lack access to important tools and systems like your observability platform, CRM, design tools, database, debuggers, etc.
Test Automation – Automated tests serve double duty for agents. Not only do they help verify correctness, they also serve as documentation of your software’s functional requirements that would otherwise live in the developer’s head.
Security/Permissions – If agents require input for approving resource access, you can unblock them with better sandboxing and isolation.

A note about quality

Selecting input frequency as the primary agent productivity metric assumes that developers uphold quality standards rather than letting them slip.

While input frequency is a metric you want to move, it’s equally important to use your existing quality metrics like defect rates as guardrails that do not move as you adopt agents. Otherwise, it’s possible to reduce input frequency by just accepting whatever the agent gives you with less scrutiny.

The way forward

In this article, we looked at why input frequency is a good metric for improving agent productivity along with guardrail metrics to uphold quality.

During the adoption phase of agentic development, the critical path is cutting out unnecessary human involvement. Input frequency makes that the priority.

This relies on a few assumptions, however, that may no longer hold as agents improve

If agents start one-shotting tasks and your input frequency becomes two (one start, one to approve the result), we will have to look for other metrics to continue pushing the envelope.

Once agents become extremely autonomous, the assumption that humans are more expensive could also turn on its head with bills that exceed the entire team’s salary. In this world, it may become more important to offload agent tasks onto traditional software rather than further increase agent autonomy.

But, whatever the future holds, one thing seems certain: agents will be a big part of it.

Teams that succeed will need actionable metrics to make good prioritization decisions in the world of ever-growing complexity.

Good Ticket Hygiene Helps Engineers Too

Kevin Borders — Thu, 20 Nov 2025 19:27:37 GMT

Every team strives to deliver on their commitments and create value for the organization. The reason I started minware is I believe tracking and improving metrics is an essential part of doing this at scale.

The first layer of metrics we show to people cover five basic ticket hygiene practices: (1) representing work in tickets, (2) estimating tickets, (3) working on tickets in sprints, (4) adding tickets to epics, and (5) setting due dates on epics.

But the question we sometimes get from engineers is: what’s in it for me?

Not in a selfish sense, but engineers want to spend as much of their time as possible creating value for customers. Keeping Jira up-to-date may seem like busy work.

It may also make engineers nervous to share what they’re doing and attach an estimate to it, especially if they don’t feel supported by their manager.

It’s natural for individuals to feel like ticket hygiene does not benefit them. However, it really does if you look at the big picture, which is what we explore in this article.

Estimates expose problems early

Perhaps the biggest reason every software engineer should estimate their work (even if they are working alone on a side project!) is that estimation forces you to think through a detailed plan. Putting story point or time estimates on every ticket and organizing tickets in sprints and epics means you have to consider everything that’s involved with achieving your goal.

One of the most common mistakes software teams make is starting work with only vague back-of-the-napkin plans and continually discovering scope as they go along.

This obviously increases the risk of missing deadlines if you have them, but even if you don’t, not estimating can cause major inefficiency.

Teams that don’t plan well often put too much effort into polishing earlier tasks. Then, as time goes on and they discover how much work is left, there is tremendous pressure to cut scope and ship something valuable as soon as possible.

However, they can’t cut scope on the earlier tasks because those are already done.

Estimating work enables you to address resource limitations up front when you have the most flexibility.

Ticket hygiene communicates clear expectations

For individual engineers, estimates are a critical tool for communicating and managing expectations. By estimating tickets, committing to them in sprints, and setting due dates on epics, your manager and outside stakeholders know what to expect. They can ask you to change your plans if they don’t like what they see, but should otherwise leave you in peace while you’re working.

In contrast, teams without good ticket hygiene tend to be dominated by chaos. Managers and stakeholders always need things done by a certain time. When there aren’t reliable ticket estimates, sprints, and epics, they’ll ping people constantly on Slack, which further interrupts engineers and delays work.

If you’re an engineer and people send you messages every day asking when things will be done (outside of a stand-up), the first step of the solution is better ticket hygiene.

Predictable teams are rewarded

The ultimate goal of good ticket hygiene is to make it easier for teams to deliver on their commitments and create value for the organization.

At first glance, this may not seem like it benefits the individual that much.

However, the team’s success or failure often has more influence on career trajectory than individual performance.

Software teams that deliver on their promises make customers happy and win deals.

Salespeople with dependable software teams promise more and grow market share.

Companies with more revenue pay higher bonuses, expand the team, promote people, and have a good reputation.

Even without counting equity, profit sharing, or bonuses, being on a strong team gives you a major advantage with future employment and compensation.

Imagine these two hypothetical applicants:

Someone with outstanding recommendations from managers at a company that had poor software quality and went bankrupt, or…
An applicant who didn’t get promoted quickly but was on the core product team at a hot tech company that recently had a big IPO

I would rather hire the applicant from the hot tech company whose team shipped features that users wanted. Wouldn’t you?

Everyone should optimize their ticket hygiene metrics

It’s in everyone’s interest – even on the smallest and newest teams – to consistently follow basic ticket hygiene best practices and keep track of them with metrics.

Here’s what you should be doing to estimate work and make sure those estimates are visible to others:

Representing work with tickets – If significant work isn’t represented in your ticketing system, no one will be able to tell when it will be done or if it is complete without communicating out-of-band. Also, under-the-radar work impacts the predictability of ticketed work.
Setting estimates on tickets – Every ticket that you commit to starting (e.g., by adding it to a sprint) should have an estimate of some form. Otherwise, you don’t know how much capacity you’re committing to complete, and you won’t be able to accurately project your velocity in the future.
Adding tickets to sprints/iterations – Any team that does planned work or has to balance stakeholder requests over a longer time frame (i.e., more than just an IT service team with response time SLAs) should add all the tickets that their team members work on to a sprint so that they can estimate when a certain scope of work will be done. Consistently completing tickets that are scheduled in the current sprint tremendously eases the burden of stakeholder communication.
Adding tickets to epics – When there are larger tasks or broader outcomes, it is important to add all the related tickets to a parent epic so that stakeholders can clearly see the status of the high-level work they care about in one place. Without epics, it’s hard to see the status or projected completion date of important business deliverables.
Setting due dates on epics – Whether you do this automatically by sizing epics and lining them up on a timeline or setting each due date individually, specifying estimated due dates is critical for communicating with stakeholders and retrospectively assessing planning accuracy to identify misses and continuously improve.

Without ticket hygiene to provide essential visibility, it’s nearly impossible for teams to achieve the higher-level goal of consistently delivering high-quality software.

Driving Revenue with Strategic Tech Debt Management

Kevin Borders — Sun, 26 Oct 2025 21:10:18 GMT

I recently hosted a roundtable on technical debt at LeadingEng with Dan Na.

It was interesting hearing from a variety of leaders (albeit biased toward caring about tech debt) how they manage their organizations.

With tech debt, some teams are underwater. They’ve made bad decisions in the past, and tech debt has already cost them more than it would have to fix early on.

For leaders in this situation, it’s straightforward to get buy-in for fixes because you can point to real ongoing business impact (project delays, outages, attrition, etc.)

The roundtable attendees were actually not in this situation. They stayed on top of tech debt by monitoring metrics like time spent fixing bugs and addressing the root cause.

However, they were not fully satisfied with that approach because it still felt fundamentally reactive.

The question arose: What does it look like to manage tech debt strategically and shift from reducing costs to driving revenue?

A major hidden cost of tech debt comes from missed opportunities the organization cannot pursue due to limiting technology.

Organizations that employ a predictive approach to model the impact of tech debt in likely future scenarios can avoid this opportunity cost and enable growth.

The final million dollar question is: how do you prepare for unpredictable events? No one could have foreseen COVID or the rapid ascendance of generative AI, yet some companies won big while others floundered.

The highest form of tech debt management is an anti-fragile approach, which focuses on general capabilities to help organizations thrive in a chaotic, uncertain future.

This article shows how to elevate tech debt management from reactive to predictive and then anti-fragile, transforming technology from cost savings into a strategic driver of growth.

Reactive tech debt management

The first step in effective tech debt management is tracking its impact on your organization today before it gets out of hand.

Leaders who do this well deploy a variety of metrics to monitor the various ways in which debt incurs real costs. They then link those costs to specific areas of tech debt to successfully advocate for improvements with business stakeholders.

Bugs

The first important area to track is bugs. Change Failure Rate is a key DORA metric that looks at how often deployments result in rollbacks, outages, or hotfixes. However, non-critical bugs can have a significant drag on development too. You should also look at total effort devoted to bugs using ticket counts, story points, time logs, or an effort model like the one in minware. Time allocated to bugs is a particularly helpful metric because you can arrive at dollars by multiplying engineering salaries.

To use bug cost for tech debt management, you also need to attribute bugs to their cause so that you can say “If we fix this tech debt, these bugs will go away.”

There’s no one-size-fits-all approach for this, but some organizations link bugs to the original changes where they were introduced. Others may use a coarse-grained approach of linking bugs to particular services, repositories, or areas of the code to approximate its cost.

Slower development

Another effect of tech debt is missed estimates and slower development. One leader I’ve spoken to asked engineers to record an “actual story points” field after completing tasks. He then used this to compute a missed estimate cost by code area to demonstrate the value of refactoring a particular service to his CEO. You can also do something similar with more granular time logs or effort modeling.

At my previous company, we asked engineers to write down a percentage of time lost to tech debt during each sprint retrospective and note the cause. This gave us a clear picture of where we needed to invest in paying down tech debt.

Morale

The most insidious impact of tech debt is bad morale. It’s easy to overlook in the short term, but losing your best people is incredibly detrimental.

It’s essential to keep a pulse on how people feel about tech debt. You and your managers should ask about it in one-on-ones, surveys, and exit interviews.

Depending on the culture of your organization and whether they pay top-of-market for engineers, management may not be receptive to morale as a business justification for fixing tech debt.

Regardless of management attitude, however, you should know when tech debt is putting your best people at risk and push harder on the other reasons for fixing it.

The flipside is that engineers who are most bothered by tech debt are often the most energized by fixing it. If they would otherwise leave, putting them on these projects can be a win/win.

Predictive tech debt management

After you have a healthy program for reactive tech debt management, the next step is to look at the impact of tech debt on the predictable future. This section looks at two ways to do this.

Immediate roadmap impact

Before thinking about far off events, you should look at the opportunities you would pursue today if not for tech debt.

The opportunity cost of projects you don’t do is often far greater than the tech debt impact on those you do. A predictive tech debt management program should take inventory of areas currently held back by technology and look at the value that better technology would unlock.

One approach is to ask product managers to list the problematic areas where they avoid making changes, and also look at top projects that didn’t make the roadmap primarily because of high engineering estimates.

The next question is: what would it look like if we did this work?

There are different approaches here, but one is to copy the current roadmap and create a hypothetical version where you fix an area of tech debt and pursue the initiatives it blocks instead of doing less valuable work.

You can then compare the total value of the hypothetical roadmap to the original one (using whatever value estimates are already in place) to calculate the incremental value created. Finally, you can combine this with estimated cost savings to bolster the case for fixing tech debt.

Modeling the future

After you’ve assessed the impact of tech debt on your immediate roadmap, the next step is to consider what is likely to happen to your organization over the next few years. This is particularly important for organizations experiencing a lot of change.

The exact way you do this will depend on your organization, but you should take time to think through plausible future scenarios with input from business stakeholders. Here’s a list of things to consider as a starting point:

Pace of Hiring - How many engineers may join the team in the future? New employees will be more impacted by tech debt than tenured engineers.
Downsizing/Offshoring - Is it likely you’ll lose some of the engineers you have today and need to get by with fewer? High tech debt can make this a lot more painful.
Customer Volume - If you had more of the same customers, how would this stress the system, especially in non-linear ways with components that have resource caps?
New Customer Requirements - What are prospective and current customers asking for today that you may need to build in the future to capture market share?
New Markets - Is the business likely to pursue new markets with different requirements like languages, currencies, pricing models, on-premise deployment, etc.?
New Distribution Channels - Will the business pursue distribution channels like resellers, affiliates, or partners that come with new requirements?
Platform APIs and Services - Is the business likely to start selling internal capabilities as services? (Tech debt can have a major impact on the ability to pull this off.)
New Technology - Which up-and-coming technologies are likely to require support or compatibility in the future?

Once you have this list of possibilities, I recommend adding two columns: likelihood and technical readiness.

Because the future holds a lot of uncertainty, you don’t have to be super precise. Three categories of (1) unlikely / (2) maybe / (3) probably and (1) high / (2) medium / (3) low will probably suffice as a starting point.

What you’re looking for here is totals of 5 or 6. These represent conversations that need to happen between engineering and business leadership. The business either needs to scale back its ambitions or invest in fixing tech debt now so that its goals are achievable in the future.

The great thing about this exercise is that it transforms the conversation from engineering costs to engineering as a primary creator of value.

Now, moving your API to GraphQL isn’t just about making engineers more productive, it’s also about gaining new services revenue and selling the product on more platforms.

Anti-fragile tech debt management

Nobody could have anticipated the pandemic, which was a black swan event. Yet, many businesses thrived – not because they predicted it, but because they were flexible and ready for anything.

The same thing is happening now with generative AI. Companies witnessing explosive growth weren’t more prescient, they were more prepared.

Nassim Taleb describes this as being anti-fragile. That is: putting yourself in a position to benefit from chaos and uncertainty.

When chaos erupts, moving quickly is essential for capitalizing on opportunities. Technical debt is like an anchor that weighs you down and ties you to the status quo. Shrewdly managing it is essential for reducing reaction time in sudden, extreme circumstances.

The fundamental difference between anti-fragile tech debt management and predictive management is that you have no list of probable events. Instead, you must look at how engineering organizations generally respond under extreme conditions (a concept Taleb calls convexity) and position yourself to benefit when they occur.

Let’s see what that looks like and how you can prepare for an uncertain future.

How Collage.com won during COVID with low tech debt

At the start of COVID, demand for photo blankets, puzzles, and other gifts skyrocketed. Supply chains were long, so every company that offered photo products was out of stock.

At the same time, no one could raise prices directly due to outstanding gift vouchers from sites like Groupon.

Collage.com (my former company), was able to quickly introduce a new “priority” shipping method between standard and expedited that was still standard shipping, but with a separate priority production queue. We then adjusted the price dynamically based on demand so “standard” might take a few months, but “priority” would always arrive quickly – in effect creating surge pricing.

We implemented this change from concept to launch in a week. Meanwhile, our biggest competitors only released major updates a few times per year.

As a result, we were the only place where you could buy photo products and receive them on time during much of the pandemic, and we were able to charge higher prices.

This ultimately led to an acquisition at around 3x the company’s pre-COVID value.

We could do this because we had invested in things prior to the pandemic like CI/CD, trunk-based development, and systems for pricing and delivery estimation.

If COVID were predictable, the framework from the earlier section would show high technical readiness for an overnight 5x demand surge, while our competitors were low.

Maintenance overhead/OpEx

The first metric to look at for assessing your ability to weather extreme events is total maintenance overhead, sometimes called “keeping the lights on (KTLO).” This is the amount of resources you need to keep your software running if you were to stop all new feature development.

In finance terminology, these are operating expenses (OpEx), while resources that go into new development are capital expenses (CapEx).

You can calculate maintenance overhead by tagging maintenance tasks in your ticketing system. It can be tricky to get this right because certain bugs are caused by new feature launches. Some people handle this by tagging those bugs differently or associating them with a capitalizable project. However you do it, the goal is to identify tasks related to fixing or configuring the current software, as well as necessary patches and updates.

Once you’ve classified maintenance tasks, you can assess how many full-time engineers you would need to run your software in maintenance mode.

Finally, you can link tech debt fixes to maintenance overhead they would eliminate, like improvements to system stability or making certain engineering tasks self-service.

The value of reducing OpEx

Every engineer that you shift from maintenance (OpEx) to new development (CapEx) directly contributes to EBITDA (earnings before interest, taxes, depreciation, and amortization).

Depending on the stage and ownership profile of your business, its total value may be primarily driven by an EBITDA multiple rather than growth rate, revenue, or profit. In this case, the direct value of reducing OpEx is the cost savings multiplied by the EBITDA multiple, which can be quite high (e.g., over 20x for SaaS businesses).

In any case, fixing tech debt that reduces OpEx can be quite valuable even under normal circumstances.

When you consider extreme events, however, the value is even greater.

If there is a sudden liquidity crunch caused by severe financial duress, being able to slim down to a low burn rate can mean the difference between survival and insolvency.

On the flip side, having low OpEx when there is a major new opportunity like generative AI frees up more people to quickly shift toward new initiatives.

For an effective anti-fragile strategy, engineering and business leadership should agree on a value multiple to use for OpEx reduction afforded by tech debt fixes. This multiple should err on the high side to account for the benefit of optionality provided by low OpEx under extreme scenarios. Tech debt fixes that cost less (in terms of engineering time) than multiplying this number by their projected OpEx savings are positive ROI and worth pursuing.

Lead time for changes

So far the metrics we’ve looked at relate to amounts of effort, not latency. A key property of black swan events (natural disasters, financial crises, political upheaval, major new technologies, etc.) is that they are often sudden.

When major unpredictable changes occur, being the first mover is a massive advantage.

The most important metric for gauging responsiveness is lead time for changes, which looks at the total elapsed time between starting work and delivering value.

As defined in traditional DORA metrics, lead time is the time between first code commit and production deployment.

However, at the business level, the lead time that really matters is the time between deciding to pursue a new initiative and delivering value to customers.

While the code lead time matters and you should measure it, you should also measure lead time at the ticket/feature level, and at the project/epic level for value delivery.

In the COVID photo product example, the crucial lead time was the time between deciding to add a new shipping method and having it live in the product. This included time for planning, design, implementation, and testing.

Tech debt can inflate lead times at all stages of the software development lifecycle (SDLC). To manage it effectively, you should estimate the impact that fixing tech debt would have on the end-to-end lead time from project conception to completion.

How much is lower lead time worth?

During normal circumstances, lower lead times drive value by allowing organizations to be more responsive to customers, thus winning sales from competitors and more rapidly iterating on customer feedback.

In the COVID photo product example, the value was similar in nature, but greatly amplified. Every week the new shipping method was live before competitors had it drove several hundred thousand dollars in revenue.

Lead time is a revenue multiplier, just like the length of the sales cycle for B2B companies. Every day you maintain a first mover advantage, you sell more. Every day a competitor has it, you lose market share.

It’s impossible to know the exact value of lead time for unpredictable future events, but it’s important to put some number on it for the purposes of deciding to fix tech debt.

Similar to OpEx reduction, engineering and business leaders should establish a lead time value multiple that is a portion of revenue. For example, if the multiple is 2, then fixing tech debt that reduces lead time by one week would be worth two weeks of revenue.

If your lead times are slower than the competition, you may want to pick an even higher number to reflect the compounding disadvantage of repeatedly losing market share over time.

End-to-end strategic tech debt management

In this article, we’ve looked at different methods for estimating the value of fixing tech debt. Beyond traditional cost analysis, we’ve shown how to account for the impact of technical prowess on revenue under both predictable and unpredictable circumstances.

By combining reactive, predictive, and anti-fragile methods together, you can assess the full value of fixing tech debt and help technology lead the organization into the future rather than follow.

The Danger of Doing What You Love

Kevin Borders — Tue, 24 Jun 2025 19:04:35 GMT

When I left the NSA back in 2013, people thought I was crazy – not for starting a company, but for leaving behind security to do photo products.

Many of my peers were security people. Their entire identities were wrapped up in being hackers. They even dressed the part.

They loved their work so much that it made considering other career opportunities unthinkable.

It’s good to like your job, but too much emotional attachment can sabotage rational decision making.

Here are eight reasons why you should think twice before choosing a career based on your passion and be wary of those in your organization who do.

#1: Growth Comes from Failure, Not Love

Working on something you love is rewarding, but you may not learn much.

When I look back at all the things I had to do in my career, the ones that most made me a better person were those where I had to overcome tremendous anxiety and difficulty.

I had no initial passion for hiring, and nobody enjoys having to fire underperforming employees. But, in seeing the negative impact that bad hiring decisions had on my friends and colleagues, I found motivation to overcome my fears and get a lot better at building cohesive, high-performing teams.

If you only focus on things you love, life and all of its rewarding challenges will pass you by.

#2: Love Lets You Work Inefficiently

It is important to find satisfaction in your work, which is essential for motivation.

However, it is also important to work toward a goal and reach that goal so that you can produce something of value for other people.

When motivation comes from loving what you’re doing rather than being done with it, it’s easy to get carried away and spend way more time than you should.

I have worked with software engineers who struggle to complete tasks on time because they simply enjoy the process too much and over-perfect their work.

Those who take pride in their work but see coding as a means to an end are better engineers.

#3: No One Loves Grunt Work

Regardless of what you pursue, success requires grunt work that no one enjoys. You may have to deal with office politics, clean toilets, or do odd jobs to make ends meet in order to achieve your goals.

The problem is, if you’re only in it for love, then you might have a hard time finding motivation for the unglamorous but essential tasks.

I have worked with passion-driven people who didn’t want to do things like writing automated tests, even though those tasks were essential for long-term success.

#4: You May Need to Kill the Sacred Cow

When you love something, you have an emotional attachment that goes beyond achieving an end result and may transcend practicality.

In a real-world organization, resource constraints force you to compromise. If you are too attached to your work, you may have a hard time stopping when it’s time to be done.

I have seen this take many forms, from reluctance to raise prices on customers because you feel like one of them, to not wanting to ship software until it’s perfect or not cutting a marginal feature because you like it yourself.

Change and loss aversion are already tough pills to swallow. Love makes it even harder.

#5: Opportunities Get You There Faster

If I had prioritized my passion (computer science research) right after finishing school, I’d have missed all the opportunities to learn, grow, and step outside my comfort zone at a fast-paced start-up. I also wouldn’t have the built relationships, skills, and financial resources to do whatever I wanted next.

Doing what you love instead of following the best opportunity is like putting “shortest route” into your GPS instead of following the highway. Good opportunities will take you anywhere you ultimately want to go much faster, especially earlier in your career.

#6: Love is Fleeting

Another hazard of doing what you love is that love can fade over time.

It’s always exciting to explore a new area and encounter fresh ideas on a daily basis. During this honeymoon period, positive emotions flourish.

Once you’ve walked every path, however, things start to get dull. You become aware of the roadblocks that hinder progress, and realize that other people have already thought of all your good ideas.

If love was your main motivation and that love goes away, you have nothing left.

I have worked with people who started out with a lot of excitement, but lost interest after 3-6 months. True success takes years of dedication, and relying on love alone probably won’t be enough.

#7: Doing What You’re Good at Is Better for Others

Unless you’re a rare person whose talent and passion align perfectly, following your passion instead of your talent takes an economic toll on society.

If you are a brilliant accountant but pursue medicine because you like to help people, then another less talented person (who might be better at medicine) will have to do the accounting anyway, making everyone less well off.

#8: Love Is Hard to Quit

People who choose the path that they’re on out of love have a strong desire to stay on that path. While perseverance is admirable, sometimes you need to quit.

After I sold my last company, things went south quickly. It was a bad environment and a lot of people left within a few months, myself included.

However, a lot of people stuck around because they loved photography, were photographers themselves, and cared deeply about customers.

These people put up with poor treatment out of passion for photography, which made the problem worse because managers took advantage of the fact that they’d stay regardless of the circumstances.

Being detached enough to walk away from your work is important to avoid getting stuck in a place that no longer offers good opportunities.

How I Chose My Career

After selling my last company, I was lucky enough to have the freedom to do what I wanted next. This was incredibly fortunate, but it also made the decision difficult.

The Love Option: Education

I am passionate about education. Looking back at my own experience, I see a lot of slow growth and missed opportunities.

I have young children who will face these same difficulties soon.

I believe society could be a lot better if the education system empowered everyone to reach their full potential. Now is also the time for disruption due to worldwide internet connectivity.

And yet, actually starting a school or education platform is fraught with problems.

I don’t know that much about education since I’ve spent my career building software. While it’s easy to see problems with the education system, I probably take for granted a lot of things that it does well, all of which I’d have to learn.

Educators also face many challenges working with kids and parents – dealing with harassment and bullying, managing parents with unreasonable expectations, etc.

Then there’s the matter of financing. Students most in need of education have no money. I have no experience with non-profit fundraising, but I imagine it is much harder than raising venture capital, which is already a challenge.

The Opportunity Option: Software Engineering Analytics

On the other hand, taking an objective look at the intersection of my skills, experience, and opportunities clearly pointed in one direction: software engineering analytics.

I have been building software professionally for 20 years and managing engineers for 10. I have already gone through the Dunning-Kruger cycle of initial overconfidence followed by disillusionment and developing true expertise.

I enjoy teaching, but am probably much better at building software at this point.

The time is also right with engineering analytics due to the confluence of new technology like AI and large data platforms with the growing desire to improve productivity. There is no established leader, which leaves an opening for a new company to win the market.

While engineering analytics won’t directly give my children a head start in life like education would, I have seen the pain that bad management inflicts on people. If I can help millions of software engineers feel less stress and have more time to spend with their children, then that is probably the best thing I can do for the world, and the right decision for me.

Discipline Is the Foundation of Innovation

Kevin Borders — Tue, 29 Apr 2025 19:38:35 GMT

“I’m actually as proud of the things we haven’t done as the things I have done. Innovation is saying ‘no’ to 1,000 things.” – Steve Jobs

Photo by Md Mahdi on Unsplash

Books about famous innovators like Steve Jobs or Jeff Bezos are filled with tales of revolutionary new ideas that disrupt the status quo.

What you don’t often hear about, however, is the extreme focus and discipline it takes behind the scenes to make innovation possible.

If you want to revolutionize your industry, your mental effort must go toward your biggest challenges, and you should execute everything else by the book.

The goal of minimal engineering is to be this book for people who build software – the missing chapters of every success story detailing the battles innovators chose not to fight, which are just as important as the battles that they won.

This article highlights a few areas where I’ve seen people (including my former self) waste the most time struggling against immovable laws of software engineering, with the hope that you can steer clear of them and take a shorter road to innovation.

Long-term roadmaps are important

Agile software development emerged in the 90s as an antidote to the inefficiencies of the waterfall model.

The core problem with waterfall is that extensive planning happens up front, followed by a long development process. By the time the software ships, it is often out of date because the requirements were finalized months or years earlier.

Long software development cycles and changing requirements are a real issue, but some agile proponents have thrown the baby out with the bathwater by eschewing long-term roadmaps entirely, claiming they are harmful because “you can’t predict the future.”

You should be highly skeptical of practices that are based on broad generalizations like “you can’t predict the future.”

The obvious truth is that sometimes you can predict the future, and sometimes you can’t.

Hindsight bias causes people to think the future is more predictable than it actually is, which is a real challenge.

However, not planning for things that are predictable is foolish. Conversely, good long-term roadmap can be extremely beneficial.

Jeff Bezos is famous for basing Amazon’s strategy around things that won’t change, like that customers will always want low prices and fast delivery.

On a software team, you may not be able to predict what features customers will want or even what line of business you’ll be in next year. But if you stop and think, there may be more invariants that you realize.

If you sell software to businesses, for example, you will need to handle data. You will need security, you will need a reliable billing system. You will probably need role-based access control for larger customers, and audit logging.

It’s possible you’ll stop building business software or go bankrupt. However, if you’re 95% sure you’ll have to do something, long-term roadmap planning dramatically improves your chances of success. It helps you make better decisions now about resources, architecture, and systems that you will need in the future.

Disciplined roadmap planning requires flexibility in adapting to future uncertainty, but also diligently assessing things that won’t change.

Smaller tasks are better

The reason people fall into the trap of avoiding long-term planning is that it can be dangerous. Biting off too much at once to achieve a big vision dramatically increases the chances of failure.

A key tenet of lean methodology’s approach for reducing waste is to keep task sizes small.

The scrum methodology involves “sprints” that are usually two weeks, and other frameworks like Shape Up involve a maximum iteration size of six weeks.

The efficiency of small tasks is fairly common knowledge, but people still struggle to put it into practice.

The reason is that when you first plan a new piece of functionality, that plan describes the end state. It often contains dependencies that prevent it from really working until the whole thing is finished.

By default, projects tend to start out large.

It usually takes significant effort beyond initial planning to break dependencies and deliver code, features, and user value in smaller pieces.

This extra effort might not seem worthwhile at the start when you intend to complete the whole project. Why spend an extra day to break up a 4-week project into two 2-week milestones when you could use that day to start working?

The issue, of course, is that things rarely go as planned. The larger the plan, the more likely this is to happen.

Discipline with small task sizes means assessing whether people have devoted enough effort to breaking down tasks up front, and then looking back at task sizes in project retrospectives to continuously improve.

What Every CEO Should Know About Software Planning covers this topic in more detail.

Some tasks are inherently large

Sometimes teams recognize the importance of long-term roadmap planning and small task sizes, but still experience massive project cost overruns.

The issue is that some initiatives really do require months or more of well-executed work to realize their full value, no matter how hard you try to break them down.

Examples of large tasks include migrating to a new platform, overhauling a major system, or making changes to core software architecture.

Most work doesn’t fall into this category, but I’ve never seen a real business that doesn’t encounter major technical initiatives from time to time.

If you commit to such an initiative and start working on small pieces without carefully planning the full scope and how everything will fit together, you’re asking for trouble.

When my last company was acquired and we merged engineering teams, the buyer was nine months into a three-month effort to replace their payment system. I quickly learned that the culprit was refusal to plan more than two weeks ahead because doing so “wasn’t agile.”

There’s a difference between implementing software in small batches and incomplete planning.

Disciplined project execution requires working on tasks in small iterations, but also planning each iteration in detail to avoid unpleasant surprises.

If people say this “isn’t agile”, then too bad. Neither is working on multi-month projects in the first place, but sometimes that’s the reality.

Standardized processes are more efficient

Some big and successful tech companies like Facebook are known for giving their teams freedom to work in whatever way they want.

This approach resonates with engineers who like doing things their own way.

It also avoids overly restrictive processes that can emerge at large organizations and stifle innovation.

While this freedom may be good for teams in the short term, it comes at a significant cost.

In the long run, process fragmentation makes things like training, switching teams, reporting on activity, and managing multiple teams a lot harder.

It also adds cognitive burden as people debate low-value process choices rather than focusing on bigger challenges.

(Keep in mind that Facebook is flush with cash and is able to hire the most talented engineers in the world, so their teams’ capacity to self-manage may mitigate these costs more so than at other companies.)

The truth is that much like the argument of tabs vs. spaces, most process choices don’t have a major impact one way or another, but inconsistency does.

Organizations are generally better off just standardizing things like version control systems, ticket tracking systems, and even sprint processes.

Disciplined process standardization necessitates weighing the global, long-term cost of fragmentation against the potential benefit of flexibility, and is a microcosm of overall disciplined innovation.

Quality is important

Underinvesting in quality didn’t used to be as common of a problem, but The Lean Startup movement and Facebook’s mantra of “move fast and break things” changed that.

Modern innovators are under tremendous pressure to discover customer needs and build valuable products as quickly as possible.

Like with agile, some practitioners have taken the idea of a minimum viable product (MVP) too far and saddled their business with major quality problems.

The heart of the issue is the distinction between a prototype and production software.

This distinction is muddied by having early users pay for prototypes, which can shift their perspective and make their feedback more valuable.

Once customers are paying for a prototype, it’s also tempting for start-ups with limited runway to keep selling the prototype rather than switching modes and investing in production software.

There’s no clear line between prototype and production, but you should ask yourself: if there’s a medium-severity bug, will customers expect you to fix it?

If the answer is yes, then you need production quality, which involves observability, automated testing, on-call rotations, and prioritization practices like fixing all your bugs.

Failure to enact quality discipline will lead to engineers spending all of their time dealing with urgent interruptions rather than building new innovative functionality.

Fixed-scope deadlines hurt quality

All software businesses face pressure to deliver on time and on budget.

Inexperienced leaders often make the mistake of believing it is possible to do so without sacrificing quality.

In reality, when leadership asks a team to complete a fixed scope of work by a deadline without compromising quality, the team is forced to cut corners in ways that are not immediately apparent.

This might involve shipping sloppy code that is difficult to read and maintain, or foregoing automation of important tests.

Quality will ultimately suffer down the road, making future development slower and creating a downward spiral if management fails to ease up on their expectations.

Teams that repeatedly cut quality are not fun places to work.

The better approach is to give engineering teams autonomy over quality by making project scope flexible.

As projects evolve, engineers will discover that some features are easier to implement than expected, and some are much more difficult. Empowering them to actively discuss scope reduction with product managers throughout the project greatly improves your overall return on investment and helps maintain quality standards.

Your architecture and development system is a product

When you first create new software, you don’t really have your own architecture or development environment. Instead, it is based on third-party systems.

As you create more specialized functionality for your business, those third-party systems become increasingly inadequate for building the software that you need.

It is important to recognize from day one that the software you use to build your software (your “platform”) is a product itself, and is critical to your competitive advantage.

To manage your platform well, you need clear ownership and resources.

A dedicated platform team may not make sense for smaller organizations, but inattention to the platform and core architecture can lead to spiraling technical debt and inability to get anything done.

Some entrepreneurs take a cavalier approach to tech debt, claiming there will be more resources to fix it later and all that matters is product demand.

Sure, some highly sticky businesses like Twitter have pulled out of a tech debt spiral, but others with fewer resources may not be able to do so. Also, even though Twitter survived, fixing their tech debt was extremely costly and letting it accumulate was probably not a good decision.

A disciplined software team should be able to identify who owns decisions about platform investments, measure how much effort is being dedicated to them, and assess whether the level of investment is appropriate for their stage of growth.

Data analysis is important

Many organizations claim to be data driven, but struggle to use data effectively for making decisions.

The most important thing that people fail to understand about data is that they’re already using it every day. Making a decision “without data” is impossible – this actually just means relying on your memory of information you’ve gathered over time.

The human mind is incredibly powerful and can arrive at insights that are quite hard to derive from quantitative analysis.

However, the human mind is also incredibly biased.

The availability heuristic can dramatically skew one’s sense of the frequency and severity of events, especially when drawing from a limited sample size like things you’ve heard in conversations.

The double-whammy is layering on confirmation bias and only focusing on data that confirms your pre-existing beliefs.

If you want to build software with discipline following the principles outlined above, you must also instill discipline around data analysis so you can accurately assess your progress.

Effectively using data requires analysis by someone who is trained and has the proper context to interpret the results.

This is not to say that people without “data” in their job shouldn’t be allowed to analyze data, but they do need proper training for their domain.

Part of this training involves statistical literacy to avoid common pitfalls like mistaking correlation for causation or concluding that a number “changed” when the change is within the normal range of random variation.

A more subtle error that even trained data analysts make is lack of understanding source data idiosyncrasies. Real data often contains significant errors or gaps. If you don’t work with the data regularly and have a solid understanding of how it was collected, it is easy to overlook these issues.

For example, if you’re looking at code commit activity to assess task size, you can miss data if you don’t properly handle squash and rebase merges.

Using a third-party vendor (like minware) can help fill in a lot of this context and make data more self-service.

However, no vendor will know the full context of your business, so it’s important to have an in-house expert to configure vendor tools and curate accurate reports.

For example, when we create reports about development activity with minware, we provide expertise on interpreting Git commit data to avoid problems with squash and rebase merges mentioned above. However, we can’t know (without having someone embedded in the company) whether a person with lower output is part-time, an intern, or has other non-development responsibilities that make their level of contribution in line with expectations.

It is particularly important for leaders to demonstrate discipline in this area, because they rarely have the context to analyze data themselves. They should solicit analysis when consuming data and foster skill development so that their organization can successfully use data to make better decisions.

Conclusion

To innovate, you need to be aggressively revolutionary in your business, but also maintain focus to avoid distraction from your core purpose.

At the same time, software engineering is a complex endeavor filled with many choices and pitfalls.

I and many others have navigated these pitfalls the hard way, wasting a lot of time on things that weren’t on the critical path to innovation.

The goal of minimal engineering is to share timeless, hard-won knowledge and provide a framework for software engineering discipline to maximize your chances of revolutionizing your industry.

We have covered a few core tenets of minimal engineering in this article, and plan to expand its breadth and depth in future articles.

Reducing Cycle Times With Design As Code

Kevin Borders — Mon, 03 Mar 2025 21:03:22 GMT

As a small, bootstrapped company, we’re always looking for ways to simplify our internal workflows.

Code changes that add new features or fix bugs tend to get a lot of attention, with lead time being one of the four DORA metrics.

However, people often overlook workflows that cross department boundaries, like updating the public website for a SaaS product.

We want our website to have high-quality graphics that accurately depict our product.

In the past, we’ve made each one manually with Figma. However, this has become increasingly time-consuming and slowed down the cycle times for marketing tasks.

To address this constraint, we’ve adopted Design as Code (DaC) – making visual asset production self-service for the marketing team.

DaC drastically sped up the workflow for updating site graphics in exchange for up-front engineering work. This graphic – which I created in 60 seconds with our new DaC system – shows the effect:

This article offers tips for rolling out Design as Code on your team to speed up visual asset production.

What is Design as Code?

Previous discussions of Design as Code have focused on applying version control and review processes to design artifacts. While a good first step, this falls short of actually defining design artifacts with code, which is what we’re talking about here.

Design as Code also shouldn’t be confused with Design to Code, which involves converting design files into CSS, HTML, etc. This goes in the opposite direction and is more like “Code as Design.”

I actually tried one of these systems – anima – and it was a mess. The code seemed okay at first, until I put it in version control and made a tiny change to the design. The diff was enormous and completely impossible to merge with other changes. These systems only work if you use the code as-is, which isn’t really feasible for production web applications.

Instead, Design as Code means defining design elements with code and then programmatically generating the visual assets you use in your application or website.

What is the designer’s role with Design as Code?

Design as Code is actually a boon for designers because it lets them spend more time on design and less time on repetitive image creation.

To understand what the designer does with Design as Code, consider the following graphic, which we’re now generating with DaC on minware’s website:

There’s a lot of important work for the designer to do here, such as specifying the fonts, color scheme, icon system, element spacing, corner rounding, etc.

We still created an original version of this graphic in Figma to experiment with these parameters and produce the initial design concept.

However, with Design as Code, the designer no longer has to build and export different versions of this graphic every time we want to show different chart contents.

What is the engineer’s role with Design as Code?

The front-end engineer plays a much larger role with design-as-code than in a traditional design implementation.

In a traditional workflow, the engineer takes the design files from the designer and implements the component structure and styles for the page layout.

However, for complicated static graphics (like the chart above), the designer usually provides rendered images.

With Design as Code, the engineer implements code that will generate rendered graphics from a simple specification, which can be kept in a content management system (CMS).

We specify the chart above entirely in JSON, which is editable with live preview in a CMS. To make things real, here is the actual JSON for that chart:

Our Design-as-Code system then converts that JSON into an SVG image that can be used anywhere. The end result is similar to what a designer would export, but it’s fully automated.

Why not just use screenshots?

Instead of exporting files from a design tool, you could take screenshots of your application.

Some people do this, especially if they don’t have a full-time designer.

Before DaC, we did this at minware for many of our graphics.

The biggest issue with this approach is that it’s not actually as easy as you think.

You have to make sure the browser window is the exact right size to get screenshots of a particular dimension. If you want screenshots that are big enough that they look good when downsized to different resolutions on different pixel-density monitors, then you also have to use browser zoom, which further complicates things.

Getting components like charts into the right configuration within your application can be tricky too. It often requires a test environment where you can wire up specific sample data. Otherwise, you have to edit the content in a browser debugger to do things like replace real customer information with example text.

Then there’s the issue of updating images when your application changes. For every visual update, you have to regenerate all of your screenshots. (Or not, and have an out-of-date marketing site.)

Finally, compared with SVGs (which are vector-based), static images are a pain to work with. To keep them from being blurry, they need to be high-resolution, and then you need a complex system for downsizing them on the back end based on how large they will appear on the page. (If you don’t, they’ll be blurry, or load slowly.) Also, static images will always be larger and take longer to transfer over the network.

Why not just use application components?

Another thing you may be wondering is “If you’re showing this chart on our website, don’t you have an application component that displays the same thing?”.

We do, and this is a good question. Making application components flexible enough to use statically is a legitimate approach for implementing Design as Code. It can also save time by eliminating the need for a second DaC implementation for each component.

However, using real application components presents a few challenges.

The first is performance. The actual component we use to render charts in our application is complex because it handles all sorts of scenarios, like different data ranges, axis labels, and legend items. It also handles hovering over points, clicking to drill down, animation, etc.

Using the application component to render the figures on our home page would make our load time unacceptably slow.

Our Design-as-Code rendering function is much faster and smaller than our application component, but lacks features and guardrails in the application we don’t need for static examples, like handling text overflow or paging large lists of legend series.

For landing pages, you may also want to do things visually that you don’t do in the application, like scale, rotate, transform, or blur the graphics. You might also want to simplify the elements (e.g., our real charts have an extra line of detailed text that we cut from the DaC component), or expand and add a drop shadow to “pop out” a sub-component.

In the end, making your components do double-duty for the application and for static pages will lead them to not be really optimal for either one.

How can you implement Design as Code?

Your approach for adopting Design as Code will vary depending on your particular application and the types of graphics that you want to create.

For minware, we just needed to produce SVG graphics for charts in our application UI. You can easily output SVG instead of HTML elements using React. The key difference is that you become responsible for positioning everything (including measuring text) instead of being able to rely on CSS.

So, we just created components that progressively rendered the other components inside of them – eventually accumulating and outputting the nodes as an SVG.

SVGs seemed easiest because they were the most portable, but you could also use a canvas-based rendering system like react-canvas or react-konva.

However you implement rendering, the most important decision is how you structure the input. It should give internal users (e.g., people on the marketing team) flexibility while abstracting the details.

Obviously, your input could just be a list of exact line positions, but that wouldn’t really make things easier.

The JSON specification from earlier provides a good example of how to strike a good balance.

Takeaways

Hopefully you learned something and can save time with Design as Code in your organization.

However, the true message is larger than that: workflow optimization isn’t just for engineering; it’s for the whole organization (though help may be needed from engineering).

Engineering leaders who adopt lean process metrics like DORA to optimize software delivery should look beyond their team as well – there may be significant opportunities to help the whole organization move faster.

You Need Data to Write a Fair Engineering Performance Review

Kevin Borders — Thu, 16 Jan 2025 22:01:23 GMT

There is a lot of skepticism about using data for engineering performance reviews.

Bad managers have a long history of abusing data, such as stack ranking and firing people based on lines of code.

However, these abuses are not an indictment of data itself. When applied properly, it can fill critical knowledge gaps and provide clarity for managers rather than dehumanizing them by trying to replace their judgment.

There’s no such thing as a data-free review either. Managers who “don’t use data” are really just relying on qualitative data from their own memory.

Unfortunately, human memory is severely biased, even with the best of intentions.

The performance appraisal Wikipedia page has a good list of reasons why managers misjudge performance that should cause concern, such as Recency Bias (overweighting more recent events because they’re easier to remember) and Halo Effect/Horn Effect. Here’s a quotation about the halo effect in particular:

"In the work setting, the halo effect is most likely to show up in a supervisor's appraisal of a subordinate's job performance. In fact, the halo effect is probably the most common bias in performance appraisal. Think about what happens when a supervisor evaluates the performance of a subordinate. The supervisor may give prominence to a single characteristic of the employee, such as enthusiasm, and allow the entire evaluation to be colored by how he or she judges the employee on that one characteristic." (Schneider, F.W., Gruman, J. A., & Coutts, L. M., Applied Social Psychology, 2012)

If you think experienced managers are immune, then you are mistaken (and also possibly falling victim to the bias blind spot – a bias that makes you think you’re less biased than other people).

I have been managing engineers for over a decade, and still ran into my own bias during a recent review. Before digging into the data, I had an overly negative impression of an engineer’s performance due to recent struggles with a big project.

After more deeply exploring his metrics, however, his performance was only worse for the month he was working on that project. The rest of the year, the metrics were similar to others on the team.

If you want your performance reviews to be fair, you need to use data to cover your blind spots.

This article shares several metrics that you can use to get a more holistic view of engineering performance in three key areas: recognizing contributions, pace of work, and quality.

Recognizing contributions

The first thing a performance review should do is recognize an employee’s meaningful contributions. This affirms that the manager both notices each contribution and recognizes its value.

Managers should expect any contribution that they fail to recognize to not happen in the future.

It’s particularly important to include things that go above and beyond the normal call of duty and are less visible, like working extra hours, helping out others, handling behind-the-scenes grunt work, or having a positive attitude under difficult circumstances.

Managers should also make an extra effort to understand the accomplishments of people who don’t like to brag about their work. Failing to do so creates a toxic culture of brown-nosing that drives away humble performers.

Below we’ll look at several ways data can highlight easy-to-miss engineering contributions.

Development effort by project, work type

The first metric you should look at is the total amount of engineering effort dedicated to each project during the performance review period based on assigned tickets.

This can take the form of a pivot table in a spreadsheet that aggregates story points, ticket counts, or time logs by epic for each assignee. minware’s individual contributions report can also tell you more precisely how people spent their time based on their commit activity.

It’s especially important here to break down the “none” bucket for tickets that aren’t part of a project. Many engineers spend a significant portion of time on miscellaneous non-project tasks like bug fixes and maintenance.

If you don’t have consistent issue types, parent tickets, or labels for non-project work types, then you should sample some of those tickets and categorize them to better understand where time went.

Failing to recognize people for bug fixes, maintenance, and stakeholder requests will ultimately lead to that work being neglected, which you probably don’t want.

Unticketed work

It’s also essential to look at people’s contributions beyond their assigned tickets. There are many ways that people help out on a team that are important to recognize.

As a manager, you should dig into each system where people operate to sample their activity during the evaluation period. Ways to do this include searching through email, slack channels, or documentation systems by a person’s name.

You can also run reports in your ticketing system to see how many tickets the person created or commented on without being assigned.

Similarly, you can see their code reviews by searching your version control system.

This may be a bit difficult to do depending on your Git provider, but you should also look at commits and merges for PRs opened by others to see how much of the time developers helped out on tickets assigned to others.

minware’s individual contributions report covers some of this by including code reviews and ticket creation. (Its work time metrics naturally count time spent on commits associated with other people’s tickets.)

Finally, there is the human factor. All this data will tell you what someone did, but it’s up to you to interpret the data and decide what had the biggest impact. You should also solicit feedback from others on the team and combine everything you’ve learned into a comprehensive assessment of your employee’s contributions throughout the year.

Assessing the pace of work

After recognizing contributions, the next thing you may want to assess is how much effort it took to deliver those results, and whether that level of effort was appropriate. This section covers a few metrics that can help with assessing the pace of work. (Though this list is not meant to be comprehensive and you may want to look at other metrics depending on your situation.)

Velocity Metrics

Here, the classic story point velocity metrics can be helpful. Velocity has its limitations, but can work well when comparing one person’s pace of work to previous time periods (keeping in mind that sometimes velocity fluctuations are caused by inaccurate estimates rather than less work getting done).

Looking at points or other estimate units completed every sprint (or every few weeks if you’re not using sprints) can identify hot spots where the engineer completed less estimated work than usual during that period.

If you have time logs or you’re using a system like minware, you can also create a report that divides story points by the time spent on each ticket, which gives you precise visibility into which projects and issue types had lower than normal estimated velocity.

Here’s a chart I looked at showing points completed per day for a particular person, broken down by ticket point estimate. You can see here that July had much higher velocity and November’s was lower:

Once you’ve identified projects, issue types, or time periods with lower-than-average velocity, the next step is to figure out why the velocity was lower by looking at specific tickets and pull requests.

Low-than-usual velocity can have many different causes, some of which are outside of the engineer’s control. Here is a list of common causes, but there are many more:

The engineer had to redo work because functional or technical expectations were not clearly defined, or because they failed to meet established expectations.
Tasks were overly large or complex and should have been broken down into smaller units.
The engineer went down a rabbit hole pursuing a bad solution when they should have discussed it sooner with others.
The task was too difficult for the engineer, or they were learning a new area of the software.
The type of work was fundamentally difficult to estimate (e.g., fixing a set of bugs with unknown size, such as when upgrading major dependency versions).
The task was underestimated due to lack of experience.
There was insufficient planning and the engineer encountered problems that could have been anticipated.
The task had scope creep or shifting requirements.
There were interruptions or other unticketed responsibilities that took from completing assigned tickets.
Product management failed to provide clear direction about valuable work to do.

However, determining the root cause by looking at raw data can be cumbersome.

Next, we explore metrics that can help with understanding the root cause of velocity fluctuations.

Work in progress (WIP)

Workflow metrics can show you why there are changes in the underlying velocity.

The first one I always look at is work in progress (WIP), a classic metric for lean process efficiency. There are various Jira plug-ins that can calculate this metric for tickets, and minware can do it for both tickets and code branches in the lead/cycle time and workflow report. Here is a chart for one person’s work-in-progress on our team, which has been increasing over the past few years and has had some spikes in recent months:

High work in progress can significantly reduce the pace of work due to the overhead of context switching.

When you see high work-in-progress, you should drill down into the data to see which tickets are blocked/waiting and why the person is unable to make progress on them.

This can happen for many reasons. In the context of a performance review, you should assess whether the engineer could have done anything differently to reduce WIP, like break work into smaller tasks, ask for help or review sooner, better plan to avoid blocking dependencies, etc.

PR lead time for changes by stage (PRLT)

The next metric that can be illuminating is pull request lead time (PRLT) by stage (e.g., from first commit to opening the PR, open to receiving review, review to merge).

Here is a chart that compares median PR lead times (in days) for two engineers on our team:

The median lead times for these two engineers were not that different this past year. However, the lead times by stage were much different. The first engineer spent most of that time waiting for review, while the second engineer spent most of that time after receiving a review.

One underlying issue I discussed with the second engineer was having too much work in progress. In particular, he was opening a pull request for review, starting on the next task, but not coming back to the pull request review until the next task was complete. This was causing pull requests to sit around for longer than necessary.

(On the left side of the top chart, the engineer had just joined the company. Also, in May 2023, he shifted responsibilities from application development to operations, which has smaller task sizes. On the left side of the bottom chart, the engineer was also newer to the company.)

Development Time by PR Status (DTPR)

In addition to PR lead times, active development time by pull request status (a minware-specific metric) is also helpful. It provides deeper visibility than PR lead time because PR lead time doesn’t tell you whether active work is happening or if the task is just sitting around.

These charts show that not only was there a difference in total lead time between these two engineers that could be explained by pull requests sitting around post-review, but also a difference in the amount of active development effort that went into pull requests before and after receiving a review. The 2024 average was around 50% post-review dev time for the second engineer and 30% for the first one:

The underlying cause in this case was that the second engineer’s pull requests received more requests for changes during code reviews due to a difference in task difficulty vs. skill set.

I shared feedback with the second engineer about things he could do to improve code quality prior to opening pull requests so that fewer changes were needed. The feedback called out coding anti-patterns, but the metrics helped show the size of the opportunity to improve (i.e., reducing 50% of time spent post review to a level in line with others on the team around 30%).

Other pace-of-work metrics

You may have noticed that the metrics shown above don’t cover tasks that took longer than expected without any context switching or rework.

This can happen in theory. In my experience, however, almost all dips in velocity are associated with significant rework or context switching. Tasks that are done correctly the first time in one contiguous interval usually hit their estimates.

Your situation may vary though and you should look for metrics that help you trace back drops in velocity on your team to their root cause.

Evaluating work quality

Looking at PR lead times and PR development times by status may provide insights about quality, but only for issues that are detected before merging code.

To get a holistic view of engineering work quality, you should also look at how code fares over time in production. This section highlights a few helpful metrics that can give you a better view of quality.

Bug Creation Rate

People often pay attention to severe issues like outages, such as with the change failure rate (CFR) metric, which is part of DORA.

If someone regularly ships bugs that cause outages, that is definitely something to address.

However, a majority of bugs are usually low or medium priority. Someone who creates a lot of them still needs guidance about getting better because those bugs can have a significant impact in aggregate.

The metric I like to look at is new bugs (of any priority level) per active work day of development (NBD), which you can see in the CFR/Bug Creation minware report.

For this metric to be relevant to an individual, you also need a way of attributing bugs to the person who created them. If the team has a practice of assigning bugs to that person, then you can group by assignee.

However, you may need to manually assess a sample of bugs to mark which people or code areas generated them.

When you’re looking at the bug creation rate, it’s also important to consider that someone who works in repositories with fewer tests and lower code quality will create more bugs through no fault of their own.

One thing that can help is to go through specific bugs and determine what would have been necessary to catch each one before launch.

If you do this enough, patterns will emerge that help you identify specific areas of improvement for people whose modifications cause bugs, as well as for teams or code owners to make their code easier to modify.

Non-bug load

Sometimes, velocity will look great, but the team mysteriously struggles to get anything done.

This can occur when people cut corners on quality to “finish” their tickets, knowing that they will get credit for completing more story points when bugs pop up in the future.

If you think this might be happening, then you should look at how much time the team is spending overall on bugs and whether it is at a healthy level. This is a screen shot of our numbers for minware from the CFR/Bug Creation report:

Our numbers have been fairly consistent until this month, as we’ve recently fixed a lot of bugs that were discovered by improvements to our testing infrastructure.

This metric shows you the impact of quality problems on velocity, which helps you decide how much to emphasize quality improvements as part of the performance review process.

Bug fix vs. find rate (BFFR) and average bug backlog size (ABBS)

Finally, a trap that teams sometimes fall into is sweeping bugs under the rug, so to speak. If bugs don’t seem like a problem for you, you should double-check that they are actually being discovered, filed, and fixed.

First, you should verify that customers are using the software and that there are good mechanisms to detect and report bugs. If the customer service team doesn’t have access to your bug tracking system, for example, you could have a lot of issues that you’re not seeing.

Assuming bugs are getting filed, you should look at your fix. vs. find rate and average bug backlog size (also available in the CFR/Bug Creation report).

If the numbers are significant, you should assess which bugs aren’t getting fixed, and if they are being created by a different person or part of the code than ones that are being fixed. (This may be difficult to do if the bug isn’t fixed yet though, which is why I strongly recommend fixing all your bugs.)

Addressing the issue of not fixing bugs is probably more of an issue for the product manager than for engineers, but you should be extra diligent in your assessment of work quality if the team doesn’t even know the impact of bugs because they’re deferring them until later.

Caveat on quality improvements

One important caveat with judging the quality of work done by an engineer in a performance review is that you need to account for management pressure to cut corners on quality to get projects out the door.

If you want people to improve quality, then you have to let them either cut scope or take more time. If people are told by management to hold scope and time constant, then it’s not their fault when quality suffers.

Similarly, if you ask someone to improve quality, you should specify whether to cut scope or to extend timelines to make it happen, because they will have to do one of the two.

Conclusion

An engineer’s job is extremely complex and using data in performance assessments is a deep topic. We’ve covered several here that can help build a more objective and complete understanding of engineering performance, but this really just scratches the surface.

The most important thing to take away is the role that data should play in performance reviews.

Qualitative metrics should assist and enrich the manager’s understanding of performance in conjunction with expert analysis and other data sources like qualitative feedback. It should not replace or supersede the manager’s judgement.

Engineering managers who use data wisely will write better performance reviews that guide employees toward the highest-impact growth opportunities and ultimately accomplish more with the members of their team.

Startup CEO’s Productivity Hack: Extreme Single-Tasking

Kevin Borders — Mon, 02 Dec 2024 23:05:59 GMT

Being a startup CEO is one of the most multi-faceted jobs.

Unlike big-company CEOs, early startup CEOs have to manage people in every role and personally fill in all the skill-set gaps until an executive team is in place.

A startup CEO may have to speak with a customer, interview a job candidate, close the monthly books, and review a new product feature specification – things that would normally span a variety of roles – all in one day.

The CEO’s personal productivity is critically important for startups with limited resources, yet it’s one of the hardest jobs to do productively. You have to juggle a wide variety of responsibilities, and context switching between them carries significant overhead.

After doing this job for over a decade, the single greatest productivity hack I have found is this: only work on one thing at a time, and take it to an insane level: what I call extreme single-tasking.

With regular single-tasking methods like the Pomodoro Technique (working in uninterrupted 25-minute bursts), you block out distractions to work in focused chunks.

While helpful, these methods don’t address blockers or interruptions at the organization level, which are typically beyond an individual’s control.

Extreme single-tasking involves structuring the entire organization around single-task work as the highest priority so that the CEO and everyone else can be as productive as possible.

Very few companies prioritize workflow efficiency to this degree. It takes a lot of effort and that effort doesn’t directly yield revenue.

However, long-term commitment to extreme single-tasking has enabled me and my colleagues to outpace competitors several times our size.

This article looks at the most common impediments to single-task work and shares tips for adopting extreme single-tasking in your organization.

Build a single-tasking culture

The cornerstone of successful single-tasking is building a culture that values it.

Some leaders make the mistake of optimizing for their own workflow while disregarding the impact on others.

If you reach out to people directly with questions or small tasks and expect an immediate response, you might get your current task done faster, but it will have a ripple effect on other work that ultimately leads to a net decrease in single-tasking across the organization.

This also creates a culture where each level of management amplifies inefficiencies beneath them to make their own lives easier. Once you get down to the individual level, the work environment becomes toxic.

At the end of the day, this approach circles back to leaders and ultimately prevents them from efficiently accomplishing their goals.

Another anti-pattern occurs when leaders invest in process and system improvements to support their own single-task workflow, but don’t empower others with time and budget to do the same.

To create an organization that prioritizes single-tasking, leaders need to send the message that everyone should make it a priority, and that they should strive to reduce context switching across the whole organization.

This means that everyone must respect each other’s workflow and not interrupt or block other people just to help themselves, starting from the top.

Leaders also need to make everyone responsible for productivity optimization as a main part of their job. Managers must give people the resources they need and hold them accountable during one-on-ones and performance evaluations.

A healthy culture is a necessary precursor for all of the tactics that follow.

Reduce task sizes

The number one way to work on one task at a time is to make each task smaller. The longer and more complex a task, the greater the likelihood is that something else comes up requiring your attention before it’s finished.

This is a deep topic and there are multiple levels of tasks from individual work items to higher-level deliverables and projects.

For more in-depth guidance on reducing tasks sizes, see What Every CEO Should Know About Software Planning (which is relevant for non-development work as well).

Leaders should continuously strive to reduce task sizes for themselves and the larger organization. They can do this by supporting improvements that reduce fixed costs for tasks and by incentivizing everyone to create plans that minimize task sizes.

Personally, whenever I have the instinct to bundle tasks together that could be separate, I stop and ask why. Is it because I’d spend time waiting on a slow build process or review? Whatever the cause, fixing it becomes a top priority.

Cut engineering out of the process

In a software company, tasks that require code changes usually take an order of magnitude longer than those that only require configuration updates.

Chris Espinosa famously built an application that let Steve Jobs design the first Apple calculator app without having to go back and forth with engineering for each change.

Investing in domain specific languages, content management systems, and admin interfaces that enable new functionality without deploying code is one of the highest-impact decisions that leaders can make. It greatly increases the set of tasks that people can complete in one sitting without having to wait on engineering.

At minware, since we created the minQL formula language, I can now build reports myself in a few hours that used to take the whole engineering team a week. This has been incredibly helpful for handling customer requests the same day rather than scheduling them as part of the regular development process.

For more in-depth advice on how to cut engineering out of the process, see Everyone Needs Their Own Programming Language.

Establish SLAs for all time-sensitive work

When people hear about service level agreements (SLAs), they usually think of response times for urgent customer tickets. However, SLAs are important for every task that has a time expectation, including slack messages and emails.

Without an explicit SLA in place, people don’t know how quickly to handle urgent requests. They may respond faster or slower than they should, either needlessly interrupting their own work or delaying someone else.

SLAs empower everyone to minimize the impact of urgent issues on single-tasking.

To establish SLAs for day-to-day communication, every organization should create a working agreement that lays out how quickly people should expect a response in each channel. This lets senders choose how to communicate based on when they need an answer and minimize unnecessary interruptions.

Larger urgent tasks that go beyond a simple email response should all go into a ticketing system like Jira, and they should have an explicit priority field that is tied to an SLA.

Want To Ship Features Faster? Fix All Your Bugs goes into more detail about how to implement a ticket SLA and shares the prioritization scheme that we use at minware.

Essentially, we have statuses for “stop everything and do this now,” “start on this by the end of the day but otherwise don’t interrupt your current work,” and “do this by the end of the next week.” These statuses come with SLAs of 1, 3, and 14 days.

This makes it easy for me as CEO to set a priority label and know when something will wrap while minimizing the impact on other people.

Don’t tolerate bad planning

The type of interruption I find most frustrating when someone makes an urgent request, but knew about the requirement weeks earlier.

Strong leadership is important in this scenario. You don’t want the person being interrupted to feel the urge to say “no” just to avoid creating a moral hazard and protect their time. The interruptee should feel confident that the bad planner will be held accountable and discouraged from repeating the problem in the future, even if the right decision for the business is to complete the urgent task today.

Leaders should also keep a close eye on how work progresses on projects that don’t involve external dependencies. Whenever there is a lapse of activity on a project, it’s important to look at why that lapse occurred and whether the dependency could have been identified with better up-front planning.

Everyone should know that single-tasking is a priority at the level of larger projects, not just individual tickets or deliverables.

Kill all non-essential blocking workflow steps

As organizations grow, the number of blocking steps in workflows will naturally expand without deliberate effort to keep them under control.

These steps are usually related to quality control. They are often put in place to make sure that work doesn’t proceed if there is a serious problem. They may include things like specification approval, code review, QA review, automated build processes, product manager validation, etc.

While well-intended and sometimes necessary, these blocking workflow steps create interruptions that prevent people from single-tasking.

The first way to mitigate blocking steps is to provide an escape hatch so that someone can skip the step if they judge it to be unnecessary.

We have done this with all code reviews at minware. If the author believes that a change is sufficiently low-risk, they can tag a pull request as “low risk” and merge it before receiving a review. Code review is an important step for many changes but making it block trivial things like button color or text updates isn’t worth the workflow impact.

Another thing you should consider is whether it’s absolutely essential for blocking steps to be blocking, or whether they can happen after the fact.

Going back to minware’s code review process, we actually require code reviews still for low-risk changes, but they can happen after merging the change. There are a few cases where bugs have gone into production that were caught in post-merge review. However, this minor cost has been well worth the efficiency gain.

For each of your blocking workflow steps, you should seriously assess whether an escaped problem will truly cause irreversible damage. If you can back out of it with minimal impact (e.g., pushing a hotfix or canceling work on a project after a few days), then you will likely benefit from making it non-blocking.

Even if problems don’t cause irreversible damage, people sometimes impose blocking steps when the rate of escaped problems is too high. This is often the reason for making development tasks go through QA approval.

In this scenario, you should strive to eliminate the blocking steps through better automated mistake-proofing. Test automation is one way to do this for development tasks, but there are a lot of other tools that will automatically catch mistakes in various types of work (e.g., spell check even counts).

Checklists and templates can also help people catch enough of their own mistakes that it’s no longer necessary to subject their work to a blocking review step.

At my previous company, listing all of the common gotchas in our technical plan template (e.g., performance, security, etc.) allowed us to eliminate the requirement for engineering leadership approval before starting work.

Finally, it’s important for managers to trust employees and address repeated problems with individuals. Too often, blocking workflow steps exist that slow everyone down because of a few people, when the better solution is to help those under-performers operate better within a flexible process or move them out of their role.

Eliminate missing information

One common source of delay that can lead to context switching is waiting on a critical piece of information.

There are a few possible reasons for this, but it generally means that information is either stuck in someone’s head or not easy to find.

The default inclination people have when encountering missing information is to reach out and ask someone for an answer. As a CEO, it might be tempting to just require everyone to be available to answer your questions, even if that means messaging people when they’re off work.

To complete the immediate task, you might need to do this. However, doing it repeatedly destroys everyone else’s productivity and creates a toxic always-working culture.

Instead, whenever a leader encounters missing information, they should ask: why wasn’t it documented in the first place?

Leaders can permanently mitigate this type of delay by establishing processes that ensure every important artifact is documented at the time of creation. Common ways of doing this include recording and sharing meeting minutes, writing down plans in shared documents that include comments, tracking every piece of work in a ticketing system like Jira, and making sure artifacts are linked to a project or sprint plan.

Furthermore, you should foster a culture where information is made public by default and fight against private slack channels and documents unless absolutely necessary.

At minware, we make everything public within the company by default in Jira, Slack, and Google Docs. There is only one private folder for sensitive HR/legal documents, another location for sensitive customer data, and a password manager for sensitive credential access.

Finally, it’s important to prioritize improvements to information sharing based on their long-term impact. One short delay may be small, but missing information can be a substantial hidden burden over time.

Optimize blocking systems

One common source of delay that can lead to context switching is waiting on a system to do something on the critical path to completing a task.

For example, you may need to wait on software to deploy before letting a customer know that a bug has been fixed.

People tend to have an intuitive sense that waiting on systems is bad, but often underestimate the impact because switching to work on something else may make the time spent waiting not feel like a big loss.

It’s important for leaders to recognize this tendency to normalize wait time and aggressively optimize slow systems.

Whenever I wait on a system, I treat the full wait time multiplied by my hourly rate as the cost, even if I do something else in the interim. This may seem like it would overprioritize system optimizations, but it could be an underestimate if you account for the impact of context switching. In any case, it’s a simple heuristic that ensures sufficient investment in optimizing systems on critical workflow paths.

Conclusion

This article covered the most common organizational impediments to single-tasking and discussed how to eliminate them.

The most important thing on your journey to extreme single-tasking, however, is adopting the right mindset.

Having coached many people on workflow efficiency, the biggest hurdle is complacency. If you’re too focused on staying busy, you’ll never develop an intrinsic intolerance for multitasking.

It’s that intolerance that is essential for leaders to instill throughout their organization to drive change and fully realize the potential of single-task work.

To Sprint, or Not To Sprint

Kevin Borders — Fri, 08 Nov 2024 21:41:37 GMT

If there are two universal truths of process maturity, they are that new companies never start out with sprints, and that almost everyone ends up there eventually.

A corollary is that each team must decide if and when to start doing sprints.

Spoiler alert: like leaving a bad job, almost everyone waits too long.

I’ve worked with multiple teams in just the past week who waited too long to adopt sprints and needlessly suffered.

As with most issues, people hate change and tend to avoid it until things get so bad that there’s no other option.

With software planning, “so bad” looks like:

Completely unreliable project estimates, often taking 2-5x as long as originally planned.
Stakeholders have no idea when specific tasks will be done, with small tickets often taking weeks.
So, stakeholders start following up daily with engineers when something is important.
Engineers end up having to juggle several tasks and act as mini project managers, deciding which stakeholder request is most important.

At the same time, newly formed close-knit teams usually do just fine without sprints, and some mature teams operate successfully without them too.

How do you know if you need sprints, and when the right time is to adopt different sprint practices to avoid things getting bad? This article provides the answers.

What is a sprint process anyway?

Before getting into the signs that you might want a sprint process, it’s important to define what that is.

If you do things strictly by the book of scrum and agile, this means a lot of things, including:

Fixed-length planning increments (e.g., 2 weeks)
Daily stand-ups
Sprint planning / backlog refinement meeting
Sprint review meeting
Sprint retrospective meeting
Story point estimates
Sprint goals
Avoid adding new work to the sprint after it starts
Clearly defined scrum master and product owner roles

If you actually do all of these things, and you have a small, experienced team that’s newly formed, it is almost certain to slow you down.

Beware of the bloated-process straw man

Developers who don’t want processes often cite the overhead of all the stuff listed above as a reason for not making any changes to the way they work.

News flash: sprint processes aren’t all or nothing. It’s not like the Agile Police are going to show up and arrest you for starting bi-weekly planning meetings without doing daily stand-ups.

You should view “sprints” as a distinct thing from “scrum”. Scrum is a specific set of processes for implementing sprints, but you can implement sprints without implementing all of scrum.

A definition of “sprint”

At its core, a sprint is a short, fixed-length planning iteration. This is different from other planning processes where the iterations are long or have a variable end date.

The key benefit of a short, fixed-length planning iteration is that all the tasks acquire an implicit completion estimate of the iteration’s end date, which is some time soon.

That’s it.

That one benefit is extremely powerful because it lets everyone outside the team know when they can expect tasks to wrap up and sets ground rules about urgent requests mid-iteration.

It also improves estimation for large, multi-iteration projects because you can look at the team’s track record for completing planned work (i.e., velocity not counting scope additions).

How small, new teams survive without sprints

When you’re first starting a new project or organization, things will probably go fine without sprints. A lot of teams get by with a simple shared to-do list in a place like Trello.

There are a few reasons for this, and understanding them will help you spot the early signs that it is time to start doing sprints.

There aren’t any users yet

Most interrupting work is the result of people using the product having some sort of unmet need or experiencing a problem with the product. If a product is pre-launch, you aren’t going to have customer-driven interruptions, so you don’t need a process to triage, manage, and measure their impact on velocity.

There aren’t any bugs or tech debt yet

While estimating a greenfield project comes with its own challenges, it’s much easier than a mature product, which is more like a minefield.

A process for recording estimates, analyzing their accuracy, and prioritizing improvements to drive future predictability is a lot more important once software has grown older and more unwieldy.

All the stakeholders sit in the same room

One big benefit of sprints is that they serve as a central written agreement about the priority of work. This is important for larger organizations where, for example, developers might not understand why the sales team needs something done at a certain time, and the CEO might not understand the impact of an urgent request on other priorities.

At smaller organizations where everyone is sitting in the same room, everyone shares roughly the same knowledge and it’s easier to form a consensus about what is important.

As things change, developers can easily switch priorities on their to-do list and still be working on the most important thing for the company, while everyone else in the room can see what’s happening and adjust their expectations.

When should you start following sprint processes?

Before getting into the details of when to start sprint processes, it’s important to understand that if you make a mistake and add a process too early, the impact is probably a small percentage of wasted effort and minor annoyance. If you introduce processes too late, the cost of chaos can be quite substantial.

When in doubt, a good approach is to fall back on how a process will affect morale. If stakeholders are reasonably happy and the team really doesn’t want a process, then it’s probably fine to skip. If either the team or their stakeholders are demoralized with the current state of affairs, then it’s time to take action.

When adopting sprint processes, keep in mind that there are several different practices to consider, each of which may be appropriate at different times.

In the rest of this section, we look at each component of sprint processes, roughly in order of when they become beneficial.

1. Filing tickets for work

Whether you are using physical post-it notes, bullet points in a document, trello cards, or full-blown Jira tickets, this first practice looks at whether you keep a written list of tasks, whatever their form may be.

As you might guess, I recommend doing this from day one. If you don’t, you’ll have a few problems, even for a brand new project with a team of one person.

First, you’ll forget to do important things like testing key functionality or fixing bugs. Second, without writing a detailed list of tasks to complete a project, your estimates are going to be really bad.

You might be able to get by without tickets for a small school project, but any professional software developer should be working off of a to-do list in some form.

2. Estimating tasks

One of the biggest surprises for me working with customers at minware is how many engineers at decent-sized companies (20+ developers) don’t regularly estimate their tickets.

This is another practice that is usually beneficial from day one.

Assigning an estimate to each task forces you to think more carefully about the scope of work and whether the task is well-defined. I have done story point estimates even when working alone, and they have been very helpful for me.

Whenever I give a task a large estimate, I’m instantly reminded of the previous tasks that I have given large estimates and how those have turned out (not well). And if I give something a small estimate and it turns out to take a lot of time, I’ll be much less likely to underestimate similar tasks in the future.

By writing down a time commitment, estimates force developers to be self-critical and double check their plans, which helps avoid significant problems down the line.

Additionally, writing down estimates for individual tasks has the aggregate benefit of exposing project scope issues early and provides a crucial data point for learning how to estimate better.

A common anti-pattern that occurs without estimates is that developers work on marginal features at the beginning of a project. When the deadline approaches and the project is far from done, they then have to scramble to get the project out the door or push the deadline. If there were good estimates up front, then teams could have cut scope early and avoided this late-project strife.

This issue affects greenfield projects and small teams just as much as big ones, so the best time to start estimation is right away.

When estimates aren’t as important

Estimates are important if you’re building software in the traditional sense. However, there are other scenarios where estimates aren’t as critical.

The main distinction is whether teams are performing a large number of repetitive tasks (e.g., handling alerts or provisioning resources on some DevOps or IT teams), or whether each task reflects unique work that has never been done before.

In the repetitive task scenario, estimates aren’t as important because everyone knows about how long tasks will take based on previous experience.

3. Iteration retrospectives

Retrospectives are the most powerful practice for any type of engineering. They are a key part of Kaizen, as made famous in Toyota’s production system of lean manufacturing.

Regular retrospectives are probably the most important process in software engineering, and something you should strive to do from the start, even if you are working by yourself.

If you don’t pause and take time to reflect on how things could go better, it’s easy to normalize problems and miss opportunities to improve.

4. Short (1-4 week) planning iterations

One of the most strongest indicators of development productivity is work batch size, which I wrote about in What Every CEO Should Know About Software Planning.

Even if you are working on a pre-release product that won’t ship for months (like a game), shorter planning iterations are valuable from the start.

Short iterations ensure that you are delivering value in a short time frame. This gives you the flexibility to adjust plans and cut scope in future iterations.

If instead you work on everything in parallel in one big batch and run up against timeline pressure, you’re stuck throwing out work or pushing out the deadline – one of the classic downfalls of waterfall-based development.

Note here that we’re not necessarily talking about fixed-length iterations like in a full sprint process, but just about keeping iterations short as a first step.

5. Fixed-length sprints

The next step up in process rigor is fixing the planning iterations in sprints.

This is the first practice that may not be beneficial with a small team working on a pre-release product.

One downside of sprints is that they may not fit the natural cadence of work, so you could end up starting a sprint when a milestone is almost finished and have to plan out work for the next milestone when you don’t know, for example, how many post-launch bugs you’ll have to fix.

On the other hand, fixed-length iterations make development easier in a number of ways.

Having a predictable schedule is helpful for the team because they know what to expect and can get good at estimating what they’ll be able to get done in one sprint.

A fixed end date can also be your friend when you’re having a bad iteration because it forces you to stop and look back on how things are going, which lets you course-correct earlier when things are going off the rails.

I recommend starting sprints once a product is live or there are customer, marketing, and sales stakeholders. You may also start them earlier if you find the planning cadence helpful. Personally, I used them from day one working alone, but I’m also a process nerd.

Sprints help everyone get on the same page about exactly when things will be done so that people outside the team can make reliable plans.

They also help with balancing bug fixes, maintenance, and stakeholder requests against new feature development.

A common complaint that small, nimble teams have about adopting sprints is that they’re too long. Two weeks may be an eternity to wait for a fast-moving team, and a lot can happen that affects priorities in that time tame.

If this is a concern, I strongly recommend one-week sprints. We’ve been doing them at minware on our small team, and they pretty much mitigate this concern.

Also, if new information arises, you can always change what’s in a sprint. If you do this all the time then it defeats the purpose of a sprint. However, sprints can still add value if you regularly substitute 25% of the work because you gain predictability for the rest.

6. Planned velocity metrics

It is helpful to look at net velocity metrics when adopting retrospectives, but this section looks at measuring the velocity of planned work (that is, non-bug tickets that were in the sprint when it started).

The benefit of measuring planned velocity is that it lets you see how much capacity is available for new value-adding functionality.

This is something you should start looking at when the amount of time spent fixing bugs becomes non-trivial (>10%).

Without this metric, it’s easy for quality problems and technical debt to get out of hand and start significantly slowing down development. If you’re only looking at net velocity, this problem is less apparent because the team may still be completing the same amount of work.

As you adopt longer time-horizon roadmap planning, a lack of visibility into planned velocity can cause significant misses.

7. Stand-ups

Stand-up meetings are a key part of by-the-book scrum processes. The common recommendation is to have them daily.

On this front, I have found it reasonable to be flexible and adjust stand-up calls within a sprint based on the needs of the team.

The primary benefit of a stand-up call is that it helps the team leader identify when a developer is stuck and they don’t know it. If they do know it, then they can just reach out and ask for help.

However, if someone is going down a bad path and doesn’t know to ask for help, a daily stand-up ensures that it stops after at most one day.

If your team only has experienced engineers, you don’t need stand-ups as frequently.

If your team is remote, and if they are working in different time zones, stand-ups become more important.

My recommendation for stand-ups is to start a few times a week, then adjust based on the frequency of wasted effort due to lack of communication. I have been on teams where I switched from weekly to daily stand-ups, and the need was pretty self-evident due to junior developers who regularly struggled for too long.

8. Advanced sprint metrics

In a previous article – Advanced Sprint Metrics Are for Everyone – I went into detail about several more advanced metrics that can improve sprint performance, such as under-the-radar work, which measure the amount of time spent on tasks outside of the sprint.

Like with the steps here, some advanced sprint metrics are more important at earlier stages than others, and some can be helpful at the start.

My general recommendation is that more advanced sprint metrics are very important once you oversee multiple teams and can’t be in every sprint meeting. They help teams better understand workflow efficiency and self-manage when those with the most knowledge become less involved in day-to-day work.

When do mature teams not need sprints?

Working with minware customers, I sometimes encounter mature teams at larger companies using a kanban process instead of sprints, and they are doing just fine!

More often than not though, teams would do better with sprints.

How do you know which scenario applies to you?

My rough rule of thumb is that only teams doing <25% roadmap work can be efficient without sprints.

This usually implies that teams that are handling repetitive operational tasks and aren’t doing significant new feature development, such as some DevOps teams.

These teams can survive without sprints because they don’t do much planned work. Interruptions aren’t a factor because nearly all of their tasks are on-demand and, in a sense, interrupting.

Planned velocity doesn’t matter because they don’t have many roadmap commitments.

Be careful though, even DevOps teams often have important long-term projects. In this case, it might be better to use underfilled (e.g., 25%) sprints where you add on-demand tasks to the sprint and handle them kanban style to make sure projects aren’t starved for resources.

Conclusion

The question of whether to use sprints might at first seem like a simple binary decision, but it is actually deep and multi-faceted.

With the guidance here, you can better plan ahead and roll out sprint processes progressively as they benefit your team, rather than waiting until things get bad and attempting a big-bang agile transformation.

As always, my view of the world is biased by my experience, so please comment or reach out and share your story – I’d love to hear it!

Yes, You Can Measure Technical Debt

Kevin Borders — Fri, 01 Nov 2024 20:27:36 GMT

Managing technical debt is perhaps the CTO’s most important responsibility.

Fix too much and end up like Netscape.

Fix too little and end up in a tech-debt death spiral.

Disagreements about whether to fix tech debt are also a common point of conflict between engineers and non-technical leaders.

Engineers feel the pain firsthand, with an innate sense of how much easier their lives would be given a clean, well-structured code base and 100% test coverage.

But is it worth it?

This is the million dollar question.

The default response from CEOs is often: “If you can’t prove the value, you can’t put it on the roadmap.”

A recent VP of engineering candidate told me about how his CEO repeatedly denied resources for a critical refactoring project. What finally got him to change his mind? A projected cost savings analysis denominated in dollars.

And that is not unreasonable.

Engineers are paid in dollars and customers pay dollars. It’s hard to rationalize doing something with an uncertain benefit when the alternative is shipping valuable new functionality that increases revenue. Also, if a CTO can’t estimate the monetary benefit of refactoring, how can the CEO be expected to do it?

The problem is that measuring the impact of tech debt is difficult. It’s not like you have two software versions side-by-side where you can make the same changes to each and see how much harder it is with legacy code.

Even worse, tech debt compounds. As legacy software gains dependencies, it becomes harder and harder to fix.

The real tragedy occurs when engineers are right but fail to convince leadership that fixing tech debt is important, leading to a death spiral where everyone loses.

The key to staying on top of tech debt is measuring its cost and fixing the important pieces before they get out of hand.

This article first looks at measurement strategy, covers different approaches for manually measuring tech debt, and then shows how to automate the process.

Tech debt measurement goals and strategy

Before getting into the details of technical debt measurement, it’s important to understand the goals and strategy.

The reason for measuring technical debt is to calculate the value of fixing it as compared to adding new functionality, and ultimately decide which projects have the highest return on investment.

The primary benefit of fixing tech debt is usually time savings, though there may be other benefits like improved security, performance, quality, or morale.

This savings will also happen in the future, which is subject to uncertainty depending on how much the software will be modified.

For the purposes of prioritization, the measurements will be divided by the estimated implementation cost, further amplifying uncertainty.

There are other factors too, like how not fixing certain tech debt may increase the cost of fixing it later, and allowing tech debt to grow too much can cause talented people to leave or cut off strategic business options.

In the end, we will use tech debt measurements in a context where there are a lot of unknowns. We therefore can’t expect a high level of precision.

The goal should be a ballpark estimate within 2-3x of the true cost to at least make sure that we don’t ignore any major ticking time bombs. You can then put your finger on the scale a bit during roadmap planning if there are other significant factors like compounding dependencies.

As an approximation, the current portion of time spent on tech debt (extrapolated over a time horizon like 2 years) is a reasonable estimate for the benefit of fixing tech debt for the purposes of roadmap planning.

How can you measure time spent on tech debt?

This is where it gets tricky. Unless you’ve already fixed the tech debt and are also maintaining a legacy version, you aren’t going to have ground truth about its cost.

Ask engineers

One approach is to rely on expert opinion. That is, ask engineers how much time they wasted on issues caused by technical debt.

I did this in the past with a spreadsheet that teams would fill out during each sprint retrospective. Each person would enter the percentage of their time they feel like they lost to tech debt “interest,” as well as time they spent resolving tech debt “principle.”

Over a span of about two years with four teams, the average percent of time engineers reported losing to tech debt was 7%. However, as you can see in the histogram below, most sprints had under 5%, while a minority of bad ones were over 20%.

As a manager, I focused on the bad sprints where tech debt went beyond minor annoyance and was a major problem. I then dug into the specific root causes to make sure they were addressed.

The downside of this approach is that people are sometimes biased in either direction. Engineers may report a higher amount of technical debt if they experienced a small but particularly frustrating issue.

On the other hand, each person’s definition of tech debt is different. Junior engineers in particular might not have a good reference point for what developing software in a high-quality code base looks like, and can easily normalize and under-report real issues.

Though we were pretty on top of fixing technical debt, the overall 7% number feels low, so it’s worth looking at things from another angle.

Measure bugs

Beyond slowing down feature work, another major way that tech debt manifests is by creating bugs.

There’s no such thing as bug-free software, but code with a lot of tech debt also tends to generate a lot more bugs.

Part of the reason the overall tech debt number in the previous section is only 7% is that the self-reported number only covered non-bug tasks and bugs were counted separately. Bugs represented an additional 19% of all story points completed, for a total of 26% bugs and non-bug tech debt combined.

This histogram shows a percentage of each sprint’s story points spent on bugs, which is much more substantial than points attributed to non-bug tech debt:

The challenge with bugs, however, is deciding what they mean. They’re small and there are a lot of them. It’s usually not easy to attribute a given bug to a particular piece of tech debt.

In theory, any bug could have been prevented with better test coverage and/or static checking. However, knowing in advance how many future bugs will be prevented by particular test coverage or refactoring is hard.

Going back to the VP of engineering candidate I mentioned earlier, he required developers to record the module that was the root cause of each bug when closing it. This showed that one module in particular was responsible for an outsized portion of bugs, justifying an overhaul.

When you’re considering the bug cost of tech debt, a good approach is to start from the tech debt you know exists, and then find ways to identify bugs that would go away if it were fixed.

For example, you can look at particular repositories, folders, file extensions, or bug classes. If you’re considering the priority of moving from Javascript to Typescript, you can look at undefined reference errors in .js files and be reasonably certain those bugs would disappear.

Quantify estimate misses

Asking engineers to record tech debt adds substantial overhead, and the data is not reliable unless recorded in real-time because people forget.

Another proxy for the impact of tech debt on non-bug issues is missed estimates.

The VP of engineering candidate mentioned previously also required engineers to fill out an “actual story points” field when closing tickets to collect information about estimate misses. He then added this to time spent on bugs to get an overall tech debt estimate by module.

You can also get this information with time logs or a system like minware. If you want a DIY solution, you can import pull request commit data from GitHub and look at the count of days with active commits compared to story points on tickets linked from pull requests.

Estimate misses aren’t a perfect metric. There are many problems besides tech debt that can cause estimate misses, and engineers may buffer their estimates based on the presence of tech debt.

Nevertheless, if you look at estimate misses in aggregate and group them by system component (such as by filling it out as a Jira ticket field as mentioned earlier), then you can approximate tech debt overhead by looking at the difference in estimate misses between problematic components and those with less tech debt.

The advantage of estimate misses over asking engineers about tech debt is you’re no longer subject to personal biases from normalizing problems, overestimating the impact of frustrating issues, or the fallibility of human memory. The truth lies in the numbers.

How to automatically track tech debt

Manually tracking tech debt takes a lot of time. Many people, including myself, have found the trade-off worthwhile to make better decisions about managing tech debt.

However, it’s better if you automate the process, particularly because that gives you historical data, which is useful, but rarely worthwhile to go back and label.

We’ve created a tech debt cost report in minware that lets you quickly set up an automated solution:

The report uses minware’s time model to calculate the amount of effort spent on each ticket based on commit data and links between pull requests and tickets.

What you consider an estimate miss might vary depending on your team and organization.

The report shows the average number of dev days spent per story point, which you can then use to set a days per point threshold above which you count time as over the estimate. The default is double the average, which leaves room for story point estimates being approximate while still counting time that is significantly above what is expected for a given story point level.

The next part is deciding how to slice the cost of bugs and estimate overages. The report defaults to doing it by team, but you can also edit the report to aggregate by repository or custom ticket field to look at tech debt overhead in different ways.

Finally, this report shows the numbers in time rather than dollars. If you want to report on total personnel cost, you can update the values to multiply them by average engineering salaries in your organization.

This report requires a lot of org-specific configuration since everyone’s tech debt situation is different. If you want to chat more about tech debt, just send me a message – I’d love to hear from you!

Conclusion

For many engineering leaders, managing tech debt is the most challenging and critical part of their job.

It’s challenging because the costs are elusive and decisions often default to gut feel.

Measuring technical debt isn’t a cake walk, but it is possible with the approaches outlined in this article.

The silver lining is that most companies struggle badly with tech debt management. If you do it well, you can gain a significant leg up on your competitors.

Hire the Most Expensive Engineers You Can Find

Kevin Borders — Fri, 04 Oct 2024 15:16:39 GMT

A disagreement about employee compensation almost killed the sale of Collage.com to private equity back in 2021.

A top employee wanted a raise.

But, the buyer was concerned because he was already being paid above the 90th percentile for his job role.

However, he was a 99th percentile employee, not a 90th percentile employee. We knew that he could get a job at Google for the salary he was requesting.

Moreover, having a top person leave because we refused to pay him top-of-market right after selling to private equity could trigger an exodus of other top-of-market people.

When we finally “won” this argument, someone on their side commented to my business partner “I’m glad we could work this out, I know you really like him.”

Well, if you ever want to offend a labor economist (like my former partner), imply that they make compensation decisions based on personal relationships.

They essentially gave in to get the deal done even though they disagreed with our position and assumed we were doling out favors like the mafia rather than trying to run a rational business.

Top talent is a lemon market

Our story wasn’t actually that bad. The PE firm went through with the deal and didn’t try to cut any salaries – just block one raise.

Other buyout firms like Vista are notorious for aggressively driving down engineering salaries (someone I know was being forced to keep the average salary below $90k annually!)

When I first heard this, I thought it was crazy.

After thinking more about my experience, however, it starts to make sense.

The fundamental issue in our disagreement while selling Collage.com wasn’t that the PE firm believed top-of-market employees are overpaid generally, but that they had no way of validating our assessment that a particular person was top-of-market.

In short, there was information asymmetry. Just like a mechanic trying to sell a used car that’s actually good in a lemon market, the buyer has less knowledge and can’t be sure whether that’s true, so they don’t want to pay top price.

The people who run Vista have been successful and are surely savvy enough not to believe Google overpays their top engineers.

When they’re dealing with random small companies, however, they can’t know whether engineers with top-of-market salaries are actually top-of-market talent, or whether the founders are just overpaying buddies of theirs who could never get a job at Google.

It’s easy to see how true 99th percentile employees lose out.

The technical assessment chicken-and-egg problem

The lemon market issue for top talent extends more generally, not just at PE-owned firms.

Evaluating technical skills at the high end is very difficult. To do it independently (i.e., not just hiring someone who worked at another company with good evaluations), you need someone who has those skills or close to them to structure the evaluation.

This creates a chicken-and-egg problem because the CEO or CFO (who are rarely technical) need to have confidence that the engineering leader can do the evaluation or identify someone who can if they are to approve top-rate salaries.

I can attest after talking to many VP of engineering candidates with nice-looking resumes that many of them aren’t up to the task, and CEOs are right to be skeptical.

Another approach is to just hire people who passed technical evaluations at other reputable companies.

This can work, but it’s more expensive and risky than evaluating people yourself. There are fewer people who have worked at places like FAANG companies than the broader population, so they’ll cost more. Also, big companies make mistakes and have a range of performers too. Without your own evaluation, you’re liable to end up with the lowest performers who cleared the bar elsewhere, which puts you partially back in the lemon market situation.

Ultimately, the difficulty of independently differentiating between someone who could be a staff engineer vs. a principal engineer at Google leads to widespread information asymmetry at the high end of the market, especially for people who don’t want to work at big-name companies with widely reputable technical assessment practices.

Following lemon market theory, this drives down prices for hiring top talent.

It’s hard to measure the financial impact of engineering

An article from the economic policy institute about CEO pay shows that it has risen 1,322% since 1978.

You may be wondering: why 1978? Well, in the early 80s, there was a run-up of CEO salaries based on analysis of their impact on stock price and a shift to stock-based compensation. Essentially, people started paying CEOs based on their estimated share price impact.

Some CEOs might be overpaid, but certainly some of them do produce massively more value than an average worker (for example, Steve Jobs, Elon Musk).

It’s unlikely that a top engineer can add as much value as a top CEO, but with the extremely high leverage of software, the impact can still be large. Top companies like Google have recognized this. According to levels.fyi, Google pays distinguished engineers $2.6M.

However, it’s a lot harder to directly tie engineering contributions to business results than those of a CEO, and most companies probably aren’t as good at it as Google.

At lower senior levels like staff and principal engineer, it is even more difficult because there are fewer big patents or innovations directly linking specific people to revenue.

Ultimately, companies aren’t going to pay people more than they believe that they will add to the bottom line, which is difficult to demonstrate for individual top engineers, even if it’s true.

The high-salary political factor

The CEO compensation article in the last section goes on to mention: “Exorbitant CEO pay is a major contributor to rising inequality that we could safely do away with.”

People who add above-average value for companies have to fight against tremendous political pressure if they want to earn a commensurate amount of that value.

At a local level, it’s awkward for someone to make 13 times more than their colleagues (levels.fyi shows $204k for an entry-level engineering salary at Google, 1/13 of the $2.6M for distinguished engineers).

Also, to get paid more, you have to want to fight for it. This is just a hunch based on my personal experience, but of the people I know, engineers tend to be more egalitarian than CEOs. I wouldn’t be surprised if engineers are generally less aggressive in pursuing higher salaries than CEOs.

On top of the lemon market issues, it’s important to consider that top engineers may be further underpaid due to broad political sentiment, as well as their own personal beliefs.

The first catch: identifying top talent

So far we’ve focused on the factors that make top engineers the most underpaid. The obvious conclusion is that if you can find one and convince them to work for you, you should hire them.

The first catch, however, is that “if.”

The driving force of lemon markets is information asymmetry. To actually realize a bargain, you need to break that asymmetry and differentiate people who are actually good from people who just look good.

There’s no magic solution here and it is an extremely difficult problem with too much nuance to cover here in depth, but there are a few things specific to engineering you can do to increase your chances.

Make engineers do engineering

An engineer’s job is to do engineering, not talk about doing engineering.

I was surprised to encounter this at first, but some people who are great at talking about engineering can’t actually do it to save their lives and fail basic coding exercises.

On the other hand, mediocre candidates can quickly get good at solving leet-code style interviewing questions. So just giving out algorithm and data structure problems on a whiteboard can lead you to passing over people who would do a good job or worse: lead you to hiring people who can’t actually do the job at all.

You should have a brief (2-3 hour) technical assignment as part of any engineering hiring process, and the assignment should reflect real work as closely as possible. Why 2-3 hours? I have found that you can learn a lot about a person’s abilities from a problem that can be solved in this amount of time. It is a commitment that most (but certainly not all) candidates can make to an interview process about which they are serious.

This means working with existing code, debugging, writing tests, and clarifying ambiguous requirements. For more senior engineers, this means drafting and reviewing architectural plans.

Penalize people who are charismatic

Doing a technical assignment can help you be more objective, but I have found the most challenging thing about hiring to be overcoming the immense bias people have to favor those they like.

Of course, if a candidate presents as untrustworthy, dishonest, is a very poor communicator, or for some other appropriate reason would be a negative addition to your team, you should stop the interview process.

But, if someone is unassuming and not very animated or exciting, you should get excited!

You may wonder, if someone is good at selling, isn’t that a positive thing? All else being equal, yes. They will be able to better communicate internally and get things done.

However, if they are better at selling, then there is a much higher risk that interviewers will assess their skill set to be better than it actually is, especially if the candidate takes credit for accomplishments that were a team effort.

As if that isn’t bad enough, previous interviewers at other companies are more likely to have over-assessed the candidate, so the reliability of signals from prior resume experience is also much lower.

Actually putting this adjustment into practice is difficult. You need someone to play devil’s advocate in the hiring process and question the judgment of other interviewers.

One thing that can make it easier is to instead give a bonus to unassuming “diamond in the rough” candidates who you think interview below their ability level. This makes the conversation more positive and less contentious (“we should give this person a chance” vs. “the person you like is actually unqualified”).

Beware of people who only care about money

With the market for top engineers being underpriced, one way candidates level the playing field is by seeking other non-monetary benefits.

The best people tend to be very particular about their working environment and value things like low bureaucracy, low politics, low technical debt, the ability to have a direct impact on customers, talented colleagues, and supportive management.

This isn’t foolproof of course, but you should take it as a negative signal if candidates are not heavily focused on quality of their working environment and talent of peers, because this may mean they are not top-of-market and don’t have the luxury of worrying about these intangibles.

The second catch: convincing top talent to work for you

One of the biggest things that gets in the way of successful engineering hiring is management ego.

If you want to succeed at bringing in top-of-market candidates, it’s important for everyone in management to understand that by working for you at the prevailing salary, candidates are doing you an incredible financial favor. Engineers aren’t being “divas” or “snobs” – they’re merely trying to recoup some of the value they create in the form of enjoyment and career development.

You’re also up against Google, Microsoft, and others who invest heavily in employee experience, so you need to find an edge that big tech companies can’t offer.

For the people I’ve been lucky enough to hire at my companies, the thing I have been able to offer them that Google can’t is direct access to top management and real influence on the direction of the company.

If a talented engineer believes that fixing a major piece of tech debt is worth it, let them do it without a complicated approval process. (Though, of course, you should ask them for a plan.)

If you’re talking to investors and considering whether to raise an equity round, discuss it one-on-one with individual engineers. Teach them about the nuances of finance and ask for their opinion.

If you want to land a top engineer, the CEO should speak with them during the interview process to understand what they value beyond money, and personally guarantee that they’ll get it with the full support of management.

Conclusion

Bending over backwards to recruit top engineers and pay them top-of-market may seem counterintuitive, particularly for small businesses concerned with efficiency.

This has been my strategy while bootstrapping two start-ups, and it’s worked very well. Top people get so much more done with high quality and less management overhead, plus their insight drastically improves decision making at top levels of the company.

As a final parting thought, when something feels counterintuitive, that means it is counterintuitive for others too, and may just be a great opportunity.

Don’t Fall for the Return-to-Office Hype

Kevin Borders — Mon, 30 Sep 2024 17:40:08 GMT

Return-to-office is in full swing. YCombinator – the world’s top start-up accelerator and bellwether of tech trends – has moved back to San Francisco. I recently received this reply from an investor in YC start-ups, which echoes the prevailing sentiment:

But should you jump on the return-to-office bandwagon?

Maybe, but be careful.

The book Remote has a lot of good arguments in favor of remote work, and I’d recommend reading it if you haven’t.

In this article, we look at some of the less obvious things that you should consider before recalling your employees to the office, especially if you care about efficiency and are not a VC-backed startup.

What’s best for YCombinator might not be best for you

The #1 pitfall I’ve seen people make with big decisions is assuming something that works for others will work for them without considering how their situation is different.

If we assume for the sake of argument that return-to-office is good for YCombinator, let’s take a look at how their interests conflict with the vast majority of companies and employees.

YCombinator and most VCs only care about minting unicorns

$1+ billion outcomes are all that matter for YCombinator, and for most venture capitalists more broadly. This means insane growth at all costs. What’s best for them may not be best for all the individual companies in their portfolio, let alone companies without VC backing.

The main way this manifests is the question of growth vs. efficiency. For normal companies, efficiency and profitability are essential for survival. In the strange land of venture capital, efficiency is someone else’s problem later down the road.

If VCs believe remote work is more efficient but doesn’t accelerate growth, then they won’t care about it, especially at the seed stage.

Start-up employees are younger

According to Paul Graham in 2007, “the average YC founder is about 25.” According to a study by Radix, the average age of a software developer in the US is 39.8 years old.

This relates to remote work in a few ways. First, older people are more likely to have children. Remote work is more beneficial to people who have to juggle schedules with school or child care. Strict schedules also make it harder to put in as many hours when you have to commute.

Second, older people generally have more money and better options for a home office. Younger people are more likely to have small apartments or roommates, which can make remote work difficult.

Finally, younger people have different social needs. For many, work is an important place to establish friendships. People with families tend to be less inclined to stay until 8 PM on a Friday for happy hour.

People at seed-stage start-ups may prefer an office while people at other companies may not, simply due to age.

YC companies can easily attract top talent to an office

Hot start-ups in major tech hubs like San Francisco and New York can easily recruit and pay for the best talent in their own backyard.

For others, hiring great engineers is a major challenge.

Perhaps the biggest benefit of remote work is the ability to hire from a larger talent pool. This simply doesn’t matter much for small VC-backed startups.

Commuting isn’t free for employers

A shockingly common fallacy that a lot of people engage in is thinking that just because they don’t pay for something directly, the cost is zero.

Take sales tax, for example. When we had to start collecting sales tax everywhere at Collage.com following changes to the law for ecommerce companies, I heard comments from various lawyers and accountants that it wasn’t a big deal because our customers would just pay the tax.

My business partner, who is actually an economist, estimated that we effectively paid 80% of the tax. Why? A price is a price. Some people might not consider sales tax. However, if someone only has $50 to spend (and doesn’t intend to file a use tax return, which very few people do), then that is it and the company will get less money.

For return-to-office, the thinking goes that if you still require the same number of hours and employees commute on their own time (and dime), then it doesn’t cost anything to the company.

Wrong. If commuting is a burden rather than a leisure activity (I don’t personally know anyone who feels otherwise), then employees will factor that into their decision about which job to take. With all the remote job options, they will demand higher pay to go into an office.

Moreover, even if you demand the same number of hours, you are just making the work day longer. People may have less energy and lower productivity at the start and end of the day after a long commute, and you remove peoples’ ability to choose their work hours based on when they are at their best.

You may still decide that being in an office justifies the cost, but don’t fool yourself into thinking it’s free.

Trends are dangerous

Remember open office spaces? Despite significant evidence they hurt productivity, they were all the rage before COVID.

Why? It’s hard to answer that question definitively, but they were certainly trendy, with top companies like Facebook (before it was Meta) leading the pack.

It is safe to say that returning to the office is now a trend.

When things are trendy, people do them because other people are doing them, not because they make sense.

In the case of ditching remote work, there is some validity to following others because now there are fewer competing remote job options for your soon-to-be-disgruntled employees. However, there are still a lot, and it will be harder to attract people who prefer to work remotely.

The point here is not about whether remote work is good, but that anyone making such a decision should try to ignore what other people are doing and focus on what is best for their company.

Office work has significant hidden overhead

Everyone knows how much they pay for rent and office furniture, but in-office work has other large unknown costs.

For better or worse, being around other people takes time. You have to converse with your colleagues to maintain social relationships, worry about how you’re dressed, and generally spend time thinking about others’ social perception of you while you’re in the office. On top of this, there’s all the office politics. (This is all worse with open office spaces, of course.)

Sure, there is some of this in a remote setting, but you interact with other people on your own terms and choose how much effort to put into social interaction.

In addition, I have personally found that being around other people increases groupthink. It’s easier for people to think in a bubble when they are literally enclosed by the same four walls.

Of course, there are benefits to socialization and collaboration in an office as well, and remote companies often have in-person meet-ups to build trust and foster connection.

But, if you’re analyzing the costs and benefits of office work, don’t just write down office space on one side and better communication on the other. It’s more complicated than that.

It’s all about the talent

I’ve already mentioned that recruiting from a larger talent pool is a major benefit of remote work, but it bears repeating.

If you’re doing a cost-benefit analysis of remote work, your short term costs and benefits will entirely depend on your current workforce.

However, even if an office would be better for the team you have today, remote work could still win in the end by helping you attract and retain a better future team.

Now that everyone has had a taste of remote work, those who like it and have the freedom to choose will keep working remotely.

Who has the most freedom to choose? The best people. They have the most financial independence because they have the highest salaries, and they can get a job wherever they want.

I was just talking to a friend who’s a top-tier data scientist looking for a job, and he’s not even considering Amazon because they’d make him come to an office five days a week.

Other people may prefer an office, but labor is a market governed by supply and demand. With so many companies going back to the office, it seems more likely now than before that remote employers will have better employees available at a lower cost for some time to come.

Remote work isn’t for everyone

I love remote work personally and have been successful with it.

However, I’ve also seen it go poorly, like when Collage.com (my last company) went out of business a year after selling and being run by executives who didn't understand remote work. (There were other problems too, but I believe remote work issues were a significant factor.)

From what I have seen personally, there are two cultural issues that mix very badly with remote work.

Remote work is bad with low candor

The first is what I would call a low-candor culture, which is common in American companies. In a low-candor culture, people tend to not say what they think when it’s uncomfortable.

If you’re remote, this leads to people not sharing any critical opinions and keeping everything bottled up until it explodes.

Low candor cultures can survive in an office because it’s easier to pick up on non-verbal cues in person. It’s also easier to open up when you’re physically in a room with someone.

If this is how people in your company are accustomed to discussing uncomfortable topics, you probably need an office, though hybrid work may suffice.

Remote work is bad with low accountability

I have also seen companies that suffer from low accountability, where people fundamentally don’t want to work any more than they have to.

For example, if you ask someone a question and they give you an answer they are unsure of because finding the real answer would be more effort, then you have low accountability.

When you’re in an office, it’s easier to pin people down and actually get things done in this type of environment.

You can say “Hey, did you actually follow up with person X to see if the new proposal would work for them or are you just assuming? No? Let’s go find them right now and get an answer.” Or, maybe you put everyone in a meeting to make sure there are no gaps in communication.

I have encountered this type of thing most commonly with outsourcing where people have a mercenary attitude.

It’s hard for low-accountability teams to be as productive outside of an office, even in a hybrid setting.

The bottom line: is remote work right for you?

This is a difficult question that is not going to have the same answer for every organization, nor should it.

The important thing when making such a decision that has a major impact on the lives of every employee is to do so carefully with full consideration of your unique situation, and try to resist pressure based on what others are doing.

Going against the grain also has its benefits. As for the comment earlier, here was my response:

What Every CEO Should Know About Software Planning

Kevin Borders — Mon, 23 Sep 2024 19:08:58 GMT

For many CEOs, software engineering is a black box. A roadmap comes out, money goes in, and then software comes out… maybe.

As a result, software project failures are rampant at companies both big and small.

Large companies might be able to absorb such failures, but even one could mean the end for a smaller bootstrapped business.

When I sold Collage.com and first took over as CTO at the parent company, the CEO told me: “Project X was supposed to take three months. We’ve been working on it for a year and I have no idea why.”

I hear different versions of this story every day, and they share one thing in common: bad planning.

It doesn’t have to be this way.

By instituting a few key practices and metrics, business leaders can dramatically decrease the risk of software project failures without micromanaging or getting into the weeds on technical details.

Why do software projects fail?

The simple answer is: poor planning.

There can be other reasons like unforeseeable technical risk or a key person leaving, but nine times out of ten, the project would have succeeded (or not started in the first place) with a better planning process.

Why do software plans fail?

The answer here is also simple: because they’re too big.

The larger a plan is, the more ways there are for it to go wrong, and the greater the impact if it goes off the rails.

Some projects are inherently big, like building a car. However, if you look at failed software projects, most of them are a lot larger than they need to be.

Also, even if the overall plan isn’t too big, large code changes or tasks within that plan can still derail a project.

Myth: Agile/scrum will save you

If you plan work in two-week sprints, then each one will be small and therefore more likely to succeed.

Or so the thinking goes.

The first problem is that some projects take longer than one sprint from business planning to customer value. You can plan sprints all you want, but if the value delivery cycle is longer than a sprint, then sprints won’t prevent project failures.

Going back to my experience as CTO, the team working on the project that was 12 months into a 3-month estimate had been diligently planning sprints the whole time!

The second problem with sprints is that individual code changes should be much smaller than the sprint duration. Two weeks may be short for solving a customer problem, but if you put all the team’s work for two weeks into a single pull request, you’re asking for trouble.

Finally, sprints are blind to bugs. When you create bugs, those bugs are really part of the original feature cost. With sprints, however, they just show up as new tickets in a future sprint.

Sprint metrics on their own don’t drive or hold people accountable for quality. By creating more tickets and points to fix bugs, they can in fact do the opposite because it’s easier to meet your sprint commitment if you cut corners on software quality.

Levels of planning

To truly reduce the risk of project failures, it’s important to understand the different levels of planning. A whole project or sprint may be small – let’s say a few weeks – but if the individual tasks are bundled into large chunks, it can still blow up.

Good planning should seek to minimize work batch sizes at each of the following levels.

Project level: Value live

The project itself should represent work that delivers value to the customer by solving a problem.

Typically this is represented by an “epic” parent ticket in Jira, but any equivalent field that groups tickets will suffice.

The important thing is that once the project is complete, you can validate whether it delivered customer value.

Ticket level: Feature live

Below the project level, you have individual features that work together to deliver value.

Each feature is typically represented by a ticket in your project management system. You may have subtasks or a checklist, but the ticket should represent working functionality.

For a ticket to be complete, the customer (or a representative of the customer if it’s in a staging environment) must be able to use the functionality so you can validate that it works as intended.

Pull request level: Code live

Below each feature is an individual code change, typically consisting of a pull request in your version control system.

For a unit of work to be complete at this level, it must be deployed – ideally to production but possibly in a staging environment.

With work units at this level, you want to validate that the code itself doesn’t break once introduced into the full environment.

Planning pitfall #1: Projects are too big

Each project or epic should be the minimal size to solve a problem for the customer.

At the same time, once an epic is “done,” it should be possible to validate the customer solution, rather than splitting up epics just for the sake of it and having larger hidden value-delivery batches.

An epic isn’t truly done until the customer is able to realize value.

People run into trouble when a lot of solutions are bundled together. If there is a larger initiative, you should use a distinct epic for each milestone that creates value. This way, you can incrementally verify that value and still have something to show for your effort if the project stops before completing all the milestones.

Planning pitfall #2: Tickets are too big

It’s not enough for projects to be small. Each ticket should be small as well.

Each ticket should be code-live – that is, you should not bundle multiple tickets into a single pull request. Otherwise, you may encounter integration problems and have to do more work after tickets have been marked done.

Each ticket should also be feature-live, meaning that someone can use the functionality of the ticket before marking it as done.

A ticket/feature is not done until users have had the opportunity to break it.

Large tickets that encompass multiple pieces of functionality are a lot more likely to go over estimate. The reason is that a large ticket indicates the developer hasn’t carefully thought through each step and risk associated with the task.

Planning pitfall #3: Code deployments are too big

Ideally, each code change should be done in a small pull request and deployed to production when that pull request is merged using a continuous integration/continuous deployment (CI/CD) system. This means multiple deployments per day.

Sometimes this is not practical or possible, such as if you have to submit to an app store that only allows weekly updates.

If you can’t deploy to production daily, you should at least deploy to a staging environment that's as realistic as possible each day.

A code change is not done until it is deployed in a realistic environment.

Until that point, you can never be sure what will happen when it’s integrated with existing code and exposed to production workloads that can trigger subtle performance issues and other problems.

Furthermore, as the size of a deployment grows, the risk of it causing problems and cost of fixing those problems increases multiplicatively.

Really big deployments can require weeks of post-launch hotfixes, delaying the true project completion time and tarnishing the company’s reputation in the process.

How to avoid failures with technical planning

For the 3-month project I mentioned earlier that I took over 12 months in, I eventually learned that the 3-month estimate came from a product leader who arrived at the number without consulting engineers. He just wanted it to take that long.

The number one reason this project blew up was a lack of planning step between roadmap commitment and sprint implementation. I call this step technical planning.

When you’re first discussing projects during roadmap planning, you may not have small tasks because the scope may not be defined.

The technical planning stage involves identifying technical risks and defining the individual tickets you will need to complete to finish the epic and deliver value for the customer.

It should happen after roadmap planning but before implementation so that there is an opportunity to change scope or cancel the project. Why? Because until technical planning is complete, you don’t know the true cost of the project.

But planning tickets ahead of time isn’t agile!

Too bad. Make your epic smaller, but if you can’t fit it into one sprint, then you need to plan more than a sprint’s worth of tickets up front.

If you’re not able to deliver and validate a customer solution (the purpose of an epic) in one sprint, then it’s not really “agile” anyway.

Incidentally, when I introduced the concept of technical planning to some people on the team that was nine months behind schedule, I was met with fierce resistance. I was told that it was a waste of time, and while my team was busy with technical planning, their team was shipping software. (While it was tempting to point out that they hadn’t actually done so for a year, it didn’t seem like it would help my argument, so I kept quiet.)

Instead of holistic technical planning, this team was effectively doing technical planning for the next component of the project each sprint. This meant that the true project cost was being discovered in two-week increments. It was always “almost done,” but nobody could say how much longer it would take.

If there’s one thing you should take away from this article, it is to never start on an open-ended project. This is like agreeing to buy something without knowing the price.

A project can be open-ended if the epic does not have all of its tickets, or if those tickets are big and vaguely defined, which we’ll discuss next.

What makes a good technical plan

A good technical plan does two things. First, it mitigates technical risks. Second, it provides a precise estimate of the overall cost by breaking down work into small tasks.

A technical plan should be completed by the team who will do implementation and reviewed by leadership before implementation begins.

One pitfall people encounter is thinking that all code is implementation. As a result, they fail to account for technical risks like system interoperability or performance during technical planning and end up having to redo large amounts of work.

During technical planning, engineers should be encouraged to build small prototypes to validate the architectural design and better anticipate the scope of implementation work.

Conversely, not all specification of functionality constitutes planning. Figuring out the exact button color is probably an implementation activity because it is not likely to have dependencies or impact the overall project estimate.

Doing functional specification in too much detail during the technical planning phase can bloat the process and waste time if the project does not move forward.

Identifying risks in a technical plan

My favorite way to explain risk mitigation during planning is with a peanut butter and jelly sandwich analogy.

A bad technical plan will have one task that says “Make PB&J sandwich by putting peanut butter and jelly between two pieces of bread.”

This task could easily blow up for many reasons. A good technical plan will explore all of the risks and details, such as:

What if you are out of an ingredient, or don’t have a knife or plate?
Who’s going to be eating the sandwich, and do they have any allergies or gluten intolerance?
Does the consumer prefer more or less of any ingredients, or have any quality preferences? When and how will they communicate those preferences?
When will you have to make the sandwich? How long does it take to get to the store that time of day and replace an ingredient? Will you have transportation available?
How soon is the sandwich expected to be ready? If you can’t produce the full sandwich, is an on-time peanut-butter-only sandwich better than a late PB&J sandwich?

As you can see, even a simple task becomes complicated if you want it to succeed predictably under a variety of circumstances.

When it comes to software, a good technical plan should contain a checklist of risks. Many of these may be business-specific, but you should consider things like performance, security, backward compatibility, localization, accessibility, legal issues, etc.

At my former company, Collage.com, we had a list of about ten items like this and it saved us on many occasions.

Breaking down work into small tasks

A good technical plan should also break down work into small tasks. The main benefit of this is that it forces you to think through each step. If you don’t do it, then you’re liable to overlook things and underestimate the tickets.

If you’re using story points and one point is approximately a day, then small means 1-2 points, medium is 3, and large is 5 or more.

Any tasks estimated at more than a few days are a red flag.

I can’t tell you how many times I’ve seen a 5-day estimate (e.g., implement user profile editing) turn into four weeks after I asked a developer to list each specific one-day task.

How CEOs should review plans to prevent project failures

With a solid technical planning process in place, CEOs have a good way of reviewing plans to prevent failures without micromanaging.

Note that this applies to CEOs of small companies. At larger companies, a lower-level leader might fill this role, but whoever is in charge should do the following:

Roadmap Planning - CEOs should review and approve the roadmap plan including rough initial estimates.
Technical Planning - CEOs should review and approve the technical plan prior to implementation. The executive’s role here is to verify that the project still makes sense given the more precise cost estimate, and that the plan is not open-ended by failing to address risks or break down work into small enough tasks.
Sprint Planning - At the end of each sprint, the CEO should review progress on the project to decide if it should continue by looking at two things: (1) how much work has been completed, and (2) how much new work has been added to the epic (with an understanding that small amounts of discovered work during implementation is normal). This helps identify external impediments to velocity or problems with the technical plan before they derail the project timeline.

The nice thing about this structure is that it gives the CEO necessary visibility to prevent project failures without making the team feel micromanaged or not trusted. Without this, you can avoid randomly asking the team “Why isn’t the project done yet?” which can be frustrating and disruptive.

Core project planning metrics

In addition to reviewing individual project plans, CEOs should look at core project planning metrics to make sure the plans are accurate and reliable.

Non-negotiable bookkeeping for traceability

For leaders to have any visibility, individuals need to record their work in a way that it’s not hidden.

This generally means the following:

Nearly all coding work should use version control and pull requests
Nearly all work done by developers should be documented in a ticket
Nearly all pull requests should be linked to tickets
All project tickets should be in an epic (or have an equivalent project field)
All tickets should have estimates

The overhead of these things is negligible and they are essential for visibility, so you may have the occasional slip up, but there’s no excuse for not doing them >95% of the time.

To track these things, you can manually export lists of main branch commits, pull requests, tickets, and epics to verify that they have the appropriate fields set. When looking at tickets, you can filter by those that have started work.

minware provides a Code/Ticket Traceability report that automatically tracks all of these things and rolls them up into a target percentage based on work time so you can spot large amounts of untraceable work.

Work batch sizes

In addition to reviewing individual project plans, It is also helpful to review work batch sizes in aggregate at each level – pull request, ticket, and epic.

When looking at these metrics, you should assess the number of days spent on each pull request, ticket, and epic. Pull requests and tickets with more than five days of work are a red flag.

With these metrics, it’s important to not just look at the average, but also dig into the largest outliers, because those will have the biggest impact on productivity.

There are different ways to gather these numbers yourself. Some companies use time logs, which are precise but impose overhead on each developer.

You can also look at story point estimates, which might suffice if ticket estimates are reliable, or total duration that pull requests are open.

minware has a Work Batch Sizes report that shows you all this information in a single dashboard so that you can spot areas where the work batch sizes are larger than you desire.

Rate of new and resolved bugs

Because one way to make work batches smaller is to cut corners on testing, you should also look at the number of newly created bugs. This gives you an indication of whether project work has issues with quality and whether underlying technical debt may be slowing teams down.

The rate of completed bugs is also important to make sure that teams are fixing the bugs they create. I’m a strong advocate of fixing all your bugs as a way to improve your development velocity, which you can read about in this article.

You can easily export a list of bugs from your project management system. minware also offers an automated Bug Management report that tracks fix vs. find rate and bug load by team.

Pull request, ticket, and epic scope

Earlier, we talked about how it was important that each pull request, ticket, and epic correspond to live code, live functionality, and live value.

To ensure the trustworthiness of your other metrics, you should audit items at each level to verify they actually correspond to work batches at the right level.

Because the meaning of live code, functionality, and value will depend on your environment and business, it’s hard to automate these metrics.

So, I recommend randomly auditing a sample of each item and looking at whether it represents less than or greater than the appropriate unit of work. One way to do this with tickets is look at their acceptance criteria and see if it involves using the functionality. Similarly, with epics, you can look for validation steps that check whether it provides value for the customer

Unlaunched code size

Ideally, you want to know how much undeployed code you’re sitting on for “done” tickets. This gives you a pulse on the risk of unexpected failures and rework later in a project.

Deployment frequency may be an okay proxy for this. It is one of the DORA metrics, which you can find in minware’s Dora Metrics report.

The problem with deployment frequency is it can mask big change sets that are sitting around for a long time while many smaller change sets are going out the door, which poses a big risk of merge conflicts.

If you don’t track deployment frequency, it may suffice to simply review your CI/CD practices and pull request sizes, especially if you merge every pull request to the main branch and launch it automatically (a.k.a. trunk-based development).

Another approach is to put tickets in an “awaiting deployment” status in your project management system. This takes extra work to track, but may be necessary if you use a feature flagging system where code can be technically deployed but turned off until you enable it with a feature flag.

Conclusion

Many CEOs don’t get involved in software development planning. This can work out fine if you have a strong engineering leader, but this often isn’t the case in small companies, which leads to major cost overruns.

The processes and metrics outlined here provide a simple way to gain visibility into project planning and greatly reduce the risk of project failures.

As CEO, you’re ultimately responsible for software development. It’s important to delegate work and not micromanage, but accountability is essential.

As a CEO myself, I follow all of these practices, and they have helped me avoid a lot of mistakes. Before I learned these things, I wasted orders of magnitude more time than the overhead of planning and collecting metrics, even with a small team.

My hope is that by sharing these previous failures, I can help others succeed and do more with less.

Want To Ship Features Faster? Fix All Your Bugs

Kevin Borders — Thu, 12 Sep 2024 20:40:32 GMT

For many, launching new features can mean the difference between survival and insolvency. Doing anything outside the critical path to revenue can put you out of business.

In this scenario, the natural instinct is to only fix bugs that are critical right now.

However, doing this continually is like taking out a payday loan each week to repay the last; the cost will quickly dwarf whatever benefit came from having money early.

The problem is that bugs very quickly get a lot more expensive to fix, easily doubling or more within weeks.

There is an alternative: Fix all bugs within one development iteration (i.e., sprint) or mark them as “won’t fix.”

I do this now. At first it felt wrong to work on lower-priority fixes before important new features. However, the benefits of bug-free development soon emerged:

Roadmap planning becomes easier with fewer interruptions and not having to allocate time for backlogged bugs.
You stop having to make daily decisions about when to fix each bug.
You don’t have to worry about customer emergencies caused by bugs, which reduces stress for everyone and improves the customer experience.
Ultimately, the rate of new bugs goes down as developers invest in better test automation.

Fixing all your bugs quickly is like making your bed in the morning. If you’re going to do it, it’s best to get it done right away so there is one less task hanging over your head.

In the rest of this article, we look at why bug cost escalates so quickly. We then share actionable strategies for classifying, prioritizing, and tracking bugs with an SLA to minimize their total cost and increase feature velocity.

Why is waiting to fix bugs so expensive?

Emergencies are costly

If you only fix bugs that are having an immediate impact, then every bug fix will be urgent. If a key customer is calling you about a bug, then they expect a quick resolution.

In the previous article on Calculating Your Interruption Tax, we explored why increasing levels of urgency amplify the cost of a task. The same bug may cost 4x as much to fix if someone is paged in the middle of the night, or 2x as much if it has to be done the same business day.

When you defer bug fixes until they become urgent, you make each one a lot more painful.

People lose knowledge over time

If you’re writing code, the best type of bug is one where your editor underlines it in red the second you finish typing. You know what you’re trying to accomplish, and you can quickly address the issue without interrupting your flow.

As time goes by, it becomes increasingly difficult to isolate and fix bugs. First, you lose your working memory and have to re-familiarize yourself with the surrounding code to understand the problem. Then, you forget more and more as weeks go by.

Once enough time elapses, it can be hard to even know who the best person is to fix a bug, or the person with the best knowledge may not be around any longer, which can turn a bug fix that would have taken ten minutes into a week-long affair.

Even bugs gain dependencies

For some bugs, the size of the fix might be the same whether you fix it now or later. For others, you may build a lot of other functionality on top of a flawed architecture that you have to rip out to fix the bug.

The problem is that it’s hard to tell which is which until you fix the bug.

If you wait a long time, then some bugs become really nasty. This creates a risk that if a bad bug becomes an emergency, you won’t be able to fix it fast enough and might lose a customer.

If you fix bugs right away, not only is each fix easier, but you eliminate this deeper business risk.

Customers suffer

While we’re mainly concerned with the impact on feature development, even minor bugs chip away at the customer experience.

It’s frustrating, but I’ve heard customers say that my software was “buggy” after encountering only a few issues I considered minor, like unusually long text overflowing off the screen.

The more small bugs you have, the lower the overall perception of quality, which can have a real but difficult-to-measure effect on revenue.

Though it’s hard to quantify the customer impact of having many minor bugs, the cost of zero bugs is easy to calculate: it’s zero.

Employees suffer

Having open bugs also places a burden on employees. At the very least it increases time spent receiving bug reports and triaging them to determine if they are duplicates.

Certain bugs also waste time for developers and others in the organization. Anything that causes alarm noise, internal system failures, manual workarounds, etc. takes a toll on people and ultimately slows down value-adding work.

I once interviewed an engineer whose 100-person company had an entire team of developers just working on scripts to patch over data corruption issues caused by unfixed bugs. Don’t let this be you!

Backlog management overhead is substantial

If you regularly defer bugs, then you have to spend time managing the bug backlog. The more issues in that backlog, the longer it takes.

You pay this cost each time you look at the backlog, and it gets even harder as bugs age and you lose context.

Not having a bug backlog avoids this.

Deciding whether to fix each bug takes time

If you fix (or decide not to fix) each bug immediately, then your decisions are easy.

If you don’t fix your bugs right away, then you have to decide when to fix each one. To do this, you have to analyze the impact and compare it to the value of new feature work.

This cross-comparison between bug and feature value adds another layer of planning complexity for product managers. Not doing it frees them up to spend more time solving problems for customers.

Not fixing bugs creates a moral hazard for developers and managers

In addition to the direct cost of deferring bugs, the indirect cost further compounds by reducing incentives to test code well in the first place.

If developers know they have to fix bugs right away, it’s easy for them to decide how much effort to invest in test automation and manual QA.

On the other hand, if the cost of fixing bugs won’t occur until some time in the future, it’s harder to decide what testing is worthwhile right now because you don’t have regular feedback about the cost.

This moral hazard falls just as much or more on management. If they are used to timelines not including bug fixes, they’re liable to pressure developers to sacrifice test automation, which has even less visibility than bugs.

What to consider when prioritizing bugs

So far we’ve talked about bugs in an abstract sense, but concrete guidelines are necessary for putting abstract ideas into practice.

In reality, bugs have different severity levels, and fixing something that’s hurting customers right now is more important than addressing a latent issue.

Also, even if you accept the general idea of fixing bugs within one sprint, there will be varying priority levels within that time window.

And, you still have to draw the line between which bugs you fix, and which ones you mark as won’t fix.

With this in mind, an effective strategy for prioritizing bugs should consider the following principles.

Urgency avoidance

The current customer impact of a bug is usually pretty clear. What people often don’t think about, however, is the risk of a bug becoming urgent in the future if circumstances change.

A useful thought exercise is to think about what would happen if the bug came up on an important customer demo. What would that customer think? Would the demo be totally derailed? Would it give the customer a bad impression? Or, would they not care even if they noticed?

This is similar to making other ethical decisions. A lawyer friend of mine always advises his clients by asking: what would this look like if it were on the front page of the New York Times?

By the urgency avoidance principle, you should prioritize latent bugs just below how you’d prioritize them if they were actively affecting important customers.

This minimizes the number of bugs that become urgent in the future or resurface after being marked won’t fix.

Would you ship new code with this bug?

To avoid the moral hazard problem, you should ask yourself whether new code having the same bug would fail your organization’s quality standards. Is it something you’d fix if you knew about it before launching a new feature?

If the answer is yes, then you should fix it. Otherwise, whatever quality standard you claim to have will deteriorate because it is a double-standard for new and old code.

Consider internal costs

People usually account for customer impact when prioritizing bugs, but it’s important to look at impact on employees too.

Issues with internal systems like alarm noise, user tracking inaccuracy, or build system failures often take a back seat to customer problems because they don’t affect revenue.

However, internal problems can have a major impact on velocity, and development teams should be empowered to prioritize them accordingly.

Don’t mark a bug “won’t fix” unless you really won’t fix it

One way to follow the approach suggested here by the letter but not in spirit is to just mark bugs that aren’t having an immediate impact as won’t fix and wait for them to pop up again.

This defeats the purpose of a “fix all bugs” strategy. When deciding not to fix a bug, it’s important to think whether it will ever be something you want to fix without a fundamental change to your standards or resources and use the won’t-fix option judiciously.

Bug priority levels

So far we’ve focused on the decision about whether to fix a bug or not, but it’s also important to have different priority levels for bugs that you do decide to fix.

When establishing priority level guidelines, the goal is to balance urgency avoidance (since urgently fixing a bug is more disruptive) with addressing important issues quickly.

It’s also important to have clear and simple guidelines for priority levels so that everyone agrees about what qualifies for each priority level and how to handle each one.

I’ve had success with the following levels. You might use different names, but the important thing is what the priority levels mean.

Critical - Someone will be paged and start working on the bug immediately. Example: site outage.
Urgent - Stop working on whatever else you’re doing and fix it right away, but during business hours. Example: one customer is locked out of their account.
High - Start on it next after your current task, but within one business day at the latest. Example: a small group of customers can’t use a minor function of the software.
Medium - Complete it within one development iteration (i.e., sprint). Example: everything else you plan to ever fix.
Low - This is the won’t-fix status and you may want to close bugs with this priority level. If you do leave them open, everyone should have the understanding that they will not be fixed unless the opportunity arises to address them easily as part of another change, or if there is a major change in quality standards, resources, or business strategy.

The key thing to notice here is that there’s no priority level between “fix it within one sprint” and “won’t fix.”

There is no “gee, this could really bite us but maybe we can get away with putting it off for a few months” priority level, which is in line with the strategy of fixing all bugs.

Measuring results with a bug SLA

It’s one thing to talk about fixing all bugs, but it’s another to put the strategy into practice.

Reality is never absolute, nor should it be.

Processes are designed to handle the common case well, but there are always exceptions where processes don’t make sense. People need latitude to bend the rules sometimes. This provides the benefit of the process without imposing excessive rigidity.

Things also aren’t going to change overnight if you’re adopting a new process like fixing all bugs. Instead, you want to see consistent progress toward a goal.

A helpful metric for tracking bug-fix performance is an SLA (service level agreement). With an SLA, you define what portion of the time (e.g., 95%) you plan to meet the SLA target (e.g., fix a bug within 3 days).

You may choose different SLAs, but here is what we use for minware:

Critical and Urgent - 95% of bugs fixed within 24 hours
High - 95% of bugs fixed within 3 days
Medium - 95% of bugs fixed within 2 weeks

Calculating bug SLA resolution (BSLAR) metric in a spreadsheet

Once you have established SLA levels, it’s time to start tracking your bug SLA resolution (BSLAR) metrics.

If you’re using Jira, you can export all of your bug issues including the created at and resolved at times.

Once you have this data, you can create a pivot table that shows the percentage of bugs for each priority level that met the SLA target (the bug SLA resolution metric) and compare this to your goal ratio.

Another useful way to look at the data is to display percentiles, like median, 75%, 90%, 95%, 99%. This provides more detailed insight into how well you’re meeting the SLA and which actions you should take.

For example, if the 90th percentile is way under but the 95th misses your SLA, then you may want to focus on outliers that take the most time. On the other hand, if the median is really close to your SLA target, then perhaps there are broader issues like having too many bugs assigned to one person.

Additionally, you may want to create a board showing open bugs with swimlanes for each priority level. Looking at this board regularly (e.g., during each stand up) helps stay on top of open bugs rather than waiting until they show up as SLA misses in the report.

Automating the bug SLA resolution (BSLAR) metric

Calculation bug SLA metrics from exported Jira data in a spreadsheet has some limitations, and of course takes time.

In particular, the set of fields is limited so you can only see when the bug is created or resolved.

However, you may want to use different criteria for the start and end of the “open” time window. For example, bugs may sit in a post mortem status prior to being officially resolved. Or, you might want to start the clock when an issue is escalated to a high priority, not when it is filed.

It also might look bad if there’s a regression and a high-priority bug is reopened weeks after originally being fixed, so you may want to reset the timer for reopened issues.

We’ve created a bug SLA report in minware to automatically compute bug SLA resolution metrics. It looks at time windows when a bug was both open and set to a high priority separately for a single bug to avoid inaccuracies.

Conclusion

It may seem crazy at first for a team with limited resources to fix all their bugs before working on new features.

I have done it and it can be painful at first, but now I wouldn’t work any other way. All the headaches I experienced balancing priorities and planning work with a bug backlog have just gone away, and sometimes it’s easy to forget what it was like before.

If you have these headaches too, I recommend giving bug-free development a try for a few months, even if only for new bugs – I’d love to hear how it goes.

Advanced Sprint Metrics Are for Everyone

Kevin Borders — Fri, 06 Sep 2024 15:25:28 GMT

I often see new teams “cowboy coding” – that is, developing software without any semblance of a planning process.

Sometimes people do this because they don’t know any better and lack guidance. Other times, however, it’s a deliberate choice – one which I believe is misguided.

Those who adopt sprint metrics often stop at out-of-the-box reports without thinking about their blind spots. In extreme cases, the metrics can look great while the team is utterly failing to complete meaningful work.

Advanced sprint metrics can help teams of any size do more with less by exposing common sources of inefficiency and drive continuous improvement during retrospectives.

In this article, we look at why that is, what advanced metrics real teams use, and how to measure them.

Bad arguments against planning

“Planning is for managers”

Some people believe that agile processes are only for larger, more mature companies. This is a common misunderstanding about the purpose of planning. Sprint processes aren’t there to benefit management; they are mainly there to help teams meet their goals.

By not planning, you might upset your manager (if you have one), but you’re really just hurting yourself.

“I’ve got too many bigger problems to worry about planning”

Another common argument I hear from new teams is that there are so many larger risks – like customers not wanting the product or not having a viable way to sell it – that it’s not worth the effort to worry about predictable delivery.

While it’s true that other things are more likely to kill a start-up, this is a misguided and fatalistic attitude. By this reasoning, does your business not pay taxes or follow employment laws? I hope not, because you’re going to be in for a world of hurt if you do have initial success.

When there’s a risk of building the wrong thing, you should still try to do it as efficiently as possible.

“Planning takes away my autonomy”

One side-effect of good planning is that it can make implementation boring. Instead of solving problems during implementation, engineers are forced to think about and solve those problems up front.

Sometimes engineers aren’t happy about this and claim a loss of autonomy, especially if they are more senior.

If you feel this way, you need to shift your mindset. Good planning actually empowers senior engineers by letting them focus on hard problems during the planning stage and allowing them to delegate tasks to less-skilled engineers that otherwise would have been too difficult.

If you’re doing it right, good sprint processes should give everyone more autonomy by organizing work into tasks of the right difficulty level for each person.

“Up-front estimation is impossible, and doing it locks in bad plans”

Estimation is a perennial challenge in software engineering. The book Shape Up by the creators of Basecamp takes the position that estimation is impossible and argues that you should instead try to just get as much done as you can within a 2- or 6-week window.

The argument is that you know the least about what to build at the start of a project, so trying to plan and lock in the design up front will lead to worse results.

There is some truth to the premise, but I disagree with the conclusion that estimating and planning work in sprints is not helpful.

It all comes down to a matter of scale. Of course it’s bad to rigidly define the scope of a 6-week project and stick to it regardless of what happens. However, it is beneficial to define and estimate small tasks like “Add this button to the UI.”

Just because there is ambiguity in the project scope doesn’t mean you should have ambiguity in the task scope.

In a good sprint process with short sprints, you estimate well-defined tasks while still having flexibility at the project level by regularly adjusting the scope of tasks in each sprint iteration.

Aren’t sprint reports good enough?

If your team is new, you might think that advanced sprint metrics aren’t for you, and that reports in Jira (or another tool) are good enough to start.

While default reports will uncover many issues and are better than nothing, you can still get value out of advanced metrics even if you haven’t mastered the basics.

It all comes down to incremental visibility. Default reports will show you problems of type A, B, and C, while advanced metrics highlight issues of type D, E, and F. New teams will still want to address low-hanging fruit first, but some of that low-hanging fruit may be in those later D, E, and F categories, so looking at advanced metrics from the beginning can give you a head start.

Which advanced sprint metrics do real teams use?

Basic sprint reports will tell you things like the number of story points committed at the start, number completed, and how many were added or removed from the sprint. These are good high-level metrics that can uncover bigger problems with estimation and interruptions, but there’s a lot they don’t catch.

Working on minware, I've been fortunate to see what real teams of all shapes and sizes do to improve their efficiency.

One thing that’s surprised me is that some small companies (<10 developers) use almost all of these metrics and are extremely efficient even though calculating the metrics can be a lot of work. On the flip side, some teams in high-profile public companies are quite inefficient and only look at basic sprint reports without digging deeper.

The overall trend I see is that small teams who do these things well have a big competitive advantage and are running circles around the competition.

This section introduces advanced metrics in use by real teams, talks about why they are important, and describes how to measure them.

Bug load (BL)

Why it’s important

One sure-fire way to juice your sprint metrics is skip testing and launch whatever you have at the end of the sprint. All the bugs you create will then show up as new tickets with more story points in the next sprint, so you can always meet your commitment!

I have seen cases where people throw such garbage over the finish line that they regularly deleted and rewrote 80% of their code each sprint.

You can mitigate this anti-pattern by tracking how much time you spend each sprint fixing old code vs. actually completing new work. The bug load should ideally be low and consistent.

How to measure it

This metric is straightforward to compute with an exported spreadsheet. If you use Jira, you can download a CSV of all ticket fields right from the issue search screen to get all of the tickets in a particular sprint. Once you have tickets along with their story point estimates in a spreadsheet, you can make a pivot table by issue type to see how much effort went into fixing existing code vs. completing new tasks.

Additionally, you should look at each of the non-bug tasks to make sure that they aren’t bugs in disguise. If a non-bug task is primarily related to fixing existing code, you can reclassify it as a bug in the spreadsheet before making the pivot table to get a true read on bug load.

In minware, you can see the inverse of this metric – non-bug load – in the Bug Management report. This metric further looks at dev time spent on each of the bug tickets so that it isn’t biased by inaccurate story point estimates. It further looks at bug fix vs. find rate to guard against the scenario of shipping bad code and then not even fixing it in a later sprint.

Large task sizes (LTS)

Why it’s important

One limitation of basic sprint metrics is that they focus on total points completed without emphasizing the size of each task.

This enables a common anti-pattern where people represent all their work in a few large tickets. This defeats the purpose of sprint planning and effectively means that there is no plan.

I have literally seen people put all their work in a single 20-point ticket and mark that done at the end of each sprint. They always completed exactly 100% of their sprint commitment!

It’s important to keep an eye on this by looking at how much work goes into tickets spanning more than a few days.

How to measure it

Fortunately, large task sizes are easy to see using data from Jira. Starting with tickets exported from a sprint, you can create a pivot table for each person to show the number of tickets by story point estimate.

Once you have this information, it’s important to look at both large tasks that had a large estimate, which you can compute by just looking at the story point field and filtering by value (e.g., > 5) to make sure larger tasks are rare and could not have been easily broken down when they occur.

However, you will also want to make sure there aren’t any large tasks with small estimates. To identify these, you should look at cases where a person’s total completed points were unusually low, and then click into each of the tickets with small estimates to see when each one started and stopped to identify the underestimated task.

minware’s Work Batch Sizes report rolls up all this information for you. It uses the amount of dev time spent on each task rather than the point estimate so that it can identify large tasks regardless of their original estimate.

Under-the-radar (UTR) work

Why it’s important

Jira can’t know what it doesn’t see. If people do work that isn’t tied to a ticket, then it won’t show up in a sprint report.

The first question you should ask whenever you don’t meet your sprint commitment is: were people actually working on the sprint?

If unmonitored, under-the-radar work can have a major impact on sprint completion. In extreme cases, it can hide the fact that people are really cowboy coding and the sprint metrics are a lie.

How to measure it

This one can be tricky to do on your own because the work is inherently not in Jira (or whatever project management system you’re using).

There are a few sources of information outside of Jira that you can look at to manually compute how many hours a week are lost to under-the-radar work:

Calendars - Look at each person’s calendar on the team, adding up how much time they spend on meetings not related to sprint work, and also factoring in disruption time before/after and between closely-scheduled meetings where it’s difficult to do focused work.
Main branch commits - For each of the repositories where your team works, look at how many direct commits there are on the main branch (that is, commits not tied to a pull request) and whether those commits are related to a sprint task, or are related to other overhead.
Unlinked pull requests and branches - Most teams have a practice of including the ticket key (e.g., DEV-123) in the branch name or pull request title so that you can tell which ticket it’s associated with. If you review the recently active branches and pull requests in each of your repositories, you can look for those that are unticketed and see approximately how much time was spent on them.
Work on off-sprint tickets - Here you can look at both pull request activity and query Jira for recently updated tickets that are assigned to people on the sprint team, but not in a sprint. This will give you a sense of how much time went into non-sprint tickets.
Ask people - It can be hard to remember how much time you spent on under-the-radar work and people can sometimes have an incentive to be dishonest (e.g., if their manager said not to work on something that they think is important), but if you complete the previous steps and have a list of those activities, you’re more likely to get an accurate estimate from people of how much time they lost to under-the-radar work.

This can be a lot of effort to compile manually for every single person and sprint retrospective, so what you might want to do is sample it from time to time (e.g., every four sprints) to see how big of an issue it is, or if there are particular areas that would benefit from closer monitoring.

If you want to see these metrics in minware, you can look at the Code/Ticket Traceability report and the On-Sprint Work metric in the Sprint Best Practices report.

Work in progress (WIP)

Why it’s important

Whether you complete each task one-at-a-time or start everything on day one and finish it at the end, the metrics will look the same.

However, the reality is that working on a lot of things at once will cause them all to take longer, but it will be less obvious why. The individual task estimates may all be correct, but you can lose a substantial amount of time to context switching.

If the team makes a habit of high work-in-progress, its capacity may just look lower than its true potential, which you will never know from looking at sprint reports.

While normally reserved for a kanban process rather than sprints, looking at your average work-in-progress can identify lost capacity due to context switching.

How to measure it

This one is a bit trickier to calculate yourself in a spreadsheet. To do it, you have to manually add the time when each ticket started in a new column. Once you’ve done that, you can subtract the start from the resolved time to get the in-progress duration.

Finally, you can add up the in-progress durations for each person’s tickets in a pivot table and divide that by the length of the sprint to get the total average work in progress for each person and for the team.

You can find the work-in-progress metric in minware’s Kanban Essentials report.

Bug SLA resolution (BSLAR)

Why it’s important

Teams that support production software inevitably encounter high-priority bugs, which can disrupt sprint plans.

While sprint plans are important, they’re not as important as customers being able to use existing software.

Looking at whether bugs are resolved within a predetermined SLA based on their priority helps ensure that there is no incentive to neglect important but disruptive work for the sake of improving sprint metrics.

How to measure it

To compute bug SLA resolution, you can use a spreadsheet exported from Jira. You can get both “Created” and “Resolved” as exported fields.

Then, you can sort the tickets by priority level and compare the time from creation to close to a predetermined SLA duration to flag bugs that exceed the SLA.

minware’s Uptime Dashboard report shows time from creation to close for bugs of a particular priority level at the bottom. You can copy this for each priority level with a different SLA to get a count of the bugs above and below the threshold.

As its name implies, this report also shows your total uptime and downtime for highest-priority bugs, which is probably a better metric than bug SLA resolution for most teams because it accounts for the number of high-priority bugs too. This is just hard to calculate on your own without a system like minware, so you might prefer the simpler bug SLA resolution.

Capacity utilization rate (CUR)

Why it’s important

The easiest way to always meet your commitment is to undercommit.

Because sprints metrics often use story points, it’s not always obvious when the commitment is far below the team’s capacity.

The capacity utilization rate looks at how much time people actually spent on sprint tasks.

Note that having some slack time is good for a well-functioning team. There’s an entire book about it, which I highly recommend.

The purpose of this metric is not to ensure that everyone is working at 100% capacity and burning out, but instead for managers to ensure that they are looking at other metrics fairly between teams and that they themselves are not inadvertently incentivizing people to significantly undercommit by overfocusing on other metrics like sprint completion.

How to measure it

There are a few different ways to measure capacity utilization. If you’re already doing time logs, then you can look at the amount of time logged against tickets in each sprint and see if it stays at a consistent level or varies a lot, meaning that people are regularly finishing their sprint work in less time.

If you don’t have time logs, you can also look at a spreadsheet showing when each ticket starts and finishes, then add up the gaps to see if there are significant holes.

In minware, you can see the capacity utilization rate in the Focused Dev Time / Work Time chart in the Inspired by Uber's Dashboard report, which compares active dev effort to total number of developers with assigned work. Keep in mind here though that off-sprint work is also counted, so you’ll need to look at under-the-radar work as well to determine if capacity is going toward off-sprint work and underutilized in the sprint itself.

Rollover rate (RR)

Why it’s important

Sprint reports will tell you how many points completed in the current sprint, but they don’t tell you how many of those tickets rolled over from previous sprints.

If a lot of tickets span multiple sprints, that can significantly degrade the value of sprint planning and not be obvious from looking at completed points alone.

I have seen pathological cases where tickets regularly spanned several “sprints” before finally wrapping up.

It’s important to keep an eye on not just throughput, but also total latency/lead time of individual tasks.

How to measure it

To calculate your rollover rate, you can start from the spreadsheet of tickets exported for each sprint. Then, the easiest way to compute rollover is by adding multiple tabs to the same spreadsheet and performing a lookup on each ticket to see if it exists in the tab for the previous sprint.

Alternatively, you can append the tickets for each sprint to a single sheet, and then add a column that sums the number of rows that have the same ticket from higher up in the sheet to arrive at a rollover count, which provides more detailed information than only looking at whether it rolled over from the last sprint.

In either case, you can create a pivot table that shows the total number of points that rolled over vs. the points for tickets that first appeared in the current sprint.

minware’s Sprint Completion Trends report displays the point value of tickets that rolled over once, and separately shows points with >= 2 rollovers so that you can see the rollover rate trend over time.

Delivery adjustment rate (DAR)

Why it’s important

An important benefit of sprint planning is that it gives stakeholders an up-to-date picture of when tasks will finish.

However, for this to work, people need to actually remove tasks from the sprint when they add new ones. Otherwise, the sprint plan is meaningless to outsiders and you’re back in the situation of “you’ll know it’s done when it’s done” without the benefits of planning.

If you have a bad sprint and don’t get as much done as planned then that’s one thing, but the delivery adjustment rate looks at how much was added to the sprint without corresponding removals to make sure people are keeping the sprint plan up-to-date.

How to measure it

The delivery adjustment rate is relatively straightforward to compute. You can export a spreadsheet with all the tickets that were in the sprint at the end, sum up their point values, and compare this to the original sprint commitment to see how much the scope expanded minus removed tickets.

The delivery adjustment rate is also easy to see in minware’s Sprint Completion Trends report because removed tickets are broken out in gray and you can see the remaining work compared to the original commitment in the dotted black line.

Scope adjustment rate (SAR)

Why it’s important

Another sprint anti-pattern I’ve seen is removing most of the original tickets and replacing them with new ones.

While some adjustment is good because you don’t want to lock yourself into bad plans if new information emerges, large amounts of scope adjustment can hide poor up-front planning.

By looking at how much effort goes into tickets added after the start of the sprint, you make sure that scope adjustments are truly the result of previously unknown information rather than lax planning.

How to measure it

This one requires a little more manual work to put together. One approach is to look at Jira’s burndown report and tally up all of the scope change rows that increment the scope to arrive at a total scope adjustment. You can then divide the scope adjustment total by the original commitment.

You can see the trend in scope adjustment rate in minware’s Sprint Completion Trends report as well, where completed issues that were added after sprint start are shown in purple along with an overall rate.

Summary

Advanced metrics might sound unnecessary if your team is really lean, but they can provide value for everyone building software.

Calculating the metrics described here isn’t that hard, and can help you get a lot more done with the team you already have.

I have seen small companies with fewer than ten developers compute nearly all of these metrics by hand and still come out way ahead after their time savings.

With systems like minware and others, it’s even easier to get started. Anyone building software today should have strong metrics in place to get the most out of their development processes.

Calculating Your Interruption Tax

Kevin Borders — Thu, 29 Aug 2024 13:07:01 GMT

We’ve all seen someone drowning in urgent requests. Whenever a new task comes in, they stop and work on it – hoping to finish before the next interruption.

Everyone knows they’re overloaded, so they keep following up on Slack: “Is it done yet? This is really important, it’s for XYZ!”

So much time is wasted communicating and context switching that barely any is left for the work itself, which further exacerbates the problem.

This overhead that would disappear if you just worked on the same tasks one-at-a-time in order is your interruption tax.

Why quantify interruptions?

The cost of interruptions can be quite substantial. Organizations that can’t afford to just throw more engineers at the problem need to keep a close eye on interruptions and carefully manage their overhead.

In the case where someone is completely underwater with interruptions, you probably don’t need metrics to know that you have a serious problem.

However, if your team is not blindingly dysfunctional, quantifying interruption cost can give you a sense of whether you’re performing at an A, B, or C level. It can also help you decide how much effort to put into further reducing interruptions by illuminating their impact.

As organizations scale to multiple layers of management, it can also be difficult to prevent even the obvious case of total interruption overload from happening in dark corners.

Metrics are critical for maintaining best practices at a scale when you can no longer depend on hearing about all the important issues around the water cooler.

How will we use an interruption cost metric?

Whenever you’re measuring something, it’s important to first consider how you’re going to use the data.

In our case, one way we’ll use the data is for deciding which improvements are worthwhile. For example, if some flaky deployment process is regularly causing P1 issues but will take a week to fix, we can compare that to the cost of the interruptions to see how long it would take for the investment to pay off.

For the purposes of calculating return on investment (ROI), the interruption costs don’t need to be super precise. When looking at ROI, most things you examine will be so positive or negative that the interruption costs being off by +/- 50% wouldn’t change the result.

Also keep in mind that the other side of the equation is estimated time for the improvement. We all know time estimates often blow up by 2-4x, so more precision than that on the cost side of the equation won’t have a significant impact.

Another way we might use interruption costs is to identify specific teams or people who are struggling so that we can help them, or whether things are changing over time. In this case, the absolute cost of interruptions doesn’t matter as long as the relative difference between teams and time periods reflects a true difference in overhead.

Overall, if our metrics overestimate or underestimate the true cost of interruptions by 2x, that is okay, as long as it’s consistent.

How do you measure the cost of an interruption?

The impact of each interruption depends on a lot of things, including:

How deeply the person was working on another task
The size of the interrupting task
Whether the person starts on the interruption immediately or wraps up the previous task
Whether stakeholder communication is required
The impact on downstream deadlines
How much it impacts morale
Etc.

Because we are only looking for an approximation, however, we can group interrupting tasks into a few high-level buckets based on impact:

Highest - An after-hours page. This creates a lot of disruption. It can cause the person to come into work later the next day. Too many of these will lead to burnout and attrition.
High - Interrupting the current task. This is an interruption that causes someone to stop an in-progress task during work hours. The impact is high because it adds context switching overhead and therefore will delay the in-progress task by more than the time it takes to resolve the interruption. This can also cause stress and burnout.
Medium - Interrupting the current sprint plan. This is an interruption where the person wraps up their current task but works on the interruption next instead of their originally planned task. It is moderately disruptive because effort that went into planning the original task may be lost or have to be redone, which might impact deadlines and require additional stakeholder communication.

Next, we have to decide what heuristic to use for measuring overall interruption overhead. A good approximation is multiplying the size of the interrupting task by a constant factor based on severity.

Other things may influence the interruption cost like importance or size of the task it’s interrupting, but again we are just looking for a rough estimate, so a simple heuristic should suffice.

The question then becomes: what is the average overhead of an interruption at each severity level?

One way to answer this is to consider approximately how much time would be wasted if all tasks were at that severity level, and then use that as the interruption multiplier by ticket size.

These are the numbers we came up with for minware (my company) through this thought exercise, though it may make sense to use different numbers in your organization:

Highest - 75% overhead. This means that if all tasks were assigned with emergency priority by pages at all hours of the night and day, you’d probably only get 25% of the work done compared to a regular schedule.
High - 50% overhead. This is the pathological case described earlier where you always stop what you’re doing when you get a new task, leading to 50% efficiency vs. a normal schedule.
Medium - 25% overhead. In this case, you work on tasks from start to finish without context switching, but plans are always changing and you never know what you’re going to work on next, so you have to spend 25% of your time re-planning and communicating.

Finally, you can list all of the issues completed recently by your team, group them by interruption severity, multiply the story points by the interruption overhead factor, and divide that total by all story points completed to arrive at your interruption tax rate.

Automating the interruption tax metric

Manually labeling each interruption by severity may work for a proof of concept, but is probably too much effort on an ongoing basis.

One thing that can help is looking at sprint reports, which will typically list tickets that were added after the sprint, and are therefore a medium priority interruption or higher.

You can also configure your paging system to label tickets that it creates so that you can easily identify those in a spreadsheet.

If you have a reliable scheme for setting ticket priorities, then you may also be able to use ticket priority as a proxy for interruption severity and compute your interruption tax rate using a spreadsheet pivot table.

The problem, of course, is that people who have a lot of interruptions also tend to be disorganized, and may not adhere to a prioritization process. Or, stakeholders might file tickets with overly escalated priorities just to accomplish regular tasks.

To address these issues, I created a minware report template that automatically identifies tickets that are medium- and high-level interruptions (at the sprint and ticket level, respectively). It doesn’t cover the highest level for now, but that is easy to add with labels from a paging system.

Sprint interruptions are defined as follows, which is pretty standard:

The ticket is added to a sprint after it starts.
That ticket is completed prior to the end of the same sprint.

Ticket interruptions are kind of tricky to define, but I was able to get it working with the following logic:

The ticket is added to a sprint after it starts.
No other tickets are completed in the same sprint by the same assignee until…
That ticket is completed prior to the end of the same sprint.

This effectively means that the person stopped what they were doing to complete the interrupting ticket.

The report multiplies ticket interruption story points by 0.5, sprint interruption points by 0.25, and divides it by total completed points to show a chart with the overall interruption tax rate by team over time.

As an added bonus, it shows how many tickets are sprint- or ticket-level interruptions by priority level so you can see whether the manually specified priority field aligns with how people are actually treating tickets.

What’s a healthy interruption tax rate?

The following chart is from the demo org in minware, which uses the data from my former company, Collage.com.

This org has four teams. Their rates over a three-month period are 30%, 25%, 9%, and 8%. Right now at minware, our rate for the past three months is 10%, which is similar to those lower teams.

I plan to do a broader survey in the future, but teams that I know first-hand have a healthy interruption workload are around 10%, which seems reasonable for a team that is both supporting production software and working on new projects.

On the other hand, teams that struggle with planning and interruptions may have a rate of 20-30%, which means that sprints are highly unreliable and the team loses a sizable amount of capacity to context switching.

How to manage your interruption rate

Calculating your interruption rate can be a helpful one-time exercise to see where you stand, but mature organizations should establish a consistent process for managing interruption overhead.

Just looking at the numbers alone is not enough, because there may be important additional context that influences the impact of interruptions. Or, people may be doing things in the ticketing system that make the numbers misleading like removing the original tickets and replacing them with new ones before starting work.

To consistently manage interruptions as an organization, I recommend that each team conduct a regular review focused on interruptions and prioritization. The cadence of this review can decrease as the team matures, but it could range anywhere from every few weeks to quarterly. If you make this a standing agenda item in sprint retrospectives, it is also helpful to have a dedicated review every few months to make sure you devote sufficient attention to the topic and look at long-term trends.

As part of this interruption and prioritization review, you should calculate your recent interruption tax rate to facilitate discussion, and prepare a list of interrupting tickets. Here are some questions you may want to consider during this review, or add your own:

How have the interruption levels changed since last time?
Did we complete the action items we identified in the previous review, and are improvements to interruptions being prioritized appropriately?
Are the interruption levels for each ticket accurate, or do the heuristics need adjustment to due unexpected work patterns?
What happiness/frustration level does each person feel about interruptions?
Is the interruption load spread fairly across the team?
To what degree is the interruption load interfering with the team’s ability to meet SLA obligations for resolving high-priority issues?
How much are interruptions interfering with the ability to deliver on roadmap commitments?
Is the ticket priority field being set correctly for interrupting tasks?
Are teams working on high-priority tickets right away and stopping other work when they should?

It is also helpful to dig into specific interruptions to better understand their root cause. When looking at individual interruptions, here are questions to consider:

Was it correct to work on the ticket right away, or should we have waited because it was actually a lower priority? Was the priority field set to something overly high?
If the ticket was a stakeholder request, did the person who filed the ticket know that the work needed to be done earlier and neglect to file it until the last minute? If so, stakeholders may need guidance about the impact they are having on the team.
For stakeholder requests, if the ticket creator did not know about the need earlier, could they have known with better planning? In this case, the solution may be guiding stakeholders to improve their planning practices.
If the ticket was a bug, did the bug exist for a substantial amount of time before the ticket was filed? If so, what observability and system instrumentation improvements would have detected it earlier?
If the ticket was a bug, how did it escape each step of the QA process, including unit tests, integration tests, end-to-end tests, manual tests, and code review?
If the ticket was a bug, would it have been prevented with improvements to code or system architecture?

At the end of each review, you should assess and record specific action items to mitigate interruptions, and discuss those at the next review.

Finally, organizations that have more than a few teams will benefit from conducting an org-wide interruption and prioritization review on a less frequent cadence, such as quarterly. The goal of this review is to ensure that individual team reviews are effective and senior leadership is providing teams with the resources they need.

In this org-wide review, consider the following questions:

Are team-level reviews happening consistently?
Are those reviews thorough and do they result in meaningful action items?
What is the long-term interruption rate trend for each team?
Are any teams struggling to achieve a healthy rate?
Are teams empowered to reduce interruptions caused by poor planning or communication from outside groups like marketing and sales, or is top-level executive involvement needed?
Is the organization enabling teams to prioritize the important action items they identify in their reviews?
Are responsibilities like urgent bug fixes and support escalations distributed appropriately across teams, and do teams have the resources they need to strike a good balance between planned and interrupting work?
Do the organization’s current guidelines and processes related to communication, ticketing, and prioritization support healthy work patterns, or do they need adjustment?

Summary

Everyone knows interruptions are bad, but measuring them reliably to make better decisions can be difficult.

Here, we introduced a method you can use to measure interruptions. By combining this metric with systematic review processes, it’s possible for organizations of any size to keep interruptions under control and stay lean.

I’d love to hear about how you manage interruptions in your organization or any additional tips you have, so please comment below!

Scaling Software Engineering Discipline

Kevin Borders — Fri, 23 Aug 2024 21:09:39 GMT

Software engineering is wasteful and undisciplined compared to other industries like security, especially at scale. Google’s CEO can have confidence in their low-level firewall settings, but what about their day-to-day engineering practices?

The security industry has solved this best-practice control problem, but engineering hasn’t caught on yet. This creates a huge opportunity for companies who figure it out to run circles around their competitors

Why is security more disciplined?

Well, security teams can’t afford otherwise. The cost of an incident is huge, and it only takes a few small mistakes to let in a hacker.

In software engineering, mistakes lead to death by a thousand cuts. When you lose market share, it’s impossible to trace it back to one source, which makes individual mistakes easier to hide and downplay.

Death by poor software engineering also takes longer than getting hacked. By the time it occurs, the underlying mistakes are often long forgotten.

CEOs don’t take engineering discipline as seriously as security or give it as much budget because it’s harder to see the business impact. However, the impact is there, and those who learn to identify it in their data will be at a major advantage.

How do security teams prevent mistakes?

Or, the better question is: how can Google’s CEO actually be confident in their low-level firewall settings?

The answer is hierarchical recurring controls.

This is a fancy way of saying that they have a process for changing firewall rules. Then, they audit that process to make sure it is running effectively, audit the audit, audit that audit, and so on. Eventually, there is a top-level leadership review of the entire security program.

In each of these audits, you review what processes are in place, what information you’re collecting for observability, how well everything is working, and how to get better. You do this separately for each area so you don’t miss anything, and you do it at a cadence where the audits are valuable and not repetitive.

Industries outside of security do something similar as well. Operations teams use structured controls for infrastructure reliability, and finance teams use them for financial reporting.

This is how you manage important details at scale.

What do software engineering teams do?

Well, it varies, but the most common pattern is a disjointed combination of sprint retrospectives, random “review” meetings, manager one-on-ones, and performance reviews.

Or, people just look at things randomly whenever they think to do so, which usually means waiting too long until something has become a major problem.

Then, once you realize you should keep an eye on something – let’s say a team’s meeting load for example – a common anti-pattern is trying to shoehorn it into an existing activity like sprint retrospectives.

You might have a good conversation about the topic the first time it comes up, but then gradually stop paying attention to it even though it’s on the agenda because talking about it every two weeks doesn’t make sense.

Because everything is ad hoc, no higher-level audits occur, and teams fall back on bad habits as the company grows, only maintaining good practices in areas that are the main focus of retrospectives and other recurring activities.

I’ve personally experienced these issues on several occasions while growing the team at Collage.com, and they caused me and others a lot of anxiety.

If you’re doing things this way, it’s impossible to scale efficiently and you end up dropping a lot of balls.

What should software engineering teams do?

Things really turned a corner at Collage.com when we introduced a centralized and hierarchical recurring control structure for engineering.

It doesn’t have to be anything fancy (ours was a spreadsheet), but you essentially need a list of review activities with frequencies, and another list with instances of those activities so you can see when they should happen and view the results.

Most importantly, each group of controls needs an audit activity where you review the controls to see whether they are effective and if any should be added or removed.

As the company grows, the set of controls will expand and you may introduce more levels of hierarchy, but every control should always roll up to CEO-level review.

What controls should a software company have?

This varies a bit by industry, but here are some examples of things you might consider reviewing on a regular basis in different areas beyond security. This list isn’t meant to be comprehensive, and there are a lot more things you’ll need that aren’t listed here, but this should give you an idea of what a control structure looks like.

As part of each activity, it’s important to discuss what data you’re collecting to inform the review, and if you should make changes to collect more information in the future. Ideally, each activity should have a dashboard that presents all the relevant information in one place (my company minware does this for many of these activities). Or, someone should at least prepare a single report that gathers all information in one place in advance of the review.

Workflow

Meeting Load and Distribution – How much time do people spend in meetings and are the meetings scheduled well to provide time for focused work? Are the right team and one-on-one meetings happening with the right cadence?
Task Workflow by Status – What steps does each type of task follow, how long does each one take, and how often do tasks bounce back to a previous status? Are all the steps really essential, or can some be cut out?
Context Switching/Interruptions – How much work-in-progress is there on average, both at a ticket/project and pull request level? How many high-priority tasks come up that interrupt planned work? How often do people switch to another task because something is blocking them, like an answer to a question or a review?
Communication – Are there clear expectations for response times for different channels like email and Slack? Are people meeting those expectations and receiving responses quickly enough? Are people communicating excessively outside of business hours?

Agile and Ticketing

Work Tracking Hygiene – Is work being done in tickets, are those tickets added to a sprint, are code changes linked to the tickets, and do those tickets have an estimate before work starts?
Ticket Quality – Do tickets have appropriate acceptance criteria set? Do bugs have adequate reproduction steps?
Agile Retrospective – Are tickets following your agile processes as expected? Are estimates accurate? What issues are getting in the way of predictably meeting commitments?

Quality

Code Reviews – Are code reviews happening and are they effective, or are they rubber-stamp reviews? Is the review load balanced appropriately? Are people completing reviews quickly enough and meeting established SLAs?
Bug/Incident SLA – Are tickets appropriately labeled with priority? Do you have SLAs for resolving issues of different priorities? What is the mean time to restore for each one? Are bugs routed appropriately so that they are being fixed by the right person?
Post Mortems – Do major issues have effective post mortems that identify the root cause and have good action items? Are those action items actually being prioritized?
Automated Testing – Are there areas where bugs are popping up frequently that may be lacking test coverage? Do you have code coverage metrics in place for each of your repositories? How are code coverage metrics trending? Are there problems with test flakiness, or are certain tests taking more time than they should to keep up-to-date?

Performance

API Response Times – Which requests take the most time in aggregate, and which individual requests are taking too long (i.e., look at median, p95, p99)? What is driving slow response times and what opportunities are there for optimization?
Database Query Times – Which database queries take the most time in aggregate, and which individual queries are outliers? Do those queries have appropriate indexes and can they be further optimized? Can query results be cached that aren’t being cached right now?
Cache Hit Rates – What are the hit rates of caching layers like CDNs and memcache servers? Are they what you would expect? Are the opportunities for improvement?
Page Load Times – For websites, what is the page speed insights score in different areas and how is it trending over time? Is there automated testing for issues that can impact this score? Are optimization tasks ticketed and prioritized effectively?
Application Action Times – Do actions inside of applications that are not instantaneous have performance instrumentation so you can see how long they take? Do long actions have an appropriate waiting spinner or progress bar?
Server Load – Do you have appropriate load balancing in place? Is performance degrading under higher load, or is there excessive cost when load is lower? Are you prepared to handle unexpected surges?
Cost – Is there granular instrumentation about the drivers of infrastructure cost? What are the biggest costs in different areas, and what are the biggest opportunities to reduce costs? Are cost reduction tasks recorded properly in tickets and appropriately prioritized alongside other work? Do you have reservations and contracts in place to optimize cloud resource costs?

Observability

Error Instrumentation – Are all meaningful errors being logged in a place where they are accessible? Is it easy to debug and reproduce errors given the contextual information that’s recorded in the logs? Is there alerting in place for new high-priority errors?
User Tracking – Do product managers, engineers, and customer service reps have good visibility into how specific people and relevant cohorts are using the software to support their work? Are events being tracked as expected? Is there testing in place for user tracking?
Alarm Coverage – What alarms are in place for the infrastructure? Are there gaps where certain types of failures would not trigger an alarm? Are there too many false positives causing noise? Are the notification policies for your paging system in alignment with the severity of different alarms?

Technical Debt / Architecture

Dependency Versions – What versions are you using of operating systems, languages, frameworks and other dependencies? Are you keeping up-to-date with upgrading versions before they reach end-of-life?
Tech Debt Backlog and Effort Allocation – Are tech debt items being recorded and prioritized in a backlog, and is the team devoting an appropriate amount of effort to fixing tech debt?
Architecture – Are you staying up-to-date with best practices for system architecture and frameworks? In what areas is the architecture struggling to meet current demands, and where is it likely to encounter scalability issues next?

DevOps

Development Environment – How long does it take to install the development environment and how frequently does setup fail? How often does it break? Does the development environment have sufficient parity with production, or are people frequently finding errors that don’t happen locally?
CI/CD Performance – How fast is each pipeline and how are the build times trending? Which jobs are the slowest and can they be optimized? Are the pipelines doing appropriate things at each level, or is too much running in pre-release pipelines? Are there any new technologies or best practices you should adopt to speed up build times?
CI/CD Reliability – How often do builds fail and what are the causes of build failures? What is the mean time to restore (MTTR) of build failures, and are people addressing the causes with sufficient priority?

Final Thoughts

Structuring reviews outside of security as control tasks was an aha moment for me as an engineering leader, and hopefully it is for you too (or you’re already doing it!)

My view has been limited to a few smaller organizations and my time at the department of defense, so I’d also be very curious to hear what controls other people have in place to manage software engineering at scale. If you’re willing to share, please comment below!