Calculating Your Interruption Tax
How to quantify and reduce the impact of interruptions on software development
We’ve all seen someone drowning in urgent requests. Whenever a new task comes in, they stop and work on it – hoping to finish before the next interruption.
Everyone knows they’re overloaded, so they keep following up on Slack: “Is it done yet? This is really important, it’s for XYZ!”
So much time is wasted communicating and context switching that barely any is left for the work itself, which further exacerbates the problem.
This overhead that would disappear if you just worked on the same tasks one-at-a-time in order is your interruption tax.
Why quantify interruptions?
The cost of interruptions can be quite substantial. Organizations that can’t afford to just throw more engineers at the problem need to keep a close eye on interruptions and carefully manage their overhead.
In the case where someone is completely underwater with interruptions, you probably don’t need metrics to know that you have a serious problem.
However, if your team is not blindingly dysfunctional, quantifying interruption cost can give you a sense of whether you’re performing at an A, B, or C level. It can also help you decide how much effort to put into further reducing interruptions by illuminating their impact.
As organizations scale to multiple layers of management, it can also be difficult to prevent even the obvious case of total interruption overload from happening in dark corners.
Metrics are critical for maintaining best practices at a scale when you can no longer depend on hearing about all the important issues around the water cooler.
How will we use an interruption cost metric?
Whenever you’re measuring something, it’s important to first consider how you’re going to use the data.
In our case, one way we’ll use the data is for deciding which improvements are worthwhile. For example, if some flaky deployment process is regularly causing P1 issues but will take a week to fix, we can compare that to the cost of the interruptions to see how long it would take for the investment to pay off.
For the purposes of calculating return on investment (ROI), the interruption costs don’t need to be super precise. When looking at ROI, most things you examine will be so positive or negative that the interruption costs being off by +/- 50% wouldn’t change the result.
Also keep in mind that the other side of the equation is estimated time for the improvement. We all know time estimates often blow up by 2-4x, so more precision than that on the cost side of the equation won’t have a significant impact.
Another way we might use interruption costs is to identify specific teams or people who are struggling so that we can help them, or whether things are changing over time. In this case, the absolute cost of interruptions doesn’t matter as long as the relative difference between teams and time periods reflects a true difference in overhead.
Overall, if our metrics overestimate or underestimate the true cost of interruptions by 2x, that is okay, as long as it’s consistent.
How do you measure the cost of an interruption?
The impact of each interruption depends on a lot of things, including:
How deeply the person was working on another task
The size of the interrupting task
Whether the person starts on the interruption immediately or wraps up the previous task
Whether stakeholder communication is required
The impact on downstream deadlines
How much it impacts morale
Etc.
Because we are only looking for an approximation, however, we can group interrupting tasks into a few high-level buckets based on impact:
Highest - An after-hours page. This creates a lot of disruption. It can cause the person to come into work later the next day. Too many of these will lead to burnout and attrition.
High - Interrupting the current task. This is an interruption that causes someone to stop an in-progress task during work hours. The impact is high because it adds context switching overhead and therefore will delay the in-progress task by more than the time it takes to resolve the interruption. This can also cause stress and burnout.
Medium - Interrupting the current sprint plan. This is an interruption where the person wraps up their current task but works on the interruption next instead of their originally planned task. It is moderately disruptive because effort that went into planning the original task may be lost or have to be redone, which might impact deadlines and require additional stakeholder communication.
Next, we have to decide what heuristic to use for measuring overall interruption overhead. A good approximation is multiplying the size of the interrupting task by a constant factor based on severity.
Other things may influence the interruption cost like importance or size of the task it’s interrupting, but again we are just looking for a rough estimate, so a simple heuristic should suffice.
The question then becomes: what is the average overhead of an interruption at each severity level?
One way to answer this is to consider approximately how much time would be wasted if all tasks were at that severity level, and then use that as the interruption multiplier by ticket size.
These are the numbers we came up with for minware (my company) through this thought exercise, though it may make sense to use different numbers in your organization:
Highest - 75% overhead. This means that if all tasks were assigned with emergency priority by pages at all hours of the night and day, you’d probably only get 25% of the work done compared to a regular schedule.
High - 50% overhead. This is the pathological case described earlier where you always stop what you’re doing when you get a new task, leading to 50% efficiency vs. a normal schedule.
Medium - 25% overhead. In this case, you work on tasks from start to finish without context switching, but plans are always changing and you never know what you’re going to work on next, so you have to spend 25% of your time re-planning and communicating.
Finally, you can list all of the issues completed recently by your team, group them by interruption severity, multiply the story points by the interruption overhead factor, and divide that total by all story points completed to arrive at your interruption tax rate.
Automating the interruption tax metric
Manually labeling each interruption by severity may work for a proof of concept, but is probably too much effort on an ongoing basis.
One thing that can help is looking at sprint reports, which will typically list tickets that were added after the sprint, and are therefore a medium priority interruption or higher.
You can also configure your paging system to label tickets that it creates so that you can easily identify those in a spreadsheet.
If you have a reliable scheme for setting ticket priorities, then you may also be able to use ticket priority as a proxy for interruption severity and compute your interruption tax rate using a spreadsheet pivot table.
The problem, of course, is that people who have a lot of interruptions also tend to be disorganized, and may not adhere to a prioritization process. Or, stakeholders might file tickets with overly escalated priorities just to accomplish regular tasks.
To address these issues, I created a minware report template that automatically identifies tickets that are medium- and high-level interruptions (at the sprint and ticket level, respectively). It doesn’t cover the highest level for now, but that is easy to add with labels from a paging system.
Sprint interruptions are defined as follows, which is pretty standard:
The ticket is added to a sprint after it starts.
That ticket is completed prior to the end of the same sprint.
Ticket interruptions are kind of tricky to define, but I was able to get it working with the following logic:
The ticket is added to a sprint after it starts.
No other tickets are completed in the same sprint by the same assignee until…
That ticket is completed prior to the end of the same sprint.
This effectively means that the person stopped what they were doing to complete the interrupting ticket.
The report multiplies ticket interruption story points by 0.5, sprint interruption points by 0.25, and divides it by total completed points to show a chart with the overall interruption tax rate by team over time.
As an added bonus, it shows how many tickets are sprint- or ticket-level interruptions by priority level so you can see whether the manually specified priority field aligns with how people are actually treating tickets.
What’s a healthy interruption tax rate?
The following chart is from the demo org in minware, which uses the data from my former company, Collage.com.
This org has four teams. Their rates over a three-month period are 30%, 25%, 9%, and 8%. Right now at minware, our rate for the past three months is 10%, which is similar to those lower teams.
I plan to do a broader survey in the future, but teams that I know first-hand have a healthy interruption workload are around 10%, which seems reasonable for a team that is both supporting production software and working on new projects.
On the other hand, teams that struggle with planning and interruptions may have a rate of 20-30%, which means that sprints are highly unreliable and the team loses a sizable amount of capacity to context switching.
How to manage your interruption rate
Calculating your interruption rate can be a helpful one-time exercise to see where you stand, but mature organizations should establish a consistent process for managing interruption overhead.
Just looking at the numbers alone is not enough, because there may be important additional context that influences the impact of interruptions. Or, people may be doing things in the ticketing system that make the numbers misleading like removing the original tickets and replacing them with new ones before starting work.
To consistently manage interruptions as an organization, I recommend that each team conduct a regular review focused on interruptions and prioritization. The cadence of this review can decrease as the team matures, but it could range anywhere from every few weeks to quarterly. If you make this a standing agenda item in sprint retrospectives, it is also helpful to have a dedicated review every few months to make sure you devote sufficient attention to the topic and look at long-term trends.
As part of this interruption and prioritization review, you should calculate your recent interruption tax rate to facilitate discussion, and prepare a list of interrupting tickets. Here are some questions you may want to consider during this review, or add your own:
How have the interruption levels changed since last time?
Did we complete the action items we identified in the previous review, and are improvements to interruptions being prioritized appropriately?
Are the interruption levels for each ticket accurate, or do the heuristics need adjustment to due unexpected work patterns?
What happiness/frustration level does each person feel about interruptions?
Is the interruption load spread fairly across the team?
To what degree is the interruption load interfering with the team’s ability to meet SLA obligations for resolving high-priority issues?
How much are interruptions interfering with the ability to deliver on roadmap commitments?
Is the ticket priority field being set correctly for interrupting tasks?
Are teams working on high-priority tickets right away and stopping other work when they should?
It is also helpful to dig into specific interruptions to better understand their root cause. When looking at individual interruptions, here are questions to consider:
Was it correct to work on the ticket right away, or should we have waited because it was actually a lower priority? Was the priority field set to something overly high?
If the ticket was a stakeholder request, did the person who filed the ticket know that the work needed to be done earlier and neglect to file it until the last minute? If so, stakeholders may need guidance about the impact they are having on the team.
For stakeholder requests, if the ticket creator did not know about the need earlier, could they have known with better planning? In this case, the solution may be guiding stakeholders to improve their planning practices.
If the ticket was a bug, did the bug exist for a substantial amount of time before the ticket was filed? If so, what observability and system instrumentation improvements would have detected it earlier?
If the ticket was a bug, how did it escape each step of the QA process, including unit tests, integration tests, end-to-end tests, manual tests, and code review?
If the ticket was a bug, would it have been prevented with improvements to code or system architecture?
At the end of each review, you should assess and record specific action items to mitigate interruptions, and discuss those at the next review.
Finally, organizations that have more than a few teams will benefit from conducting an org-wide interruption and prioritization review on a less frequent cadence, such as quarterly. The goal of this review is to ensure that individual team reviews are effective and senior leadership is providing teams with the resources they need.
In this org-wide review, consider the following questions:
Are team-level reviews happening consistently?
Are those reviews thorough and do they result in meaningful action items?
What is the long-term interruption rate trend for each team?
Are any teams struggling to achieve a healthy rate?
Are teams empowered to reduce interruptions caused by poor planning or communication from outside groups like marketing and sales, or is top-level executive involvement needed?
Is the organization enabling teams to prioritize the important action items they identify in their reviews?
Are responsibilities like urgent bug fixes and support escalations distributed appropriately across teams, and do teams have the resources they need to strike a good balance between planned and interrupting work?
Do the organization’s current guidelines and processes related to communication, ticketing, and prioritization support healthy work patterns, or do they need adjustment?
Summary
Everyone knows interruptions are bad, but measuring them reliably to make better decisions can be difficult.
Here, we introduced a method you can use to measure interruptions. By combining this metric with systematic review processes, it’s possible for organizations of any size to keep interruptions under control and stay lean.
I’d love to hear about how you manage interruptions in your organization or any additional tips you have, so please comment below!
This looks great for measuring some of these things that have been very difficult to measure. Too often, the best and most senior engineers end up absorbing the interruption taxes and burn out too quickly. You mentioned some of the others that are even more difficult to measure, such as I.M.s and shoulder taps. Some things I've tried for these is to encourage blocking out focus time each day and encouraging juniors to try to figure things out for themselves for a limited timeframe (at a certain point there are diminishing returns if someone is still blocked after 30 minutes or an hour)