Scaling Software Engineering Discipline

What engineers can learn from the security industry

Aug 23, 2024

Software engineering is wasteful and undisciplined compared to other industries like security, especially at scale. Google’s CEO can have confidence in their low-level firewall settings, but what about their day-to-day engineering practices?

The security industry has solved this best-practice control problem, but engineering hasn’t caught on yet. This creates a huge opportunity for companies who figure it out to run circles around their competitors

Why is security more disciplined?

Well, security teams can’t afford otherwise. The cost of an incident is huge, and it only takes a few small mistakes to let in a hacker.

In software engineering, mistakes lead to death by a thousand cuts. When you lose market share, it’s impossible to trace it back to one source, which makes individual mistakes easier to hide and downplay.

Death by poor software engineering also takes longer than getting hacked. By the time it occurs, the underlying mistakes are often long forgotten.

CEOs don’t take engineering discipline as seriously as security or give it as much budget because it’s harder to see the business impact. However, the impact is there, and those who learn to identify it in their data will be at a major advantage.

How do security teams prevent mistakes?

Or, the better question is: how can Google’s CEO actually be confident in their low-level firewall settings?

The answer is hierarchical recurring controls.

This is a fancy way of saying that they have a process for changing firewall rules. Then, they audit that process to make sure it is running effectively, audit the audit, audit that audit, and so on. Eventually, there is a top-level leadership review of the entire security program.

In each of these audits, you review what processes are in place, what information you’re collecting for observability, how well everything is working, and how to get better. You do this separately for each area so you don’t miss anything, and you do it at a cadence where the audits are valuable and not repetitive.

Industries outside of security do something similar as well. Operations teams use structured controls for infrastructure reliability, and finance teams use them for financial reporting.

This is how you manage important details at scale.

What do software engineering teams do?

Well, it varies, but the most common pattern is a disjointed combination of sprint retrospectives, random “review” meetings, manager one-on-ones, and performance reviews.

Or, people just look at things randomly whenever they think to do so, which usually means waiting too long until something has become a major problem.

Then, once you realize you should keep an eye on something – let’s say a team’s meeting load for example – a common anti-pattern is trying to shoehorn it into an existing activity like sprint retrospectives.

You might have a good conversation about the topic the first time it comes up, but then gradually stop paying attention to it even though it’s on the agenda because talking about it every two weeks doesn’t make sense.

Because everything is ad hoc, no higher-level audits occur, and teams fall back on bad habits as the company grows, only maintaining good practices in areas that are the main focus of retrospectives and other recurring activities.

I’ve personally experienced these issues on several occasions while growing the team at Collage.com, and they caused me and others a lot of anxiety.

If you’re doing things this way, it’s impossible to scale efficiently and you end up dropping a lot of balls.

What should software engineering teams do?

Things really turned a corner at Collage.com when we introduced a centralized and hierarchical recurring control structure for engineering.

It doesn’t have to be anything fancy (ours was a spreadsheet), but you essentially need a list of review activities with frequencies, and another list with instances of those activities so you can see when they should happen and view the results.

Most importantly, each group of controls needs an audit activity where you review the controls to see whether they are effective and if any should be added or removed.

As the company grows, the set of controls will expand and you may introduce more levels of hierarchy, but every control should always roll up to CEO-level review.

What controls should a software company have?

This varies a bit by industry, but here are some examples of things you might consider reviewing on a regular basis in different areas beyond security. This list isn’t meant to be comprehensive, and there are a lot more things you’ll need that aren’t listed here, but this should give you an idea of what a control structure looks like.

As part of each activity, it’s important to discuss what data you’re collecting to inform the review, and if you should make changes to collect more information in the future. Ideally, each activity should have a dashboard that presents all the relevant information in one place (my company minware does this for many of these activities). Or, someone should at least prepare a single report that gathers all information in one place in advance of the review.

Workflow

Meeting Load and Distribution – How much time do people spend in meetings and are the meetings scheduled well to provide time for focused work? Are the right team and one-on-one meetings happening with the right cadence?
Task Workflow by Status – What steps does each type of task follow, how long does each one take, and how often do tasks bounce back to a previous status? Are all the steps really essential, or can some be cut out?
Context Switching/Interruptions – How much work-in-progress is there on average, both at a ticket/project and pull request level? How many high-priority tasks come up that interrupt planned work? How often do people switch to another task because something is blocking them, like an answer to a question or a review?
Communication – Are there clear expectations for response times for different channels like email and Slack? Are people meeting those expectations and receiving responses quickly enough? Are people communicating excessively outside of business hours?

Agile and Ticketing

Work Tracking Hygiene – Is work being done in tickets, are those tickets added to a sprint, are code changes linked to the tickets, and do those tickets have an estimate before work starts?
Ticket Quality – Do tickets have appropriate acceptance criteria set? Do bugs have adequate reproduction steps?
Agile Retrospective – Are tickets following your agile processes as expected? Are estimates accurate? What issues are getting in the way of predictably meeting commitments?

Quality

Code Reviews – Are code reviews happening and are they effective, or are they rubber-stamp reviews? Is the review load balanced appropriately? Are people completing reviews quickly enough and meeting established SLAs?
Bug/Incident SLA – Are tickets appropriately labeled with priority? Do you have SLAs for resolving issues of different priorities? What is the mean time to restore for each one? Are bugs routed appropriately so that they are being fixed by the right person?
Post Mortems – Do major issues have effective post mortems that identify the root cause and have good action items? Are those action items actually being prioritized?
Automated Testing – Are there areas where bugs are popping up frequently that may be lacking test coverage? Do you have code coverage metrics in place for each of your repositories? How are code coverage metrics trending? Are there problems with test flakiness, or are certain tests taking more time than they should to keep up-to-date?

Performance

API Response Times – Which requests take the most time in aggregate, and which individual requests are taking too long (i.e., look at median, p95, p99)? What is driving slow response times and what opportunities are there for optimization?
Database Query Times – Which database queries take the most time in aggregate, and which individual queries are outliers? Do those queries have appropriate indexes and can they be further optimized? Can query results be cached that aren’t being cached right now?
Cache Hit Rates – What are the hit rates of caching layers like CDNs and memcache servers? Are they what you would expect? Are the opportunities for improvement?
Page Load Times – For websites, what is the page speed insights score in different areas and how is it trending over time? Is there automated testing for issues that can impact this score? Are optimization tasks ticketed and prioritized effectively?
Application Action Times – Do actions inside of applications that are not instantaneous have performance instrumentation so you can see how long they take? Do long actions have an appropriate waiting spinner or progress bar?
Server Load – Do you have appropriate load balancing in place? Is performance degrading under higher load, or is there excessive cost when load is lower? Are you prepared to handle unexpected surges?
Cost – Is there granular instrumentation about the drivers of infrastructure cost? What are the biggest costs in different areas, and what are the biggest opportunities to reduce costs? Are cost reduction tasks recorded properly in tickets and appropriately prioritized alongside other work? Do you have reservations and contracts in place to optimize cloud resource costs?

Observability

Error Instrumentation – Are all meaningful errors being logged in a place where they are accessible? Is it easy to debug and reproduce errors given the contextual information that’s recorded in the logs? Is there alerting in place for new high-priority errors?
User Tracking – Do product managers, engineers, and customer service reps have good visibility into how specific people and relevant cohorts are using the software to support their work? Are events being tracked as expected? Is there testing in place for user tracking?
Alarm Coverage – What alarms are in place for the infrastructure? Are there gaps where certain types of failures would not trigger an alarm? Are there too many false positives causing noise? Are the notification policies for your paging system in alignment with the severity of different alarms?

Technical Debt / Architecture

Dependency Versions – What versions are you using of operating systems, languages, frameworks and other dependencies? Are you keeping up-to-date with upgrading versions before they reach end-of-life?
Tech Debt Backlog and Effort Allocation – Are tech debt items being recorded and prioritized in a backlog, and is the team devoting an appropriate amount of effort to fixing tech debt?
Architecture – Are you staying up-to-date with best practices for system architecture and frameworks? In what areas is the architecture struggling to meet current demands, and where is it likely to encounter scalability issues next?

DevOps

Development Environment – How long does it take to install the development environment and how frequently does setup fail? How often does it break? Does the development environment have sufficient parity with production, or are people frequently finding errors that don’t happen locally?
CI/CD Performance – How fast is each pipeline and how are the build times trending? Which jobs are the slowest and can they be optimized? Are the pipelines doing appropriate things at each level, or is too much running in pre-release pipelines? Are there any new technologies or best practices you should adopt to speed up build times?
CI/CD Reliability – How often do builds fail and what are the causes of build failures? What is the mean time to restore (MTTR) of build failures, and are people addressing the causes with sufficient priority?

Final Thoughts

Structuring reviews outside of security as control tasks was an aha moment for me as an engineering leader, and hopefully it is for you too (or you’re already doing it!)

My view has been limited to a few smaller organizations and my time at the department of defense, so I’d also be very curious to hear what controls other people have in place to manage software engineering at scale. If you’re willing to share, please comment below!

James Melvin

Jan 8

Kevin, thanks for posting. Fantastic, thought provoking list!

The value of instrumenting the software dev workflow has certainly increased over time. Some things are mostly subjective and hard to measure, e.g. meeting load, effective documentation or AI value-add. However, today's code managers, static code analyzers, IDEs and bug tracking systems... all provide measurable statistics on a variety of things. Many of these statistics provide actionable information to help tune up the workflow to reduce latency, improve throughput and get ever closer to max performance.

There are a couple of things I might add to the list for some companies...

1) Compatibility

Best case, customers should be able to drop in new releases to replace older versions and keep going without issue. So, we might measure the rate at which the team inadvertently breaks compatibility and spikes customer feedback.

2) Release Cadence

Customers will take new releases at different rates. Does the team's product release cadence meet the majority of customer expectations? Are we releasing too often and needlessly incurring the costs for turning the full crank? Or, are we releasing to slowly to deliver hot fixes to address most customer needs.

Expand full comment

2 replies by Kevin Borders and others

Karen Smiley

Aug 25

Welcome to Substack, Kevin! (and thank you to Autumn Patterson for sharing this new newsletter on LinkedIn :)

1 reply by Kevin Borders

3 more comments...

minimal engineering

Discussion about this post