Tuesday, February 15, 2011

Zero Bug Tolerance Intolerance

It sounds good to say that you shall not and will not release code with bugs – that your team has “zero bug tolerance”. It makes for a nice sound bite. It sounds responsible. And it sounds right. But let’s look carefully at what this really means.

First, there are the logical arguments as to whether it is possible to build a perfect system of any level of complexity, and whether you can prove that a piece of nontrivial code is bug free. These are interesting questions, but I think the more important question is whether you should really try to.

I was guilty earlier in my career of pushing feature-set and schedule over quality, leaving too many bugs too late, and then having to deal with the aftermath. When I first started managing development projects a long time ago, I didn’t understand how “slowing down” to fix bugs would help get the project done faster. But I have learned, and I know (there is a difference between learning something and knowing something, really knowing it deep down) that fixing bugs helps keep costs down, and that it is possible to build good software quickly.

In Software Quality at Top Speed, Steve McConnell makes the case that short-changing design and writing bad code is stupid and will bite you in the ass, and that doing a responsible job on design and writing good code gets you to the end faster. At somewhere around the 90% defect removal rate you reach an optimal point:
“the point at which projects achieve the shortest schedules, least effort, and highest levels of user satisfaction” Capers Jones, Applied Software Measurement: Assuring Productivity and Quality, 1991.
Most teams don’t get close to the optimal point. But aiming beyond this, towards 100% perfection, causes costs to skyrocket; you quickly reach the point of diminishing returns. Diminishing returns in the pursuit of perfect software is explored further by Andy Boothe in The Economics of Perfect Software:
“For example, imagine a program has 100 bugs, and we know it will take 100 units of effort to find and fix all 100 of those bugs. The Law of Diminishing Returns tells us that the first 40 units of effort would find the first 70 bugs, the next 30 units of effort would find the next 20 bugs, and the next 30 units of effort would find the last 10 bugs. This means that the first 70 bugs (the shallow bugs) are cheap to find and squash at only 40 / 70 = 0.571 units of work per per bug (on average). The next 20 bugs (the deep bugs) are significantly more expensive at 30 / 20 = 1.5 units of effort per bug, and the final 10 bugs (the really deep bugs) are astronomically expensive at 30 / 10 = 3 units of effort per bug. The last 10 bugs are more than 5 times more time- and capital-intensive to eliminate per bug than the first 70 bugs. In terms of effort, the difference between eliminating most bugs (say 70%-90%) and all bugs is huge, to the tune of a 2x difference in effort and cost.

And in real life it’s actually worse than that. Because you don’t know when you’ve killed the last bug — there’s no countdown sign, like we had in our example — you have to keep looking for more bugs even when they’re all dead just to make sure they’re all dead. If you really want to kill all the bugs, you have to plan for that cost too.”
There's a cost to building good software

There’s a cost to put in place the necessary controls and practices, the checks and balances, and to build the team’s focus and commitment and discipline, and keep this up over time. To build the right culture, the right skills, and the right level of oversight. And there’s a cost to saying no to the customer: to cutting back on features, or asking for more time upfront, or delaying a release because of technical risks.

You also have to account for opportunity costs. Kent Beck and Martin Fowler have built their careers on writing high quality software and teaching other people how to do this. In Planning Extreme Programming they make it clear that it is important to write good software, but:
“For most software, however, we don’t actually want zero bugs. Any defect, once it is in there, takes time and effort to remove. That time and effort will take away from effort spent putting in features. So you have to decide what to do.”
That’s why I am concerned by right-sounding technical demands for zero bug tolerance. This isn’t a technical decision that can be made by developers or testers or project managers…or consultants. It’s bigger than all of them. It’s not just a technical decision – it’s also a business decision.

Long Tail of Bugs

Like many other problem spaces, the idea of the Long Tail also applies to bugs. There are bugs that need to be fixed and can be fixed right now. But there are other bugs that may never need to be fixed, or bugs that the customer may never see: minor bugs in parts of the system that aren’t used often, or problems that occur in unlikely configurations or unusual strings of events, or only under extreme stress testing. Bugs in code that is being rewritten anyways, or in a part of the system that is going to be decommissioned soon. Small cosmetic issues: if there was nothing better to do, sure you would clean this up, but you have something better to do, so you do this something better instead.

There are bugs that you aren’t sure are actually bugs, where you can’t agree on what the proper behavior should be. And there are bugs that can be expensive and time consuming to track down and reproduce and fix. Race conditions that take a long time to find – and may take a redesign to really fix. Intermittent heisenbugs that disappear when you look at them, non-deterministic problems that you don’t have the information or time to fix right now. And WTF bugs that you don’t understand and don’t know how to fix yet, or the only fix you can think of is scarier than the situation that you are already in.

Then there are bugs that you can’t fix yourself, bugs in underlying technology or third party libraries or partner systems that you have to live with or work around for now. And there are bugs that you haven’t found yet and won’t find unless you keep looking. And at some point you have to stop looking.

All of these bugs are sources of risk. But that doesn’t necessarily mean that they have to be fixed right now. As Eric Sink explains in My Life as a Code Economist, there are different questions that need to be answered to determine whether a bug should be fixed:

First there are the customer questions, basic questions about the importance of a bug to the business:
  • Severity: when this bug happens, how bad is the impact? Is it visible to a large number of customers? Could we lose data, or lose service to important customers or partners? What are the downstream effects: what other systems or partners could be impacted, and how quickly could the problem be contained or repaired? Could this violate service levels, or regulatory or compliance requirements?

  • Frequency: how often could this bug happen in production?
Then there are the developer questions, the technical questions about what needs to be done to fix the bug:
  • Cost: how much work is required to reproduce, fix and test this bug - including regression testing? What about RCA, how much work should we do in digging deeper and fixing the root cause?

  • Risk: what is the technical risk of making things worse by trying to fix it? How well do I understand the code? How much refactoring do I need to do to make sure that the code, and the fix, is clear? Is the code protected by a good set of tests?
For some bugs, the decision is dead easy: simple, stupid mistakes that you or somebody else just made as part of a change or another fix, mistakes that are found right away in testing or review, and should be fixed right away. You know what to do, you don’t waste time: you fix it and you move on.

But for other bugs, especially bugs discovered in existing code, it’s sometimes not so easy. Zero bug tolerance naively assumes that it is always good, it’s always right, to fix a bug. But fixing a bug is not always the right thing to do, because with any fix you run the risk of introducing new problems:
“the bugs you know are better than introducing new bugs”
Fred Brooks first pointed the regression problem out in The Mythical Man Month:
“…fixing a defect has a substantial (20-50%) chance of introducing another. So the whole process is two steps forward and one step back.”
The risk of introducing a bug when trying to fix another bug may have gone down since Fred Brooks’ time. In Geriatric Issues of Aging Software Capers Jones provides some less frightening numbers
“Roughly 7 percent of all defect repairs will contain a new defect that was not there before. For very complex and poorly structured applications, these bad fix injections have topped 20 percent”.
But the cost and risks are still real, and need to be accounted for.

Bars and Broken Windows

You, the team, the customer all need to agree on how high to set the bar, what kind of bugs and risks can be accepted. And then you have to figure out what controls, checks, practices, tools, skills, and how much more time you need to consistently hit that bar. How much it is going to cost. If you’re building safety-critical systems like the command software for the space shuttle
(what are we going to use as an example when the space shuttles stop flying?) the bar is extremely high, and so of course are the costs. If you’re a social media Internet startup, short on cash and time and big on opportunity, then the bar is as low as you can afford it. For the rest of us the bar is somewhere in between.

I get what’s behind Zero Bug Tolerance. I understand the “No Broken Windows” principle in software development: that if we let one bug through then what’s to stop the next one, and the next one and the next one after that. But it’s not that simple. No Broken Windows is about having the discipline to do a professional job. It’s about not being sloppy or careless or irresponsible. And there is nothing irresponsible about making tough and informed decisions about what can and should be fixed now.

Knowing when to stop fixing bugs, when you’ve reached the point of diminishing returns, when you should focus on more important work, isn’t easy. Knowing which bugs to fix and which ones not too, or which ones you can’t or shouldn’t fix now, isn’t easy. And you will be wrong sometimes. Some small problem that you didn’t think was important enough to look into further, some bug that you couldn’t justify the time to chase down, may come back and bite you. You’ll learn and hopefully make better decisions in the future.

That’s real life. In real life we have to make these kinds of hard decisions all of the time. Unfortunately, we can't rely on simple, 100% answers to real problems.

1 comment:

Note: Only a member of this blog may post a comment.