Building Real Software: Fixing Bugs that can’t be Reproduced

There are bugs that can’t be reproduced, or at least not easily: intermittent and transient errors; bugs that disappear when you try to look for them; bugs that occur as the result of a long chain of independent operations or cross-request timing. Some of these bugs are only found in high-scale production systems that have been running for a long time under heavy load.

Capers Jones calls these bugs “abeyant defects” and estimates that in big systems, as much as 10% of bugs cannot be reproduced, or are too expensive to try reproducing. These bugs can cost 100x more to fix than a simple defect – an “average” bug like this can take more than a week for somebody to find (if they can be found at all) by walking through the design and code, and up to another week or two to fix.

Heisenbugs

One class of bugs that can’t be reproduced are Heisenbugs: bugs that disappear when you attempt to trace or isolate them. When you add tracing code, or step through the problem in a debugger, the problem goes away.

In Debug It!, Paul Butcher offers some hope for dealing with these bugs. He says Heisenbugs are caused by non-deterministic behavior which in turn can only be caused by:

Unpredictable initial state – a common problem in C/C++ code
Interaction with external systems – which can be isolated and stubbed out, although this is not always easy
Deliberate randomness – random factors can also be stubbed out in testing
Concurrency – the most common cause of Heisenbugs today, at least in Java.

Knowing this (or at least making yourself believe it while you are trying to find a problem like this) can help you decide where to start looking for a cause, and how to go forward. But unfortunately, it doesn’t mean that you will find the bug, at least not soon.

Race Conditions

Race conditions used to be problems only for systems programmers and people writing communications handlers. But almost everybody runs into race conditions today – is anybody writing code that isn’t multi-threaded anymore? Races are synchronization errors that occur when two or more threads or processes access the same data or resource without a consistent locking approach.

Races can result in corrupted data – memory getting stomped on (especially in C/C++), changes applied more than once (balances going negative) or changes being lost (credits without debits) – inconsistent UI behaviour, random crashes due to null pointer problems (a thread references an object that has already been freed by another thread), intermittent timeouts, and in actions executed out of sequence, including time-of-check time-of-use security violations.

The results will depend on which thread or process wins the race a particular time. Because races are the result of unlucky timing, and because they problem may not be visible right away (e.g., something gets stepped on but you don’t know until much later when something else tries to use it), they’re usually hard to understand and hard to make happen.

You can’t fix (or probably even find) a race condition without understanding concurrency. And the fact that you have a race condition is a good sign that whoever wrote this code didn’t understand concurrency, so you’re probably not dealing with only one mistake. You’ll have to be careful in diagnosing and especially in fixing the bug, to make sure that you don’t change the problem from a race into a stall, thread starvation or livelock, or deadlock instead, by getting the synchronization approach wrong.

Fixing bugs that can’t be reproduced

If you’re think you’re dealing with a concurrency problem, a race condition or a timing-related bug, try introducing log pauses between different threads – this will expand the window for races and timing-related problems to occur, which should make the problem more obvious. This is what IBM Research’s ConTest tool does (or did - unfortunately, ConTest seems to have disappeared off of the IBM alphaWorks site), messing with thread scheduling to make deadlocks and races occur more often.

If you can’t reproduce the bug, that doesn’t mean that you give up. There are still some things to look at and try.

It’s often faster to find concurrency bugs, timing problems and other hard problems by working back from the error and stepping through the code – in a debugger or by hand – to build up your own model of how the code is supposed to work.

“Even if you can’t easily reproduce the bug in the lab, use the debugger to understand the affected code. Judiciously stepping into or over functions based on your level of “need to know” about that code path. Examine live data and stack traces to augment your knowledge of the code paths.” Jeff Vroom, Debugging Hard Problems

I’ve worked with brilliant programmers who can work back through even the nastiest code, tracing through what’s going on until they see the error. For the rest of us, debugging is a perfect time to pair up. While I'm not sure that pair programming makes a lot of sense as an everyday practice for everyday work, I haven’t seen too many ugly bugs solved without two smart people stepping through the code and logs together.

It’s also important to check compiler and static analysis warnings – these tools can help point to easily-overlooked coding mistakes that could be the cause of your bug, and maybe even other bugs. Findbugs has statistical concurrency bug pattern checkers that can point out common concurrency bug patterns as well as lots of other coding mistakes that you can miss finding on your own.

There are also dynamic analysis tools that are supposed to help find race conditions and other kinds of concurrency bugs at run-time, but I haven’t seen any of them actually work. If anybody has had success with tools like this in real-world applications I would like to hear about it.

Shotgun Debugging, Offensive Programming, Brute-Force Debugging and other ideas

Debug It! recommends that if you are desperate, just take a copy of the code, and try changing something, anything, and see what happens. Sometimes the new information that results from your change may point you in a new direction. This kind of undirected “Shotgun Debugging” is basically wishful thinking, relying on luck and accidental success, what Andy Hunt and Dave Thomas call “Programming by Coincidence”. It isn’t something that you want to rely on, or be proud of.

But there are some changes that do make a lot of sense to try. In Code Complete, Steve McConnell recommends making “offensive coding” changes: adding asserts and other run-time debugging checks that will cause the code to fail if something “impossible” happens, because something “impossible” apparently is happening.

Jeff Vroom suggests writing your own debugging code:

“For certain types of complex code, I will write debugging code, which I put in temporarily just to isolate a specific code path where a simple breakpoint won’t do. I’ve found using the debugger’s conditional breakpoints is usually too slow when the code path you are testing is complicated. You may hit a specific method 1000′s of times before the one that causes the failure. The only way to stop in the right iteration is to add specific code to test for values of input parameters… Once you stop at the interesting point, you can examine all of the relevant state and use that to understand more about the program.”

Paul Butcher suggests that before you give up trying to reproduce and fix a problem, see if there are any other bugs reported in the same area and try to fix them – even if they aren’t serious bugs. The rationale: fixing other bugs may clear up the situation (your bug was being masked by another one); and by working on something else in the same area, you may learn more about the code and get some new ideas about your original problem.

Refactoring or even quickly rewriting some of the code where you think the problem might be can sometimes help you to see the problem more clearly, especially if the code is difficult to follow. Or it could possibly move the problem somewhere else, making it easier to find.

If you can’t find it and fix it, at least add defensive code: tighten up error handling and input checking, and add some logging – if something is happening that can’t happen or won’t happen for you, you need more information on where to look when it happens again.

Question Everything

Finally, The Pragmatic Programmer tells us to remember that nothing is impossible. If you aren’t getting closer to the answer, question your assumptions. Code that has always worked might not work in this particular case. Unless you have a complete trace of the problem in the original report, it’s possible that the report is incomplete or misleading – that the problem you need to solve is not the problem that you are looking at.

If you're trapped in a real debugging nightmare, look outside of your own code. The bug may not be in your code, but in underlying third party code in your service stack or the OS. In big systems under heavy load, I’ve run into problems in the operating system kernel, web servers, messaging middleware, virtual machines, and the DBMS. Your job when debugging a problem like this is to make sure that you know where the bug isn’t (in your code), and try to come up with a simple a test case as possible that shows this – when you’re working with a vendor, they’re not going to be able to setup and run a full-scale enterprise app in order to reproduce your problem.

Hard bugs that can't be reproduced can blow schedules and service levels to hell, and wear out even your best people. Taking all of this into account, let’s get back to the original problem of how or whether to estimate bug fixes, in the next post.