Building Real Software: research

Thursday, August 20, 2015

How to Prevent Catastrophic Failures in Complex Distributed Systems

In his now famous paper How Complex Systems Fail, Dr. Richard Cook explains how and why failures happen in complex systems:

Some Rules of failure in Complex Systems

4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.

The net of this: Complex systems are essentially and unavoidably fragile. We can try, but we can’t stop them from failing – there are too many moving pieces, too many variables and too many combinations to understand and to test. And even the smallest change or mistake can trigger a catastrophic failure.

A New Hope

But new research at the University of Toronto on catastrophic failures in complex distributed systems offers some hope – a potentially simple way to reduce the risk and impact of these failures.

The researchers looked at distributed online systems that had been extensively reviewed and tested, but still failed in spectacular ways.

They found that most catastrophic failures were initially triggered by minor, non-fatal errors: mistakes in configuration, small bugs, hardware failures that should have been tolerated. Then, following rule #3 above, a specific and unusual sequence of events had to occur for the catastrophe to unravel.

The bad news is that this sequence of events can’t be predicted – or tested for – in advance.

The good news is that catastrophic failures in complex, distributed systems may actually be easier to fix than anyone previously thought. Looking closer, the researchers found that almost all (92%) catastrophic failures are the result of incorrect handling on non-fatal errors. These mistakes in error handling caused the system to behave unpredictably, causing other errors, which weren’t always handled correctly or predictably, creating a domino effect.

More than half (58%) of catastrophic failures could be prevented by careful review and testing of error handling code. In 35% of the cases, the faults in error handling code were trivial: the error handler was empty or only logged a failure, or the logic was clearly incomplete. Easy mistakes to find and fix. So easy that the researchers built a freely available static analysis checker for Java byte code, Aspirator, to catch many of these problems.

In another 23% of the cases, the error handling logic of a non-fatal error was so wrong that basic statement coverage testing or careful code reviews would have caught the mistakes.

The next challenge that the researchers encountered was convincing developers to take these mistakes seriously. They had to walk developers through understanding why small bugs in error handling, bugs that “would never realistically happen” needed to be fixed – and why careful error handling is so important.

This is a challenge that we all need to take up – if we hope to prevent catastrophic failure in complex distributed systems.

Thursday, March 19, 2015

Making Refactoring Work

A recent academic study raises some questions about how useful and how important refactoring really is.

The researchers found that refactoring didn’t seem to make code measurably easier to understand or change, or even measurably cleaner (measured by cyclomatic complexity, depth of inheritance, class coupling or lines of code).

But as other people have discussed, this study is deeply flawed. It appears to have been designed by people who didn’t understand how to do refactoring properly:

The researchers chose 10 “high impact” refactoring techniques (from a 2011 study by Shatnawi and Li) based on a model of OO code quality which measures reusability, flexibility, extendibility and effectiveness (“the degree to which a design is able to achieve the desired functionality and behavior using OO design concepts and techniques” – whatever that means), but which specifically did not include understandability. And then they found that the refactored code was not measurably easier to understand or fix. Umm, should this have been a surprise?

The refactorings were intended to make the code more extensible and reusable and flexible. In many cases this would have actually made the code less simple and harder to understand. Flexibility and extendibility and reusability often come at the expense of simplicity, requiring additional scaffolding and abstraction. These are long-term investments that are intended to pay back over the life of a system – something that could not be measured in the couple of hours that the study allowed.

The list of techniques did not include common and obviously useful refactorings which would have made the code simpler and easier to understand, such as Extract Class and Extract Method (which are the two most impactful refactorings, according to research by Alshehri and Benedicenti, 2014) Extract Variable, Move Method, Change Method Signature, Rename anything, … [insert your own shortlist of other useful refactorings here].
There is no evidence – and no reason to believe – that the refactoring work that was done, was done properly. Presumably somebody entered some refactoring commands in Visual Studio and the code was “refactored properly”.
The study set out to measure whether refactoring made code easier to change. But they attempted to do this by assessing whether students were able to find and fix bugs that had been inserted into the code – which is much more about understanding the code than it is about changing it.
The code base (4500 lines) and the study size (two groups of 10 students) were both too small to be meaningful, and students were not given enough time to do meaningful work: 5 minutes to read the code, 30 minutes to answer some questions about it, 90 minutes to try to find and fix some bugs.
And as the researchers point out, the developers who were trying to understand the code were inexperienced. It’s not clear that they would have been able to understand the code and work with it even it had been refactored properly.

But the study does point to some important limitations to refactoring and how it needs to be done.

Good Refactoring takes Time

Refactoring code properly takes experience and time. Time to understand the code. Time to understand which refactorings should be used in what context. Time to learn how to use the refactoring tools properly. Time to learn how much refactoring is enough. And of course time to get the job done right.

Someone who isn’t familiar with the language or the design and the problem domain, and who hasn’t worked through refactoring before won’t do a good job of it.

Refactoring is Selfish

When you refactor, it’s all about you. You refactor the code in ways to make it easier for YOU to understand and that should make it easier for YOU to change in the future. But this doesn’t necessarily mean that the code will be easier for someone else to understand and change.

It’s hard to go wrong doing some basic, practical refactoring. But deeper and wider structural changes, like Refactoring to Patterns or other “Big Refactoring” or “Large Scale Refactoring” changes that make some programmers happy can also make the code much harder for other programmes to understand and work with – especially if the work only gets done part way (which often happens with well-intentioned, ambitious root canal refactoring work).

In the study, the researchers thought that they were making the code better, by trying to make it more extensible, reusable and flexible. But they didn’t take the needs of the students into consideration. And they didn’t follow the prime directive of refactoring:

Always start by refactoring to understand. If you aren’t making the code simpler and easier to understand, you’re doing it wrong.

Ironically, what the students in the study should have done – with the original code, as well as the “refactored code” – was to refactor it on their own first so that they could understand it. That would have made for a more interesting, and much more useful, study.

Refactoring Works

There’s no doubt that refactoring – done properly – will make code more understandable, more maintainable, and easier to change. But you need to do it right.