Thursday, August 20, 2015

How to Prevent Catastrophic Failures in Complex Distributed Systems

In his now famous paper How Complex Systems Fail, Dr. Richard Cook explains how and why failures happen in complex systems:

Some Rules of failure in Complex Systems

4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.

The net of this: Complex systems are essentially and unavoidably fragile. We can try, but we can’t stop them from failing – there are too many moving pieces, too many variables and too many combinations to understand and to test. And even the smallest change or mistake can trigger a catastrophic failure.

A New Hope

But new research at the University of Toronto on catastrophic failures in complex distributed systems offers some hope – a potentially simple way to reduce the risk and impact of these failures.

The researchers looked at distributed online systems that had been extensively reviewed and tested, but still failed in spectacular ways.

They found that most catastrophic failures were initially triggered by minor, non-fatal errors: mistakes in configuration, small bugs, hardware failures that should have been tolerated. Then, following rule #3 above, a specific and unusual sequence of events had to occur for the catastrophe to unravel.

The bad news is that this sequence of events can’t be predicted – or tested for – in advance.

The good news is that catastrophic failures in complex, distributed systems may actually be easier to fix than anyone previously thought. Looking closer, the researchers found that almost all (92%) catastrophic failures are the result of incorrect handling on non-fatal errors. These mistakes in error handling caused the system to behave unpredictably, causing other errors, which weren’t always handled correctly or predictably, creating a domino effect.

More than half (58%) of catastrophic failures could be prevented by careful review and testing of error handling code. In 35% of the cases, the faults in error handling code were trivial: the error handler was empty or only logged a failure, or the logic was clearly incomplete. Easy mistakes to find and fix. So easy that the researchers built a freely available static analysis checker for Java byte code, Aspirator, to catch many of these problems.

In another 23% of the cases, the error handling logic of a non-fatal error was so wrong that basic statement coverage testing or careful code reviews would have caught the mistakes.

The next challenge that the researchers encountered was convincing developers to take these mistakes seriously. They had to walk developers through understanding why small bugs in error handling, bugs that “would never realistically happen” needed to be fixed – and why careful error handling is so important.

This is a challenge that we all need to take up – if we hope to prevent catastrophic failure in complex distributed systems.