With so many moving parts in high-scale systems, failures of one kind or another are common: disks (especially), servers and multiple servers, racks, networks, data center outages and geographic disasters, and database failures, middleware and application software failures, and of course human error – mistakes in configuration and operations. As you scale up, you will also encounter multiple simultaneous failures and failure combinations, normal accidents, more heisenbugs and mandelbugs, and data corruption problems and other silent failures.
These challenges are taken to extremes in petascale and exascale HPC platforms. The Mean Time to Failure (MTTF) on petascale systems (the largest super computers running today) can be as low as 1 day. According to Michael Heroux at Sandia National Laboratories, exascale systems, consisting of millions of processors and capable of handling 1 million trillion calculations per second (these computers don’t exist yet, but are expected in the next 10 years)
“will have very high fault rates and will in fact be in a constant state of decay. ‘All nodes up and running’, our current sense of a well-functioning scalable system, will not be feasible. Instead we will always have a portion of the machine that is dead, a portion that is dying and perhaps producing faulty results, another that is coming back to life and a final, hopefully large, portion that is computing fast and accurate results.”For enterprise systems, of course, rates of component or combination failures will be much lower, but the same risks exist, and the principles still hold. Recognizing that failures can and will happen, it is important to:
    Identify failures as quickly as possible.
    Minimize and contain the impact of failures.
    Recover as quickly as possible.
At QCON SF 2009, Jason McHugh described how Amazon’s S3 high-scale cloud computing service is architected for resiliency in the face of so many failure conditions. He lists 7 key principles for system survivability at high-scale:
1. Decouple upstream and downstream processes, and protection yourself from dependencies upstream and downstream when problems occur: overload and spikes from upstream, failures and slow-downs downstream.
2. Design for large failures.
3. Don’t trust data on the wire or on disk.
4. Elasticity – resources can be brought online at any time.
5. Monitor, extrapolate and react: instrument, and create feedback loops.
6. Design for frequent single system failures.
7. Hold “Game Days”: shutdown a set of servers or a data center in production, and prove that your recovery strategy works in the real world.
Failure management needs to be architected in, and handled at the implementation level: you have to expect, and handle, failures for every API call. This is one of the key ideas in Michael Nygard’s Release It!, that developers of data center-ready, large-scale systems cannot afford to be satisfied dealing with abstractions, that they must understand the environment that the system runs in, and that they have to learn not to trust it. This book offers some effective advice and ideas (“stability patterns”) to protect against horizontal failures (Chain Reactions) and vertical failures (Cascading Failures).
For vertical failures between tiers, use timeouts and delayed retry mechanisms to prevent hangs and timeouts – protect request handling threads on all remote calls (and resource pool checkouts), and never wait forever. Release It! introduces a Circuit Breaker pattern to manage this: after too many retries, close-off the connection, back away, then try again – if the problem has not been corrected, close-off and back away again. And remember to fail fast – don’t get stuck blocking or deadlock if a resource is not available or a remote call cannot be completed.
In horizontal Chain Reactions within a tier, the failure of one component raises the possibility of failure in its peers, as the workload rebalances to overwhelm the remaining services. Protect the system from Chain Reactions with partitions and Bulkheads – build internal firewalls to isolate workloads on different servers or services or resource pools. Virtual servers are one way to partition and isolate workloads, although we chose not to use virtual servers in production because of the extra complexity in management – we have had a lot of success with virtualization for provisioning test environments and management services, but for high-volume, low-latency transaction processing we’ve found that separate physical servers and application partitioning is faster and more reliable.
Application and data partitioning is a fundamental design idea in many systems, for example exchange trading engines, where the market is broken down into many different trading products or groups of products, isolated in different partitions for scalability and workload balancing purposes. James Hamilton, formerly with Microsoft and now a Distinguished Engineer on the Amazon Web Services team, hilights the importance of partitioning for application scaling as well as fault isolation and multi-tenancy in On Designing and Deploying Internet-Scale Services, an analysis of key issues in operations management and architecture of systems at high-scale, with a focus on efficiency and resiliency. He talks about the importance of minimizing cross-partition operations, and managing partitions intelligently, at a fine-grained level. Some of the other valuable ideas here include:
    Design for failure
    Zero trust of underlying components
    At scale, even unlikely, unusual combinations can become commonplace
    Version everything
    Minimize dependencies
    Practice your failover operations regularly - in production
Read this paper.
All of these ideas build on research work in Recovery Oriented Computing at Berkeley and Stanford, which set out to solve reliability problems in systems delivered “in Internet time” during the dot com boom. The basics of Recovery Oriented Computing are:
Failures will happen, and you cannot predict or avoid failures. Confront this fact and accept it.
Avoiding failures and maximizing Mean Time to Failure by taking careful steps in architecture and engineering and maintenance, and minimizing opportunities for human error in operations through training, procedures and tools – all of this is necessary, but it is not enough. For “always-on” availability, you also need to minimize the Mean Time to Recovery (MTTR).
Recovery consists of 3 steps:
1. Detect the problem
2. Diagnose the cause of the problem
3. Repair the problem and restore service.
Identify failures as quickly as possible. Minimize and contain the impact of failures. Provide tested and reliable tools for system administrators to recover the system quickly and safely.
One of the key enablers of Recovery Oriented Computing is, again, partitioning: to isolate faults and contain damage; to simplify diagnosis; to enable rapid online repair, recovery and restart at the component level; and to support dynamic provisioning – elastic, or “on demand” capacity upgrades.
As continuing reports of cloud computing failures and other large scale SaaS problems show, these problems have not been solved, and as the sheer scale of these systems continues to increase, they may never be solved. But there is still a lot that we can learn from the work that has been done so far.