Monday, June 27, 2011

Moving forward from Failure

System failures at scale are inescapable, as I have talked about before in the context of designing systems for failure in high-scale computing and how to apply these ideas to enterprise systems. Failures are wasted if you don’t learn enough from them; if the way that you design and deliver application systems, and the way that you deploy and run these systems, don’t get better as a result. You have to find lessons in each failure and constantly move forward, and make the system, and the team, more resilient.

I’ve read and heard a lot of good stuff over the last year or so from the DevOps community on handling system failures: how to minimize the risk of failures through deployment planning, how complex systems fail, how to communicate failures effectively to stakeholders so that you don’t destroy trust, and how to understand failures through postmortem analysis, including Jacob Loomis’ essay “How to Make Failure Beautiful” in the Web Operations book and John Allspaw’s keynote on Advanced Postmortem Fu and Human Error 101 at this year’s Velocity conference.

Many of these ideas fit with my own experience, and fill in some important gaps – they extend beyond the concerns of operating high-volume Web sites to broader, general problems in system operations and systems engineering, and deserve a wider audience outside of the Web operations and Web startup worlds.

Postmortems and Root Cause Analysis

The keys to successful postmortems and Root Cause Analysis are deceptively simple, and therefore difficult to get right:
  • get the right people together in a room.

  • create the right environment for blameless problem solving: make it safe for people to be open and honest, don’t point fingers, focus on facts and solving problems and how to get better.

  • postmortems are expensive and painful, and people want to get out of them as soon as possible. Don’t stop too soon, before the team has really understood the problems, and the solutions.

  • don’t be satisfied with a single root cause – complex system failures usually have multiple causes if you dig deep enough.

  • human error isn’t a root cause – it’s a symptom of something else that you’ve done wrong.
You can’t stop with Root Cause Analysis

Root Cause Analysis is important, but it is just the first step. Once you’ve reviewed a failure as a team and found your way to the root causes or at least identified some real problems, now you have to do something about it, beyond the immediate fixes and workarounds that the team made recovering from the incident.

There are straightforward things that you can do now – detective and corrective actions, quick fixes, low cost and low risk – so you do them. Fix a bug, plug a hole, add defensive coding, add some tests. Better logging and diagnostics and alerting, better error handling, better metrics – you can always do this stuff better – so that ops can see the problem coming or at least recognize it when it happens again. Better troubleshooting tools and training for ops.

Don’t stop with corrective actions either

But it’s still not enough to recover from a failure and patch the holes and tighten things up, or even build a better incident management capability. You have to make sure to find the lessons in each failure, really learn about why the failure happened, how it relates to other failures or problems that you have; and about how you dealt with the failure when it happened, and what you need to do to do a better job the next time. Then you have to act to reduce the risk and cost of failures, take steps to make sure that failures like this one, and even failures that aren’t like this one, don’t happen again. And that when the next failure does happen, that the team is more prepared for it, that you can recover faster and with less stress and impact to the business.

Some of these preventative actions are straightforward: training for developers and testers and ops, and better communications in and especially between teams, checklists, health checks and dependency checks, more disciplined change controls and deployment controls, and other safeties.

But most preventative changes, the ones that make a real difference, are deeper, more fundamental: fixing architecture, culture, organization, or management weaknesses. Fixing these kinds of problems takes longer, involves more people, and costs more. Making changes to the architecture – switching out a core technology or implementing a partitioning strategy to contain failures and to scale up – can take months to work out and implement. Organizational and management changes are hard because this directly impacts people’s jobs. Changing culture is even harder and takes longer, especially to make the changes stick.

You can’t improve what you can’t see – you need data

These problems are also hard to see and understand. It’s naïve to think that you can recognize or prove the need for fundamental changes to architecture or organization or management controls or culture from a postmortem meeting – it takes time and perspective and experience with more than one failure to see this kind of problem. And because the fixes are fundamental and expensive, they can be hard to justify, hard to get management and the business and your own people to buy-in.

You need data to help you understand what you’re doing wrong, what you need to change or what you need to stop doing – and what you’re doing right, what you need to keep doing, and to build a case for change. At last year’s Velocity conference, John Allspaw explained how to use metrics to find patterns and trends in failures, to understand what’s hurting you the most, or hurting you the most often, what’s working well and what’s not. To find out what changes are safe, and what changes aren’t; what failures are cheap and easy to recover from, and what failures can take the company down.

Metrics are important to help you see problems, and to help show if you are learning and getting better. Track the types and severity of failures, frequency of failures, your response to failures – time to detect and time to recover from failures; and the frequency and type and size of changes, and the correlation between changes and failures (by type and size of changes, by type and severity of failures). And make sure to track regressions – the number of times that you take a step backward.

Deciding what to fix, and what not to

Metrics can also help you to decide what you can fix and change, and what’s not worth fixing or changing – at least not for now. You need to recognize when a problem is a true outlier, an isolated case that is unlikely to happen again and that you can accept and move forward. It’s important that you can see the difference between a problem that is a one-of, and a first-of –an indicator of something deeply wrong, a fundamental weakness in the way that the system was built or the way that you work. You don’t want to over-react to outliers but you also don’t can’t treat every problem as unique and miss seeing the patterns and connections, the underlying root causes.

There is diminishing returns in preventing (or even reducing the probability) of some problems, trying to stop the unstoppable. Some problems are too rare, or too expensive or difficult to prevent – and trying to prevent these problems can introduce new complexities and new risks into the system. I agree with John Allspaw that there are situations when it makes more sense to focus on how to react and recover faster, on improving MTTR rather than MTBF, but you need to know when to make that case.

Moving forward from Failures is real work and needs to be managed

People don’t want to think about failures for long – they want to put the mistakes and confusion and badness behind them and move on. And the business and management want priorities back on day-to-day delivery as soon as possible. When failures happen, you need to act fast, get the most that you can out of the moment, make decisions and commitments and reinforce changes while the pain is still fresh.

But fixing real problems in the system or the way that people are organized or the way that they work or the way that they think, can’t be done right away. Building reliability and resilience and operability into how you work and how you think and how you build and test software and how you solve problems takes time and commitment and continuous reinforcement. Management, and the team, have to be made accountable for longer-term preventative actions, for making bigger changes and improvements.

This work needs to be recognized by the business and management and the team and built in to the backlog, and actively managed. You need to remind people of the risks when you see them missing steps or trying to cut corners, or falling back into old patterns of thinking and acting. You’ll need to use metrics and cost data to drive behaviour and to drive change, and to decide how much to push and how often: are you changing too much too often, running too loose; or is change costing you too much, are you overcompensating?

A Resilient system is a DevOps problem

Development can’t solve all of these problems on its own, and neither can Operations. It needs the commitment of development (and by development, I mean everyone responsible for designing, developing, testing, securing and releasing the software) and operations (everyone responsible for building and securing and running the infrastructure, deploying the code, and running and monitoring the system). This is a real DevOps problem: development and operations have to be aligned and work together, communicating and collaborating, sharing responsibilities and technology. This is one of the places where I see DevOps adding real value in any organization.

No comments:

Site Meter