Monday, June 27, 2011

Moving forward from Failure

System failures at scale are inescapable, as I have talked about before in the context of designing systems for failure in high-scale computing and how to apply these ideas to enterprise systems. Failures are wasted if you don’t learn enough from them; if the way that you design and deliver application systems, and the way that you deploy and run these systems, don’t get better as a result. You have to find lessons in each failure and constantly move forward, and make the system, and the team, more resilient.

I’ve read and heard a lot of good stuff over the last year or so from the DevOps community on handling system failures: how to minimize the risk of failures through deployment planning, how complex systems fail, how to communicate failures effectively to stakeholders so that you don’t destroy trust, and how to understand failures through postmortem analysis, including Jacob Loomis’ essay “How to Make Failure Beautiful” in the Web Operations book and John Allspaw’s keynote on Advanced Postmortem Fu and Human Error 101 at this year’s Velocity conference.

Many of these ideas fit with my own experience, and fill in some important gaps – they extend beyond the concerns of operating high-volume Web sites to broader, general problems in system operations and systems engineering, and deserve a wider audience outside of the Web operations and Web startup worlds.

Postmortems and Root Cause Analysis

The keys to successful postmortems and Root Cause Analysis are deceptively simple, and therefore difficult to get right:
  • get the right people together in a room.

  • create the right environment for blameless problem solving: make it safe for people to be open and honest, don’t point fingers, focus on facts and solving problems and how to get better.

  • postmortems are expensive and painful, and people want to get out of them as soon as possible. Don’t stop too soon, before the team has really understood the problems, and the solutions.

  • don’t be satisfied with a single root cause – complex system failures usually have multiple causes if you dig deep enough.

  • human error isn’t a root cause – it’s a symptom of something else that you’ve done wrong.
You can’t stop with Root Cause Analysis

Root Cause Analysis is important, but it is just the first step. Once you’ve reviewed a failure as a team and found your way to the root causes or at least identified some real problems, now you have to do something about it, beyond the immediate fixes and workarounds that the team made recovering from the incident.

There are straightforward things that you can do now – detective and corrective actions, quick fixes, low cost and low risk – so you do them. Fix a bug, plug a hole, add defensive coding, add some tests. Better logging and diagnostics and alerting, better error handling, better metrics – you can always do this stuff better – so that ops can see the problem coming or at least recognize it when it happens again. Better troubleshooting tools and training for ops.

Don’t stop with corrective actions either

But it’s still not enough to recover from a failure and patch the holes and tighten things up, or even build a better incident management capability. You have to make sure to find the lessons in each failure, really learn about why the failure happened, how it relates to other failures or problems that you have; and about how you dealt with the failure when it happened, and what you need to do to do a better job the next time. Then you have to act to reduce the risk and cost of failures, take steps to make sure that failures like this one, and even failures that aren’t like this one, don’t happen again. And that when the next failure does happen, that the team is more prepared for it, that you can recover faster and with less stress and impact to the business.

Some of these preventative actions are straightforward: training for developers and testers and ops, and better communications in and especially between teams, checklists, health checks and dependency checks, more disciplined change controls and deployment controls, and other safeties.

But most preventative changes, the ones that make a real difference, are deeper, more fundamental: fixing architecture, culture, organization, or management weaknesses. Fixing these kinds of problems takes longer, involves more people, and costs more. Making changes to the architecture – switching out a core technology or implementing a partitioning strategy to contain failures and to scale up – can take months to work out and implement. Organizational and management changes are hard because this directly impacts people’s jobs. Changing culture is even harder and takes longer, especially to make the changes stick.

You can’t improve what you can’t see – you need data

These problems are also hard to see and understand. It’s naïve to think that you can recognize or prove the need for fundamental changes to architecture or organization or management controls or culture from a postmortem meeting – it takes time and perspective and experience with more than one failure to see this kind of problem. And because the fixes are fundamental and expensive, they can be hard to justify, hard to get management and the business and your own people to buy-in.

You need data to help you understand what you’re doing wrong, what you need to change or what you need to stop doing – and what you’re doing right, what you need to keep doing, and to build a case for change. At last year’s Velocity conference, John Allspaw explained how to use metrics to find patterns and trends in failures, to understand what’s hurting you the most, or hurting you the most often, what’s working well and what’s not. To find out what changes are safe, and what changes aren’t; what failures are cheap and easy to recover from, and what failures can take the company down.

Metrics are important to help you see problems, and to help show if you are learning and getting better. Track the types and severity of failures, frequency of failures, your response to failures – time to detect and time to recover from failures; and the frequency and type and size of changes, and the correlation between changes and failures (by type and size of changes, by type and severity of failures). And make sure to track regressions – the number of times that you take a step backward.

Deciding what to fix, and what not to

Metrics can also help you to decide what you can fix and change, and what’s not worth fixing or changing – at least not for now. You need to recognize when a problem is a true outlier, an isolated case that is unlikely to happen again and that you can accept and move forward. It’s important that you can see the difference between a problem that is a one-of, and a first-of –an indicator of something deeply wrong, a fundamental weakness in the way that the system was built or the way that you work. You don’t want to over-react to outliers but you also don’t can’t treat every problem as unique and miss seeing the patterns and connections, the underlying root causes.

There is diminishing returns in preventing (or even reducing the probability) of some problems, trying to stop the unstoppable. Some problems are too rare, or too expensive or difficult to prevent – and trying to prevent these problems can introduce new complexities and new risks into the system. I agree with John Allspaw that there are situations when it makes more sense to focus on how to react and recover faster, on improving MTTR rather than MTBF, but you need to know when to make that case.

Moving forward from Failures is real work and needs to be managed

People don’t want to think about failures for long – they want to put the mistakes and confusion and badness behind them and move on. And the business and management want priorities back on day-to-day delivery as soon as possible. When failures happen, you need to act fast, get the most that you can out of the moment, make decisions and commitments and reinforce changes while the pain is still fresh.

But fixing real problems in the system or the way that people are organized or the way that they work or the way that they think, can’t be done right away. Building reliability and resilience and operability into how you work and how you think and how you build and test software and how you solve problems takes time and commitment and continuous reinforcement. Management, and the team, have to be made accountable for longer-term preventative actions, for making bigger changes and improvements.

This work needs to be recognized by the business and management and the team and built in to the backlog, and actively managed. You need to remind people of the risks when you see them missing steps or trying to cut corners, or falling back into old patterns of thinking and acting. You’ll need to use metrics and cost data to drive behaviour and to drive change, and to decide how much to push and how often: are you changing too much too often, running too loose; or is change costing you too much, are you overcompensating?

A Resilient system is a DevOps problem

Development can’t solve all of these problems on its own, and neither can Operations. It needs the commitment of development (and by development, I mean everyone responsible for designing, developing, testing, securing and releasing the software) and operations (everyone responsible for building and securing and running the infrastructure, deploying the code, and running and monitoring the system). This is a real DevOps problem: development and operations have to be aligned and work together, communicating and collaborating, sharing responsibilities and technology. This is one of the places where I see DevOps adding real value in any organization.

Tuesday, June 21, 2011

Estimation is not a Black Art

Construx Software is one of the few companies that take the problems of estimation in software development seriously. Earlier this year I attended Steve McConnell’s Software Estimation in Depth course based on his book Software Estimation: Demystifying the Black Art. Two days reviewing factors in estimation and the causes of errors in estimating, the differences between estimates and targets and commitments, how to present estimates to the customer and to management, the importance of feedback and data, estimating techniques that work and why they work, and when and how to apply these techniques to different estimating problems.

You can get a lot out of reading Steve’s book, but the course setting gives you a chance to work with other people and solve real problems in estimation and planning, dig deep into questions with the man himself, and it introduces new material and research not included in the book. The key takeways from this course for me were:

The Basics

Even simple techniques work much better than “expert judgment” (aka wild ass guesses). With training and using straightforward tools and simple mathematical models and following a bit of structure, you can get to 70-75% accuracy in estimating, which is good enough for most of us.

Never estimate on the spot – it is professionally irresponsible, and you’re setting yourself, your team, and your customer up to fail. Always give the person who is doing the estimate time to think. Everyone knows this, but it’s important to be reminded of it, especially under pressure.

People’s systemic error tendency is to underestimate. Many organizations underestimate by a factor of at least 2x. This also applies to comparing back to actual costs: a common mistake is to remember how much you estimated something would take, not how long it actually took.

Estimates are always better if you can use real data to calibrate them: to compare estimates against evidence of how your organization has performed in the past. I knew this. But what I didn’t know is that you don’t need a lot of data points: even a few weeks of data can be useful, especially if it contradicts with judgment and forces you to question your assumptions.

Use different techniques at different stages as you move along a project’s Cone of Uncertainty, depending on how much information you have available at the time, and how high the stakes are – what the cost of estimation errors could be to the business. If you need higher confidence or higher quality estimates, use multiple techniques and compare the results.

I like T-Shirt sizing to help in planning and making a business case. Developers come up with a rough order of magnitude estimate on cost or effort (small, medium, large, extra-large) while the Customer/Product Owner/Product Manager does the same for the expected business value of a feature or change request. Then you match them up: Extra-Large business value and Small development cost gets a big thumbs-up. Small business value and Extra-Large cost gets a big thumbs down.

Estimating by analogy – comparing the work that you need to estimate to something similar that you’ve done before – is a simple technique that we use a lot. It works well if you are adding “another screen”, writing “another report”, or building “another interface”. It’s a good fit for maintenance, if you’ve been working with the same code for a while and know most of the gotchas.

Intact teams (like mine) tend to view problems in a similar way – ideas and experience converge. This is good if you work on similar problems, in maintenance or a long-running project, because you can make decisions quickly and confidently. But this is bad if you are confronted with new problems – you need techniques like Wideband Delphi to challenge people’s thinking patterns and let new ideas in.

Agile Estimating and Story Points and Velocity

We spent more time than I expected exploring issues in estimating Agile development work. Incremental and iterative planning in Agile development poses problems for a lot of organizations. The business/customer needs to know when the system will be ready and how much it will cost so that they can make their own commitments, but the team ideally wants to work this out iteratively, as they go. This means instead that they have to define the scope and cost as much as they can upfront, and then work out the details in each sprint – more like staged or spiral development methods. Once they have the backlog sized, they can plan out sprints based on forecast velocity or historical velocity - if they can figure this out.

I’m still not convinced that Agile story point estimating is better than (or as good as) other techniques, or that programmers sitting around a table playing Planning Poker is really an effective alternative to thoughtful estimating. Story points create some problems in planning new project development, because they are purposefully abstract – too abstract to be useful in helping to make commitments to the business. You might have an idea of how many story points give or take you have to deliver, but what’s a story point in the real world? People who use story points can’t seem to agree on how to define a story point, what a story point means and how or when to use them in estimating.

More fundamentally, you can’t know what a story point costs until the team starts to deliver, by measuring the team’s velocity (the actual story points completed in an iteration).This leaves you with a bootstrapping problem: you can’t estimate how long it is going to take to do something until you do it. You can look at data from other projects (if you’ve tracked this data), but you’re not supposed to compare story points across projects because each project and each team is unique, and their idea of story points may be different. So you have to make an educated guess as to how long it is going to take you to deliver what’s in front of you, and now we’re back to the “expert judgement” problem.

The good news is that it won’t take long to calibrate Story Point estimates against evidence from the team’s actual velocity. Mike Cohn in Succeeding with Agile says that you need at least 5 iterations to establish confidence in a team’s velocity. But Construx has found that you can have a good understanding of a team’s velocity in as little as 2 iterations. That’s not a long time to wait for some kind of answer on how long it should take to get the work that you know that you have in front of you done.

There is more to estimating

There is a lot more to estimating, and to this estimation course: details on different techniques and rules of thumb, models and software tools, how to measure estimation error, how to turn estimates into schedules, how to handle schedule compression and how to stay out of the Impossible Zone. To get a taste of the course, check out this webinar on the 10 Deadly Sins of Software Estimation.

Software estimation, like other problems in managing software development, doesn’t require a lot of advanced technical knowledge. It comes down to relatively simple ideas and practices that need to be consistently applied; to fundamentals and discipline. That doesn’t mean that it is easy to do it right, but it’s not dark magic either.

Monday, June 6, 2011

Safer Software through Secure Frameworks

We have to make it easier for developers to build secure apps, especially Web apps. We can't keep forcing everybody who builds an application to understand and plug all of the stupid holes in how the Web works on their own — and to do this perfectly right every time. What we need is implementation-level security issues taken care of at the language and framework level. So that developers can focus on their real jobs: solving design problems and writing code that works.

Go to the SANS Application Security Street Fighter for my latest post on how to write safer software using secure frameworks, and application frameworks that are secure. And to read more about the OWASP Developer Outreach.