Monday, March 22, 2010

Failure Isolation and Recovery: Learning from High-Scale and Extreme-Scale Computing

While I have been building business-critical enterprise systems for a long time, I haven't worked on high-scale cloud computing or Internet-scale architectures, with tens of thousands or hundreds of thousands of servers. There are some fascinating, hard problems that need to be solved in engineering systems at high-scale, but the most interesting to me are problems in deployment and operations management, and especially how to deal with failure.

With so many moving parts in high-scale systems, failures of one kind or another are common: disks (especially), servers and multiple servers, racks, networks, data center outages and geographic disasters, and database failures, middleware and application software failures, and of course human error – mistakes in configuration and operations. As you scale up, you will also encounter multiple simultaneous failures and failure combinations, normal accidents, more heisenbugs and mandelbugs, and data corruption problems and other silent failures.

These challenges are taken to extremes in petascale and exascale HPC platforms. The Mean Time to Failure (MTTF) on petascale systems (the largest super computers running today) can be as low as 1 day. According to Michael Heroux at Sandia National Laboratories, exascale systems, consisting of millions of processors and capable of handling 1 million trillion calculations per second (these computers don’t exist yet, but are expected in the next 10 years)
“will have very high fault rates and will in fact be in a constant state of decay. ‘All nodes up and running’, our current sense of a well-functioning scalable system, will not be feasible. Instead we will always have a portion of the machine that is dead, a portion that is dying and perhaps producing faulty results, another that is coming back to life and a final, hopefully large, portion that is computing fast and accurate results.”
For enterprise systems, of course, rates of component or combination failures will be much lower, but the same risks exist, and the principles still hold. Recognizing that failures can and will happen, it is important to:

    Identify failures as quickly as possible.

    Minimize and contain the impact of failures.

    Recover as quickly as possible.

At QCON SF 2009, Jason McHugh described how Amazon’s S3 high-scale cloud computing service is architected for resiliency in the face of so many failure conditions. He lists 7 key principles for system survivability at high-scale:

1. Decouple upstream and downstream processes, and protection yourself from dependencies upstream and downstream when problems occur: overload and spikes from upstream, failures and slow-downs downstream.

2. Design for large failures.

3. Don’t trust data on the wire or on disk.

4. Elasticity – resources can be brought online at any time.

5. Monitor, extrapolate and react: instrument, and create feedback loops.

6. Design for frequent single system failures.

7. Hold “Game Days”: shutdown a set of servers or a data center in production, and prove that your recovery strategy works in the real world.

Failure management needs to be architected in, and handled at the implementation level: you have to expect, and handle, failures for every API call. This is one of the key ideas in Michael Nygard’s Release It!, that developers of data center-ready, large-scale systems cannot afford to be satisfied dealing with abstractions, that they must understand the environment that the system runs in, and that they have to learn not to trust it. This book offers some effective advice and ideas (“stability patterns”) to protect against horizontal failures (Chain Reactions) and vertical failures (Cascading Failures).

For vertical failures between tiers, use timeouts and delayed retry mechanisms to prevent hangs and timeouts – protect request handling threads on all remote calls (and resource pool checkouts), and never wait forever. Release It! introduces a Circuit Breaker pattern to manage this: after too many retries, close-off the connection, back away, then try again – if the problem has not been corrected, close-off and back away again. And remember to fail fast – don’t get stuck blocking or deadlock if a resource is not available or a remote call cannot be completed.

In horizontal Chain Reactions within a tier, the failure of one component raises the possibility of failure in its peers, as the workload rebalances to overwhelm the remaining services. Protect the system from Chain Reactions with partitions and Bulkheads – build internal firewalls to isolate workloads on different servers or services or resource pools. Virtual servers are one way to partition and isolate workloads, although we chose not to use virtual servers in production because of the extra complexity in management – we have had a lot of success with virtualization for provisioning test environments and management services, but for high-volume, low-latency transaction processing we’ve found that separate physical servers and application partitioning is faster and more reliable.

Application and data partitioning is a fundamental design idea in many systems, for example exchange trading engines, where the market is broken down into many different trading products or groups of products, isolated in different partitions for scalability and workload balancing purposes. James Hamilton, formerly with Microsoft and now a Distinguished Engineer on the Amazon Web Services team, hilights the importance of partitioning for application scaling as well as fault isolation and multi-tenancy in On Designing and Deploying Internet-Scale Services, an analysis of key issues in operations management and architecture of systems at high-scale, with a focus on efficiency and resiliency. He talks about the importance of minimizing cross-partition operations, and managing partitions intelligently, at a fine-grained level. Some of the other valuable ideas here include:

    Design for failure

    Zero trust of underlying components

    At scale, even unlikely, unusual combinations can become commonplace

    Version everything

    Minimize dependencies

    Practice your failover operations regularly - in production

Read this paper.

All of these ideas build on research work in Recovery Oriented Computing at Berkeley and Stanford, which set out to solve reliability problems in systems delivered “in Internet time” during the dot com boom. The basics of Recovery Oriented Computing are:

Failures will happen, and you cannot predict or avoid failures. Confront this fact and accept it.

Avoiding failures and maximizing Mean Time to Failure by taking careful steps in architecture and engineering and maintenance, and minimizing opportunities for human error in operations through training, procedures and tools – all of this is necessary, but it is not enough. For “always-on” availability, you also need to minimize the Mean Time to Recovery (MTTR).

Recovery consists of 3 steps:

1. Detect the problem
2. Diagnose the cause of the problem
3. Repair the problem and restore service.

Identify failures as quickly as possible. Minimize and contain the impact of failures. Provide tested and reliable tools for system administrators to recover the system quickly and safely.

One of the key enablers of Recovery Oriented Computing is, again, partitioning: to isolate faults and contain damage; to simplify diagnosis; to enable rapid online repair, recovery and restart at the component level; and to support dynamic provisioning – elastic, or “on demand” capacity upgrades.

As continuing reports of cloud computing failures and other large scale SaaS problems show, these problems have not been solved, and as the sheer scale of these systems continues to increase, they may never be solved. But there is still a lot that we can learn from the work that has been done so far.

Saturday, March 13, 2010

Secure Software Development: Awareness Training

I’ve been looking for a good basic awareness course on software security as an introduction for new developers and as a refresher for the rest of the team. A course that covers the important bases, and reinforces fundamental issues in building secure software. I prefer online training for something like this: if you have the discipline, online and on-demand training works because it is easier to schedule, people can go as fast or as slow as they like, and of course it is less expensive.

Stanford Software Security Foundations

Last year I took the Software Security Foundations course, the first course in Stanford University’s Advanced Computer Security certificate program:6 courses, all offered online, on-demand.

Foundations is a basic introduction to software security for developers and technical managers. This course was designed by (and principally delivered by)
Neil Daswani a former security program manager at Google, a Stanford alumni, and now a principal at Dasient, a web computer security startup.

The course is based on Mr. Daswani’s book Foundations of Computer Security: What Every Programmer Needs to Know: it is useful to have a copy of the book handy while going through the course. Foundations is a day’s worth of lectures made available as online videos for a 3-month period, with slides that can be downloaded for printing, and an exam at the end to ensure that you aren’t lazy about following the material. It covers the fundamentals of software security, including:

Security Principles

Authentication, authorization, access control, confidentiality, data integrity, non-repudiation, risk management, and secure system design principles including least privilege, fail-safe and defense-in-depth. Good, but no real surprises here.

Secure Programming

Buffer overflows, SQL injection, password management, cross site request forgery, cross site scripting, and common mistakes in crypto. There was good coverage of SQL injection and cross domain security threats, particularly XSRF. However I found the explanation of Cross Site Scripting confusing, even after reading through the section on cross domain security in the book – it’s not a straightforward problem, ok, but it can be explained in a simpler way. The examples and problems are biased slightly towards C/C++ programmers, but this should not present a problem for programmers working in other environments. It covers most of the important bases with the exception of input validation, which needs more attention.

Introduction to Cryptography

A walkthrough of symmetric encryption and public key cryptography, and then a brief discussion of advanced research in cryptography from Stanford Professor Dan Boneh, an expert in applied cryptography. The explanation of cryptographic primitives was especially lucid, and, well beautiful: I didn’t know that cipher block chaining could be beautiful, but it is. Some of the more advanced issues in cryptography are covered at a cursory level only (not a lot of it will stick with you if you aren’t already familiar with this subject), and you are referred to other courses offered in the program if you want to get a good, basic understanding of how to use cryptography and secure protocols.

SANS Software Security Awareness

In January I checked out the On Demand Software Security Awareness course from the SANS Institute.

I have had success with other courses from SANS before, classroom and On Demand. In particular the course SANS Security Leadership Essentials for Managers is excellent: a challenging and exhaustive 6-day course covering pretty much everything a technical manager would need to know about IT security from technical, project management, risk management, strategic, and legal and ethical perspectives.

Software Security Awareness is a short, 3-hour survey course, offered online as recorded audio lectures and a slide show. SANS prints and binds a copy of the slides and couriers a copy to you shortly after you register for the course. The course covers:

- vulnerability/patch cycle
- security architecture
- principles of software security
- security design
- implementation (coding and deployment)
- input issues
- code review and security testing.

Unfortunately, the course does not start out strong: the walkthrough of the vulnerability/patch cycle is supposed to help build a case for secure software development, but it takes too long to make too fine a point, and it’s hard to stay interested.

The next sections on architecture, principles of software security, and design are confused by a weakness in organization. It’s not clear where architecture ends and design begins, and there is unnecessary repetition between the sections. It would flow better if the discussion started with foundational principles of software security (which are well covered in the course), then proceeding through architecture and then design. Architecture should cover security requirements, risk analysis and threat modeling to determine “how much security is enough”, defense-in-depth, attack surface analysis and minimizing attack surface, complexity and economy of mechanism, cost considerations, and layering and defining trust zones. Some of these issues are covered in the discussion on architecture, some in design, some in both, and some not at all.

The risk assessment discussion surveys different risk modeling techniques, including Microsoft’s STRIDE and DREAD, OCTAVE and others. It includes a recommendation for SALSA, an initiative which SANS seems to have started together with another vendor back in 2007 and which has gone nowhere since. SALSA doesn’t belong with the other established models.

The section on design covers security requirements (again) and some general motherhood on good design practices, then briefly touches on authentication, authorization and permissioning, non-repudiation and auditing, and data confidentiality, but doesn’t explore any of these important areas further in the context of design. This is one area where the Stanford course is much stronger.

The discussion of implementation is good, starting with an explanation of Gary McGraw’s 7 Kingdoms and then a long list of implementation-specific issues, which may or may not apply to your system, but anyways offers some concrete guidance which may catch the attention of programmers.

Then an excellent, brief exploration of input handling problems, recognizing the importance of not trusting data. And a (too) brief discussion of cryptography in the context of storing confidential data.

Finally, one of the strongest parts of the course, on security code review practices and security testing. Starting with an explanation of why and how to do code reviews (including formal inspections and tool-assisted code reviews), and references to language-specific checklists. Then a discussion of static analysis and a survey of different tools: a bit out of date, missing Microsoft’s SDL tools for example, but a decent overview of available static analysis tools. Then a good survey of testing techniques from a security perspective, including fuzzing and pen testing and risk-based testing, and finally a discussion of attack surface analysis (which more properly belongs in the architecture section).

You have 4 months to complete the 3 hour course, which is more than adequate. Because this was a beta of a new course and a new online delivery model, there was no assessment included.

Comparison of the courses

It was interesting to see that, except for a common focus on basic secure software principles, the courses took different, almost complementary approaches to introducing secure software development.

The SANS course was much more broad and current, with more references to resources like OWASP and frameworks and tools, and more on secure SDLC practices. There were some notable gaps in the SANS course, in architecture and especially design, and it suffers from some structural weaknesses, but it covers many of the bases, and important concepts such as secure error handling, failing safe, input handling, and risk management and threat modeling techniques. With a few improvements in structure and content around architecture and design, this would be an excellent course. As it stands, if I can get the developers past the introduction on the vulnerability cycle, they should learn some useful things.

The Stanford course is more than twice as long and more focused, but doesn’t touch on secure SDLC practices such as reviews and security testing, or risk assessment and threat modeling. It digs deeper into specific security implementation problems such as XSRF, XSS, password management, and SQL injection; and it is also stronger in its coverage of basic secure design, and especially cryptography. This course would be of special interest to developers who want to go on with the full Advanced Security Program at Stanford.

Neither of these courses is perfect, but they are both professionally delivered, and offer good value for the money.

Tuesday, March 2, 2010

Continuously Putting Your Customers at Risk

Over the past year or so there has been some buzz around Continuous Deployment: immediately and constantly pushing out new features directly to customers. This practice is followed by companies such as Facebook, Flickr and IMVU, where apparently programmers push changes out to production up to 50 times per day.

Continuous integration and rapid deployment have a number of advantages, and continuous deployment appears to maximize value from these ideas: immediate and continuous feedback from customers, and cutting deployment and release costs way down, eliminating overheads and manual checks.

A profile on Extreme Agility at Facebook describes how the company’s small development team deploys rapid updates to production:
“…Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of the system and surely fixing any bugs that would result from these frequent small changes.”
But there are some fundamental and serious challenges with continuous deployment. Success depends on a few key factors:
  1. a comprehensive and fast automated test suite to catch mistakes, especially regressions;
  2. customers that are willing to let you test in production;
  3. an architecture that catches and isolates failures, preventing problems from chaining or cascading across the cluster;
  4. a disciplined, proven, robust deployment model.

Automated Testing


At IMVU, all changes are run through an automated test suite, which executes on a cluster of test servers, before deploying to production.
"So what magic happens in our test suite that allows us to skip having a manual Quality Assurance step in our deploy process? The magic is in the scope, scale and thoroughness.”
The author goes on to say that
“We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day.”
Ummm, actually, no it isn't. That’s 15k test cases a day. Run 70 times [and no explanation why the tests are 70 times per day, since we’re talking about pushing out 50 changes per day, but anyways…]. You could run 15k test cases a million times and it would still be 15k test cases, in the same way that running 1 test case a million times is still only 1 test case.

A regression suite of 15,000 automated unit and functional tests sounds impressive – of course what is much more important than the number of tests is the quality of tests. We’re a small shop, and we run more than 15,000 automated tests as part of our continuous integration environment, and we also run static analysis checks on all code, and we do peer code reviews, and manual testing including operations tests, exploratory and destructive tests, system trials, stress and performance testing, and end-to-end integration tests before we push to production.

From subsequent statements, it seems clear that most of the changes deployed this way at IMVU at least are trivial, bug fixes and minor modifications: schema changes, for example, are made out of band (they take 2 days to rollout to production at IMVU). So we can assume that the scope, and therefore, risk of any one change can be contained. This is backed up by a comment made by the author at 50 Deployments A Day and the Perpetual Beta:
“when working on new features (not fixing bugs, not refactoring, not making performance enhancements, not solving scalability bottlenecks, etc), we’ll have a controlled deliberate roll out plan that involves manual QE checks along the way, as well as a gradual roll0out and A/B testing.”
I’m not sure why you wouldn’t have a controlled roll-out plan for solving scalability bottlenecks, but let’s assume that he was referring to minor tweaks to the code or configuration, say increasing the size of a resource pool or something.

Testing in Production


After hearing about the approach followed by IMVU last year, a couple of exploratory testing experts, Michael Bolton and James Bach, spent a few minutes trying out IMVU’s system. They, not surprisingly, found a lot of problems without making much of an effort:
“Yes folks, you can deploy 50 times a day. If you don’t care bout the quality of what you’re deploying…”
The writer from IMVU admits that they have a lot of bugs:
“continuous deployment lets you write software *regression free*, it sure doesn’t gift you high quality software.”
Ignore the “*regression free*” claim which assumes that the test suite will always catch *all* regressions. Continuous Deployment essentially concedes the job of testing to your customers: you do some superficial reviews and regression tests, and leave the real work of finding problems to your customers. I can appreciate that this might be acceptable to some customers, for example people participating in online communities or online games, as they trade off the inconvenience of occasional glitches against the chance to try out cool new features quickly.

There’s nothing wrong with running experiments, trying out new ideas with part of your customer base through A/B split testing, seeing what sticks and what doesn’t. But this doesn’t mean you need to, or should, deploy every change directly to production. Again, if your customers are trusting you with financial transactions or sensitive personal information, you would be irresponsible if you took this approach, even if, like Facebook, you only push out changes incrementally to a small number of clients at a time.

Failure Isolation and Fail Safe


If you are going to continually roll changes, and anticipate that some of these changes will fail, the architecture of the system needs to isolate and contain failures, using internal firewalling, fast-fail techniques, timeouts and retries and so on to reduce the likelihood of a failure chaining through layers or cascading across servers and taking the cluster down.

Unfortunately, this doesn’t seem to be the case, at least in the example described here, where Alex, a programmer, is preparing to deploy code containing a 1-character typo which can cause a failure cascade and take out the site:
“Alex commits. Minutes later warnings go off that the cluster is no longer healthy. The failure is easily correlated to Alex’s change and her change is reverted. Alex spends minimal time debugging, finding the now obvious typo with ease. Her changes still caused a failure cascade, but the downtime was minimal."
So the development team is not only conceding that they cannot write good code, and that they are incapable of doing a decent job testing their work, but also that they cannot take the steps to fail safe at an architectural level and minimize the scope of any failures that they cause. This is not the same as conceding, as Google does, that at massive scale, failures will inevitably happen, and you will have to learn how to deal with them. This is simply giving up, and pushing risk out to customers, again.

A Deployment Model that Works


The deployment model at IMVU is cool: of course, if you are going to do something 50 times per day, you should be pretty good at it.
“The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed.”
This rollback model assumes that problems will be found in the first minute, or few minutes of operation. It does not account for race conditions, deadlocks and other synchronization faults, or time-dependent problems or statistical bugs or downstream integration conflicts or intermittent problems which might not show up for hours or even days, by which time dozens or hundreds of other changes have been applied. Good luck finding out where the problem came from by then.

The rollback approach also assumes that all that needs to be done to fix the problem is to rollback the code. It does not account for what needs to be done to track down and repair broken transactions and corrupt data, which might be ok in an online gaming environment, but would be CTO-suicide in a real system.

And you wonder why some web sites have serious software security issues?


Let’s put aside the arguments above about reliability and quality and responsibility, and just look at the problem of security: building secure software in an environment where developers push each change immediately to production.

First, try to imagine someone detecting unauthorized changes with all of this going on. Was every one of those changes intended and authorized? Would you be able to detect an attack in the middle of all of this noise?

Then there’s the problem of building secure software, which is hard enough even if you follow good practice, especially if you are moving fast. There has been a lot of work over the past couple of years defining how developers can build secure software in an agile way. Microsoft’s SDL for Agile maps secure software design and development practices to agile development models like Scrum, and cuts secure software development controls and practices down to fit rapid incremental development.

But with Continuous Deployment, at least as described so far, there is no time or opportunity to do even a minimal set of security checks and reviews before software changes are pushed out by developers.

It’s bad enough to build insecure software out of ignorance. But by following continuous deployment, you are consciously choosing to push out software before it is ready, before you have done even the minimum to make sure it is safe. You are putting business agility and cost savings ahead of protecting the integrity or privacy of customer data.

Continuous deployment sounds cool. In a world where safety and reliability and privacy and security aren’t important, it would be fun to try. But like a lot of other developers, I live in the real world. And I need to build real software.