Tuesday, March 2, 2010

Continuously Putting Your Customers at Risk

Over the past year or so there has been some buzz around Continuous Deployment: immediately and constantly pushing out new features directly to customers. This practice is followed by companies such as Facebook, Flickr and IMVU, where apparently programmers push changes out to production up to 50 times per day.

Continuous integration and rapid deployment have a number of advantages, and continuous deployment appears to maximize value from these ideas: immediate and continuous feedback from customers, and cutting deployment and release costs way down, eliminating overheads and manual checks.

A profile on Extreme Agility at Facebook describes how the company’s small development team deploys rapid updates to production:
“…Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of the system and surely fixing any bugs that would result from these frequent small changes.”
But there are some fundamental and serious challenges with continuous deployment. Success depends on a few key factors:
  1. a comprehensive and fast automated test suite to catch mistakes, especially regressions;
  2. customers that are willing to let you test in production;
  3. an architecture that catches and isolates failures, preventing problems from chaining or cascading across the cluster;
  4. a disciplined, proven, robust deployment model.

Automated Testing


At IMVU, all changes are run through an automated test suite, which executes on a cluster of test servers, before deploying to production.
"So what magic happens in our test suite that allows us to skip having a manual Quality Assurance step in our deploy process? The magic is in the scope, scale and thoroughness.”
The author goes on to say that
“We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day.”
Ummm, actually, no it isn't. That’s 15k test cases a day. Run 70 times [and no explanation why the tests are 70 times per day, since we’re talking about pushing out 50 changes per day, but anyways…]. You could run 15k test cases a million times and it would still be 15k test cases, in the same way that running 1 test case a million times is still only 1 test case.

A regression suite of 15,000 automated unit and functional tests sounds impressive – of course what is much more important than the number of tests is the quality of tests. We’re a small shop, and we run more than 15,000 automated tests as part of our continuous integration environment, and we also run static analysis checks on all code, and we do peer code reviews, and manual testing including operations tests, exploratory and destructive tests, system trials, stress and performance testing, and end-to-end integration tests before we push to production.

From subsequent statements, it seems clear that most of the changes deployed this way at IMVU at least are trivial, bug fixes and minor modifications: schema changes, for example, are made out of band (they take 2 days to rollout to production at IMVU). So we can assume that the scope, and therefore, risk of any one change can be contained. This is backed up by a comment made by the author at 50 Deployments A Day and the Perpetual Beta:
“when working on new features (not fixing bugs, not refactoring, not making performance enhancements, not solving scalability bottlenecks, etc), we’ll have a controlled deliberate roll out plan that involves manual QE checks along the way, as well as a gradual roll0out and A/B testing.”
I’m not sure why you wouldn’t have a controlled roll-out plan for solving scalability bottlenecks, but let’s assume that he was referring to minor tweaks to the code or configuration, say increasing the size of a resource pool or something.

Testing in Production


After hearing about the approach followed by IMVU last year, a couple of exploratory testing experts, Michael Bolton and James Bach, spent a few minutes trying out IMVU’s system. They, not surprisingly, found a lot of problems without making much of an effort:
“Yes folks, you can deploy 50 times a day. If you don’t care bout the quality of what you’re deploying…”
The writer from IMVU admits that they have a lot of bugs:
“continuous deployment lets you write software *regression free*, it sure doesn’t gift you high quality software.”
Ignore the “*regression free*” claim which assumes that the test suite will always catch *all* regressions. Continuous Deployment essentially concedes the job of testing to your customers: you do some superficial reviews and regression tests, and leave the real work of finding problems to your customers. I can appreciate that this might be acceptable to some customers, for example people participating in online communities or online games, as they trade off the inconvenience of occasional glitches against the chance to try out cool new features quickly.

There’s nothing wrong with running experiments, trying out new ideas with part of your customer base through A/B split testing, seeing what sticks and what doesn’t. But this doesn’t mean you need to, or should, deploy every change directly to production. Again, if your customers are trusting you with financial transactions or sensitive personal information, you would be irresponsible if you took this approach, even if, like Facebook, you only push out changes incrementally to a small number of clients at a time.

Failure Isolation and Fail Safe


If you are going to continually roll changes, and anticipate that some of these changes will fail, the architecture of the system needs to isolate and contain failures, using internal firewalling, fast-fail techniques, timeouts and retries and so on to reduce the likelihood of a failure chaining through layers or cascading across servers and taking the cluster down.

Unfortunately, this doesn’t seem to be the case, at least in the example described here, where Alex, a programmer, is preparing to deploy code containing a 1-character typo which can cause a failure cascade and take out the site:
“Alex commits. Minutes later warnings go off that the cluster is no longer healthy. The failure is easily correlated to Alex’s change and her change is reverted. Alex spends minimal time debugging, finding the now obvious typo with ease. Her changes still caused a failure cascade, but the downtime was minimal."
So the development team is not only conceding that they cannot write good code, and that they are incapable of doing a decent job testing their work, but also that they cannot take the steps to fail safe at an architectural level and minimize the scope of any failures that they cause. This is not the same as conceding, as Google does, that at massive scale, failures will inevitably happen, and you will have to learn how to deal with them. This is simply giving up, and pushing risk out to customers, again.

A Deployment Model that Works


The deployment model at IMVU is cool: of course, if you are going to do something 50 times per day, you should be pretty good at it.
“The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed.”
This rollback model assumes that problems will be found in the first minute, or few minutes of operation. It does not account for race conditions, deadlocks and other synchronization faults, or time-dependent problems or statistical bugs or downstream integration conflicts or intermittent problems which might not show up for hours or even days, by which time dozens or hundreds of other changes have been applied. Good luck finding out where the problem came from by then.

The rollback approach also assumes that all that needs to be done to fix the problem is to rollback the code. It does not account for what needs to be done to track down and repair broken transactions and corrupt data, which might be ok in an online gaming environment, but would be CTO-suicide in a real system.

And you wonder why some web sites have serious software security issues?


Let’s put aside the arguments above about reliability and quality and responsibility, and just look at the problem of security: building secure software in an environment where developers push each change immediately to production.

First, try to imagine someone detecting unauthorized changes with all of this going on. Was every one of those changes intended and authorized? Would you be able to detect an attack in the middle of all of this noise?

Then there’s the problem of building secure software, which is hard enough even if you follow good practice, especially if you are moving fast. There has been a lot of work over the past couple of years defining how developers can build secure software in an agile way. Microsoft’s SDL for Agile maps secure software design and development practices to agile development models like Scrum, and cuts secure software development controls and practices down to fit rapid incremental development.

But with Continuous Deployment, at least as described so far, there is no time or opportunity to do even a minimal set of security checks and reviews before software changes are pushed out by developers.

It’s bad enough to build insecure software out of ignorance. But by following continuous deployment, you are consciously choosing to push out software before it is ready, before you have done even the minimum to make sure it is safe. You are putting business agility and cost savings ahead of protecting the integrity or privacy of customer data.

Continuous deployment sounds cool. In a world where safety and reliability and privacy and security aren’t important, it would be fun to try. But like a lot of other developers, I live in the real world. And I need to build real software.

5 comments:

  1. I disagree with your thesis that Facebook, IMVU, Flickr and other companies that do CD don't live in the real world. As the beacon debacle showed, privacy is a very real concern for Facebook users and must be something to consider when bring out new features. A breach in security or privacy can be deadly for a social networking site. There's also nothing saying that you can't/wouldn't have a security be a concern in both automated testing and deployment (we do these things where I work).

    Also, when you put down other people's very successful approaches to deployments with statements like "I work in real software," it doesn't help you make your (valid and thoughtful) points; instead it makes me want to ignore everything you wrote. I work writing airline reservation systems and we embrace a lot of CD/CI principles, does that mean airline reservations systems aren't real software?

    ReplyDelete
  2. Seems like the "real world" is full of sour grapes.

    ReplyDelete
  3. @adamfblahblah: Your criticism is fair. I should not have come across as condemning Continuous Deployment as a principle, as it is possible to rapidly deploy changes to production without all of the downside that I explored here. From comments made at

    http://lastinfirstout.blogspot.com/2009/03/continuous-deployment-debate.html

    it is clear that Flickr at least takes a thoughtful and responsible approach to rapid deployment, as I am sure that your firm does. I am interested in understanding more about how the approach you take for an airline reservations system differs from the ones described so far, how you are able to make continuous deployment robust.

    My concern is that some of the implementations of this method appear to be irresponsible. If privacy and security concerns are important to a company, why would they take an approach that short cuts controls to such an extent? If changes are never delayed and pushed directly to production to get feedback (as described in the posts I referred to), then I don't see how changes can be made safe (or at least safe enough). Maybe the methods weren't described completely, and there are more controls and checks being done somewhere, and my concerns are not justified?

    @CR: sour grapes. No, I don't see it that way. Fundamentally concerned about where this might be taking us all, down a path where we could all expect continuous problems, yes definitely.

    ReplyDelete
  4. One of the problem has been that there is a tendency to overhype the positives and underscore the negatives. CI/CD has some very nice pros but as Jim has pointed out it has some very serious flaws. It really all depends on your particular set of circumstances and how your application works.

    @JimBird I don't believe you concerns are unjustified. I was personally extremely skeptical of the concept and even though I have not used it/implemented it I can now see that it has value.

    ReplyDelete
  5. @jim,


    Using your post as a springboard, I've written up a post on some of the CI/CD we do where I work: http://www.thesimplelogic.com/2010/03/04/continuous-integration-deployment-in-the-airline-industry/

    Hopefully you're okay with me using the quote I took issue with to start my discussion. I'd love to hear your comments.

    -Adam

    ReplyDelete

Note: Only a member of this blog may post a comment.