Wednesday, June 30, 2010

Still getting my head around Continuous Deployment

I attended a long webinar earlier today, sponsored by SD Times: Kent Beck’s Principles of Agility. The other speakers were Jez Humble from ThoughtWorks, a proponent of Continuous Delivery; and Timothy Fitz at IMVU, the leading evangelist for Continuous Deployment.

The arguments in support of Continuous Deployment

Kent Beck explored a fundamental mismatch between rapid cycling in design and construction, and then getting stuck when we are ready to deploy. He argues that that queuing theory and experience show that there is more value in a system when all of the pipes are the same size, and follow the same cycle times. Ideally, there should be a smooth flow from ideas to design and development and to deployment, and then information from real use fed back as soon as possible to ideas. Instead we have a choke point at deployment.

Then there is the ROI argument that we can get faster return on money spent if we deploy something that we have done as soon as it is ready.

Kent Beck also explained that based on his experience at one company the constraints of deploying immediately make people more careful and thoughtful: that the practice becomes self-reinforcing, that developers stop taking risks because they don’t have time to. Essentially problems become simpler because they have to be.

Timothy Fitz presented a Deployment Equation:

If Information Value + Direct Value > Deployment Risk then Deploy

The idea is that Continuous Deployment increases information value by giving us information earlier. He talked about ways to reduce risk:

- Rolling out larger changes slowly to customers, through dark launching (hiding the changes from the front-end until ready: not exactly a new idea) and enabling features for different sets of users.
- Extensive automated testing, supplemented with manual exploratory testing before exposing dark-launched features.
- Ensuring that you can detect problems quickly and correct them through production monitoring, looking for leading indicators of problems, and instant production roll back.
- An architecture that supports stability through isolation. Follow the patterns in Release It! to minimize the chance of “stupid take the cluster out” errors.
- Locking down core infrastructure, preventing changes from certain parts of the system without additional checks.

Jez Humble at ThoughtWorks presented on Continuous Delivery: building on top of Continuous Integration to automate and optimize further downstream packaging and deployment activities. Continuous Deployment is effectively an extension of Continuous Delivery. It was mostly a re-hash of another presentation that I had already seen from ThoughtWorks, and of course there will be a book coming out soon on all of this.

Some questions on Continuous Delivery and Continuous Deployment

Me: Continuous Delivery is based on the assumption that you can get immediate feedback: from automated tests, from post-deployment checks, from customers. How do you account for problems that don't show up immediately, by which time you have deployed 50 or 100 or more changes?

Answer from Timothy Fitz: The first time, you revert and re-push. Then you post-mortem and figure out how to catch faster by looking for a leading indicator. Performance issues can be caught by dark launching, in which case turning off or reverting the functionality will have 0 visible effect. Frontend issues are usually caught by A/B tests, where you can mitigate risk by not running them at 100% of all traffic (have 80% control, 20% hypothesis, etc)

Me: Followup on my question about handling problems that show after 50 or 100 changes. The answer was to revert and re-push - but revert what? A problem may not show itself immediately. How do you know which changes or changes to rollback?

Answer from Timothy Fitz: If it took 50-100 changes, then you'll be finding the change manually. It turns out to be fairly easy even if it's been 48-96 hours, you're only looking through a few hundred very small commits most of which are in isolated areas unrelated to your problem.

Me: How to you handle changes to data (contents and/or schema) on a continuous basis?

Answer: not answered. Jez Humble talked about writing code that could work with multiple different database versions (which would make design and testing nasty of course), and how to automate some database migration tasks with tools like DBDeploy, but admitted that “databases were not optimized for Continuous Delivery”. There were no good answers on how to handle expensive data conversions.

Me: My team has obligations to ensure that the software we deliver is secure, so we follow secure SDLC checks and controls before we release. In Continuous Delivery I can see how this can be done along the pipeline. But secure Continuous Delivery?

Answer from Jez Humble: Ideally you'd want to run those checks against every version. If you can't do that, do it as often as you can.
[I didn’t expect a meaningful answer on this one, and I didn’t get one]

Somebody else’s question: Do you find users struggling to keep up and adapt to the constant changes?

Answer from Kent Beck: In practice it doesn't seem to be a problem usually because each change is small--a new widget, a new menu item, a new property page that's similar to existing pages. A wholesale change to the UI would be a different story. I would try to use social processes to support such a change--have a few leaders try the new UI first, then teach others.

Somebody else’s question: Without solid continuous testing in place, CD is [a] fast track to continuous complaints from end users

Answer from Timothy Fitz: Not always, but usually. For the cases where it makes sense (small startup, or isolated segment that opts-in to alpha) you can find user segments who value features 100% over stability, and will gladly sign up for Continuous Deployment.

So what do I really think about Continuous Deployment

OK I can see how Continuous Deployment can work,

If: your architecture supports isolation, that it is horizontal and shallow, offering features that are clearly independent;

If: you don’t follow the all-or-none approach – that you recognize that some kinds of changes can be deployed continuously and some parts of the system are too important and require additional checks, tests, reviews, and more time;

If: you build up enough trust across the company;

If: your customers are willing to put up with more mistakes in return for faster delivery, if at least some of them are willing to help you do your testing for you;

If: you invest enough in tools and technology for automated layered testing and deployment and post-deployment checking and roll-back capabilities.

Continuous Deployment is still an immature approach and there are too many holes in it. And as Kent Beck has pointed out, there aren’t enough tools yet to support a lot of the ideas and requirements: you have to roll your own, which comes with its own costs and risks.

And finally, I have to question the fundamental importance of immediate feedback to a company. I can see that waiting a year, or even a month, for feedback can be too long. I fully understand and agree that sometimes changes need to be made quickly, that sometimes the windows of opportunity are small and we need to be ready immediately. And there’s first mover advantage, of course. But I have a hard time believing that any kind of changes need to be continuously made 50 times per day: that there are any changes that can be made that quickly that will have any real difference to customers or to the business. And I will go further and say that such rapid changes are not in the interests of customers, that they don’t need or even want this much change this fast. And that I don’t believe that it’s really about reducing waste, or maximizing velocity or increasing information value.

No, I suspect it is more about a need for immediate satisfaction – for programmers, and the people who drive them. Their desire to see what they’ve done get into production, and to see it right away, to get that little rush. The simple inability to delay gratification. And that’s not a good reason to adopt a model for change.

Monday, June 28, 2010

Velocity 2010 Conference Take-Aways

I spent an interesting few days last week at the Velocity 2010 conference in Santa Clara. The focus of the conference was on performance and application operations for large-scale web apps. Here are my take-aways:


Fundamentally a problem of scale-out, of handling online communities of millions of users and the massive amounts of information that they want at hand. As Theo Schlossnagle pointed out in an excellent workshop on Scalable Internet Architectures (or you can read the book…), the players in this space approach performance problems with similar technologies (LAMP or something similar like Ruby on Rails as the principal stack, and commodity servers) and architectural strategies:

1. Data partitioning – sharding datasets across commodity servers, required because MySQL does not scale vertically. Theo’s advice on sharding: “Avoid sharding as long as possible, it is painful. If you have to shard, follow these steps. Step 1: Shard your data. Step 2: Shoot yourself”. Consider duplicating data if you need the same information available in different partitioning schemes.

2. Non-ACID key-value data stores and NOSQL distributed data managers like Cassandra, MongoDB, Voldemort, Redis or CouchDB for handing high volumes of write-intensive data. Fast and simple, but these technologies are still immature, they are not hardened or reliable, and they lack the kinds of management capabilities and tools that Oracle DBAs have been accustomed to for years.

3. Strategies for effective caching of high-volume data, basically ways of extending and optimizing the use of memcached, and different schemes for effective cache consistency and cache coherency.

Some other advice from Theo: Planning for more than a 10-fold increase in workload is a waste of time – you won’t understand the type of problems that you are going to face until you get closer. On architecture and design: don’t simplify simple problems.

Coming from a financial trading background, I was surprised to see that the argument still needed to be made that performance was an important business factor: that speed could improve business opportunities. Seems obvious.

According to one of the keynote speakers, Urz Holzle at Google, the average time for a page to load is 4.9 seconds, while the goal should be around 100 ms – the time that it takes a reader to turn a page in a book. Google presented some interesting research work that they are leading to improve the front-end response time of the web experience, including proposals to improve DNS and TCP, work done in Chrome to improve browser performance, and advanced performance profiling tools made available to the community.

Operations and DevOps

Provisioning and deployment (a real management problem when you need to deploy to thousands or tens of thousands of servers); change management and the rate of change; version control and other disciplines; instrumentation and logging; metrics and more metrics; and failure handling and incident management.

Log and measure as much as you can about the application and infrastructure – establish baselines, understand Normal, understand how the system looks when it is working correctly and is healthy.

Configuration management and deployment. Advice from Theo: version control everything – not just code and application configuration, and server configs, but also the configs for firewalls and load balancers and switches/routers and the database schemas and…

Several companies were using Chef or Puppet for managing configuration changes. Facebook and Twitter were both using BitTorrent to stream code updates across thousands of servers.

Change management. The consensus is that ITIL is very uncool – it is all about being slow and bureaucratic. This is a shame – I think that everyone in an operations role could learn from the basics of ITIL and Visible Ops, the disciplines and frameworks.

The emphasis was on how to effect rapid change, how to get feedback as quickly as possible, time to market, continuous prototyping, A/B split testing to understand customer needs, the need to make decisions quickly and see results quickly. At the same time, different speakers stressed the need for discipline and responsibility and accountability: that the person who is responsible for making a change should make sure it gets deployed properly, and that it works.

Continuous Deployment came up several times, although “Continuous” means different things to different people. For Facebook this means pushing out small changes and patches every day and features once per week.

You can’t make changes without taking on the risk of failure. This was especially clear to an audience where so many people had experience in startups.

Lenny Rachitsky’s session, The Upside of Downtime, covered the need for transparency in the event of failure, and showed how being transparent and honest in the event of a failure can help build customer confidence. His blog, Transparent Uptime includes an interesting collection of Public Health Dashboards for web communities.

To succeed you need to learn from failures of course – use postmortems and Root Cause Analysis to understand what happened and implement changes so that you don’t keep making the same mistakes. Another quote from Theo: “Good judgment comes from experience. Experience comes from bad judgment. Allow people to make mistakes – but limit the liability. Measure the poise and integrity with which someone deals with the problem and its remediation.”

So failure can scorch you, make you afraid, and this fear can affect your decision making, slow you down, stop you from taking on necessary and manageable risks. You need to know how much risk you can take on, whether you are going too slow or too fast, and how to move forward.

John Allspaw at Etsy, one of the rock stars of the devops community, made a clear and compelling (and entertaining) case for meta-metrics, data to provide confidence in your operational decisions: “How do we get confidence in [what we are doing]? We make f*&^ing graphs!”

First track all changes: who made the change, when, what type, and how much was changed. Track all incidents: severity, time started, time to detect, time to recover/resolve, and the cause (determined by RCA). Then correlate changes with incidents: by type, size, frequency. With this you can answer questions like: What type of incidents have high Time to Recover? What types of changes have high / low success rates?

Unfortunately the video and slide deck for this presentation are not available on the Velocity site yet.

There was some macho bullshit from one of the speakers about “failing forward” – that essentially rolling back was for cowards. I think this statement was made tongue-in-cheek and I hope that it was taken as such by the audience.

The Rest

I also followed up some more on Cloud Computing. Sure, the Cloud gives you cheap access to massive resources but the consensus at the conference was that it is still not reliable and it is definitely not safe, and it doesn’t look like it will get that way soon. Any data that you need to be safe or confidential needs to be kept out of the Cloud or at minimum encrypted and signed with the keys and other secrets stored out of the Cloud, following a public/private data architecture.

The conference was fun and thought-provoking, and I met a lot of smart and thoughtful people. The crowd was mostly young and attention-deficit: iphones, ipads, notebooks and laptops in constant use throughout the sessions.

Maybe it was the California sunshine, but the atmosphere was more open, more sharing, and less proprietary than I am used to – there was a refreshing amount of transparency into the technology and operations at many of the companies. The vendor representation was small and low key, but recruitment was blatant and pervasive: everyone was hunting for talent.

I am an uptight enterprise guy. It would be fun to work on large-scale consumer problems, with more freedom to make changes. I regret missing the followon DevOps Days event last Friday but I had to get home. And finally, I am looking forward to getting my copy of the new WebOps book
which was premiered at the conference, and to next years Velocity conference.
Site Meter