Thursday, October 24, 2013

Making Devops work outside of Webops

I've spent the last 3 years or so learning more about devops. I went to Velocity and Devopsdays and a bunch of other conferences that included devops stuff (like the last couple of OWASP USA conferences and this year's Agile conference). I've been following the devops forums and news and reading devops books and trying out devops tools and Continuous Delivery, talking to smart people who work at successful devops shops, encouraging people in my organization to adopt some of these ideas and tools where they make sense. Looking for practices and patterns and technology we can take and apply to the work that we do, which is in an enterprise, B2B environment.

A problem for us is that devops today is still mostly where it started, rooted in Web Ops with a focus on building and scaling online shops and communities and Cloud services – except maybe where some enterprise technology vendors have jumped on the devops bandwagon to re-brand their tools.

Is there really that much that a well-run highly-regulated enterprise IT organization hooked into hundreds or thousands of other enterprises can learn from a technology startup trying to launch a new online social community or a multi-player online game, or even from larger, more mature devops shops like Etsy or Netflix? Do the same rules and ideas apply?

The answer is: yes sometimes, but no, not always.

There are some important factors that separate enterprises from most devops shops today.

Platform heterogeneity and the need to support legacy systems and all of the operational inter-dependencies between systems is one – you can’t pretend that you can take care of your configuration management problems by using Puppet or Chef in an enterprise that has been built up over many years through mergers and acquisitions and that has to support thousands of different applications on dozens of different technology platforms. Many of those apps are third party apps that you don’t have control over. Some of those platforms are legacy systems that aren't supported any more. Some of those configs are one-off snow flakes because that’s the only way that somebody could get things to work.

Governance and regulatory compliance (and all the paperwork and hassle that goes with this) is another. Even devops shops don’t handle their highly-regulated core business functions the same as they do the rest of their code (a good example is how Etsy meets PCI compliance).

There are two other important factors that separate many enterprises such as large financial institutions from the way that a devops shop works: the need for speed in change, and the cost of failure.

The Need for Speed

If “How can I change things faster?” is the question, devops looks like the answer.

Devops enables – and emphasizes – rapid, continuous change, through relentless automation, breaking down walls between developers and operations, and through practices like Continuous Deployment, where developers push out code changes to production several times a day.

Being able to move this quickly is important in early stages of iterative design and development, for online startups that need to build a critical mass of customers before they run out of money, and other organizations experiencing hyper growth. Every organization has some systems that need to be changed often, and can be changed with minimal impact: CRM systems, analytics and internal management reporting for example. And as James Urqhuart explains, optimizing for change makes sense when you need to change technology often.

But there are other systems that you don’t need to or you can’t change every day or every week or every month: ERP and accounting systems, payment handling, B2B transactional systems, industrial control. Where you don’t need to and don’t want to run experiments on your customers to try out new features or constantly refine the details of their user experience because the system already works and lots of people are depending on it to work a certain way and what’s really important is keeping the system working properly and keeping operational costs down. Or where change is risky and expensive because of regulatory and compliance requirements and operational interdependencies with other systems and other organizations. And where the cost of failure is high.

Change, even when you optimize for it, always comes with the risk of failure. The 2013 State of Devops Report found that high performing devops shops deploy code 30x more frequently, with “double the change success rate”. By themselves these figures are impressive. Taken together – they aren’t good enough. Changing more often still means failing more often than an organization which moves more slowly and more cautiously, and not every organization can afford to fail more often.

The Cost of Failure

Most online businesses exist in a simpler, more innocent world where change is straightforward – it’s your code and your Cloud so you can make a change and push it out without worrying about dependencies on other systems and the people who use them or how to coordinate a roll-out globally across different companies and business lines – and where the consequences of failure are really not that high.

If you’re not charging anything (Facebook, Twitter) or next to nothing (Netflix) for customers to use your service, and if the cost of failure to customers is not that much (they have to wait a little bit to tell people that their kitty just sneezed or to post a picture of their gold fish or watch a movie) then nobody has the right to expect too much when something goes wrong.

It’s a completely different world for financial services companies, where failure is not always an option – and “fail fast, fail often” works in early stage development, but not once customers are using your technology.

I’ve been told by a tech exec at a bank that Etsy (or any web company) wasn't a “serious” endeavor, that his bank works with “serious money” which means that they can’t “screw around” like web companies do. I've also seen web companies poo-poo the enterprise because they're "spoiled" with their small user base and non-24x7 working environments.

Until there is a shared understanding between those groups, the healthy and mature swapping of ideas and concepts is going to be slow.

John Allspaw, interview, Is the Entrerprise Ready for DevOps?

That bank exec, though he wasn't diplomatic about it, is right.

The cost and risk involved in a failure is several orders of magnitude different between a bank and an online consumer web business, even something as large as Etsy. In a presentation at the end of 2012, Etsy's CTO boasted that they are now handling “real money”, as much as “$1k per minute” at that time. While that’s real money to real customers at Etsy (and I am sure that this number is higher by now), it’s negligible compared to the value of transactions that any major financial institution handles.

There aren’t any mistakes that a company like Etsy or even Facebook could make that could compare with the impact of a big system failure at a major bank, or a credit card processor or a major stock exchange or financial clearing house or brokerage, or some other large financial institution.

This is not just because of the high value of transactions that are moving through these systems. It is also because of the chain reaction that such failures have on other institutions in the national and international system-of-systems that these organizations operate in – the impact on partner and customers’ systems, and on their customers and partners and so on. The costs of these failures can run into the millions or hundreds of millions of dollars, much more if you include lost opportunity costs and the downstream costs of responding to the failure (including IT costs for upgrading or replacing the entire system, which is a common response to a major failure), never mind the follow-on costs of increased regulatory oversight that is often demanded across an entire industry after a high-profile system failure.

Problems of this scale just don’t happen when an online e-business or social network fails, even if it fails spectacularly. It’s an inconvenience to customers, and it is a real cost to the online business, but failures don’t cascade down to other companies and industries, unless maybe Amazon’s AWS infrastructure fails big time and the online companies that depend on it are left hanging, which seems to happen a couple of times each year.

It’s not enough for many enterprises and even smaller B2B platforms to optimize for MTTR and try harder next time or to accept that roll-back is a myth and that “real men only roll forward” – and from the continuing stories of high-profile failures at online organizations this isn't enough for devops organizations once they reach a certain size either.

But you can still learn a lot from Devops

It’s not that devops won’t work in the enterprise. Just not devops as it is mostly described until now. Devops overplays the “everything needs to change faster and more often” card, and oversimplifies some of the other problems that many organizations face if they don’t or can’t run everything in the Cloud. But there is still a lot that to learn from devops leaders, even if their stories and their priorities and constraints and their business situations don’t match up.

We can certainly learn from them about how to run a scalable Web presence and about how to move stuff to the Cloud.

We can learn how to take more of the risk and cost out of change management by simplifying and standardizing configurations where possible and simplifying and automating testing and deployment steps as much as possible – even if we aren't going to change things every day.

But for now, probably the most valuable thing that devops brings to those of us who don't work in online Web shops isn't tools or practices. It’s that devops is creating new reasons and more opportunities for dev and Ops and management to engage with each other, as they try to figure out what this devops thing is and whether and how it makes sense in our organizations.

To get developers, managers and Ops together talking about configuration management and how to improve release and deployment and run-time alerting and application monitoring and run-time health checks and how developers and Ops can learn from failures together. Getting developers and Ops thinking about these problems and trying to solve them together, in collaborative and constructive ways that ITIL certainly never did. Spending more time on problem solving and less time on expensive bureaucracy and buck passing. That’s gotta be a good thing.

6 comments:

Josh Meier said...

I disagree with a lot of what's said here. Let's start with what I do agree with.

1) Large enterprises that have legacy systems to support and maintain definitely make it a challenge and Puppet/Chef aren't going to fix everything for you magically.
2) Not everyone needs to deploy to production on a daily/weekly basis. That's a fair assessment but has nothing (or very little) to do with devops.

Now, for what I disagree with:

1) Devops doesn't emphasis speed over any other value. I'm a little surprised that is the interpretation you got after spending so much time researching it. Here is one of my favorite quotes about what DevOps is:

To me, DevOps is a label that describes a situation where there are no walls, no gates, no transitions, and no ceremony between Development and Operations. They are seamlessly integrated (when viewed from “above”) into a single, value delivering, IT entity. From within, there may be individuals who specialize on “one side or the other,” but even those individuals interface seamlessly and directly without the need for an intermediary. Most importantly (in my humble opinion) DevOps means that everyone—from Jr Analyst, to Mid Dev, to Sr. Test, to Director of IT—is equally responsible and accountable for the product from inception through retirement, meaning the Devs are just as likely to be maintaining the system in Prod as the Sys Admins are, to be doing configuration testing in the Dev Environment.

I'd also like to add that DevOps is about putting the customer first at all stages of the game. Where I think you may have gotten confused is the "fail fast, fail often" mantra you often hear when talking about devops. When you fail you had better be willing, able and ready to fix it fast. In some shops that may mean rolling back (though I hope that's becoming less common) but in most cases it means that the delta of changes from the previous deployment is small enough that it's easy to identify, fix and push the fixed code to deployment. You don't have to deploy often, but you better be reducing the changelist size and doing continuous delivery (aka continuous integration) of some sort.

Taking that description of DevOps do you still believe it's not a near-perfect fit for enterprise environments?

2) Are you really trying to convince me that "web" companies don't deal in real money? Amazon.com is a very early adopter of the devops mentality (including rapid deployments by the way) and when their main site goes down they lose millions of dollars a minute.

Sure, they aren't dealing with the level of transactions that some financial institutes are, but they are dealing with "real money." Again, devops does not have to mean pushing code to production every 5 minutes.



It sure looks like your biggest argument against devops in the enterprise is the "need for speed" and that's just silly. DevOps isn't law, it isn't a set-in-stone commandment. It's a set of guidelines and recommended practices to allow companies to work more efficiently, more effectively and provide better products to their customers. Why would any organization say that DevOps isn't right for them?

Jim Bird said...

Josh, you know what you are talking about and I appreciate your position. I agree that improving collaboration and communication between dev and ops is important and valuable and this is something that Devops has helped make clearer to everyone.

But speed of change continues to dominate the conversation. Look at the sessions at the recent Flowcon conference for example (a conference that was specifically aimed at Devops for the Enterprise): "Velocity and Volume (or Speed Wins)", "The Virtuous Cycle of Velocity: What I Learned About Going Fast at eBay and Google", "Cloud Operations at Netflix: Optimizing Innovation Speed While Supporting Availability", ...

Sure there's more to Devops than Continuous Deployment. But where do you find people talking about Devops where they aren't talking about how many times they are deploying per day, and minimizing the risk of change?

I don't want to dismiss the cost to online businesses when a big failure occurs. My point about "real money" is really the system-of-systems issue where risk of changes cascade across organizations and industries. If Amazon goes down, they lose out, and anyone using their service loses out too. If a major bank or financial clearing house fails (and I only use these examples because this is a world that I am familiar with) the failures cascade and chain throughout all of the other organizations that work with them. This is a much much more expensive and ugly problem, which makes the risk of change much higher (hence the increased governance and regulatory oversight involved).

What I am asking for is more conversations about Devops that don't emphasize speed and rapid innovation/time-to-market... and are more realistic about the costs and risks of change. Something that more people can actually use and follow. The only ones I have run across that are doing this are the handful of ThoughtWorks people behind Continuous Delivery. Everyone else seems mesmerized by how many changes Amazon or some other organization pushes out every day.

Josh Meier said...

Thanks for the well thought out response, I appreciate it. If speed to production is what people are becoming mesmerized by then they aren't truly grasping the real intentions of devops. Speed is a byproduct of devops but is in no way a requirement (unless you are talking about the speed of feedback loops, in which case it is critical, but that's not devops, that's agile.)

Continuous Delivery I do believe is actually key to devops, but delivery does not mean deployment. Code should be continually built and shipped to test environments for early (and frequent) feedback. Without it you can break down all the barriers you want, but everyone but the developers will be bored for long periods of time.

I would encourage everyone who thinks that speed to production is the primary goal/driver of the devops movement to reeducate themselves; something got lost somewhere. Some constructive feedback for this article itself would be to actually address the primary concern and educate the readers on what devops actually means. You could kill two birds with one stone that way.

Attila-Mihaly Balazs said...

I too disagree strongly.

First of, DevOps in my mind is about automating as much as possible. Asking people to be "more careful" doesn't work (just ask the Knight Capital folks about it). Automation works and is truly the only solution to ensure that you make every mistake at most once.

Second, in big financial systems there always are alternatives. I see from your bio that you work in this sector and I'm surprised that you didn't mention this. Complex system doesn't work? Shut it down and just use the basic part. System fails entirely? Call up the broker and ask them to cancel your orders by phone. Brokers' system down? For one, they will be footing the bill. Second, call up the exchange and ask to cancel your trades (or the broker is already doing that). Financial systems are very loosely coupled and a problem at any one company won't bring it down. Not to mention that big companies are not "one company". There are more like many small companies cooperating with separate systems which work separately.

So DevOps doesn't increase risks - it actually reduces them.

Jim Bird said...

Attila-Mihaly,

I do agree with you about the importance of automation, replacing manual steps and meetings with automated steps and built in controls - which means getting dev and ops together to build these tools.

Unfortunately, financial systems aren't as loosely coupled as you would like them to be. Yes failures can be managed and you can fail back. And yes failures are isolated to one part of a system or one part of a company - these systems are not monolithic. But there are a lot of interdependencies between systems and between different organizations.

If a consumer website fails (even a big one), users wait and try again, and if whatever they were working on was lost they shrug "something must have gone wrong on the Internet" and keep going. If a core financial system fails of the organizations that work with it are busy reconciling and unravelling commitments, making sure that they understand their positions and that their customers are not impacted and so on. There are tools and protocols to do this, but it still takes time. It's not that failures can't be fixed - it's that costs a lot more. And if for example the biggest stock exchange in the country fails on the last day of tax filing (which happened at the London Stock Exchange in 2000) the entire economy of a country can be impacted.

I am not saying that devops can't be used to help reduce risk - making smaller changes more often is a good way to reduce risk up to a certain point, I know this is effective because this is how we work today at my shop. There are things that devops can help with outside of Webops, as long as you understand the differences.

Peter Spung said...

Very nice post Jim. I appreciate your perspective, time invested in writing about it, and the 'bibliography' of links throughout with additional references and points of view. Thank you. Peter

Site Meter