Thursday, January 9, 2014

Developers working in Production. Of course! Maybe, sometimes. What, are you nuts?

One of the basic ideas in Devops is that developers and operations should share responsibility for designing systems, for implementing them and keeping them running. Developers should be on call in case something goes wrong, and be the one to fix whatever breaks. Because the person who wrote the code is often the only one who knows how it really works. And because of the moral hazard argument: if programmers are held fully accountable for the work that they do, they will be incented to do a better job, instead of writing garbage and handing it off to somebody else.

But this means that developers need some kind of access to production. How much access developers need, how often, and how this can be made safe, are important questions that have to be answered.

Hire wicked smart people and give them all access to root.
Unnamed devops evangelist, Is Devops Subversive?

If you ask whether developers should have access to production you’ll find that people fall into one of 3 camps:

Yeah, sure, of course – who else is going to support the system?

This is a simple decision for online startups, where there’s often nobody else to install, configure and support the application any ways.

As these organizations grow, developers often continue to stay closely involved in deployment, support and application operations, and in some cases, still play a primary role, especially in shops that heavily leverage cloud infrastructure (think Netflix).

Read my lips: Never Ever! Are you out of your freakin’ mind?

Question: Should developers have access to production?

Answer: Not only no, but hell no.

kperrier, Slashdot: Should Developers Have Access to Production

The situation is much different in large enterprises and government organizations, where walls have been built up between development and operations for many different reasons. It’s not just mergers and acquisitions and inertia and internal politics and protectionism that made this happen. It’s also SOX and PCI and HIPAA and GLBA and other overlapping regulations and privacy rules, and ITIL and COBIT and ISOxxx and CMMI and other IT governance frameworks, and internal and external auditors enforcing separation of duties and need-to-know access limitations in order to ensure the integrity and confidentiality of system data.

The same rules also apply to leaner-and-meaner Devops shops. For example, at Etsy (a Devops leader), PCI DSS compliant functions are managed and supported by a different team in a different way from the rest of their online systems: while developers have R/O access to a lot of production “data porn” (metrics and graphs and logs), they do not have access to production databases; there are more requirements for activity logging; a push to QA is handled in a clearly different way than a push to production; and all changes to production must be tracked and approved through a ticketing system.

And there’s also the problem of shared infrastructure: the same networks and servers and databases and other parts of the stack may be used by many different applications and different business units. Developers of course only understand the applications that they are working on and are only familiar with the simplified test configurations that they use day-to-day – they may not know about other systems and their shared dependencies, and could easily make changes that break these systems without being aware of the risks.

In case of emergency, break glass

Most organizations fall somewhere in between a Noops web startup in the cloud and a legacy-bound enterprise weighted down by too much governance and management politics. Operations is usually run separately, management is still accountable to regulators and auditors, but most people understand and recognize the need for developers to help out, especially when something goes wrong.

When the shit has indeed and truly hit the fan, developers – although usually only senior developers or team leads – are brought in to help troubleshoot and recover. Their access is temporary, maybe using a “fire id” extracted from a vault, then locked down again as soon as they are done. Developers are often paired up with an operations buddy who does most of the driving, or at least watches them carefully when the developer has to take the wheel.

Question: Should developers have access to production?

Answer: Everyone agrees that developers should never have access to production… Unless they’re the developer, in which case it’s different.

SatanicPuppy, Slashdot: Should Developers Have Access to Production

Problems in production can be fixed much faster if developers can see the logs, stack traces and core dumps and look at production data when something goes wrong. Giving at least some developers read access to production logs and alerts and monitors – enough to recognize that something has gone wrong and to figure out what needs to be fixed – makes sense.

Sometimes really bad things happen and all that matters is getting the system back and up and running as quickly as possible. You want the best people you can find working on the problem, and this includes developers. You’ll need their help with diagnosis and deciding what options are safest to take for roll back or roll forward, putting in an emergency fix or workaround, and data repair and reconciliation. Everyone will need to check later to make sure that any temporary fixes or workarounds are implemented properly, checked-in and redeployed.

When you run incident management fire drills, make sure that developers are included. And developers should also be included in incident postmortem reviews, even if they weren't part of the incident management team, because this is an important opportunity to learn more about the system and to improve it. But if you have developers firefighting in production more than almost never, then you’re doing almost everything wrong.

Debugging in production?

Some problems, intermittent failures and timing-related problems and heisenbugs, only happen in production and can’t be reproduced in test – or at least not without a lot of time, expense and luck. To debug these problems a developer may need to examine the run-time state of the system when the problem happens. But these problems should be the exception, not the rule. Debugging in production opens up security problems (exposing private data in memory) and run-time risks that developers and Ops both need to be aware of.

Question: Should developers have access to production?

Answer: Whenever an error occurs that I can’t replicate in a dev environment, I'm always SO tempted to hop into prod and start adding in some output statements... Yeah, it’s probably a good thing I don’t have access to prod.

Enderjsy, Slashdot: Should Developers Have Access to Production

Deploying to production?

Auditors will tell you that the people who write the code cannot be the same people who deploy it in production. But some developers will tell you that they need to take care of deployment, because Ops won’t understand all the steps involved, or at least that they need to manually check that all of the config changes were made correctly, and to run the data conversion and check that it worked, and to make sure that the right code was installed in the right places. If this is how your deployment is done, you’re doing it wrong.

And you’re doing a lot of things wrong if Ops won’t trust development enough to push changes out at all:

Most times, when I see devs screwing with production it's either a "hero" coder who is way too good to use best practices, or a situation in which the environment is so hostile that the "best" solution seems to be breaking the rules.

I once did some contract work for a company where the QA and testing process took a minimum of two weeks for the most trivial changes, and where the admins on the production servers refused to deploy things like security patches without a testing period that ran close to a month. The devs there had a hundred tricks for sneaking their code into production, and linking production code to the development servers in an attempt to meet their productivity goals.

SatanicPuppy, Slashdot: Should Developers Have Access to Production

Testing in production?

The only testing that has to be done in production is A/B split testing to see what features customers like or don’t like. You should not need to test in production to see if something works – that’s what test environments are for – except maybe when you are deploying and launching a system for the first time, or some limited integration test cases with other systems that can’t be reached from a test environment. Or load testing done with Ops – a lot of shops can’t afford to have a test environment sized big enough for real load testing.

Making Production Safe for (and from) Developers

Whether developers should have production access (and how much access you can allow them) also depends on how much developers can be trusted to be careful and responsible with the systems and with customer data. It’s inconsistent that while organizations will trust developers to write the software that runs in production, they won’t trust them with the production system. But development and production are different worlds.

Most developers lack the necessary situational awareness. They are used to experimenting and trying things to see what happens. I've seen smart, experienced developers do dangerous things in production without realizing it while they are deep into problem solving. Developers should be scared of working in production. Not too scared to think, but scared enough to think before they act. They need to understand the risks, and be held to the same duties of care as anyone in Ops.

You can spend a lot of time breaking down the wall between development and Ops, only to see it built back up overnight (much thicker and higher too) the first time that a developer blows away a production database when they thought they were in test, or kills the wrong process or hot deploys the wrong version of code or deletes the wrong config file and causes a widespread outage. Make sure that test and development environments are firewalled from production so that it isn’t possible for anything running in test to touch production through hard-wired links. Make it clear to developers when they are in production. Force them to make a jump: open a tunnel, sign on with a different id and password, see a different prompt.

With great power comes great responsibility

Nobody supporting an app should need – or even want – root access for day-to-day support and troubleshooting. Developers should only be granted the access that they need and no more, so that they can’t do things they shouldn't do, they can’t see things that they shouldn't see, and so that they can’t cause more damage than you can afford.

At the Velocity Conference in 2009, John Allspaw and Paul Hammond explained how important and useful it is for developers to have access to production - but that most of this access can be and should be limited:

Allspaw: “I believe that ops people should make sure that developers can see what’s happening on the systems without going through operations… There’s nothing worse than playing phone tag with shell commands. It’s just dumb.”

“Giving someone [i.e., a developer] a read-only shell account on production hardware is really low risk. Solving problems without it is too difficult.”

Hammond: “We’re not saying that every developer should have root access on every production box.”

Developers who need access to the system should be given a read-only account that allows them to monitor the run-time – logs and metrics. Then force them to make another jump to gain whatever command or write access they need to do admin functions or help with repair and recovery.

One problem is that a lot of systems aren’t designed with fine grained access control at the admin level: there’s an admin user (that owns the application and can see and do everything needed to setup and run the system) and there’s everybody else. It can be painful to break out the application and the environment ownership structure and permissioning scheme and separate read-only monitoring access from support and control functions, to setup sudo privilege escalation rules, and to track and manage all of the user accounts properly.

And none of this works if you aren’t properly protecting confidential and private data or other data that somebody could use for their own benefit. Tokenizing or masking or encrypting data so it can’t be read, hashing critical data to make sure that it hasn’t been tampered with; making sure that confidential data isn’t written to logs or temporary files.

You also have to make sure that you can track what everyone in production does, what they looked at and what they changed through auditing in the application, database and OS; and track changes to important files (including the code) using a detective change control tool like OSSEC.

All of these checks and safeties also make it safer for developers, as well as for Ops, and will hopefully be enough to keep the auditors satisfied.

Try to make it work

There are advantages to having developers working in production besides getting their help with support and troubleshooting.

The more time that developers spend working in production on operations issues with operations staff, the more that they will learn about what it takes to design and build a real-world system. Hopefully they will take this and design better, more resilient systems with more transparency and more consideration for support and admin requirements.

And having developers share responsibility for making the software work and support it, proving that they care and helping out, will go a long way to breaking down the wall of confusion between operations and development.

It’s not a simple thing to do. It might not even be possible in your organization – at least not in your lifetime. You need to understand and balance the risks and advantages. You need to understand the political and governance constraints and how to deal with them. You need to put in the proper safeguards. And you need to make sure that you stay onside of compliance and regulations. But you’re leaving too much on the table if you don’t try.

4 comments:

TiTi said...

Excellent article!
Very good description of what's happening, visions according to the person role.
Interesting recommandations.
Thank you!

Jessica Dodson said...

"Whenever an error occurs that I can’t replicate in a dev environment, I'm always SO tempted to hop into prod and start adding in some output statements... Yeah, it’s probably a good thing I don’t have access to prod"

Ha. I think it depends on how quickly the developer can get in, fix what's wrong, and get out. You don't want to get super side tracked, but if the developer can get in and resolve it in one go it's worth it.

Anonymous said...

Great article! Nicely cuts across all aspects of real-world questions that we ask ourselfs when trying to find a way to make devops work.

Anonymous said...

Another obstacle - data analysts.

With enough manpower and process automation, you could conceivably have analysts with no access to the internet, no local admin rights, no code deployment rights, and only obfuscated data, but the cost would be prohibitive.

Site Meter