Building Real Software: support

Showing posts with label support. Show all posts

Thursday, January 9, 2014

Developers working in Production. Of course! Maybe, sometimes. What, are you nuts?

One of the basic ideas in Devops is that developers and operations should share responsibility for designing systems, for implementing them and keeping them running. Developers should be on call in case something goes wrong, and be the one to fix whatever breaks. Because the person who wrote the code is often the only one who knows how it really works. And because of the moral hazard argument: if programmers are held fully accountable for the work that they do, they will be incented to do a better job, instead of writing garbage and handing it off to somebody else.

But this means that developers need some kind of access to production. How much access developers need, how often, and how this can be made safe, are important questions that have to be answered.

Hire wicked smart people and give them all access to root.
Unnamed devops evangelist, Is Devops Subversive?

If you ask whether developers should have access to production you’ll find that people fall into one of 3 camps:

Yeah, sure, of course – who else is going to support the system?

This is a simple decision for online startups, where there’s often nobody else to install, configure and support the application any ways.

As these organizations grow, developers often continue to stay closely involved in deployment, support and application operations, and in some cases, still play a primary role, especially in shops that heavily leverage cloud infrastructure (think Netflix).

Read my lips: Never Ever! Are you out of your freakin’ mind?

Question: Should developers have access to production?

Answer: Not only no, but hell no.

kperrier, Slashdot: Should Developers Have Access to Production

The situation is much different in large enterprises and government organizations, where walls have been built up between development and operations for many different reasons. It’s not just mergers and acquisitions and inertia and internal politics and protectionism that made this happen. It’s also SOX and PCI and HIPAA and GLBA and other overlapping regulations and privacy rules, and ITIL and COBIT and ISOxxx and CMMI and other IT governance frameworks, and internal and external auditors enforcing separation of duties and need-to-know access limitations in order to ensure the integrity and confidentiality of system data.

The same rules also apply to leaner-and-meaner Devops shops. For example, at Etsy (a Devops leader), PCI DSS compliant functions are managed and supported by a different team in a different way from the rest of their online systems: while developers have R/O access to a lot of production “data porn” (metrics and graphs and logs), they do not have access to production databases; there are more requirements for activity logging; a push to QA is handled in a clearly different way than a push to production; and all changes to production must be tracked and approved through a ticketing system.

And there’s also the problem of shared infrastructure: the same networks and servers and databases and other parts of the stack may be used by many different applications and different business units. Developers of course only understand the applications that they are working on and are only familiar with the simplified test configurations that they use day-to-day – they may not know about other systems and their shared dependencies, and could easily make changes that break these systems without being aware of the risks.

In case of emergency, break glass

Most organizations fall somewhere in between a Noops web startup in the cloud and a legacy-bound enterprise weighted down by too much governance and management politics. Operations is usually run separately, management is still accountable to regulators and auditors, but most people understand and recognize the need for developers to help out, especially when something goes wrong.

When the shit has indeed and truly hit the fan, developers – although usually only senior developers or team leads – are brought in to help troubleshoot and recover. Their access is temporary, maybe using a “fire id” extracted from a vault, then locked down again as soon as they are done. Developers are often paired up with an operations buddy who does most of the driving, or at least watches them carefully when the developer has to take the wheel.

Question: Should developers have access to production?

Answer: Everyone agrees that developers should never have access to production… Unless they’re the developer, in which case it’s different.

SatanicPuppy, Slashdot: Should Developers Have Access to Production

Problems in production can be fixed much faster if developers can see the logs, stack traces and core dumps and look at production data when something goes wrong. Giving at least some developers read access to production logs and alerts and monitors – enough to recognize that something has gone wrong and to figure out what needs to be fixed – makes sense.

Sometimes really bad things happen and all that matters is getting the system back and up and running as quickly as possible. You want the best people you can find working on the problem, and this includes developers. You’ll need their help with diagnosis and deciding what options are safest to take for roll back or roll forward, putting in an emergency fix or workaround, and data repair and reconciliation. Everyone will need to check later to make sure that any temporary fixes or workarounds are implemented properly, checked-in and redeployed.

When you run incident management fire drills, make sure that developers are included. And developers should also be included in incident postmortem reviews, even if they weren't part of the incident management team, because this is an important opportunity to learn more about the system and to improve it. But if you have developers firefighting in production more than almost never, then you’re doing almost everything wrong.

Debugging in production?

Some problems, intermittent failures and timing-related problems and heisenbugs, only happen in production and can’t be reproduced in test – or at least not without a lot of time, expense and luck. To debug these problems a developer may need to examine the run-time state of the system when the problem happens. But these problems should be the exception, not the rule. Debugging in production opens up security problems (exposing private data in memory) and run-time risks that developers and Ops both need to be aware of.

Question: Should developers have access to production?

Answer: Whenever an error occurs that I can’t replicate in a dev environment, I'm always SO tempted to hop into prod and start adding in some output statements... Yeah, it’s probably a good thing I don’t have access to prod.

Enderjsy, Slashdot: Should Developers Have Access to Production

Deploying to production?

Auditors will tell you that the people who write the code cannot be the same people who deploy it in production. But some developers will tell you that they need to take care of deployment, because Ops won’t understand all the steps involved, or at least that they need to manually check that all of the config changes were made correctly, and to run the data conversion and check that it worked, and to make sure that the right code was installed in the right places. If this is how your deployment is done, you’re doing it wrong.

And you’re doing a lot of things wrong if Ops won’t trust development enough to push changes out at all:

Most times, when I see devs screwing with production it's either a "hero" coder who is way too good to use best practices, or a situation in which the environment is so hostile that the "best" solution seems to be breaking the rules.

I once did some contract work for a company where the QA and testing process took a minimum of two weeks for the most trivial changes, and where the admins on the production servers refused to deploy things like security patches without a testing period that ran close to a month. The devs there had a hundred tricks for sneaking their code into production, and linking production code to the development servers in an attempt to meet their productivity goals.

SatanicPuppy, Slashdot: Should Developers Have Access to Production

Testing in production?

The only testing that has to be done in production is A/B split testing to see what features customers like or don’t like. You should not need to test in production to see if something works – that’s what test environments are for – except maybe when you are deploying and launching a system for the first time, or some limited integration test cases with other systems that can’t be reached from a test environment. Or load testing done with Ops – a lot of shops can’t afford to have a test environment sized big enough for real load testing.

Making Production Safe for (and from) Developers

Whether developers should have production access (and how much access you can allow them) also depends on how much developers can be trusted to be careful and responsible with the systems and with customer data. It’s inconsistent that while organizations will trust developers to write the software that runs in production, they won’t trust them with the production system. But development and production are different worlds.

Most developers lack the necessary situational awareness. They are used to experimenting and trying things to see what happens. I've seen smart, experienced developers do dangerous things in production without realizing it while they are deep into problem solving. Developers should be scared of working in production. Not too scared to think, but scared enough to think before they act. They need to understand the risks, and be held to the same duties of care as anyone in Ops.

You can spend a lot of time breaking down the wall between development and Ops, only to see it built back up overnight (much thicker and higher too) the first time that a developer blows away a production database when they thought they were in test, or kills the wrong process or hot deploys the wrong version of code or deletes the wrong config file and causes a widespread outage. Make sure that test and development environments are firewalled from production so that it isn’t possible for anything running in test to touch production through hard-wired links. Make it clear to developers when they are in production. Force them to make a jump: open a tunnel, sign on with a different id and password, see a different prompt.

With great power comes great responsibility

Nobody supporting an app should need – or even want – root access for day-to-day support and troubleshooting. Developers should only be granted the access that they need and no more, so that they can’t do things they shouldn't do, they can’t see things that they shouldn't see, and so that they can’t cause more damage than you can afford.

At the Velocity Conference in 2009, John Allspaw and Paul Hammond explained how important and useful it is for developers to have access to production - but that most of this access can be and should be limited:

Allspaw: “I believe that ops people should make sure that developers can see what’s happening on the systems without going through operations… There’s nothing worse than playing phone tag with shell commands. It’s just dumb.”

“Giving someone [i.e., a developer] a read-only shell account on production hardware is really low risk. Solving problems without it is too difficult.”

Hammond: “We’re not saying that every developer should have root access on every production box.”

Developers who need access to the system should be given a read-only account that allows them to monitor the run-time – logs and metrics. Then force them to make another jump to gain whatever command or write access they need to do admin functions or help with repair and recovery.

One problem is that a lot of systems aren’t designed with fine grained access control at the admin level: there’s an admin user (that owns the application and can see and do everything needed to setup and run the system) and there’s everybody else. It can be painful to break out the application and the environment ownership structure and permissioning scheme and separate read-only monitoring access from support and control functions, to setup sudo privilege escalation rules, and to track and manage all of the user accounts properly.

And none of this works if you aren’t properly protecting confidential and private data or other data that somebody could use for their own benefit. Tokenizing or masking or encrypting data so it can’t be read, hashing critical data to make sure that it hasn’t been tampered with; making sure that confidential data isn’t written to logs or temporary files.

You also have to make sure that you can track what everyone in production does, what they looked at and what they changed through auditing in the application, database and OS; and track changes to important files (including the code) using a detective change control tool like OSSEC.

All of these checks and safeties also make it safer for developers, as well as for Ops, and will hopefully be enough to keep the auditors satisfied.

Try to make it work

There are advantages to having developers working in production besides getting their help with support and troubleshooting.

The more time that developers spend working in production on operations issues with operations staff, the more that they will learn about what it takes to design and build a real-world system. Hopefully they will take this and design better, more resilient systems with more transparency and more consideration for support and admin requirements.

And having developers share responsibility for making the software work and support it, proving that they care and helping out, will go a long way to breaking down the wall of confusion between operations and development.

It’s not a simple thing to do. It might not even be possible in your organization – at least not in your lifetime. You need to understand and balance the risks and advantages. You need to understand the political and governance constraints and how to deal with them. You need to put in the proper safeguards. And you need to make sure that you stay onside of compliance and regulations. But you’re leaving too much on the table if you don’t try.

Thursday, October 10, 2013

Don't You Know that Support is the Most Important Part of a Developer’s Job?

Agile development – because you are building working software faster and delivering it incrementally – forces development teams to face a common, fundamental problem: how to balance the work of developing new software with the need to support a system that is already being used in production, whether it’s the legacy system that you’re replacing, or the system that you are still building – and sometimes both.

This is especially a problem for Agile teams following Scrum. On the one hand, in order for the team to meet Sprint goals and commitments and to establish a velocity for future planning, the team is not supposed to be interrupted while they are doing their work. On the other hand, the point of working iteratively and incrementally in Scrum is to deliver working software early and frequently to the customer, who will want to use this software as soon as they can, and who will then need support and help using the software – help and support that needs to come from the people who wrote the software.

At some point, often still early in developing a system, these teams have to stop working in a bubble with their pretend Customer, and start working in the real world with real customers who have real demands.

Supporting Customers and Still Building New Software

This means that teams have to find a way to juggle support and maintenance work with design and development, to deal with rapidly changing priorities and interruptions and complaints and questions and the stress of fire fighting when things break, while still trying to deliver good quality software and hit deadlines.

It’s not easy to balance two completely different kinds of work with directly opposed goals and incentives and metrics. As Don Schueler explains in the “The Fragile Balance between Agile Development and Customer Support”, development teams – even Agile teams working closely with their Customer – are mostly inward-looking, internally focused on delivery and velocity and cost and code quality and technical concerns. Support teams are outward-looking, focused on customer relationships and customer experience and completeness and minimizing operational risk.

Development is about being predictable and efficient: deliver to schedule and keep development costs down. Support is about being responsive and effective: listen to the customer, answer questions, fit in unplanned work, figure out problems and fix things right away. Development work is about flow, continuity, predictability, velocity, and, if managed correctly, is mostly under control of the team. Support and maintenance work is interrupt-driven, immediate, inconsistent and unpredictable – a completely different way of working and thinking. Development work requires the team to be drawn together so that they can collaborate on common goals and the design. Most maintenance and support work is disjointed and disconnected, smaller tasks that can be done by people working independently. Development, even in high pressure projects, is measured in weeks or months. Support and maintenance work needs to be done in days or hours or sometimes minutes.

Agile Support Models: Maintenance Victims

One way that teams try to handle support and maintenance is by sacrificing someone from the team: offering up a “maintenance victim” who takes on the support burden for the rest of the team temporarily, letting the others focus on design and development work. This includes taking calls from Ops or directly from customers, looking at logs, solving problems, fixing bugs. This could mean staying after hours to help troubleshoot or repairing a production problem or putting out a fix, and being on call after hours and on weekends.

The rest of the team tries to pretend that this victim doesn’t exist. If the victim isn’t busy working on support issues or fixing bugs found in production, they might work on fixing other bugs or maybe some other low-priority development work, but they are subtracted from the team’s velocity – nobody depends on them to deliver anything important.

Teams generally rotate someone through support and triage responsibilities for one or two Sprints. This way everyone at some point “shares the pain” and gets some familiarity with support problems and operational issues. There are also positive sides to being sacrificed to support. Developers get a chance to learn more about the system and stretch some of their technical skills, and get off of the hamster wheel of Sprint-after-Sprint delivery for a bit. And they get a chance to play the hero, step in and fix something important and make the customer happy.

Kent Beck and Martin Fowler in Planning Extreme Programming extend this idea to larger organizations by creating a small production support team: 2-4 developers who volunteer to focus on fixing bugs and dealing with production problems. Developers spend a couple of Sprints in production support, then rotate back to development work. Beck and Fowler recommend staggering rotations, making sure that at least one developer is in the first rotation and another in the second so that at least one member of the support team always knows about what is going on and what problems are being worked on.

Sacrificing a maintenance victim or a team makes it possible for most of the rest of the team to move forward on development, while still meeting support commitments. This approach assumes that anyone on the team is capable of figuring out and fixing any problem in the system – that everyone is a cross-functional generalist. And this means that whoever is on this support rotation has to be good enough and experienced enough that they can deal with most issues without bringing in the rest of the team - you can’t rotate newbies through support and maintenance work, at least not without someone senior backing them up.

And you also have to be prepared for problems that are too big or too urgent for your maintenance victim to take care of on their own. Even with a dedicated team you may still need to build in some kind of slack or buffer to deal with emergencies and general helping out, so that you don’t keep blowing up Sprints. You can come up with a reasonable allowance based on “yesterday’s weather”, on how much support work the team has had to do over the last few weeks or months. If you can't make this work, if the entire team is spending too much time on support and fire fighting and pushing hot fixes, then you are doing something wrong and you have to get things under control before you build more software any ways.

Kanban instead of – or inside of – Scrum

Rather than trying to shoe horn maintenance and support into time boxes, some teams have found that Kanban is much better structured than Scrum or XP is to balance support, maintenance, and operations with new development work.

Kanban’s queuing model and use of task boards makes it easy to see what work needs to be done, what work is being done, who is doing it, what’s getting in the way, and when anything changes.

Kanban makes it easier to track and manage different kinds of work that requires different kinds of skills and that don’t always fit nicely into a 1-week or 2-week time-box..

Kanban doesn’t pretend that you won’t be or can’t be interrupted – instead it helps you to manage interruptions and minimize their impact on the team. First, in Kanban you set limits on how much of different kinds of work the team can deal with at a time. This lets the team get control over work coming in, and stay focused on getting things done. Kanban’s queue-and-task model allows emergencies to pre-empt whatever work is in progress through escalation/priority lanes. And priorities can keep changing right up until the last minute – team members just pull the highest priority work item from the ready queue when they are free to take on more work, whether this is designing and developing a new feature, or fixing a bug, or dealing with a support issue.

Kanban helps teams focus more on immediate, tactical issues. It’s a better model to follow when you have more maintenance and support work than new design and development, or when you have to assert control over a major problem or manage something with a lot of moving pieces like the launch of a new system.

Devops Changes Everything

Devops, as followed by organizations like Etsy and Facebook and Netflix (where they go so far as to call it NoOps) tries to completely break down the boundaries between development, maintenance, support and operations. Devops engages developers directly and closely into the support, maintenance and operations of the systems that they build. Developers who work in these organizations are not just writing code – they are part of a team running an online service-based business, which means that support work is as important, and sometimes more important, than designing and writing more software.

In these organizations, developers are held personally responsible for their software, for getting it into production and making sure that it works. They are on call for problems with software that they worked on. They are actively involved in operations of the system, providing insight into how the system works and how it is running, in testing and configuring it and tuning it and troubleshooting problems.

Devops changes what developers work on and how they do it. They move away from project work and more towards fast feature development, fixing, tuning and hardening. Availability and reliability and performance and security and other operational factors become as important – or more important – than delivery schedules and velocity. Developers spend more time thinking about how to make the system work, how to simplify deployment and setup and about the information that people need to understand what’s going on inside the system, what metrics and tools might be useful, how to handle bad data and infrastructure failures, what could go wrong when they make a change and who they need to check with and what they need to test for.

Maintenance and Support – Responsibility and Feedback

Whether developers need to – or even should – take first line support calls from users, they at least need to be part of second level and third level support, where problems are investigated and solved.

This is not just because they are usually the only people who can actually figure out and fix many problems.

Putting aside moral hazard arguments about whether it’s ethically acceptable for developers not to take full responsibility for the consequences for their decisions and the quality of their work, there are compelling advantages to developers being directly involved in supporting and maintaining the software that they work on.

The most important is the quality of the feedback that developers get from supporting a real system – feedback that is too valuable for them to ignore.

Real feedback on what you did right in building the system, and what you got wrong. Feedback on what you thought the customer needed vs. what they really need. What features customers really find useful (and what they don`t). Where the design is weak. Where most of your problems are coming from (the 20% of the code where 80% of the bugs are hiding), where the weaknesses are in your testing and reviews, where you need to focus and where you need to improve. Valuable information into what you’re building and how you’re building it and how you plan and prioritize, and how you can get better.

When developers are called into fire fighting production incidents and Root Cause Analysis reviews they can learn enormous amounts about what it takes to build software for the real world. Thinking seriously about how problems happened and how to prevent them can change how you plan, design, build, test and deploy software; and how people work together as a team.

Farming all of this off to someone else, filtering it through a help desk or an offshore maintenance team, breaks these valuable feedback loops, with negative effects for everyone involved.

Peter Gillard-Moss explains how this happens:

In a startup, developers take care of problems themselves, well, because there isn`t anybody else to do it. But at some point things change:

“…managers decided that we were spending far too long investigating users’ problems and not long enough building the new features the business wanted. Developers needed to be more productive, and more productive meant developers developing more new features. To get developers to develop they need to be ‘in the zone’. They need headphones and big screens to glue their eyes to. They did not need petty interruptions like stupid users ringing up because they got a pop up saying their details will be resent when they tried to refresh.”

But by doing this, the development team became disconnected from the results of their work, and from their customers…

“A systems thinker would tell you this is wrong. You’ve gone from a system that connected a user to the team responsible with one degree of separation, to one that has three degrees of separation. Or think of it another way: the team producing the product, and responsible for improvements and fixes used to be one degree away from their end users, who use the product and are feeding back the product’s shortcomings and issues, but are now three degrees. And not even three degrees all of the time. The majority of the time the team won’t ever hear about most of the support issues. And most of the time the team won’t even have that much interaction with the team that does hear about most of the support issues.”

The result: Customers don’t get the support that they need. Developers don’t get the information that they need to understand how to make the system work better. A support team stuck in the middle with people just trying to keep things from getting worse and hoping to find a better job someday. It’s a self-reinforcing, negative spiral.

In our shop, support takes priority over development – always. Our senior developers work with operations to support the system, and are on call when we put new software in and on call if something goes wrong after hours. They can bring in anyone else from any team that they need for help. As a result, we have very few serious problems and these problems get fixed fast and fixed right. The experience that everyone gets from working in support helps them to design and write better, safer code. This has made the system more resilient and easier and less expensive to support and safer to setup and run and easier and safer to change. And it has made our organization better too. It’s brought developers and operations closer together, and closer to what’s important to the business.

Whether you call it “Agile” or not, there’s nothing more agile than a team that is working directly with customers, responding immediately to problems and changing requirements in a live system. While some developers and managers think of this as overhead, sustaining engineering and try to push it off to somebody else so that they can focus on “more strategic" work, others recognize that this is really the leading edge of software development, and the only way to run a successful software organization, and the only way to make software, and developers, better.