I’ve been going over a couple of posts by Steve Kilner that question whether Agile methods can be used effectively in software maintenance. It’s a surprising question really. There are a lot of maintenance teams who have had success following Agile methods like Scrum and Extreme Programming (XP) for some time now. We’ve been doing it for almost 5 years, enhancing and maintaining and supporting enterprise systems, and I know that it works.
Agile development naturally leads into maintenance – the goal of incremental Agile development is to get working software out to customers as soon as possible, and get customers using it. At some point, when customers are relying on the software to get real business done and need support and help to keep the system running, teams cross from development over to maintenance. But there’s no reason for Agile development teams to fundamentally change the way that they work when this happens.
It is harder to introduce Agile practices into a legacy maintenance team – there are a lot of technical requirements and some cultural changes that need to be made. But most maintenance teams have little to lose and lots to gain from borrowing from what Agile development teams are doing. Agile methods are designed to help small teams deal with a lot of change and uncertainty, and to deliver software quickly – all things that are at least as important in maintenance as they are in development. Technical practices in Extreme Programming especially help ensure that the code is always working – which is even more important in maintenance than it is in development, because the code has to work the first time in production.
Agile methods have to be adapted to maintenance, but most teams have found it necessary to adapt these methods to fit their situations anyways. Let’s look at what works and what has to be changed to make Agile methods like Scrum and XP work in maintenance.
What works well and what doesn’t
Managing maintenance isn’t the same as managing a development project – even an Agile development project. Although Agile development teams expect to deal with ambiguity and constant change, maintenance teams need to be even more flexible and responsive, to manage conflicts and unpredictable resourcing problems. Work has to be continuously reviewed and prioritized as it comes in – the customer can’t wait for 2 weeks for you to look at a production bug. The team needs a fast path for urgent changes and especially for hot fixes.
You have to be prepared for support demands and interruptions. Structure the team so that some people can take care of second-level support, firefighting and emergency bug fixing and the rest of the team can keep moving forward and get something done. Build slack into schedules to allow for last-minute changes and support escalation.
You will also have to be more careful in planning out maintenance work, to take into account technical and operational dependencies and constraints and risks. You’re working in the real world now, not the virtual reality of a project.
Standups play an important role in Agile projects to help teams come up to speed and bond. But most maintenance teams work fine without standups – since a lot of maintenance work can be done by one person working on their own, team members don’t need to listen to each other each morning talking about what they did yesterday and what they’re going to do – unless the team is working together on major changes. If someone has a question or runs into a problem, they can ask for help without waiting until the next day.
Most changes and fixes that maintenance teams need to make are small, and there is almost always pressure from the business to get the code out as soon as it is ready, so an Agile approach with small and frequent releases makes a lot of sense. If the time boxes are short enough, the customer is less likely to interrupt and re-prioritize work in progress – most businesses can wait a few days or a couple of weeks to get something changed.
Time boxing gives teams a way to control and structure their work, an opportunity to batch up related work to reduce development and testing costs, and natural opportunities to add in security controls and reviews and other gates. It also makes maintenance work more like a project, giving the team a chance to set goals and to see something get done. But time boxing comes with overhead – the planning and setup at the start, then deployment and reviews at the end – all of which adds up over time. Maintenance teams need to be ruthless with ceremonies and meetings, pare them down, keep only what’s necessary and what works.
It’s even more important in maintenance than in development to remember that the goal is to deliver working code at the end of each time box. If some code is not working, or you’re not sure if it is working, then extend the deadline, back some of the changes out, or pull the plug on this release and start over. Don’t risk a production failure in order to hit an arbitrary deadline. If the team is having problems fitting work into time boxes, then stop and figure out what you’re doing wrong – the team is trying to do too much too fast, or the code is too unstable, or people don’t understand the code enough – and fix it and move on.
Reviews and Retrospectives
Retrospectives are important in maintenance to keep the team moving forward, to find better ways of working, and to solve problems. But like many practices, regular reviews reach a point of diminishing returns over time – people end up going through the motions. Once the team is setup, reviews don’t need to be done in each iteration unless the team runs into problems.
Schedule reviews when you or the team need them. Collect data on how the team is working, on cycle time and bug report/fix ratios, correlate problems in production with changes, and get the team together to review if the numbers move off track. If the team runs into a serious problem like a major production failure, then get to the bottom of it through Root Cause Analysis.
Sustainable pace / 40-hour week
It’s not always possible to work a 40-hour week in maintenance. There are times when the team will be pushed to make urgent changes, spend late nights firefighting, releasing after hours and testing on weekends. But if this happens too often or goes on too long the team will burn out. It’s critical to establish a sustainable pace over the long term, to treat people fairly and give them a chance to do a good job.
Pairing is hard to do in small teams where people are working on many different things. Pairing does make sense in some cases – people naturally pair-up when trying to debug a nasty problem or walking through a complicated change – but it’s not necessary to force it on people, and there are good reasons not to.
Some teams (like mine) rely more on code reviews instead of pairing, or try to get developers to pair when first looking at a problem or change, and at the end again to review the code and tests. The important thing is to ensure that changes get looked at by at least one other person if possible, however this gets done.
Collective Code Ownership
Because maintenance teams are usually small and have to deal with a lot of different kinds of work, sooner or later different people will end up working on different parts of the code. It’s necessary, and it’s a good thing because people get a chance to learn more about the system and work with different technologies and on different problems.
But there’s still a place for specialists in maintenance. You want the people who know the code the best to make emergency fixes or high-risk changes – or at least have them review the changes – because it has to work the first time. And sometimes you have no choice – sometimes there is only one person who understands a framework or language or technical problem well enough to get something done.
Coding Guidelines – follow the rules
Getting the team to follow coding guidelines is important in maintenance to help ensure the consistency and integrity of the code base over time – and to help ensure software security. Of course teams may have to compromise on coding standards and style conventions, depending on what they have inherited in the code base; and teams that maintain multiple systems will have to follow different guidelines for each system.
In XP, teams are supposed to share a Metaphor: a simple high-level expression of the system architecture (the system is a production line, or a bill of materials) and common names and patterns that can be used to describe the system. It’s a fuzzy concept at best, a weak substitute for more detailed architecture or design, and it’s not of much practical value in maintenance. Maintenance teams have to work with the architecture and patterns that are already in place in the system.
What is important is making sure that the team has a common understanding of these patterns and the basic architecture so that the integrity isn’t lost – if it hasn’t been lost already. Getting the team together and reviewing the architecture, or reverse-engineering it, making sure that they all agree on it and documenting it in a simple way is important especially when taking over maintenance of a new system and when you are planning major changes.
Agile development teams start with simple designs and try to keep them simple. Maintenance teams have to work with whatever design and architecture that they inherit, which can be overwhelmingly complex, especially in bigger and older systems. But the driving principle should still be to design changes and new features as simple as the existing system lets you – and to simplify the system’s design further whenever you can.
Especially when making small changes, simple, just-enough design is good – it means less documentation and less time and less cost. But maintenance teams need to be more risk adverse than development teams – even small mistakes can break compatibility or cause a run-time failure or open a security hole. This means that maintainers can’t be as iterative and free to take chances, and they need to spend more time upfront doing analysis, understanding the existing design and working through dependencies, as well as reviewing and testing their changes for regressions afterwards.
Refactoring takes on a lot of importance in maintenance. Every time a developer makes a change or fix they should consider how much refactoring work they should do and can do to make the code and design clearer and simpler, and to pay off technical debt. What and how much to refactor depends on what kind of work they are doing (making a well-thought-out isolated change, or doing shotgun surgery, or pushing out an emergency hot fix) and the time and risks involved, how well they understand the code, how good their tools are (development IDEs for Java and .NET at least have good built-in tools that make many refactorings simple and safe) and what kind of safety net they have in place to catch mistakes – automated tests, code reviews, static analysis.
Some maintenance teams don’t refactor because they are too afraid of making mistakes. It’s a vicious circle – over time the code will get harder and harder to understand and change, and they will have more reasons to be more afraid. Others claim that a maintenance team is not working correctly if they don’t spend at least 50% of their time refactoring.
The real answer is somewhere in between – enough refactoring to make changes and fixes safe. There are cases where extensive refactoring, restructuring or rewriting code is the right thing to do. Some code is too dangerous to change or too full of bugs to leave the way it is – studies show that in most systems, especially big systems, 80% of the bugs can cluster in 20% of the code. Restructuring or rewriting this code can pay off quickly, reducing problems in production, and significantly reducing the time needed to make changes and test them as you go forward.
Testing is even more important and necessary in maintenance than it is in development. And it’s a major part of maintenance costs. Most maintenance teams rely on developers to test their own changes and fixes by hand to make sure that the change worked and that they didn’t break anything as a side effect. Of course this makes testing expensive and inefficient and it limits how much work the team can do. In order to move fast, to make incremental changes and refactoring safe, the team needs a better safety net, by automating unit and functional tests and acceptance tests.
It can take a long time to put in test scaffolding and tools and write a good set of automated tests. But even a simple test framework and a small set of core fat tests can pay back quickly in maintenance, because a lot changes (and bugs) tend to be concentrated in the same parts of the code – the same features, framework code and APIs get changed over and over again, and will need to be tested over and over again. You can start small, get these tests running quickly and reliably and get the team to rely on them, fill in the gaps with manual tests and reviews, and then fill out the tests over time. Once you have a basic test framework in place, developers can take advantage of TFD/TDD especially for bug fixes – the fix has to be tested anyways, so why not write the test first and make sure that you fixed what you were supposed to?
To get Continuous Testing to work, you need a Continuous Integration environment. Understanding, automating and streamlining the build and getting the CI server up and running and wiring in tests and static analysis checks and reporting can take a lot of work in an enterprise system, especially if you have to deal with multiple languages and platforms and dependencies between systems. But doing this work is also the foundation for simplifying release and deployment – frequent short releases means that release and deployment has to be made as simple as possible.
Onsite Customer / Product Owner
Working closely with the customer to make sure that the team is delivering what the customer needs when the customer needs it is as important in maintenance as it is in developing a new system. Getting a talented and committed Customer engaged is hard enough on a high-profile development project – but it’s even harder in maintenance. You may end up with too many customers with conflicting agendas competing for the team’s attention, or nobody who has the time or ability to answer questions and make decisions. Maintenance teams often have to make compromises and help fill in this role on their own.
But it doesn’t all fit….
Kilner’s main point of concern isn’t really with Agile methods in maintenance. It’s with incremental design and development in general – that some work doesn’t fit nicely into short time boxes. Short iterations might work ok for bug fixes and small enhancements (they do), but sometimes you need to make bigger changes that have lots of dependencies. He argues that while Agile teams building new systems can stub out incomplete work and keep going in steps, maintenance teams have to get everything working all at once – it’s all or nothing.
It’s not easy to see how big changes can be broken down into small steps that can be fit into short time boxes. I agree that this is harder in maintenance because you have to be more careful in understanding and untangling dependencies before you make changes, and you have to be more careful not to break things. The code and design will sometimes fight the kinds of changes that you need to make, because you need to do something that was never anticipated in the original design, or whatever design there was has been lost over time and any kind of change is hard to make.
It’s not easy – but teams solve these problems all the time. You can use tools to figure out how much of a dependency mess you have in the code and what kind of changes you need to make to get out of this mess. If you are going to spend “weeks, months, or even years” to make changes to a system, then it makes sense to take time upfront to understand and break down build dependencies and isolate run-time dependencies, and put in test scaffolding and tests to protect the team from making mistakes as they go along. All of this can be done in time boxed steps. Just because you are following time boxes and simple, incremental design doesn’t mean that you start making changes without thinking them through.
Read Working With Legacy Code – Michael Feathers walks through how to deal with these problems in detail, in both object oriented and procedural languages. What to do if it takes forever to make a change. How to break dependencies. How to find interception points and pinch points. How to find structure in the design and the code. What tests to write and how to get automated tests to work.
Changing data in a production system, especially data shared with other systems, isn’t easy either. You need to plan out API changes and data structure changes as carefully as possible, but you can still make data and database changes in small, structured steps.
To make code changes in steps you can use Branching by Abstraction where it makes sense (like making back-end changes) and you can protect customers from changes through Feature Flags and Dark Launching like Facebook and Twitter and Flickr do to continuously roll out changes – although you need to be careful, because if taken too far these practices can make code more fragile and harder to work with.
Agile development teams follow incremental design and development to help them discover an optimal solution through trial-and-error. Maintenance teams work this way for a different reason – to manage technical risks by breaking big changes down and making small bets instead of big ones.
Working this way means that you have to put in scaffolding (and remember to take it out afterwards) and plan out intermediate steps and review and test everything as you make each change. Sometimes it might feel like you are running in place, that it is taking longer and costing more. But getting there in small steps is much safer, and gives you a lot more control.
Teams working on large legacy code bases and old technology platforms will have a harder time taking on these ideas and succeeding with them. But that doesn’t mean that they won’t work. Yes, you can be Agile in maintenance.