Building Real Software: iterative development

Showing posts with label iterative development. Show all posts

Wednesday, May 23, 2012

The pursuit of protection: How much testing is “enough”?

I’m definitely not a testing expert. I’m a manager who wants to know when the software that we are building is finished, safe and ready to ship.

Large-scale enterprise systems – the kinds of systems that I work on – are essentially hard to test. They have lots of rules and exceptions and lots of interfaces and lots of customization for different customers and partners, and lots of operational dependencies, and they deal with lots of data. We can’t test everything – there are tens of thousands or hundreds of thousands of different scenarios and different paths to follow.

This gets easier and harder if you are working in Agile methods, building and releasing small pieces of work at a time. Most changes or new features are easy enough to understand and test by themselves. The bigger problem is in understanding the impact of each change on the rest of the system that has already been built, what side-effects the change may have, what might have broke. This gets harder if a change is introduced in small steps over several releases, so that some parts are incomplete or even invisible to the test team for a while.

People who write flight control software or medical device controllers need to do exhaustive testing, but the rest of us can’t afford to, and there are clearly diminishing returns. So if you can’t or aren’t going to “test everything”, how do you know when you’re done testing? One answer is that you’re done testing when you run out of time to do any more testing. But that’s not good enough.

You’re done testing when your testers say they’re done

Another answer is that you’re done when the test team says they’re done. When all of the static analysis findings have been reviewed and corrected. When all of the automated tests pass. When the testers have made sure that all features that are supposed to be complete were completed and secure, finished their test checklists, made sure that the software is usable and checked for fit-and-finish, tested for performance and stability, made sure that the deployment and rollback steps work, and completed enough exploratory testing that they’ve stopped finding interesting bugs, and the bugs that they have found (the important ones at least) have all been fixed and re-checked.

This of course assumes that they tested the right things – that they understood the business requirements and priorities, and found most of the interesting and important bugs in the system. But how do you know that they’ve done a good job?

What a lot of testers do is black box testing, which falls into two different forms:

Scripted functional and acceptance testing, manual and automated – how good the testing is depends on how complete and clear the requirements are (which is a challenge for small Agile teams working through informal requirements that keep changing), and how much time the testers have to plan out and run their tests.
Unscripted behavioural or exploratory manual testing – depends on the experience and skill of the tester, and on their familiarity with the system and their understanding of the domain.

With black box testing, you have to trust in the capabilities and care of the people doing the testing work. Even if they have taken a structured, methodical approach to defining and running tests they are still going to miss something. The question is – what, and how much?

Using Code Coverage

To know when you’ve tested enough, you have to stop testing in the dark. You have to look inside the code, using white box structural testing techniques to understand what code has been tested, and then look closer at the code to figure out how to test the code that wasn’t.

A study at Microsoft over 5 years involving thousands of testers found that with scripted, structured functional testing, testers could cover as much as 83% of the code. With exploratory testing they could raise this a few percentage points, to as high as 86%. Then, by looking at code coverage and walking through what was tested and what wasn’t, they were able to come up with tests that brought coverage up above 90%.

Using code coverage this way, instrumenting code under test and then looking into the code and reviewing and improving the tests that have already been written and figuring out what new tests to write, needs testers and developers to work together even more closely.

How much code coverage is enough?

If you’re measuring code coverage, the question that comes up is how much coverage is enough?

What percentage of your code should be covered before you can ship? 100%? 90%? 80%? You will find a lot of different numbers in the literature and I have yet to find solid evidence showing that any given number is better than another. Cedric Beust, Breaking Away from the Unit Test Group Think

In Continuous Delivery, Jez Humble and David Farley set 80% coverage as a target for each of automated unit testing, functional testing and acceptance testing. Based on their experience, this should provide comprehensive testing.

Some TDD and XP advocates argue for 100% automated test coverage, which is a target to aim for if you are starting off from scratch and want to maintain high standards, especially for smaller systems. But 100% is unnecessarily expensive, and it’s a hopeless target for a large legacy system that doesn’t have extensive automated tests already in place. You’ll reach a point of diminishing returns as you continue to add tests, where each tests costs more to write and finds less. The more tests that you write, the more tests will be bad tests – duplicate tests that seem to test different things but don’t, tests that don’t test anything important (even if they help make the code coverage numbers look a little better), tests that don’t work but look like they do. All of these tests, good or bad, need to run continuously and need to be maintained and get in the way of making changes. The costs keep going up. How many shops can afford to achieve this level of coverage, and sustain it over a long period of time, or even want to?

Making Code Coverage Work for You

On the team that I manage now, we rely on automated unit and functional testing at around 70% (statement) coverage – higher in high-risk areas, lower in others. Obviously, automated coverage is also higher in areas that are easier to test with automated tools. We hit this level of coverage more than 3 years ago and it has held steady since then. There hasn’t been a good reason to push it higher – it gives us enough of a safety net for developers to make most changes safely, and it frees the test team up to focus on risks and exceptions.

Of course with the other kinds of testing that we do, manual functional testing and exploratory testing and multi-player war games, semi-automated integration testing and performance testing, and operational system testing, coverage in the end is much higher than 70% for each release. We’ve instrumented some of our manual testing work, to see what code we are covering in our smoke tests and integration testing and exploratory testing work, but it hasn’t been practical so far to instrument all of the testing to get a final sum in a release.

Defect Density, Defect Seeding and Capture/Recapture – Does anybody really do this?

In an article in IEEE Software Best Practices from 1997, Steve McConnell talks about using statistical defect data to understand when you have done enough testing.

The first approach is to use Defect Density data (# of defects per KLOC or some other common definition of size) from previous releases of the system, or even other systems that you have worked on. Add up how many defects were found in testing (assuming that you track this data – some Lean/Agile teams don’t, we do) and how many were found in production. Then measure the size of the change set for each of these releases to calculate the defect density. Do the same for the release that you are working on now, and compare the results. Assuming that your development approach hasn’t changed significantly, you should be able to predict how many more bugs still need to be found and fixed. The more data, of course, the better your predictions.

Defect Seeding, also known as bebugging,is where someone inserts bugs on purpose and then you see how many of these bugs are found by other people in reviews and testing. The percentage of the known [seeded] bugs not found gives an indication of the real bugs that remain. Apparently some teams at IBM, HP and Motorola have used Defect Seeding, and it must come up a lot in interviews for software testing labs (Google “What is Defect Seeding?”), but it doesn’t look like a practical or safe way to estimate test coverage. First, you need to know that you’ve seeded the “right” kind of bugs, across enough of the code to be representative – you have to be good at making bugs on purpose, which isn’t as easy as it sounds. If you do a Mickey Mouse job of seeding the defects and make them too easy to find, you will get a false sense of confidence in your reviews and testing – if the team finds most or all of the seeded bugs, that doesn’t mean that they’ve found most or all of the real bugs. Bugs tend to be simple and obvious, or subtle and hard to find, and bugs tend to cluster in code that was badly designed or badly written, so the seeded bugs need to somehow represent this. And I don’t like the idea of putting bugs into code on purpose. As McConnell points out, you have to be careful in removing the seeded bugs and then do still more testing to make sure that you didn’t break anything.

And finally, there is Capture/Re-Capture, an approach used to estimate wildlife populations (catch and tag fish in a lake, then see how many of the tagged fish you catch again later), which Watts Humphrey introduced to software engineering as part of TSP to estimate remaining defects from the results of testing or reviews. According to Michael Howard, this approach is sometimes used at Microsoft for security code reviews, so let’s explore this context. You have two reviewers. Both review the same code for the same kinds of problems. Add up the number of problems found by the first reviewer (A), the number found by the second reviewer (B), and separately count the common problems that both reviewers found, where they overlap (C). The total number of estimated defects: A*B/C. The total number of defects found: A+B-C. The total number of defects remaining: A*B/C – (A+B-C).

Using Michael Howard’s example, if Reviewer A found 10 problems, and Reviewer B found 12 problems, and 4 of these problems were found by both reviewers in common, the total number of estimated defects is 10*12/4=30. The total number of defects found so far: 18. So there are 12 more defects still to be found.

I’m not a statistician either, so this seems like magic to me, and to others. But like the other statistical techniques, I don’t see it scaling down effectively. You need enough people doing enough work over enough time to get useful stats. It works better for large teams working in Waterfall-style, with a long test-and-fix cycle before release. With a small number of people working in small, incremental batches, you get too much variability – a good reviewer or tester could find most or all of the problems that the other reviewers or testers found. But this doesn’t mean that you’ve found all of the bugs in the system.

Your testing is good enough until a problem shows that it is not good enough

In the end, as Martin Fowler points out, you won’t really know if your testing was good enough until you see what happens in production:

The reason, of course, why people focus on coverage numbers is because they want to know if they are testing enough. Certainly low coverage numbers, say below half, are a sign of trouble. But high numbers don't necessarily mean much, and lead to “ignorance-promoting dashboards”. Sufficiency of testing is a much more complicated attribute than coverage can answer. I would say you are doing enough testing if the following is true:

You rarely get bugs that escape into production, and

You are rarely hesitant to change some code for fear it will cause production bugs.

Test everything that you can afford to. Release the code. When problems happen in production, fix them, then use Root Cause Analysis to find out why they happened and to figure out how you’re going to prevent problems in the future, how to improve the code and how to improve the way you write it and how you test it. Keep learning and keep going.

Sunday, September 27, 2009

Risk Management - You Don't Have to Waltz with Bears

I recently finished reading Waltzing with Bears: Managing Risk on Software Projects by Tom DeMarco and Tim Lister, both recognized experts in software development and risk management. The material in this book covers much of the same territory as courses that I attended several years ago in software project management and risk management presented by these authors through The Atlantic Systems Guild.

This work is based on their experience as consultants and as expert witnesses in contract disputes and litigation over software project failures. The authors make a strong case that effective risk management is essential to the success of any software project worth doing; that you must be prepared to face failure and deal with uncertainty; that you must actively manage risks, by containing risks through schedule and budget buffers, or proactively mitigating risks by taking steps to reduce the probability and/or impact of a problem; that you must consider alternatives for any critical activities or work items; and that managing for success, attempting to evade risks, is the path to failure.

Waltzing with Bears focuses on the kinds of problems faced by (and effectively created by) large waterfall projects: trying to commit to scope and schedule and cost up front when there isn’t enough information to do so; and trying to account for and manage the unknown and unaccountable. It’s an almost hopeless situation, but the authors provide ideas and disciplines and tools that at least offer a better chance at success.

What’s necessary is to change the rules of the game, to consider other ways of building software.

In an earlier post, I explored how risk management can and should be burned into the way that you develop software; how schedule and scope and quality risks and other risks can be managed through the development lifecycle you choose and the engineering practices that your team follows. Johanna Rothman, in a paper titled “What Lifecycle: Selecting the Right Model for your Project” explores some of the same ideas, how to manage schedule risks and other risks through lifecycle models, in particular incremental and iterative development approaches.

It is clear to me that following incremental, iterative, timeboxed development, as in Extreme Programming and Scrum, will effectively mitigate many of the common risks and issues that concern the authors of Waltzing with Bears. To some extent, the authors agree, when they conclude that

“The best bang-per-buck risk mitigation strategy we know is incremental delivery” (by which they mean Staged Delivery), “development of a full or nearly full design, and then the implementation of that design in subsets, where each successive subset incorporates the ones that preceded it”.

While Scrum (interestingly) does not explicitly address risk management; it does mitigate scope, schedule, quality, customer and personnel risks through its driving principles and management practices: incremental timeboxed delivery (sprints), close collaboration within a self-managing team and with the customer, managing the backlog (scope) together with the customer (Product Owner), and daily standup meetings and retrospectives which allow the team to continuously adjust to issues and changes.

Extreme Programming (XP) recognizes and confronts risk directly and fundamentally – chapter 1 of Extreme Programming Explained: Embrace Change begins:

“The basic problem of software development is risk”.

In The Case for XP Chris Morris explains that XP is resilient to risk, that it inherently accepts change and uncertainty, rather than attempting to anticipate risk, to predict and manage dangers up front; building on a risk management model developed by political scientist Aaron Wildavsky.

XP addresses risk through:

- short iterations with fine-grained feedback

- keeping the customer close

- test-driven development to maintain a quality baseline

- refactoring and pair programming to ensure code quality

- continuous integration

- simple design

While you can manage most types of risks effectively through your SDLC especially by following incremental, iterative techniques and disciplined engineering practices, there are two general classes of risk that require active and continuous risk discovery and explicit risk management, using tools and ideas such as the ones detailed in Waltzing with Bears, Steve McConnell’s Top 10 Risk List, and the formal methods and techniques taught by the Project Management Institute (this is where my PMP comes in handy):

1. Project risks outside of your team’s work in software development, but which directly impact your success. These include: sponsor and stakeholder issues and other political risks, larger business issues outside of your control, regulatory changes, reliance on delivery from partners and sub-contractors, implementation and integration with partners and customers.

2. Technical risks in the platform, architecture or design. This is especially important if you are building enterprise, high reliability systems such as a telco, banking systems, large e-commerce sites, or financial trading. Some lifecycle and SDLC factors help to mitigate technical risks, such as prioritizing work that is technically difficult in XP, or using exploratory prototyping not only for customer feedback but for technical proof of concept work. But to ensure that your product works in the real world, the team constantly needs to consider technical risk: difficult problems to solve; fragile or complex parts of the system; areas where you are pushing beyond your technical knowledge and experience. Using this information you can determine where to focus your attention in technical reviews and testing; what to try early and prove out; what to let soak; what you should have your best people work on.

If you are building software incrementally and carefully, you won’t have to “waltz with bears”, but you still need to continuously look for, and actively manage, risks inside and outside of the work that your team is doing.

Wednesday, June 17, 2009

How long can this go on?

Our team delivers software iteratively and incrementally, and over the past 3 years we have experimented with longer (1-2 months) and shorter (1-2 week) iterations, adjusting to circumstances, looking for the proper balance between cost and control.

There are obvious costs in managing an iteration: startup activities (planning, prioritization, kickoff, securing the team's commitment to the goals of the release), technical overheads like source code and build management (branching and merging and associated controls), status reporting to stakeholders, end-to-end system and integration testing, and closure activities (retrospectives, resetting). We don’t just deliver “ship-quality” software at the end of an iteration: in almost every case we go all the way to releasing the code to production, so our costs also include packaging, change control, release management, security and operations reviews, documentation updates and release notes and training, certifications with partners, data conversion, rollback testing, and pre- and post-implementation operations support. Yep, that’s a lot of work.

All of these costs are balanced against control: our ability to manage and contain risks to the project, to the product, and to the organization. I explored how to manage risks through iterative, incremental development in an earlier post on risk management.

We’ve found that if an iteration is too long (a month or more), it is hard to defend the team from changes to priorities, to prevent new requirements from coming in and disrupting the team’s focus. And in a longer cycle, there are too many changes and fixes that need to be reviewed and tested, increasing the chance of mistakes or oversights or regressions.

Shorter releases are easier to manage because, of course, they are necessarily smaller. We can manage the pressure from the business-side for changes because of the fast delivery cycle (except for emergency hot fixes, we are usually able to convince our product owner, and ourselves, to wait for the next sprint since it is only a couple of weeks away) and it is easier for everyone to get their heads around what was changed in a release and how to verify it. And shorter cycles keep us closer to our customers, not only giving us faster feedback, but demonstrating constant value. I like to think of it as a “value pipeline”, continuously streaming business value to customers.

One of my favorite books on software project management, Johanna Rothman’s Manage It!, recommends making increments smaller to get faster feedback – feedback not just on the product, but on how you build it and how you can improve. The smaller the iteration, the easier to look at it from beginning-to-end and see where time is wasted, what works, what doesn’t, where time is being spent that isn’t expected.

“Shorter timeboxes will make the problems more obvious so you can solve them.”

Ms. Rothman recommends using the “Divide-by-Two Approach to Reduce Iteration Size”: if the iterations aren’t succeeding, divide the length in half, so 6 weeks becomes 3 weeks, and so on. Smaller iterations provide feedback - longer ones mask the problems.

Ms. Rothman also says that it is difficult to establish a rhythm for the team if iterations are too long. In “Selecting the Right Iteration Length for Your Software Development Process”, Mike Cohn of Mountain Goat Software examines the importance of establishing a rhythm in incremental development. He talks about the need for a sense of urgency: if an iteration is too long, it takes too much time for the team to “warm up” and take things seriously. Of course, this needs to be balanced against keeping the team in a constant state of emergency, and burning everyone out.

Some of the other factors that Mr. Cohn finds important in choosing an iteration length:
- how long can you go without introducing change – avoiding requirements churn during an iteration.
- if cycles are too short (for example, a week) small issues, like a key team member coming down with a cold, can throw the team’s rhythm off and impact delivery.

All of this supports our experience: shorter (but not too-short) cycles help establish a rhythm and build the team’s focus and commitment, constantly driving to delivering customer value. And shorter cycles help manage change and risk.

Now we are experimenting with an aggressive, fast-tracked delivery model: a 3-week end-to-end cycle, with software delivered to production every 2 weeks. The team starts work on designing and building the next release while the current release is in integration, packaging and rollout, overlapping development and release activities. Fast-tracking is difficult to manage, and can add risk if not done properly. But it does allow us to respond quickly to changing business demands and priorities, while giving us time for an intensive but efficient testing and release management process.

We'll review how this approach works over the next few months and change it as necessary, but we intend to continue with short increments. However, I am concerned about the longer-term risks, the potential future downsides to our rapid delivery model.

In The Decline and Fall of Agile James Shore argues that rapid cycling short cuts up-front design:

“Up-front design doesn't work when you're using short cycles, and Scrum doesn't provide a replacement. Without continuous, incremental design, Scrum teams quickly dig themselves a gigantic hole of technical debt. Two or three years later, I get a call--or one of my colleagues does. "Changes take too long and cost too much!" I hear. "Teach us about test-driven development, or pairing, or acceptance testing!" By that time, fixing the real problems requires paying back a lot of technical debt, and could take years.”

While Mr. Shore is specifically concerned about loose implementations of Scrum, and its lack of engineering practices compared with other approaches like XP (see also Martin Fowler of ThoughtWorks on the risks of incremental development without strong engineering discipline), the problem is a general one for teams working quickly, in short iterations: even with good engineering discipline, rapid cycling does not leave a lot of time for architecture, design and design reviews, test planning, security reviews... all of those quality gating activities that waterfall methods support. This is a challenge for secure software development, as there is little guidance available on effectively scaling software security SDLC practices to incremental, agile development methods, something that I will explore more later.

Trying to account for architecture and platform decisions and tooling and training in an upfront “iteration zero” isn’t enough, especially if your project is still going strong after 2 or 3 years. What I worry about (and I worry about a lot of things) is that, moving rapidly from sprint to sprint, the team cannot stop and look at the big picture, to properly re-assess architecture and platform technology decisions made earlier. Instead all the team has a chance to do is make incremental, smaller-scale improvements (tighten up the code here, clean up an interface there, upgrade some of the technology stack), which may leave fundamental questions unanswered, trading off short-term goals (deliver value, minimize the cost and risk of change) with longer-term costs and uncertainties.

One of the other factors that could affect quality in the longer term is the pressure on the team to deliver in a timebox. In Technical Debt: Warning Signs, Catherine Powell raises the concern that developers committing to a date may put schedule ahead of quality:

“Once you've committed to a release date and a feature set, it can be hard to change. And to change it because you really want to put a button on one more screen? Not likely. The "we have to ship on X because X is the date" mentality is very common (and rightly so - you can't be late forever because you're chasing perfection). However, to meet that date you're likely to cut corners, especially if you've underestimated how much time the feature really takes, or how much other stuff is going on.”

Finally, I am concerned that rapid cycling does not give the team sufficient opportunities to pause, to take a breath, to properly reset. If they are constantly moving heads down from one iteration to another, do team members really have a chance to reflect, understand and learn? One of the reasons that I maintain this blog is exactly for this: to explore problems and questions that my team and I face; to research, to look far back and far ahead, without having to focus on the goals and priorities of the next sprint.

These concerns, and others, are explored in Traps & Pitfalls of Agile Development - a Non-Contrarian View:

"Agile teams may be prone to rapid accumulation of technical debt. The accrual of technical debt can occur in a variety of ways. In a rush to completion, Iterative development is left out. Pieces get built (Incremental development) but rarely reworked. Design gets left out, possibly as a backlash to BDUF. In a rush to get started building software, sometimes preliminary design work is insufficient. Possibly too much hope is placed in refactoring. Refactoring gets left out. Refactoring is another form of rework that often is ignored in the rush to complete. In summary, the team may move too fast for it's own good."

Our team’s challenge is not just to deliver software quickly: like other teams that follow these practices, we’ve proven that we can do that. Our challenge is to deliver value consistently, at an extremely high level of quality and reliability, on a continual and sustainable basis. Each design and implementation decision has to be made carefully: if your customer's business depends on you making changes quickly and perfectly, without impacting their day-to-day operations, how much risk can you afford to take on to make changes today so that the system may be simpler and easier to change tomorrow, especially in today's business environment? It is a high stakes game we're playing. I recognize that this is a problem of debt management, and I'll explore the problems of technical debt and design debt more later.

The practices that we have followed have worked well for us so far. But is there a point where rapid development cycles, even when following good engineering practices, provide diminishing returns? When does the accumulation of design decisions made under time pressure, and conscious decisions to minimize the risk of change, add up to bigger problems? Does developing software in short cycles, with a short decision-making horizon, necessarily result in long-term debt?

Monday, March 23, 2009

What's Wrong with Sucking Less?

At the Agile 2008 conference in Toronto, David Douglas & Robin Dymond discussed their concerns that the majority of companies who adopt agile (and by “agile”, effectively meaning Scrum) practices were falling short of complete adoption. The companies that they were working with were satisfied with 1.5-2x performance improvements in quality and time-to-market gained by effectively cherry picking from the key agile, incremental software development practices. The authors were concerned that most, if not all companies adopting agile were content simply “to suck less” rather than transforming their businesses.

This was further explored in December of last year in a StickyMinds column “Little Scrum Pigs and the Big Bad Wolf” by Michelle Sliger, who expressed her concern that

“Indeed, many companies are refusing to view agile as anything other than a set of engineering practices.”

That surprised me. I thought that this was, in fact, the point of agile software development: for people to adopt more effective software engineering practices.

What is especially confusing is that Scrum, in particular, is really a project management approach and provides very little in the way of software engineering practices. It is intuitive and obvious and easy to implement, maybe too easy, which is why most agile projects today are based on Scrum: its strength, and weakness, is that it provides an effective framework for organizing and managing software projects by breaking them down into time boxed increments, but does not force the team to adopt specific engineering practices and disciplines, unlike other methods, and especially XP.

Martin Fowler of ThoughtWorks, one of the leading thinkers in the agile (in this case, XP, however, rather than Scrum) community, raises the concern that this lack of software engineering discipline can lead to Scrum teams building a lot of sloppy software, however quickly.

Back to the “Three Little Pigs” – the author goes on to express her dismay that

“They have not adopted the value system that is the underlying infrastructure of all agile approaches.”

and that companies who are simply interested in adopting good practices from scrum and XP, but who don’t buy into the complete philosophy and value set, therefore lack vision and lack commitment; and are at risk of failure.

I find the argument to be both elitist and dogmatic, and awfully awfully unclear. The author suggests that there is something hidden in the agile manifesto, that by surrendering to this mystery you will find the one true agile path to success in software development and anything less is conceding defeat, or at the least condemning yourself and your organization to mediocrity.

But what is this mystery, this ineffable something or somethings that organizations refuse to, or are somehow unable to, accept? Perhaps it is delivering working software incrementally, following a timeboxed approach – no, this can’t be it, this is a well understood engineering practice, one of those mere disciplines that the author suggests is insufficient for success. Maybe it is “continuous attention to technical excellence and good design”. That can’t be it: I don’t see why any organization would not accept this as axiomatic when building software. Or is it the emphasis on simplicity? Or that the team should have a good working environment and the trust and support of management? Or that developers and customers should work together? Or maybe that we need to create self-organizing teams? Just what is it that these companies, who are “teetering at the edge between mediocrity and high performance”, are failing to do?

Douglas and Dymond concede that there are too few real agile success stories: they point to Nokia and BMC Software and PatientKeeper of course; a small number of companies (very small, after 10 or so years of evangelism) who have had noted success in adopting Scrum in a fundamental way. But I would argue that there are a lot of success stories – all of those companies who are “sucking less”, who have started on a path to building better and better software: every day, working hard to suck less and still less and less and so on.

While there is a negative connotation to the term, I don't see what is wrong with "sucking less". With being practical, goal-focused, and incremental in improving software development practices. With delivering good, working software in time boxed, iterative releases. With building a stronger development team, and a better development environment. With following good engineering practices and management methods, and stopping bad ones. With constantly reviewing your failures and successes and finding new ways to improve. I don’t see this as building a “house of straw” – I see this as what we all have to do to succeed: constantly, ruthlessly get better and better together.

Sunday, August 3, 2008

Safety Net

When I was younger, lighter and more of an adrenaline junkie, I took up rock climbing. It was an excellent way to push myself both mentally and physically, and it forced me to focus absolutely, completely on the moment. A successful day of climbing required thorough preparation, training and conditioning, the right equipment, a good sense of balance and timing (which often eluded me), and smooth team work with my climbing partners. One of the first things I learned was that unless you were a maniac or a supremely gifted super hero, you had to put in protection as you climbed, to ensure your safety and the safety of your climbing partners.

Building real software, software that is good and useful and meant to last, is equally challenging. I am not talking about one-off custom development work or small web sites, but real software products that companies run their businesses on. There is no one best practice or methodology, no perfect language or cool tool that will ensure that you will write good code. Instead, real software development demands discipline and focus and balance, and an intelligent defense-in-depth approach: consistently following good practices and avoiding stupid ones, hiring the best possible people and making sure they are well trained and have good tools, and carefully and conscientiously managing the team and your projects.

Having said this, if you need one place to start, if you had to choose one practice that could make the difference between a chance at success and the almost complete certainty of failure, start with building yourself a safety net: a strong regression test capability that you can run quickly and automatically before you release any new code. Without this safety net, every change, every fix that you make is dangerous.

I am surprised to find software product development shops with large code bases that do not have automated regression testing in place, and rely on black box testers (or their customers!) to find problems. Relying on your test team to manually catch regressions is error-prone - sooner or later a tester will run out of time, make a mistake or miss (or misunderstand) a test - and awfully awfully expensive - it takes too many people too long to test even a moderately complex system. Regression testing is important, but you will get a lot more value from the test team if you free them up to do things like deep exploratory testing, destructive testing, stress testing and system testing, simulations and war games, and reviews and pair testing with the developers.

If you don’t have an automated test safety net today, then start building one tomorrow. Find someone in your team who understands automated unit testing, or bring a consultant in, start writing unit tests for new code and add unit tests as you change code. Run the test suite for each build as part of your continuous integration environment (ok: if you don’t have CIM set up already, you will have to do this too).

Starting from nothing, there’s no point in trying to measure test coverage. So begin by counting the number of tests the team adds each week and track the team’s progress. As bugs are found in production or by the test team, make sure to write tests for the code that needs to be corrected, and spend some time to review whether other tests should be added in the rest of the code base to catch this type of failure. Monitor the trend to ensure that new tests continue to be added, that the team isn't taking a step backwards. Just like building software, take an incremental approach in adopting unit testing.

Once you have a good base of tests in place, use a code coverage tool to identify important areas of code that are not tested. Take a risk-based approach: if a piece of code is not important, don’t bother with writing tests to attain a mandated code coverage % goal. If a critical piece of code or a core business rule is not covered by a test, write some more tests as soon as you can. Then use a mutation tool like jumble or jester to validate the effectiveness of your test suite.

Writing tests that do not materially reduce risk simply adds to the cost of building and maintaining the system. Some fundamentalists will disagree and demand 100% coverage (although with the failure of Agitar the hype on this subject has subsided somewhat) . A recent post on achieving good ROI on Unit Testing explores how much unit testing is really necessary and valuable. Rather than writing unit tests on unimportant code, you could spend that time reviewing design and code, implementing static analysis tools to catch programming errors, helping with functional and integration testing, or, hey why not, designing and writing more code which is the point after all. For most projects, achieving 100% code coverage is not practical or even all that useful. But using good judgment to test high risk areas of the system is.

Test-first or not, but work with developers to write useful unit tests, and try to review the tests to make sure you don’t end up with a dozen tests that check variations on the same boundary condition or something else equally silly. Writing and maintaining tests is expensive, so make each test count.

While a good set of unit tests is extremely valuable, I don’t agree with some XP extremists who believe that once you have a comprehensive set of unit tests in place you are done with testing. Remember defense-in-depth: unit tests are an important part, but only a part, of what is needed to achieve quality. No one type of test or review is thorough enough to catch every error in a big system.

A good unit testing program requires investment in both the short-term and the long-term. Initial investments are needed to create the necessary infrastructure, train developers on the practice and tools, and of course there’s the actual time for the team to write and review the tests, and to integrate the tests into your build environment. In the longer-term, you will need to work with developers to continually reinforce the value of developer testing, especially as new people join the team, and ensure that the discipline of writing good tests is kept up; and the test suite will need to be constantly updated as code is changed or as new bugs are found.

Designing and writing good unit tests isn’t easy, especially if you are following a test-driven development approach. And not only do you have to sell the development team on the value of unit testing; you also need to convince management and your customers to give you the time and resources necessary to do a proper job. But without a solid testing safety net, changing code is like climbing without protection: sooner or later you or your buddy is going to make a mistake, and somebody’s going to get hurt.