Monday, May 24, 2010

Fast or Secure Software Development - but you can't have both

There is a lot of excitement in the software development community around Lean Software Development: applying Japanese manufacturing principles to software development to reduce waste and eliminate overhead, to streamline production, to get software out to customers and get their feedback as quickly as possible.

Some people are going so far as to eliminate review gates and release overhead as waste: “iterations are muda”. The idea of Continuous Deployment takes this to the extreme: developers push software changes and fixes out immediately to production, with all the good and bad that you would expect from doing this.

Continuous Deployment has been followed to success at IMVU and Wordpress and Facebook and other Web 2.0 companies. The CEO of Automattic, the team behind Wordpress, recently bragged about their success in following Continuous Deployment:
“The other day we passed product release number 25,000 for WordPress.com. That means we’ve averaged about 16 product releases a day, every day for the last four and a half years!”
I am sure that he is not proud of their history of security problems however, which you can read about here, here, here, here, here and elsewhere.

And Facebook? You can read about how they use Continuous Deployment practices to push code out to production several times a day. As for their security posture, Facebook has "faced" a series of severe security and privacy problems and continues to run into them, as recently as last week.

I’ve ranted before about the risks that Continuous Deployment forces on customers. Continuous Deployment is based on the rather naïve assumption that if something is broken, if you did something wrong, you will know right away: either through your automated tests or by monitoring the state of production, errors and changes in usage patterns, or from direct customer feedback. If it doesn’t look like it’s working, you roll it back as soon as you can, before the next change is pushed out. It’s all tied to a direct feedback loop.

Of course it’s not always that simple. Security problems don’t show up like that, they show up later as successful exploits and attacks and bad press and a damaged brand and upset customers and the kind of mess that Facebook is in again. I can’t believe that the CEO of Facebook appreciates getting this kind of feedback on his company's latest security and privacy problems.

Now, maybe the rules about how to build secure and reliable software don’t, and shouldn’t, all apply to Web 2.0, as Dr. Boaz Gelbord proposes:
“Facebook are not fools of course. You don't build a business that engages every tenth adult on the planet without honing a pretty good sense for which way the wind is blowing. The company realizes that it is under no obligation to provide any real security controls to its users.
Maybe to be truly open and collaborative, you are obliged to make compromises on security and data integrity and confidentiality. Some of these Web 2.0 sites like Facebook are phenomenally successful, and it seems that most of their customers don’t care that much about security and privacy, and as long as you haven’t been foolish enough to use tools like Facebook to support your business in a major way, maybe that’s fine.

And I also don’t care how a startup manages to get software out the door. If Continuous Deployment helps you get software out faster to your customers, and your customers are willing to help you test and put up with whatever problems they find, if it gives you a higher chance of getting your business launched, then by all means consider giving it a try.

Just keep in mind that some day you may need to grow up and take a serious look at how you build and release software – that the approach that served you well as a startup may not cut it any more.

But let’s not pretend that this approach can be used for online mission-critical or business-critical enterprise or B2B systems, where your system may be hooked up to dozens or hundreds of other systems, where you are managing critical business transactions. Enterprise systems are not a game:
“I understand why people would think that a consumer internet service like IMVU isn't really mission critical. I would posit that those same people have never been on the receiving end of a phone call from a sixteen-year-old girl complaining that your new release ruined their birthday party. That's where I learned a whole new appreciation for the idea that mission critical is in the eye of the beholder.”
This is a joke right?

But seriously, I get concerned when thoughtful people in the development community, people like Kent Beck and Michael Feathers start to explore Continuous Deployment Immersion and zero-length iterations. These aren’t kids looking to launch a Web 2.0 site, they are leaders who the development community looks to for insight, for what is important and right in building software.

There is a clear risk here of widening the already wide disconnect between the software development community and the security community.

On one side we have Lean and Continuous Deployment evangelists pushing us to get software out faster and cheaper, reducing the batch size, eliminating overhead, optimizing for speed, optimizing the feedback loop.

On the other side we have the security community pleading with us to do more upfront, to be more careful and disciplined and thoughtful, to invest more in training and tools and design and reviews and testing and good engineering, all of which adds to the cost and time of building software.

Our job in software development is to balance these two opposing pressures: to find a way to build software securely and efficiently, to take the good ideas from Lean, and from Continuous Deployment (yes, there are some good ideas there in how to make deployment more automated and streamlined and reliable), and marry them with disciplined secure development and engineering practices. There is an answer to be found, but we need to start working on it together.

Thursday, May 20, 2010

Code quality, refactoring and the risk of change

When you are working on a live, business-critical production system, deciding what work needs to be done, and how to do it, you need to consider different factors:
  1. business value: the demand and urgency for new business features and changes requested by customers or needed to get new customers onboard.

  2. compliance: ensuring that you stay onside of regulatory and security requirements.

  3. operational risk and safety: the risks of injecting bugs into the live system as a side-effect of your change, the likelihood and impact of errors on the stability or usability of the system.

  4. cost: immediate development costs, longer-term maintenance and support costs, operational costs and other downstream costs, and opportunity costs of choosing one work item over another or one design approach over another.

  5. technical: investments needed to upgrade the technology stack, managing the complexity and quality of the design and code.
These factors also come into play in refactoring: deciding how much code to cleanup when making a change or fix. The decision of what, and how much to refactor, isn’t simple. It isn’t about a developer’s idea of what is beautiful and their need for personal satisfaction. It isn’t about being forced to compromise between doing the job the right way vs. putting in a hack. It’s much more difficult than that. It’s about balancing technical and operational risks, technical debt, cost factors, and trading off short-term advantages for longer-term costs and risks.

Let’s look at the longer term considerations first. A live system is going to see a lot of changes. There will be regulatory changes, fixes and new features, upgrades to the technology platform. There will also be false starts and back-tracking as you iterate, and changes in direction, and short-sighted design and implementation decisions made with insufficient time or information. Sometimes you will need to put in a quick fix, or cut-and-paste a solution. You will need to code-in exceptions, and exceptions to the exceptions, especially if you are working on an enterprise system integrated with tens or hundreds of other systems. People will leave the team and new people will join, and everyone’s understanding of the domain and the design and the technology will change over time. People will learn newer and better ways of solving problems and how to use the power of the language and their other tools; they will learn more about how the business works; or they might forget or misunderstand the intentions of the design and wander off course.

These factors, the accumulation of decisions made over time will impact the quality, the complexity, the clarity of the system design and code. This is system entropy, as described by Fred Brooks back in the Mythical Man Month:
“All repairs tend to destroy the structure, to increase the entropy and disorder of the system. Less and less effort is spent on fixing the original design flaws: more and more is spent on fixing flaws introduced by earlier fixes… Sooner or later the fixing ceases to gain any ground. Each forward step is matched by a backward one.”
So, the system will become more difficult and expensive to operate and maintain, and you will end up with more bugs and security vulnerabilities – and these bugs and security holes will be harder to find and fix. At the same time you will have a harder time keeping together a good team because nobody wants to wade knee deep in garbage if they don’t have to. At some point you will be forced to throw away everything that you have learned and all the money that you have spent, and build a new system. And start the cycle all over again.

The solution to this is of course is to be proactive: to maintain the integrity of the design by continuously refactoring and improving the code as you learn, fill-in short-cuts, eliminate duplication, clean up dead-ends, simplify as much as you can. In doing this, you need to balance the technical risks and costs of change in the short-term, with the longer-term costs and risks of letting the system slowly go to hell.

In the short term, we need to understand and overcome the risk of making changes. Michael Feathers, in his excellent book Working Effectively with Legacy Code talks about the fear and risk of change that some teams face:
“Most of the teams that I’ve worked with have tried to manage risk in a very conservative way. They minimize the number of changes they make to the code base. Sometimes this is a team policy: ‘if it’s not broke, don’t fix it’…. ‘What? Create another method for that? No, I’ll just put the lines of code right here in the method, where I can see them and the rest of the code. It involves less editing, and it’s safer.’

It’s tempting to think we can minimize software problems by avoiding them, but, unfortunately, it always catches up with us. When we avoid creating new classes and methods, the existing ones grow larger and harder to understand. When you make changes in any large system, you can expect to take a little time to get familiar with the area you are working with. The difference between good systems and bad ones is that, in the good ones, you feel pretty calm after you’ve done that learning, and you are confident in the change you are about to make. In poorly structured code, the move from figuring things out to making changes feels like jumping off a cliff to avoid a tiger. You hesitate and hesitate.

Avoiding change has other bad consequences. When people don’t make changes often they get rusty at it…The last consequence of avoiding change is fear. Unfortunately, many teams live with incredible fear of change and it gets worse every day. Often they aren’t aware of how much fear they have until they learn better techniques and the fear starts to fade away.”
It’s clear that avoiding changes won’t work. We need to get and keep control over the situation, we need to make careful and disciplined changes. And we need to protect ourselves from making mistakes.

Back to Mr. Feathers:
“Most of the fear involved in making changes to large code bases is fear of introducing subtle bugs; fear of changing things inadvertently”.
The answer is to ensure that you have a good testing safety net in place (from Michael Feathers one more time):
“With tests, you can make things better with impunity… With tests, you can make things better. Without them, you just don’t know whether things are getting better or worse.”
You need enough tests to ensure that you understand what the code does; and that you target tests that will will detect changes in behavior in the area that you want to change.

Put in a good set of tests. Refactor. Review and verify your refactoring work. Then make your changes and review and verify again. Don’t change implementation and behavior at the same time.

But there are still risks – you are making changes, and there are limits to how much protection you can get from developer testing, even if you have a high level of coverage in your automated tests and checks. Ryan Lowe reinforces this in “Be Mindful of Code Entropy”:
“The reason you don’t want to refactor established code: this ugly working code has stood the test of time. It’s been beaten down by your users, it’s been tested to hell in the UI by manual testers and QA people determined to break it. A lot of work has gone into stabilizing that mess of code you hate.

As much as it might pain you to admit it, if you refactor it you’ll throw away all of the manual testing effort you put into it except the unit tests. The unit tests… can be buggy and aren’t nearly as comprehensive as dozens/hundreds/thousands of real man hours bashing against the application.”
As with any change, code reviews will help find mistakes, and so will static analysis. We also ask our developers to make sure that the QA team understands the scope of their refactoring work so that they can include additional functional and system regression testing work.

Beyond the engineering discipline, there is the decision of how much to refactor. So as a developer, how do you make this decision, how do you decide how much is right, how much is necessary? There is a big difference between minor, in-phase restructuring of code and just plain good coding; and fundamental re-design work, what Martin Fowler and Kent Beck call “Big Refactorings” which clearly need to be broken out and done as separate pieces of work. The answer lies somewhere in between these points.

I recently returned from a trip to the Buddhist kingdom of Bhutan, where I was reminded of the value of finding balance in what we do, the value of following the Middle Way. It seems to me that the answer is to do “just enough”. To refactor only as much as you need to make the problem clear, to understand better how the code works, to simplify the change or fix… and no more.

By doing this, we still abide by Bob Martin’s Boy Scout Rule and leave the code cleaner than when we checked it out. We help protect the future value of the software. At the same time we minimize the risk of change by being careful and disciplined and patient and humble. Just enough.

Friday, May 7, 2010

Why isn't Risk Management included in Scrum?

There are a lot of fundamentally good ideas in Scrum, and a lot of teams seem to be having success with it, on more than trivial projects. I’m still concerned that there are some serious weaknesses in Scrum, especially “out of the box”, where Scrum leaves you to come up with your own engineering framework, asks you to come up with your own way to build good software.

I am happy to see that recent definitions (or reinterpretations) of Scrum, for example in Mike Cohn’s book Succeeding with Agile Software Development Using Scrum incorporate many of the ideas from Extreme Programming and other basic good management and engineering practices to fill in some of the gaps.

But I am not happy to see that risk management is still missing in Scrum. There is no discussion of risk management in the original definition of Scrum, or in the CSM training course (at least when I took it), the latest Scrum Guide (although with all of the political infighting in the Scrum community, I am not sure where to find the definitive definition if such a thing exists any more), not even in Mike Cohn’s new book.

I just don’t understand this.

I manage an experienced development and delivery team, and we follow an incremental, iterative development approach based on Scrum and XP. These are smart, senior people who understand what they are doing, they are talented and careful and disciplined, and we have good tools and we have a lot of controls built in to our SDLC and release processes. We work closely with our customer, we deliver software often (every 2-3 weeks to production, sometimes faster), we continuously review and learn and get better. But I still spend a lot of my time watching out for risks, trying to contain and plan for business and technical and operational risks and uncertainties.

Maybe I worry too much (some of my colleagues think so). But not worrying about the risks won’t make them go away. And neither will following Scrum by itself. As Ken Schwaber, one of the creators of Scrum, admits:
“Scrum purposefully has many gaps, holes, and bare spots where you are required to use best practices – such as risk management.”
My concern is that this should be made much more clear to people trying to follow the method. The contrast between Extreme Programming and Scrum is stark: XP confronts risk from the beginning, and the idea and importance of managing project and technical risks is carried throughout XP. Kent Beck, and later others including Martin Fowler, examine the problems of managing risk in building software, prioritizing work to take into account both business value and technical risk (dealing with the hard problems and most important features as early as possible), and outlining a set of disciplines and controls that need to be followed.

In Scrum, some management of risks is implicit, essentially burned-in to the approach of using short time-boxed sprints, regular meetings and reviews, more efficient communications, and a self-managing team working closely with the customer, inspecting and adapting to changes and new information. I definitely buy into this, and I’ve seen how much risk can be managed intrinsically by a good team following incremental, iterative development.

As I pointed out earlier, Scrum helps to minimize scope, schedule, quality, customer and personnel risks in software development through its driving principles and management practices.

But this is not enough, not even after you include good engineering practices from XP and Microsoft’s SDL and other sources. Some leaders in the Agile community have begun to recognize this, but there is an alarming lack of consensus within the community on how risks should be managed:

- by the team, intrinsically: like any other requirement or issue, members of the self-managing team will naturally recognize and take care of risks as part of their work. I’m not convinced that a development team is prepared to naturally manage risks, even an experienced team – software development is an essentially an optimistic effort, you need to believe that you can create something out of nothing, and most of the team’s attention will naturally be on what they need to do and how to best get it done rather than trying to forecast what could go wrong and how to prepare for it;

- by the team, explicitly: the team owns risk management, they are responsible for identifying and managing risks using a risk management framework (risk register, risk assessment and impact analysis, and so on);

- by the ScrumMaster, as another set of issues to consider when coaching and supporting the team, although how the ScrumMaster then ensures that risks are actually addressed is not clear, since they don’t control what work gets done and when;

- by the Product Owner: this idea has a lot of backing, supposedly because the Product Owner acting for the customer in the end understands the risk of failure to the business best. The real reason is because the team can resign themselves of the responsibility for risk management and hand it off to the customer. This makes me angry. It is unacceptable to suggest that someone representing the customer should be made responsible for assuming the risk for the project. The customer is paying your organization to do a job, and naturally expecting you as the expert to do this job professionally and competently. And yet you are asking this same customer to take responsibility for the risk that the project won’t be delivered successfully? Try selling that to any of the enterprise customers that I have worked with over the past 15 years…

Risk management for a Scrum project needs to take this into account, and take into account all of the risks and issues that are external to the project team:

- the political risks involved in the team’s relationship with the Product Owner, and the Product Owner’s position and influence within their own organization. Scrum places too much responsibility on the Product Owner and requires that this person is not only competent and committed to the project and deeply understands the needs of the business; but that they also always put the interests of the business ahead of their own, and of course this assumes that they are not in conflict with other important customer stakeholders over important issues. They also need to understand the importance of technical risk, and be willing to trade-off technical risk for business value in planning and scheduling work in the backlog. I am lucky in that I have the opportunity to work with a Product Owner who actually meets this profile. But I know how rare and precious this is;

- the political risks of managing executive sponsors and other key stakeholders within the customer and within your own organization - it's naive to think that delivering software regularly and keeping the Product Owner happy is enough to keep the support of key sponsors;

- financial and legal risks – making sure that the project satisfies the legal and business conditions of the contract, dealing with intellectual property and confidentiality issues, and financial governance;

- subcontractors and partners – making sure that they are not going to disappoint or surprise you, and that you are meeting your commitments to them;

- infrastructure, tooling, environmental factors – making sure that the team has everything that they need to get the job done properly;

- integration risks with other projects - requirements for program management, coordinating and communicating with other teams, managing interdependencies and resource conflicts, handling delays and changes between projects;

- rollout and implementation, packaging, training, marketing – managing the downstream requirements for distribution and deployment of the final product.

So don’t throw out that PMP.

It’s clear that you need real and active project management and risk management on Scrum projects, like any project: focused not so much inwards, but outwards. So much of what you read about Scrum, and Agile project management generally, is naive in that it assumes that the development work is being done in a vacuum, that there is nothing that concerns the project outside the immediate requirements of the development team. Even assuming that the team is dealing effectively with technical risks, and that other immediate delivery risks are contained by being burned-in to doing the work incrementally and following engineering best practices, you still have to monitor and manage the issues outside of the development work, the larger problems and issues that extend beyond the team and the next sprint, the issues that can make the difference between success and failure.