Building Real Software

Friday, May 7, 2010

Why isn't Risk Management included in Scrum?

There are a lot of fundamentally good ideas in Scrum, and a lot of teams seem to be having success with it, on more than trivial projects. I’m still concerned that there are some serious weaknesses in Scrum, especially “out of the box”, where Scrum leaves you to come up with your own engineering framework, asks you to come up with your own way to build good software.

I am happy to see that recent definitions (or reinterpretations) of Scrum, for example in Mike Cohn’s book Succeeding with Agile Software Development Using Scrum incorporate many of the ideas from Extreme Programming and other basic good management and engineering practices to fill in some of the gaps.

But I am not happy to see that risk management is still missing in Scrum. There is no discussion of risk management in the original definition of Scrum, or in the CSM training course (at least when I took it), the latest Scrum Guide (although with all of the political infighting in the Scrum community, I am not sure where to find the definitive definition if such a thing exists any more), not even in Mike Cohn’s new book.

I just don’t understand this.

I manage an experienced development and delivery team, and we follow an incremental, iterative development approach based on Scrum and XP. These are smart, senior people who understand what they are doing, they are talented and careful and disciplined, and we have good tools and we have a lot of controls built in to our SDLC and release processes. We work closely with our customer, we deliver software often (every 2-3 weeks to production, sometimes faster), we continuously review and learn and get better. But I still spend a lot of my time watching out for risks, trying to contain and plan for business and technical and operational risks and uncertainties.

Maybe I worry too much (some of my colleagues think so). But not worrying about the risks won’t make them go away. And neither will following Scrum by itself. As Ken Schwaber, one of the creators of Scrum, admits:

“Scrum purposefully has many gaps, holes, and bare spots where you are required to use best practices – such as risk management.”

My concern is that this should be made much more clear to people trying to follow the method. The contrast between Extreme Programming and Scrum is stark: XP confronts risk from the beginning, and the idea and importance of managing project and technical risks is carried throughout XP. Kent Beck, and later others including Martin Fowler, examine the problems of managing risk in building software, prioritizing work to take into account both business value and technical risk (dealing with the hard problems and most important features as early as possible), and outlining a set of disciplines and controls that need to be followed.

In Scrum, some management of risks is implicit, essentially burned-in to the approach of using short time-boxed sprints, regular meetings and reviews, more efficient communications, and a self-managing team working closely with the customer, inspecting and adapting to changes and new information. I definitely buy into this, and I’ve seen how much risk can be managed intrinsically by a good team following incremental, iterative development.

As I pointed out earlier, Scrum helps to minimize scope, schedule, quality, customer and personnel risks in software development through its driving principles and management practices.

But this is not enough, not even after you include good engineering practices from XP and Microsoft’s SDL and other sources. Some leaders in the Agile community have begun to recognize this, but there is an alarming lack of consensus within the community on how risks should be managed:

- by the team, intrinsically: like any other requirement or issue, members of the self-managing team will naturally recognize and take care of risks as part of their work. I’m not convinced that a development team is prepared to naturally manage risks, even an experienced team – software development is an essentially an optimistic effort, you need to believe that you can create something out of nothing, and most of the team’s attention will naturally be on what they need to do and how to best get it done rather than trying to forecast what could go wrong and how to prepare for it;

- by the team, explicitly: the team owns risk management, they are responsible for identifying and managing risks using a risk management framework (risk register, risk assessment and impact analysis, and so on);

- by the ScrumMaster, as another set of issues to consider when coaching and supporting the team, although how the ScrumMaster then ensures that risks are actually addressed is not clear, since they don’t control what work gets done and when;

- by the Product Owner: this idea has a lot of backing, supposedly because the Product Owner acting for the customer in the end understands the risk of failure to the business best. The real reason is because the team can resign themselves of the responsibility for risk management and hand it off to the customer. This makes me angry. It is unacceptable to suggest that someone representing the customer should be made responsible for assuming the risk for the project. The customer is paying your organization to do a job, and naturally expecting you as the expert to do this job professionally and competently. And yet you are asking this same customer to take responsibility for the risk that the project won’t be delivered successfully? Try selling that to any of the enterprise customers that I have worked with over the past 15 years…

Risk management for a Scrum project needs to take this into account, and take into account all of the risks and issues that are external to the project team:

- the political risks involved in the team’s relationship with the Product Owner, and the Product Owner’s position and influence within their own organization. Scrum places too much responsibility on the Product Owner and requires that this person is not only competent and committed to the project and deeply understands the needs of the business; but that they also always put the interests of the business ahead of their own, and of course this assumes that they are not in conflict with other important customer stakeholders over important issues. They also need to understand the importance of technical risk, and be willing to trade-off technical risk for business value in planning and scheduling work in the backlog. I am lucky in that I have the opportunity to work with a Product Owner who actually meets this profile. But I know how rare and precious this is;

- the political risks of managing executive sponsors and other key stakeholders within the customer and within your own organization - it's naive to think that delivering software regularly and keeping the Product Owner happy is enough to keep the support of key sponsors;

- financial and legal risks – making sure that the project satisfies the legal and business conditions of the contract, dealing with intellectual property and confidentiality issues, and financial governance;

- subcontractors and partners – making sure that they are not going to disappoint or surprise you, and that you are meeting your commitments to them;

- infrastructure, tooling, environmental factors – making sure that the team has everything that they need to get the job done properly;

- integration risks with other projects - requirements for program management, coordinating and communicating with other teams, managing interdependencies and resource conflicts, handling delays and changes between projects;

- rollout and implementation, packaging, training, marketing – managing the downstream requirements for distribution and deployment of the final product.

So don’t throw out that PMP.

It’s clear that you need real and active project management and risk management on Scrum projects, like any project: focused not so much inwards, but outwards. So much of what you read about Scrum, and Agile project management generally, is naive in that it assumes that the development work is being done in a vacuum, that there is nothing that concerns the project outside the immediate requirements of the development team. Even assuming that the team is dealing effectively with technical risks, and that other immediate delivery risks are contained by being burned-in to doing the work incrementally and following engineering best practices, you still have to monitor and manage the issues outside of the development work, the larger problems and issues that extend beyond the team and the next sprint, the issues that can make the difference between success and failure.

Thursday, April 22, 2010

XP and the Art of Software Maintenance

Most of us in software development will spend most of our careers maintaining and supporting software, not just working on new, greenfield projects. Maintenance is brownfield work, it’s hard work under difficult and sometimes unfair conditions and strict constraints. It’s not glamorous, and it takes discipline and skill and commitment to succeed over the long term.

There are a number of challenges in successfully maintaining a system, especially a big business-critical system, a system that has been around for a while, that represents the work of many people over many years.

There are the technical challenges in safely managing change, minimizing technical and operational risk, recognizing and containing technical debt, understanding and working with code that you did not have a hand in writing (and whose author has long left the company), testing and installing upgrades to ensure that the technology stack does not become obsolete, and keeping up with the changing security threat landscape. Setting and maintaining a high level of quality, continuously reviewing and refactoring the design and implementation, ensuring that you don’t let entropy set in and fall into the trap of having to re-write the system and lose all of the work that has already been done – what Joel Spolsky calls “the single worst strategic mistake that any software company can make”.

And there are challenges in managing the team, keeping the strongest possible team together for as long as possible, sustaining momentum and commitment and engagement over time, making the work interesting and worthwhile and important and fun.

As I described in a previous post Everything I needed to know about Maintenance, I learned how a company could be successful over a long period of time maintaining and supporting the same software, continuously focusing on delivering new value to customers and pushing for technical excellence, all done by a small senior team following an incremental development approach. I have tried to apply these ideas at my current firm, where we have been supporting and maintaining a business-critical financial application for more than 3 years now.

Many of these same ideas and practices, and more, are captured in today’s agile development methods: XP, in particular, is an engineering and management framework that is especially well-suited for software maintenance. XP provides a foundation for maintenance through:

Frequent, small releases to production: always be delivering, creating a sense of accomplishment for the team and continuously delivering value to the customer, responding to change, and learning from feedback.

A constant focus on quality: “No defect is acceptable, each is an opportunity for the team to learn and improve”.

Making change safe through automated developer testing, building a regression safety net.

Continuous integration: always knowing that the code works.

Close contact with the customer (the business side and operations both).

Lightweight, continuous communications, with just enough documentation.

Refactoring: constantly improving code and design – critical in recognizing and reducing technical debt.

Slack and sustainable pace: giving people time to do their work, preventing burnout, and ensuring that you can fit in other work demands, especially important in maintenance because you can’t control requirements for support and firefighting.

Collective code ownership, skilling-up and skilling-out the team: in maintenance, by necessity sooner or later just about everyone is going to touch just about every piece of code.

Just enough design, breaking problems down and finding the simplest possible solution – this is a controversial aspect of XP when building enterprise systems from scratch, but it is the right approach in maintenance where you are dealing with incremental problems and the constraints of existing architecture and technology.

Transparency and respect – creating and maintaining an open, trusting and respectful environment within the team, with operations and with the customer.

Now, in our case, I don’t mean 100%, full-on, hardcore, literally by-the-book XP: I mean a dialed-down implementation of Extreme Programming, a less-Extreme Extreme Programming. As Kent Beck states in Extreme Programming Explained: Embrace Change

"The values, principles and practices are there to provide guidance, challenge and accountability… The goal is successful and satisfying relationships and projects, not membership in the ‘XP Club’."

We have followed his guidance, to

“Experiment with XP using these practices as your hypotheses”

and we have adapted the ideas and principles of XP to our situation, our experience and way of working.

XP, by Kent Beck’s admission, is an integrated set of good engineering and management practices, dialed up to 10. In adapting and integrating the practices in XP for maintenance, we have dialed back in specific areas:

Pair Programming

All of our code changes are reviewed before being released. The idea behind pairing is that if code reviews are good, continuous code reviews are better. Like my friend Pete McBreen in his post Still Questioning Extreme Programming I think that pairing makes sense in specific cases, especially in troubleshooting and helping people new to the team, but it is not necessary all the time – our developers pair up when it makes sense.

Test First Development

We rely extensively on automated testing, testing early and testing often. Our developers choose to follow TFD or TDD practices, or write tests after the code is written, as they see fit, following the principle that

“code and testing can be written in either order… write tests in advance when possible”.

We put a lot of emphasis on testing, on automating developer testing – this is another area where XP is in agreement:

“in XP testing is as important as programming”.

This is a critical idea in maintenance, where even small changes to an existing system can have significant consequences.

Unlike some “pure” XP teams, we don’t rely only on the automated test suites: we have a senior team of testing specialists who conduct exploratory and destructive testing, system and integration testing, operational testing; we schedule regular application penetration tests; we run system trials and simulations with the team involved in interactive, loosely structured “war games”; and we take advantage of technologies such as static analysis in our Continuous Integration environment. Our automated regression testing safety net is an important asset, but we find valuable, higher-risk problems in exploratory testing, reviews and in the war games simulations.

Incremental Delivery

We follow a model closer to the first edition description of XP, releasing to production every 2-3 weeks rather than holding to a 1-week cycle – extremely short, rapid cycling isn’t sustainable, and doesn’t allow enough time for the reviews and other checkpoints that we have found necessary and valuable.

The Development/Maintenance Problem

In many organizations there are “developers” and “maintainers”: a team of hotshots is hired to design and build the initial system, and once the “hard work” is done, they hand it off to the maintenance and support or “sustained engineering” crew: kids and old-timers and other misfits who don’t have what it takes (yet, or any more) to do “real software development”; or the work of maintenance and support is offshored to a team in India or Eastern Europe to save costs.

But this is not the case in XP, as Pete McBreen points out in Questioning Extreme Programming

"The interesting thing about XP, however, is that it assumes that applications are never really going to be handed off to a separate maintenance team. The assumption is that after each incremental release, the customer will want more functionality and keep funding the development team… As such, there is never a need to hand the application over to a maintenance team; the original development team can continue to support the application indefinitely”.

This idea of keeping the team together, preserving the knowledge that has been built up over time, the team’s deep and shared understanding of the domain, their proven ability to deliver, is critical and fundamental. This is the real intellectual property, the real value that you have created. Erich Brechner at Microsoft explores this in his post on Sustained engineering idiocy where he shows that keeping the development team engaged in maintenance and support builds accountability, creates a deeper understanding of the system and of the customer’s needs, and informs future development, using the feedback from support to improve the quality and reliability of future releases or future products. Eric provides some useful ideas on how to balance the requirements of maintenance and support against future development: structuring your team around a core with an evaluation team that investigates and triages issues, and using backlog management to feed fixes into the incremental development schedule.

Most of the work that will be done on a piece of software, up to 70% in some studies, is done during maintenance. It just makes sense to have your best people working on what is most important: protecting the investment that you and your customer made in building the software in the first place; supporting the ongoing business operations of your customers; and ensuring that you and your customers will continue to succeed in the future.

Monday, March 22, 2010

Failure Isolation and Recovery: Learning from High-Scale and Extreme-Scale Computing

While I have been building business-critical enterprise systems for a long time, I haven't worked on high-scale cloud computing or Internet-scale architectures, with tens of thousands or hundreds of thousands of servers. There are some fascinating, hard problems that need to be solved in engineering systems at high-scale, but the most interesting to me are problems in deployment and operations management, and especially how to deal with failure.

With so many moving parts in high-scale systems, failures of one kind or another are common: disks (especially), servers and multiple servers, racks, networks, data center outages and geographic disasters, and database failures, middleware and application software failures, and of course human error – mistakes in configuration and operations. As you scale up, you will also encounter multiple simultaneous failures and failure combinations, normal accidents, more heisenbugs and mandelbugs, and data corruption problems and other silent failures.

These challenges are taken to extremes in petascale and exascale HPC platforms. The Mean Time to Failure (MTTF) on petascale systems (the largest super computers running today) can be as low as 1 day. According to Michael Heroux at Sandia National Laboratories, exascale systems, consisting of millions of processors and capable of handling 1 million trillion calculations per second (these computers don’t exist yet, but are expected in the next 10 years)

“will have very high fault rates and will in fact be in a constant state of decay. ‘All nodes up and running’, our current sense of a well-functioning scalable system, will not be feasible. Instead we will always have a portion of the machine that is dead, a portion that is dying and perhaps producing faulty results, another that is coming back to life and a final, hopefully large, portion that is computing fast and accurate results.”

For enterprise systems, of course, rates of component or combination failures will be much lower, but the same risks exist, and the principles still hold. Recognizing that failures can and will happen, it is important to:

    Identify failures as quickly as possible.

    Minimize and contain the impact of failures.

    Recover as quickly as possible.

At QCON SF 2009, Jason McHugh described how Amazon’s S3 high-scale cloud computing service is architected for resiliency in the face of so many failure conditions. He lists 7 key principles for system survivability at high-scale:

1. Decouple upstream and downstream processes, and protection yourself from dependencies upstream and downstream when problems occur: overload and spikes from upstream, failures and slow-downs downstream.

2. Design for large failures.

3. Don’t trust data on the wire or on disk.

4. Elasticity – resources can be brought online at any time.

5. Monitor, extrapolate and react: instrument, and create feedback loops.

6. Design for frequent single system failures.

7. Hold “Game Days”: shutdown a set of servers or a data center in production, and prove that your recovery strategy works in the real world.

Failure management needs to be architected in, and handled at the implementation level: you have to expect, and handle, failures for every API call. This is one of the key ideas in Michael Nygard’s Release It!, that developers of data center-ready, large-scale systems cannot afford to be satisfied dealing with abstractions, that they must understand the environment that the system runs in, and that they have to learn not to trust it. This book offers some effective advice and ideas (“stability patterns”) to protect against horizontal failures (Chain Reactions) and vertical failures (Cascading Failures).

For vertical failures between tiers, use timeouts and delayed retry mechanisms to prevent hangs and timeouts – protect request handling threads on all remote calls (and resource pool checkouts), and never wait forever. Release It! introduces a Circuit Breaker pattern to manage this: after too many retries, close-off the connection, back away, then try again – if the problem has not been corrected, close-off and back away again. And remember to fail fast – don’t get stuck blocking or deadlock if a resource is not available or a remote call cannot be completed.

In horizontal Chain Reactions within a tier, the failure of one component raises the possibility of failure in its peers, as the workload rebalances to overwhelm the remaining services. Protect the system from Chain Reactions with partitions and Bulkheads – build internal firewalls to isolate workloads on different servers or services or resource pools. Virtual servers are one way to partition and isolate workloads, although we chose not to use virtual servers in production because of the extra complexity in management – we have had a lot of success with virtualization for provisioning test environments and management services, but for high-volume, low-latency transaction processing we’ve found that separate physical servers and application partitioning is faster and more reliable.

Application and data partitioning is a fundamental design idea in many systems, for example exchange trading engines, where the market is broken down into many different trading products or groups of products, isolated in different partitions for scalability and workload balancing purposes. James Hamilton, formerly with Microsoft and now a Distinguished Engineer on the Amazon Web Services team, hilights the importance of partitioning for application scaling as well as fault isolation and multi-tenancy in On Designing and Deploying Internet-Scale Services, an analysis of key issues in operations management and architecture of systems at high-scale, with a focus on efficiency and resiliency. He talks about the importance of minimizing cross-partition operations, and managing partitions intelligently, at a fine-grained level. Some of the other valuable ideas here include:

    Design for failure

    Zero trust of underlying components

    At scale, even unlikely, unusual combinations can become commonplace

    Version everything

    Minimize dependencies

    Practice your failover operations regularly - in production

Read this paper.

All of these ideas build on research work in Recovery Oriented Computing at Berkeley and Stanford, which set out to solve reliability problems in systems delivered “in Internet time” during the dot com boom. The basics of Recovery Oriented Computing are:

Failures will happen, and you cannot predict or avoid failures. Confront this fact and accept it.

Avoiding failures and maximizing Mean Time to Failure by taking careful steps in architecture and engineering and maintenance, and minimizing opportunities for human error in operations through training, procedures and tools – all of this is necessary, but it is not enough. For “always-on” availability, you also need to minimize the Mean Time to Recovery (MTTR).

Recovery consists of 3 steps:

1. Detect the problem
2. Diagnose the cause of the problem
3. Repair the problem and restore service.

Identify failures as quickly as possible. Minimize and contain the impact of failures. Provide tested and reliable tools for system administrators to recover the system quickly and safely.

One of the key enablers of Recovery Oriented Computing is, again, partitioning: to isolate faults and contain damage; to simplify diagnosis; to enable rapid online repair, recovery and restart at the component level; and to support dynamic provisioning – elastic, or “on demand” capacity upgrades.

As continuing reports of cloud computing failures and other large scale SaaS problems show, these problems have not been solved, and as the sheer scale of these systems continues to increase, they may never be solved. But there is still a lot that we can learn from the work that has been done so far.

Saturday, March 13, 2010

Secure Software Development: Awareness Training

I’ve been looking for a good basic awareness course on software security as an introduction for new developers and as a refresher for the rest of the team. A course that covers the important bases, and reinforces fundamental issues in building secure software. I prefer online training for something like this: if you have the discipline, online and on-demand training works because it is easier to schedule, people can go as fast or as slow as they like, and of course it is less expensive.

Stanford Software Security Foundations

Last year I took the Software Security Foundations course, the first course in Stanford University’s Advanced Computer Security certificate program:6 courses, all offered online, on-demand.

Foundations is a basic introduction to software security for developers and technical managers. This course was designed by (and principally delivered by)
Neil Daswani a former security program manager at Google, a Stanford alumni, and now a principal at Dasient, a web computer security startup.

The course is based on Mr. Daswani’s book Foundations of Computer Security: What Every Programmer Needs to Know: it is useful to have a copy of the book handy while going through the course. Foundations is a day’s worth of lectures made available as online videos for a 3-month period, with slides that can be downloaded for printing, and an exam at the end to ensure that you aren’t lazy about following the material. It covers the fundamentals of software security, including:

Security Principles

Authentication, authorization, access control, confidentiality, data integrity, non-repudiation, risk management, and secure system design principles including least privilege, fail-safe and defense-in-depth. Good, but no real surprises here.

Secure Programming

Buffer overflows, SQL injection, password management, cross site request forgery, cross site scripting, and common mistakes in crypto. There was good coverage of SQL injection and cross domain security threats, particularly XSRF. However I found the explanation of Cross Site Scripting confusing, even after reading through the section on cross domain security in the book – it’s not a straightforward problem, ok, but it can be explained in a simpler way. The examples and problems are biased slightly towards C/C++ programmers, but this should not present a problem for programmers working in other environments. It covers most of the important bases with the exception of input validation, which needs more attention.

Introduction to Cryptography

A walkthrough of symmetric encryption and public key cryptography, and then a brief discussion of advanced research in cryptography from Stanford Professor Dan Boneh, an expert in applied cryptography. The explanation of cryptographic primitives was especially lucid, and, well beautiful: I didn’t know that cipher block chaining could be beautiful, but it is. Some of the more advanced issues in cryptography are covered at a cursory level only (not a lot of it will stick with you if you aren’t already familiar with this subject), and you are referred to other courses offered in the program if you want to get a good, basic understanding of how to use cryptography and secure protocols.

SANS Software Security Awareness

In January I checked out the On Demand Software Security Awareness course from the SANS Institute.

I have had success with other courses from SANS before, classroom and On Demand. In particular the course SANS Security Leadership Essentials for Managers is excellent: a challenging and exhaustive 6-day course covering pretty much everything a technical manager would need to know about IT security from technical, project management, risk management, strategic, and legal and ethical perspectives.

Software Security Awareness is a short, 3-hour survey course, offered online as recorded audio lectures and a slide show. SANS prints and binds a copy of the slides and couriers a copy to you shortly after you register for the course. The course covers:

- vulnerability/patch cycle
- security architecture
- principles of software security
- security design
- implementation (coding and deployment)
- input issues
- code review and security testing.

Unfortunately, the course does not start out strong: the walkthrough of the vulnerability/patch cycle is supposed to help build a case for secure software development, but it takes too long to make too fine a point, and it’s hard to stay interested.

The next sections on architecture, principles of software security, and design are confused by a weakness in organization. It’s not clear where architecture ends and design begins, and there is unnecessary repetition between the sections. It would flow better if the discussion started with foundational principles of software security (which are well covered in the course), then proceeding through architecture and then design. Architecture should cover security requirements, risk analysis and threat modeling to determine “how much security is enough”, defense-in-depth, attack surface analysis and minimizing attack surface, complexity and economy of mechanism, cost considerations, and layering and defining trust zones. Some of these issues are covered in the discussion on architecture, some in design, some in both, and some not at all.

The risk assessment discussion surveys different risk modeling techniques, including Microsoft’s STRIDE and DREAD, OCTAVE and others. It includes a recommendation for SALSA, an initiative which SANS seems to have started together with another vendor back in 2007 and which has gone nowhere since. SALSA doesn’t belong with the other established models.

The section on design covers security requirements (again) and some general motherhood on good design practices, then briefly touches on authentication, authorization and permissioning, non-repudiation and auditing, and data confidentiality, but doesn’t explore any of these important areas further in the context of design. This is one area where the Stanford course is much stronger.

The discussion of implementation is good, starting with an explanation of Gary McGraw’s 7 Kingdoms and then a long list of implementation-specific issues, which may or may not apply to your system, but anyways offers some concrete guidance which may catch the attention of programmers.

Then an excellent, brief exploration of input handling problems, recognizing the importance of not trusting data. And a (too) brief discussion of cryptography in the context of storing confidential data.

Finally, one of the strongest parts of the course, on security code review practices and security testing. Starting with an explanation of why and how to do code reviews (including formal inspections and tool-assisted code reviews), and references to language-specific checklists. Then a discussion of static analysis and a survey of different tools: a bit out of date, missing Microsoft’s SDL tools for example, but a decent overview of available static analysis tools. Then a good survey of testing techniques from a security perspective, including fuzzing and pen testing and risk-based testing, and finally a discussion of attack surface analysis (which more properly belongs in the architecture section).

You have 4 months to complete the 3 hour course, which is more than adequate. Because this was a beta of a new course and a new online delivery model, there was no assessment included.

Comparison of the courses

It was interesting to see that, except for a common focus on basic secure software principles, the courses took different, almost complementary approaches to introducing secure software development.

The SANS course was much more broad and current, with more references to resources like OWASP and frameworks and tools, and more on secure SDLC practices. There were some notable gaps in the SANS course, in architecture and especially design, and it suffers from some structural weaknesses, but it covers many of the bases, and important concepts such as secure error handling, failing safe, input handling, and risk management and threat modeling techniques. With a few improvements in structure and content around architecture and design, this would be an excellent course. As it stands, if I can get the developers past the introduction on the vulnerability cycle, they should learn some useful things.

The Stanford course is more than twice as long and more focused, but doesn’t touch on secure SDLC practices such as reviews and security testing, or risk assessment and threat modeling. It digs deeper into specific security implementation problems such as XSRF, XSS, password management, and SQL injection; and it is also stronger in its coverage of basic secure design, and especially cryptography. This course would be of special interest to developers who want to go on with the full Advanced Security Program at Stanford.

Neither of these courses is perfect, but they are both professionally delivered, and offer good value for the money.

Tuesday, March 2, 2010

Continuously Putting Your Customers at Risk

Over the past year or so there has been some buzz around Continuous Deployment: immediately and constantly pushing out new features directly to customers. This practice is followed by companies such as Facebook, Flickr and IMVU, where apparently programmers push changes out to production up to 50 times per day.

Continuous integration and rapid deployment have a number of advantages, and continuous deployment appears to maximize value from these ideas: immediate and continuous feedback from customers, and cutting deployment and release costs way down, eliminating overheads and manual checks.

A profile on Extreme Agility at Facebook describes how the company’s small development team deploys rapid updates to production:

“…Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of the system and surely fixing any bugs that would result from these frequent small changes.”

But there are some fundamental and serious challenges with continuous deployment. Success depends on a few key factors:

a comprehensive and fast automated test suite to catch mistakes, especially regressions;
customers that are willing to let you test in production;
an architecture that catches and isolates failures, preventing problems from chaining or cascading across the cluster;
a disciplined, proven, robust deployment model.

Automated Testing

At IMVU, all changes are run through an automated test suite, which executes on a cluster of test servers, before deploying to production.

"So what magic happens in our test suite that allows us to skip having a manual Quality Assurance step in our deploy process? The magic is in the scope, scale and thoroughness.”

The author goes on to say that

“We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day.”

Ummm, actually, no it isn't. That’s 15k test cases a day. Run 70 times [and no explanation why the tests are 70 times per day, since we’re talking about pushing out 50 changes per day, but anyways…]. You could run 15k test cases a million times and it would still be 15k test cases, in the same way that running 1 test case a million times is still only 1 test case.

A regression suite of 15,000 automated unit and functional tests sounds impressive – of course what is much more important than the number of tests is the quality of tests. We’re a small shop, and we run more than 15,000 automated tests as part of our continuous integration environment, and we also run static analysis checks on all code, and we do peer code reviews, and manual testing including operations tests, exploratory and destructive tests, system trials, stress and performance testing, and end-to-end integration tests before we push to production.

From subsequent statements, it seems clear that most of the changes deployed this way at IMVU at least are trivial, bug fixes and minor modifications: schema changes, for example, are made out of band (they take 2 days to rollout to production at IMVU). So we can assume that the scope, and therefore, risk of any one change can be contained. This is backed up by a comment made by the author at 50 Deployments A Day and the Perpetual Beta:

“when working on new features (not fixing bugs, not refactoring, not making performance enhancements, not solving scalability bottlenecks, etc), we’ll have a controlled deliberate roll out plan that involves manual QE checks along the way, as well as a gradual roll0out and A/B testing.”

I’m not sure why you wouldn’t have a controlled roll-out plan for solving scalability bottlenecks, but let’s assume that he was referring to minor tweaks to the code or configuration, say increasing the size of a resource pool or something.

Testing in Production

After hearing about the approach followed by IMVU last year, a couple of exploratory testing experts, Michael Bolton and James Bach, spent a few minutes trying out IMVU’s system. They, not surprisingly, found a lot of problems without making much of an effort:

“Yes folks, you can deploy 50 times a day. If you don’t care bout the quality of what you’re deploying…”

The writer from IMVU admits that they have a lot of bugs:

“continuous deployment lets you write software *regression free*, it sure doesn’t gift you high quality software.”

Ignore the “*regression free*” claim which assumes that the test suite will always catch *all* regressions. Continuous Deployment essentially concedes the job of testing to your customers: you do some superficial reviews and regression tests, and leave the real work of finding problems to your customers. I can appreciate that this might be acceptable to some customers, for example people participating in online communities or online games, as they trade off the inconvenience of occasional glitches against the chance to try out cool new features quickly.

There’s nothing wrong with running experiments, trying out new ideas with part of your customer base through A/B split testing, seeing what sticks and what doesn’t. But this doesn’t mean you need to, or should, deploy every change directly to production. Again, if your customers are trusting you with financial transactions or sensitive personal information, you would be irresponsible if you took this approach, even if, like Facebook, you only push out changes incrementally to a small number of clients at a time.

Failure Isolation and Fail Safe

If you are going to continually roll changes, and anticipate that some of these changes will fail, the architecture of the system needs to isolate and contain failures, using internal firewalling, fast-fail techniques, timeouts and retries and so on to reduce the likelihood of a failure chaining through layers or cascading across servers and taking the cluster down.

Unfortunately, this doesn’t seem to be the case, at least in the example described here, where Alex, a programmer, is preparing to deploy code containing a 1-character typo which can cause a failure cascade and take out the site:

“Alex commits. Minutes later warnings go off that the cluster is no longer healthy. The failure is easily correlated to Alex’s change and her change is reverted. Alex spends minimal time debugging, finding the now obvious typo with ease. Her changes still caused a failure cascade, but the downtime was minimal."

So the development team is not only conceding that they cannot write good code, and that they are incapable of doing a decent job testing their work, but also that they cannot take the steps to fail safe at an architectural level and minimize the scope of any failures that they cause. This is not the same as conceding, as Google does, that at massive scale, failures will inevitably happen, and you will have to learn how to deal with them. This is simply giving up, and pushing risk out to customers, again.

A Deployment Model that Works

The deployment model at IMVU is cool: of course, if you are going to do something 50 times per day, you should be pretty good at it.

“The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed.”

This rollback model assumes that problems will be found in the first minute, or few minutes of operation. It does not account for race conditions, deadlocks and other synchronization faults, or time-dependent problems or statistical bugs or downstream integration conflicts or intermittent problems which might not show up for hours or even days, by which time dozens or hundreds of other changes have been applied. Good luck finding out where the problem came from by then.

The rollback approach also assumes that all that needs to be done to fix the problem is to rollback the code. It does not account for what needs to be done to track down and repair broken transactions and corrupt data, which might be ok in an online gaming environment, but would be CTO-suicide in a real system.

And you wonder why some web sites have serious software security issues?

Let’s put aside the arguments above about reliability and quality and responsibility, and just look at the problem of security: building secure software in an environment where developers push each change immediately to production.

First, try to imagine someone detecting unauthorized changes with all of this going on. Was every one of those changes intended and authorized? Would you be able to detect an attack in the middle of all of this noise?

Then there’s the problem of building secure software, which is hard enough even if you follow good practice, especially if you are moving fast. There has been a lot of work over the past couple of years defining how developers can build secure software in an agile way. Microsoft’s SDL for Agile maps secure software design and development practices to agile development models like Scrum, and cuts secure software development controls and practices down to fit rapid incremental development.

But with Continuous Deployment, at least as described so far, there is no time or opportunity to do even a minimal set of security checks and reviews before software changes are pushed out by developers.

It’s bad enough to build insecure software out of ignorance. But by following continuous deployment, you are consciously choosing to push out software before it is ready, before you have done even the minimum to make sure it is safe. You are putting business agility and cost savings ahead of protecting the integrity or privacy of customer data.

Continuous deployment sounds cool. In a world where safety and reliability and privacy and security aren’t important, it would be fun to try. But like a lot of other developers, I live in the real world. And I need to build real software.

Wednesday, February 10, 2010

And now we need to be "Rugged"

A new initiative for secure software development, for Rugged Software Development was announced this week at a SANS conference. Rugged Software is

a value system for writing secure software

defined by some smart people in the application security industry.

Presumably the Rugged Software initiative is attempting to duplicate the success of the agile software movement, coming with its own Rugged Software Manifesto:

I am rugged… and more importantly, my code is rugged.

and so on.

The agile development movement was successful because it was driven by and for the people who actually build software: by programmers, for programmers. By smart, experienced programmers, people like Kent Beck and Ward Cunningham who built software for a living and were really good at it, and who were searching together for ways to solve the problems that programmers face in software development, problems that mattered to programmers. It came from inside the software development community, and set out to put programmers effectively back in charge of building software, to make better software, to make the making of software better.

And agile development, at least at the beginning, was cool, counter-culture: agile developers were sticking it to the man, doing what was right, subverting big upfront design and top-down planning and by-the-book project management and so on. It was certain to create a following…. and unfortunately, eventually to become an institutionalized Methodology subsidized by tool vendors and consultants, but that’s another story for another day.

According to one of the founders of the Rugged Software initiative

Getting the secure software development message to the masses won't be easy, and the plan is to get some initial support and momentum from the application security industry.

However well-intentioned and necessary, it looks like another set of ideas and values being imposed from outside on people who are busy building software. We already have other application security initiatives: Cigital's Build Security In and its maturity model for the enterprise, Microsoft’s SDL for the Microsoft community at least, OpenSAMM and other initiatives from OWASP, and half-baked ideas from the InfoSec community like SALSA.

And now we have Rugged Software Development.

To succeed, the initiative needs support and momentum not just from the application security community, but more importantly from the software development community – from the people who actually build software.

Fair enough, these smart and well-intentioned and hard working InfoSec guys are asking for input and participation from the development community. So after being challenged to “walk the walk" I signed up for the Rugged Software forums, blogs, lists and…. Well, there’s the announcement and some trade press coverage. And that Manifesto about ruggedness, and an empty blog and an empty forum. That’s it, that's all I have been able to find so far.

So, I guess I was walking too fast. I will wait and see if there is a real opportunity here, a chance for an initiative that speaks to, and for, the software development community, something that has a real chance to succeed.

Saturday, February 6, 2010

Real Resources for Software Development Managers

With all of the blogs and books and training programs and even a few magazines still available on software development, agile methods and software project management, there is surprisingly little material that is of real value to a software development manager: information that is thoughtful, current and grounded in real experience.

A lot of what you can find is noise. Blog postings by enthusiastic kids who have finished a project or two using Scrum and are now self-made experts on team dynamics and Agile development. Someone pushing a Project 2.0 collaboration tool or yet another “Agile Software Development using…” or “Agile Project Management…” book, or blogs going over (and over) the same ground, or tiresome evangelism for “Big A” Agile training and consulting.

It’s difficult to find useful information in all of this for software development managers who want to dig deeper. I’ve put together a list here of the resources that I have found useful, and find myself going back to.

Construx Software Builders

Construx is a consultancy specializing in software development management, founded by Steve McConnell, a leading thinker on software engineering and best practices in software development, and author of some of the definitive books on software development: Rapid Development and Code Complete.I learned a lot about good software engineering and software development management from these books. While Code Complete, a guide to writing good, clean code, was significantly revised in 2004, Rapid Development is showing its age, and needs to be updated to take into account XP and Scrum and other new ideas in software development. But it is still the best overview available of SDLCs and the risks and success factors in software projects.

Construx offers a wide range of consulting services including organizational reviews and project reviews and software due diligence reviews for acquisitions. They also offer an excellent set of training programs on software project management and software engineering. Some of the courses that my team and I have attended include:

Code Complete Essentials on the basics of good software development

Master Class on Estimating Software Projects

Developer Testing Bootcamp

Professional Tester Bootcamp

and 10x Software Engineering an excellent course on improving software development results for experienced managers.

You can also get access to white papers and posters including their famous list of Classic Mistakes in software development; and CXOne, a lightweight framework for managing software projects, with templates, samples and checklists.

Once each year or so, Construx holds an Executive Software Summit for experienced software development managers, CTOs and other senior people interested in improving how software is built. This is head and shoulders above the other software management conferences that I have attended.

Agile Development Resources

Scrum seems to have won the Agile development methodology wars over XP and DSDM and Crystal and whatnot, fundamentally because it is much easier to understand and follow (and also, unfortunately, easier for teams to build sloppy software faster). So, until Lean Software Development or some other new idea establishes the next wave you should make sure to understand Scrum, even if you don’t swallow it whole.

Certified Scrum Master training is not expensive, it doesn’t take long, and you’ll leave with a decent understanding of the method and its values and driving principles. Make sure to get a good teacher: I went straight to the source, Ken Schwaber. His book Agile Project Management with Scrum summarizes what you’ll learn from the course and is a good resource for follow-up.

Of the Agile / Scrum community blogs, the most valuable that I have found is Mike Cohn’s Succeeding with Agile and I would recommend his book of the same name if you are serious about understanding and implementing Scrum.

Pragmatic Programmer’s Bookshelf

There are a lot of good books on software development in the Pragmatic Programmer’s Bookshelf including a set of special interest for software development managers:

Manage It! by Johanna Rothman, is a simple but excellent book on incremental and iterative development practices, scheduling and estimating and risk management for small and mid-size projects and teams. This is the best, most practical book I have found on applying lightweight, agile development methods, and while it leans towards Scrum it is not dependent on any single methodology. This book also introduces program management and project portfolio management, which is explored in more detail in Manage Your Project Portfolio: useful if you’re new to the problems of managing small programs using lightweight techniques.

If you are serious about program and portfolio management, and you have the time and money, I strongly recommend the professional program on Advanced Project Management at Stanford University which you can attend online or on campus. This is a world-class program on aligning strategy with execution, managing customers, understanding and exercising power and influence, and coordinating and planning integrated programs and project portfolios.

Johanna Rothman is also the co-author (together with Esther Derby) of Behind Closed Doors, an introductory but good book on coaching and mentoring developers and managing teams. It offers practical advice and good reminders on issues like managing priorities, the value of 1-on-1 meetings, how to deal with technical people, how to give feedback – a leadership resource, but written specifically for software development managers and team leads.

Ship It! by Jared Richardson, Will Gwaltney, Jr, outlines the basics of building and deploying software in an agile context: the use of tools for source code control and continuous integration, useful strategies for adopting automated unit testing, basic engineering practices, small team leadership techniques, fundamentals of incremental development, common problems faced.
Of the books in this series, this is the most basic / introductory, but it is worth a quick read for a framework for iterative, incremental development.

Release It! by Michael T. Nygard, is an excellent resource on technical architecture for distributed, web-based (especially Java) systems – a high-level view of the challenges that your team will face building real systems, patterns and anti-patterns for stability and scalability, and how to engineer a system for real world operations. It is the “hardest” of these books, and will be useful to your architects and Dev/Ops team. You can follow Michael’s current work at his blog, Wide Awake Developers.

Although not part of the Pragmatic Programmer’s suite, you also need to read The Visible Ops Handbook, a simple, practical introduction to IT systems change management and release management.

Software Project Management Resources

I wrote earlier about Scott Berkun’s Making Things Happen, a reissue of The Art of Project Management. This is a practical, focused and well-written book on basic issues in software project management, on leadership and communications and especially on execution, based on his experience managing programs at Microsoft.

The Mythical Man Month, by Fred Brooks – yes, it really is worth reading and re-reading if you haven’t read it in a while.

As a counterbalance to all of the agile and small team collaboration stuff, I enjoy following Herding Cats. This blog is all about large-scale, high-risk, high-complexity, safety-critical projects and large-scale programs like manned space flight and nuclear power and weapons systems. It offers a completely different and provocative perspective on project management problems, and is a good resource on complexity and risk management.

Leadership Resources for Software Development Managers

Cornell University offers an excellent online program on High Performance Leadership covering change management, leadership, negotiation and coaching.

Another excellent resource on leadership is the Center for Creative Leadership, which offers inexpensive live or pre-recorded webinars.

The American Management Association and its Canadian equivalent, the Canadian Management Center, provide excellent training programs on management and leadership.

The only leadership blog that I follow consistently is Great Leadership. It’s a bit cheerleadie at times, but it it is thoughtful and has good links to other leadership resources.

The general leadership books that I have found useful and worth going back to are Difficult Conversations and Getting to Yes both developed out of the Program on Negotiation at Harvard Law School. Its Program on Negotiation for Senior Executives is an excellent course, and definitely worth taking for development managers (don’t be put off by the “for Senior Executives” label).

The Best of the Software Development Manager Blogs

Of the many other software development management blogs and forums, there are only a couple that I follow regularly:

Joel on Software of course: it's well-written and provocative, and focuses on small-scale software development and on how to run a software business. Unfortunately, for the last few months the focus has been on one of Joel’s latest ventures, Stackoverflow: a useful and free resource for problem solving for developers. While this has been good for Stackoverflow, it has not made for especially interesting reading. Here’s hoping that Joel gets back to subjects of wider interest soon.

Hard Code by Erich Brechner, the head of the best practices group at Microsoft. While this is written for Microsoft’s internal developers, most of the issues and problems that he explores apply generally to the software development community, and it is entertaining and real.

I’ll add to this resource list as I find other useful information.