Building Real Software

Wednesday, December 23, 2009

New Communities for Project Managers and Security

The other day I happened across a new Q&A community forum for project managers called AskAboutProjects.com. This site is built using the Stack Overflow Knowledge Exchange Engine, the same platform that is used to host the popular software development Q&A site Stack Overflow and Server Fault, a similar resource for IT system administrators.

The Stack Overflow engine is an effective and low cost platform for quickly building communities. It has some quirks, many of them around its security model, which make it awkward to use at times: for example it is difficult to enter links in answers or profiles (sometimes it works well, sometimes it doesn’t). I haven’t taken the time to figure out why, and I shouldn’t need to: the UI should be more seamless. And Firefox’s NoScript plug-in (I never leave home without it) occasionally catches XSS problems on some of the sites.

But the community experience is addictive – I find myself spending way too much time scanning the boards and offering help where I can. Some of the programmers on my team have found StackOverflow handy when working with new technology or debugging obscure technical problems.

There is something of a gold rush going on, with people hurrying to setup new communities using this engine: there are communities being launched for gamers, amateur radio, technology support forums, sports betters, dating, industrial robots, the iPhone, travel, diving, professional stock traders, musicians, real estate, organizational psychology, aerospace engineering, startups, world cup soccer, natural living, electronics, climate change, mountain biking, money, moms, spirituality…. you name it.

Another one of these sites that I am following is SecurityCrunch, a new community focused on IT security issues.

Of course there is no guarantee which communities will catch on. AskAboutProjects is new and the community is still small. Many of the forum questions so far are either homework assignments (which plague Stack Overflow as well) and seed questions from the founders of the community. Although it appears to be intended as a general resource for project managers, it is clearly focused at the moment on IT, and more specifically software development, projects and related issues, reflecting the founders’ backgrounds.

It will be worth keeping an eye on these communities over the next few months, to see which, if any of them, can replicate the success of Stack Overflow.

Friday, December 11, 2009

Much ado about... nothing much (Agile Vancouver 2009)

In early November I attended Much Ado About Agile, the Agile Vancouver interest group’s annual conference. I was looking for a short break, and this conference offered a chance to get away from daily responsibilities, reflect, and learn more about the state of the art in software development.

I’ve decided to look back to see what stayed with me, what I learned that was worth taking forward.

First off, it was grand being back in Vancouver – I lived in Vancouver for a couple of years and always enjoy going back, the mountains and the water, the parks and markets and the sea shore, dining at some of the city’s excellent restaurants, and of course snacking at wonderful, quirky Japadog.

The conference agenda was a mixed bag: a handful of Agile community rock stars re-playing old hits or pushing their latest books, including Martin Fowler, Johanna Rothman, and Mary Poppendieck; some consultants from ThoughtWorks and wherever else presenting commercials in the guise of case studies; and some earnest hands-on real developers telling war stories, from which you could hope to learn something.

I was surprised by the number of (mostly young), well-intentioned enthusiastic people at the sessions. There was sincere interest in the rooms; you could feel the intensity, the heat from so many questing minds. We were looking for answers, for insight, for research and experience.

But what we got wasn’t much unfortunately.

The rock stars were polished and confident, but mostly kept to safe, introductory stuff. I remember attending Martin Fowler’s keynote. Martin is indisputably a smart guy and worth listening to: I had the pleasure of spending a few days with him on round tables at last year’s Construx Software Executive Summit where we explored some interesting problems in software development. To be honest, I had to go back to my notes to remember what he spoke about in Vancouver: a couple of short talks, one on agile fundamentals and something smart about technical debt and simple design. If you’ve read Martin’s books and follow his posts, there was nothing new here, nothing to take back. Maybe I expected too much.

I decided to avoid the professional entertainment for a while and see what I could learn from some less polished, real-life practitioners. I stuck to the “hard” track, avoiding the soft presentations on team work, building trust and such.

A talk on “Agile vs the Iron Triangle” about using lightweight methods to deliver large projects delivered on a fixed cost, fixed schedule basis. How to make commitments, freezing the schedule and then managing scope – following incremental, build-to-schedule methods. Most of the challenges here of course are in estimating the size of work that needs to be done, understanding the team’s capacity to deliver the work, and making trade-offs with the customer: accepting but managing change, trading changes in scope in order to adhere to the schedule. This lecture was interesting because it was real, representing the efforts of an organization trying to reconcile plan-driven and agile practices, working with customers with real demands, under real constraints.

Another session was on operations at a small Internet startup where the development team was also responsible for operations. The focus here was on lightweight, open source operations tooling: essential tools for availability checks, log monitoring, performance and capacity analysis, system configuration using technology like Puppet. Nothing new here, but it was fun to see a developer so excited and focused on technical operations issues, and committed to keep the developers and operations staff working closely together as the company continued to grow.

Some more talks about the basics of performance tuning, an advertisement for ThoughtWorks Cruise continuous integration platform, and some other sessions that weren’t worth remembering. I had the most fun at Philippe Kruchten’s lecture on backlog management: recognizing and managing not only work for business features, but architecture / plumbing, and technical debt, “making invisible work visible”. Dr. Kruchten is an entertaining speaker, he clearly enjoys performing in front of a crowd, and he enjoys his work, his enthusiasm was infectious.

And finally a technical session by Michael Feathers on Error Processing and Error Handling as First Class Considerations in Design, who bucked the trend, playing the cool professor who could not care less if half the class was left behind. His focus was on idioms and patterns for improving error handling in code, in particular, the idea of creating “safe zones”, where you only need to worry about construction problems if you are at, or outside the edge of the zone, making for cleaner and more robust code in the safe core. Definitely the hardest, geekiest of the talks that I attended. And like several of the sessions I attended, it had little to do directly with agile development methods – instead it challenged the audience to think about ways to write good code, which is what it all comes down to in the end.

Michael Feathers aside, most of the speakers underestimated their audiences – at least I hope that they did – and spoke down, spoon feeding the newbies in the audience. It made for dull stuff much of the time – as earnest, or entertaining as the speaker might be, there wasn’t much to chew on. There could have been much more to learn with so many smart people there, and I wasn’t the only one looking for more meat, less bun. The conference wasn’t expensive, it was well managed, but it didn’t offer an effective forum to dig deep, to find new ways to build better software, or software better. For me, at least, there wasn't much ado.

Tuesday, December 8, 2009

Reliability and the Risks of Using Enterprise Middleware

If you are building systems with high requirements for performance and reliability, it is important that you are careful, selective of course, but even more important, sparing in your use of general-purpose middleware solutions to solve your technical problems.

There are strong, obvious arguments in favor of using proven middleware solutions, whether commercial off the shelf software (COTS) or open source solutions - arguments that are based on time-to-market, risk mitigation, and cost leveraging:

Time-to-market
In most cases, it will take much less time to evaluate, acquire, install, configure and understand a commercial product or open source solution than to build your own plumbing. This is especially important early in the project when your focus should be on understanding and solving important business problems, delivering value early, getting something working in the customer’s hands as soon as possible for feedback and validation.

Risk mitigation
Somebody has already gone down this path, taken the time to understand a complex technical problem space, made some mistakes and learned from them. The results are in front of you. You can take advantage of what they have already learned, and focus on solving your customer’s business problems, rather than risking falling into a technical black hole.

Of course you take on a different set of risks: that the solution is of high quality, that you will get adequate support (from the vendor or the community), that you not are buying into a dead end.

Cost leverage
For open source solutions, the cost argument is obvious: you can take advantage of the time and knowledge invested by the community for close to nothing.

In the case of enterprise middleware, companies like Oracle and IBM have spent an awful lot of money hiring smart people, or buying companies that were created by smart people, invested millions of dollars into R&D and millions more into their support infrastructures. You get to take advantage of all of this through comparatively modest license and support fees.

The do-it-yourself, not-invented-here arguments for building instead of buying are essentially that your company is so different, your needs are unique: that most of the money and time invested by Oracle and IBM, or the code built up by an open source community, does not apply to your situation, that you need something that nobody else has anticipated, nobody else has built.

I can safely say that this is almost always bullshit: naïve arguments put forward by people who might be smart, but are too intellectually lazy or inexperienced to properly understand and frame the problem, to bother to look at the choice of solutions available, to appreciate the risks and costs involved in taking a proprietary path. But, when you are pushing the limits in performance and reliability, it may actually be true.

A fascinating study on software complexity by NASA’s Office of the Chief Engineer Technical Excellence Program examines a number of factors that contribute to complexity and risk in high reliability / safety critical software systems (in this case flight systems), and success factors in delivery of these systems. One of the factors that NASA examined was the risks and benefits of using commercial off the shelf software (COTS) solutions:

Finding:

Commercial off-the-shelf (COTS) software can provide valuable and well-tested functionality, but sometimes comes bundled with additional features that are not needed and cannot easily be separated. Since the unneeded features might interact with the needed features, they must be tested too, creating extra work.

Also, COTS software sometimes embodies assumptions about the operating environment that don’t apply well to [specific] applications. If the assumptions are not apparent or well documented, they will take time to discover. This creates extra work in testing; in some cases, a lot of extra work.

Recommendation:

Make-versus-buy decisions about COTS software should include an analysis of the COTS software to: (a) determine how well the desired components or features can be separated from everything else, and (b) quantify the effect on testing complexity. In that way, projects will have a better basis for make/buy and fewer surprises.

The costs and risks involved with using off the shelf solutions can be much greater than this, especially when working with enterprise middleware solutions. Enterprise solutions offer considerable promise: power and scale, configuration to handle different environments, extensive management capabilities, interface plug-and-play… all backed up by deep support capabilities. But you must factor in the costs and complexities of properly setting up and working with these products, and the costs and complexities in understanding the software and its limits: how much time and money must you invest in a technology before you know if it is a good fit, if it fulfills its promise?

Let’s use the example of an enterprise middleware database management system, Oracle’s Real Application Cluster (RAC) maximum availability database cluster solution.

Disclaimer: I am not an Oracle DBA, I am not going to argue fine technical details here. I chose RAC because of recent and extensive experience working with this product, because it is representative of the problems that teams can have working with enterprise middleware. I could have chosen other technologies from other projects, say Weblogic Suite or Websphere Application Server and so on, but I didn’t.

The promise of RAC is to solve many of the problems of managing data and ensuring reliability in high-volume, high-availability distributed systems. RAC shares and manages data across multiple servers, masks failures and provides instant failover in an active-active cluster, and allows you to scale the system horizontally, adding more servers to the cluster as needed to handle increasing demands. RAC is a powerful data management solution, involving many software layers, including clustering and storage management and data management and operations management, designed to solve a set of complex problems.

In particular, one of these technical problems is maintaining cache fusion across the cluster: fusing the in-memory data on each server together into a global, cluster-wide cache so that each server node in the cluster can access information locally as it changes on any other node.

As you would expect, there are limits to the speed and scaling of cluster-wide cache fusion, especially at high transaction rates. And this power and complexity comes with costs. You need to invest both in infrastructure, in a highly reliable and performant network interconnect fabric and shared storage subsystem, and in making fundamental application changes, to carefully and consistently partition data within the database and carefully design your indexes in order to minimize the overhead costs of maintaining global cache state consistency. As the number of server nodes in the cluster increases (for scaling purposes or for higher availability), the overhead costs and the costs involved in managing this overhead increase.

RAC is difficult to setup, tune and manage in production conditions: this is to be expected – the software does a lot for you. But it is especially difficult to setup, tune and manage effectively in high-volume environments with low tolerance for variability and latency, where predictable performance under sustained load, and predictable behavior in failure situations, is required. It requires a significant investment in time to understand the trade-offs in setup and operations of RAC, to balance reliability and integrity factors against performance; choosing between automated and manual management options, testing and measuring system behavior, setting up and testing failover scenarios, carefully managing and monitoring system operations. To do all of this will require you to invest in setting up and maintaining test and certification labs, in training for your operations staff and DBAs, in expert consulting and additional support from Oracle.

To effectively work with enterprise technology like this, at or close to the limits of its design capabilities, you need to understand it in depth: this understanding comes from months of testing and tuning your system, working through support issues and fixing problems in the software, modifying your application and re-testing. The result is like a race car engine: highly optimized and efficient, running hot and fast, highly sensitive to change. Upgrades to your application or to the Oracle software must be reviewed carefully and extensively tested, including planning and testing rollback scenarios: you must be prepared to manage the very real risk that a software upgrade can affect the behavior of the database engine or cluster manager or operations manager or other layers, impacting the reliability or performance of the system.

Clearly one of the major risks of working with enterprise software is that it is difficult, if not impossible, to learn enough about the costs and limits of this technology early enough in the project – especially if you are pushing these limits. Hiring experienced specialists, bringing in expert consultants, investing in training, testing in the lab: all of this might not be enough. While you can get up and running much faster and cheaper than you would trying to solve so many technical problems yourself from the start, you face the risk that you may not understand the technology well enough, the design points and real limits, how to make the necessary balances and trade-offs – and whether these trade-offs will be acceptable to you or your customers. The danger is that you become over-invested in the solution, that you run out of time or resources to explore alternatives, that you give yourself no choice.

You are making a big bet when working with enterprise products. The alternative is to avoid making big bets, avoid having to find big solutions to big problems. Break your problems down, and find narrow, specific answers to these smaller, well-bounded problems. Look for lightweight, single-purpose solutions, and design the simplest possible solution to the problem if you have to build it yourself. Spread the risks out, attack your problems iteratively and incrementally.

In order to do this you need to understand the problem well – but whether you break the problem down or try to solve it with an enterprise product, you can’t avoid the need to understand the problem. Look (carefully) at the options available, at open source and commercial products, look for the smallest, simplest approach that fits. Don’t over-specify, or design, yourself into a corner. Don’t force yourself to over-commit. And think twice, or three or four times, before looking at an enterprise solution as the answer.

Sunday, November 22, 2009

Small Teams, Big Results

How big a team do you need to deliver big results?

When my partners and I created this startup a few years ago, we made the decision to staff the development team with people that we knew, strong developers and test engineers who we had worked with before, people who we trusted and who trusted us. There were a lot of unknowns in what we were trying to achieve, could we secure enough funding, would our business model succeed, did we choose the right technologies, did we have the right design, could we handle all of the details and problems in launching a new market, could we deal with all of the integration and service needs, the regulatory and compliance requirements, …. and all of the other challenges that startups with aggressive goals face. So we wanted to take the risks out of development as much as possible, to assure stakeholders that we could deliver what we needed to, when we needed it.

We were lucky to put together a strong team of senior people: many of them we had worked with for several years, and some of them on only one project. But they were known quantities: we knew what to expect, and so did they. They understood the problem domain well and came up to speed on our design, and they came together as a team quickly of course, re-establishing relationships and norms – so we could hit the ground running. And we’re even more fortunate in that the core of the team has stayed in place from the start, and that we have been able to carefully add a few more senior people to the team, so that we continue leverage the experience and knowledge that everyone has built up.

There are tremendous advantages in working with a small group of experienced people who know what they are doing and care about doing a good job, people who enjoy challenges and who work well together.

In 10x Software Engineering, Construx Software examines the key factors that contribute to exceptional performance in software development: the factors and good engineering practices that drive some individuals and teams to outperform others by up to 10x. Some of these key success factors are keeping teams small, keeping teams together, and leveraging experience: that small teams of senior people, with a strong sense of identity and high levels of trust, staying together through projects, can significantly outperform the norm.

The value of small, experience-heavy teams, and especially of senior people who are deeply committed to doing a good job, committed to their craft, is explored in Pete McBreen’s excellent book Software Craftsmanship: the New Imperative. Pete shows that developers who have worked together in the past are more productive than teams created from scratch: that it is an important success factor for teams to be able to choose who to work with, to choose people who they know they can depend on, and who they feel comfortable working with. He especially emphasizes the importance of experience: that a jelled team of experienced people, working in an open and trusting way, can amplify each other’s strengths and work at hyper-productive levels, and that in a hyper-productive team of experienced developers who are playing at the top of their game, there is little space for beginners or “warm bodies”.

Pete also looks at the issue of team size in Questioning Extreme Programming, a skeptical but balanced review of XP practices which deserves more attention. Pete suggested at the time that

"XP is best suited to projects with a narrow range of team sizes, probably 4 to 12 programmers. Outside that range, other processes are probably more appropriate. The good news, however, is that a great many projects fall into the range of applicability. Indeed, there is some evidence that XP-size projects are predominant in the industry.”

Although my focus here is not on XP practices, the idea that most problems that the industry faces can be managed by small teams, following lightweight but disciplined practices, is an important one.

Back in March 2008, Steve McConnell asked a question on the Construx Forum about how to scale up a development team quickly. My answer would be to keep the core team as small as possible, add people that other people have worked with before and know and trust, and add fewer, more senior, experienced, technically strong staff.

I have worked a lot with large technology companies like IBM and HP, and I was surprised to find out how small the core engineering teams are in those big companies. Where a company like IBM might have a big distributed first-level and second-level support organization, trainers, offshore testing labs, product managers, marketing teams, technical writers, pre-sales support engineers, sales teams, vertical industry specialists, integration specialists, project managers and other consultants: all of these people leverage the IP created by small, senior teams of engineers and researchers. These engineering teams, even at a company like IBM, have a different culture than the customer-facing parts of the organization – less formal and more inward-focused on technical excellence, on “alpha geekdom” – and were more free to come up with new ideas. Google, of course, is an extreme example of this: a large company where lots of software is created by very small, very senior teams driven to technical excellence.

It makes sense to follow the same model in scaling out a team: start with a small, senior core, be careful to add a few senior people, space out your hires, and scale out primarily in supporting roles, allowing the small core engineering team to continue to focus on results, on excellence.

One of the many advantages of small teams is that they spend less time and make fewer mistakes communicating with each other, you can use less formal (and less expensive) and more efficient communication methods. This lets you move faster, and adapt faster to change. If the team is made up of mostly of experienced, senior staff, they can get maximum value out of lightweight “small a” agile methods, take on less unintentional technical debt, again reducing cost and time, and by making fewer mistakes and writing better code in the first place, create a higher quality product, further accelerating results in a virtuous circle.

The key here is to have enough discipline, to follow enough good engineering practices, without weighing the team down too much.

In Nailing the Nominals, Eric Brechner, in charge of engineering excellence at Microsoft, sets the limit at 100,000 lines of code and 15 people. Below this line,

"you can…use emergent design, have a loose upfront design bar, rewrite and refactor the code endlessly while the customer looks over your shoulder. When your code base and your project is bigger, it's solid design and disciplined execution or it's broken code, broken teams, and broken schedules."

In another related post, Green Fields are Full of Maggots, I.M. Wright, er, I mean Eric, goes on to say that

"the regular refactoring and rework needed for emergent design isn't a problem for small code bases. However, when you get above 100,000 lines of code serious rework becomes extraordinarily expensive. That's why customer-focused, up-front architecture is essential for large projects.

This was researched by Dr. Barry Boehm, et al, in May 2004. They found that rework costs more than up-front design on projects larger than 100 KLOC. The larger the project, the more up-front design saves you."

What’s of particular interest to me is that we work right on the edge of these limits, in terms of size of code base (although we also have a lot of test code and other supporting code) and total team size. Our extra focus on discipline and controls is necessary because of the high standards of reliability that we need to stand up to, and the complexity of the problems that we solve. While we could move even faster, the risk and cost of making mistakes is too high. So the challenge is to achieve the right balance, between speed and efficiency and discipline.

Saturday, October 17, 2009

The real cost of software security

There has been a lot of discussion in the blogosphere over the last few months on costs and ROI justifications for building secure software. Back in July, I responded to a post by Jeremiah Grossman, CTO at White Hat Software, which examined the end-to-end costs of software security, whether and how upfront investments in a secure SDLC mitigate downstream security costs and risks: a classic “pay me now or pay me (much more) later” problem. In my response to Jeremiah’s analysis, I tried to break out the costs of building secure software: what I now think of as direct, “hard” pure security costs, compared to indirect, "soft" supporting costs, the costs of building software properly in the first place. From a budgeting and ROI perspective, it is important to break out the costs of building software correctly in the first place, the foundational practices, from real security costs.

An effective software security program has to rely on a foundation of software quality management practices and governance. Your quality management program can be lightweight and agile, but if you are hacking out code without care for planning, risk management, design, code reviews, testing, incident management then what you are trying to build is not going to be secure. Period. Coming from a software development background, I feel that the security community is trying to take on too much of the responsibility, and too much of the costs, for ensuring that basic good practices are in place.

But it’s more than that. If you are doing risk management, design reviews, code reviews, testing, change and release management, then adding security requirements, perspectives, controls to these practices can be done simply, incrementally, and at a modest cost:

Risk management: add security risk scenarios, threat assessment to your technical risk management program.
Design reviews: add threat modeling where it applies, just as you should add failure mode analysis for reliability.
Code reviews: first you can argue, as I did earlier in my response, that a significant number of security coding problems are basic quality problems, simply a matter of writing crappy code: poor or missing input validation (a reliability problem, not just a security problem), lousy error handling (same), race conditions and deadlocks (same), and so on. If making these kinds of mistakes a security problem (and a security cost) helps get programmers to take them seriously, then go ahead, but it shouldn’t be necessary. So what you are left with are the costs of dealing with “hard” security coding problems, like improper use of crypto, secure APIs, authentication and session management, and so on.
Static analysis: as I argued in an earlier post on the value of static analysis, static analysis checks should be part of your build anyways. If the tools that you use, like Coverity or Klocwork, check for both quality and security mistakes, then your incremental costs are in properly configuring the tools and understanding, and dealing with, the results for security defects.
Testing: security requirements need to be tested along with other requirements, on a risk basis. Regression tests, boundary and edge tests, stress tests and soak tests, negative destructive tests (thinking like an attacker), even fuzzing should be part of your testing practices for building robust and reliable software anyways.
Change management and release management: ensure that someone responsible for security has a voice on your change advisory board. Add secure deployment checks and secure operations instructions to your release and configuration management controls.
Incident management: security vulnerabilities should be tracked and managed as you would other defects. Handle security incidents as “level 1” issues in your incident management, escalation, and problem management services.

So what are the direct costs of a software security program? Looking over my own budget, these are the major cost items that I can find:

Training managers, architects, developers, testers on security concepts, issues, threats, practices. You need to invest in training upfront, and then refresh the team: Microsoft’s SDL requires that technical staff be retrained annually, to make sure that the team is aware of changes to the threat landscape, to attack bad habits, to reinforce good ones.
As I described in an earlier post on our software security roadmap , we hired expert consultants to conduct upfront secure architecture and code reviews, to help define secure coding practices, and to work with our architects and development manager to plan out a roadmap for software security. Paying for a couple of consulting engagements was worthwhile and necessary to kickstart our software security program and to get senior technical staff and management engaged.
Buying “black box” vulnerability scanning tools like IBM Rational Appscan and Nessus, and the costs of understanding and using these tools to check for common vulnerabilities in the application and the infrastructure.
Penetration testing: hiring experts to conduct penetration tests of the application a few times per year, to check that we haven’t got sloppy and missed something in our internal checks, tests and reviews, and to learn more about new attacks.
Some extra management and administrative oversight of the security program, and of suppliers and partners.

The other incremental costs of building secure software, like the costs for building robust and reliable software, are now effectively burned in to our SDLC, into how we plan, design, build, test, deploy and support software. I could break out the incremental cost burden of these security practices and controls, but the costs would be modest - most of the cost, and the work, is in building software properly. And by following an incremental, optimizing approach, starting small and continuously reviewing and improving, not only are upfront costs for a software security program reduced, but the ROI is realized much faster. If you set your quality bar high enough, the real costs of secure software are surprisingly low.

Saturday, October 10, 2009

Dreaming in Code - A Failure in Leadership

Reading Scott Rosenberg’s Dreaming in Code gives you a sick feeling, the same sick feeling that you have watching a movie where the hero’s life is coming unraveled, or when you are involved in a project that is going nowhere fast, facing certain failure, and there is nothing that you can do to change the outcome. I made myself read it twice – there are some hard lessons for managing software development in this book.

Dreaming in Code tells the story of the failed Chandler project started by Mitch Kapor, software industry visionary and founder of Lotus Development Corp, currently Chairman of the Mozilla Foundation. Chandler began as an ambitious, “change the world” project to design and build a radically different kind of personal information manager, an Outlook-and-Exchange-killer that would flatten silos between types of data and offer people new ways of thinking about information; provide programmers a cross-platform, extensible open source platform for new development; and create new ways to share data safely, securely, cheaply and democratically, in a peer-to-peer model.

The project started in the spring of 2001. Because of Mitch Kapor’s reputation and the project’s ambitions, he was able to assemble an impressive team of alpha geeks. Smart people, led by a business visionary who had experienced success, interesting problems to solve, lots of money, lots of time to play with new technology and chart a new course: a “dream project”.

But Dreaming in Code is a story of wasted opportunities. Scott Rosenberg, the book’s author, followed the project for 3 years starting in 2003, but he gave up before the team had built something that worked – he had a book to publish, whether the story was finished or not. By January 2008, Mitch Kapor had called it quits and left the company that he founded. Finally, in August 2008, the surviving team released version 1.0, scaled back to a shadow of its original goals and of course no longer relevant to the market.

It is interesting, but sad, to map this project against Construx’s Classic Mistakes list, to see the mistakes that were made:

Unclear project vision. Lack of project sponsorship. Requirements gold-plating. Feature creep. Research-oriented development. Developer gold-plating. Silver bullet syndrome. Insufficient planning. Adding people to a late project. Overly optimistic schedules. Wishful thinking. Unrealistic expectations. Insufficient risk management. Wasted time in the “fuzzy front end”: the team spent years wondering and wandering around, playing with technology, building tools, exploring candidate designs, trying to figure out what the requirements were - and never understood the priorities. Shortchanged quality assurance… hold on, quality was important to this project. It wasn’t going to play out like other Silicon Valley startups. Then why did they wait almost 3 years before hiring a test team (of one person), while they faced continual problems from the start with unstable releases, unacceptable performance, developers spending days or weeks bug hunting.

The book begins with a meeting in July 2003, when the team should be well into design and development, where the project manager announces that the team is doomed, that they cannot hope to deliver to their commitments – and not for the first time. This is met with…. well, nothing. Nobody, not even Mitch Kapor, takes action, makes any decisions. It doesn't get any better from there.

This project was going to be different, it was going to be done on a “design-first” basis. But months, even years into the project, the team is still struggling to come up with a useful design. The team makes one attempt at time-boxed development, a single iteration, and then gives up. Senior team members leave because nothing is getting done. Volunteers stop showing up. People in the community stop downloading the software because it doesn’t work, or it does less than earlier versions – the team is not just standing still, they are moving backwards.

The project manager, after only a few months, quits. Mitch Kapor takes over. A few months later, the “fun draining out of his job”, he asks the team to redesign his job, and come up with a new management structure. OK, to be fair, he is rich and doesn’t have to work hard on this, but why start it all in the first place, why put everyone involved, even those of us who are going to read the book, through all of this?

The new management team works through a real planning exercise for this first time, making real trade-offs, real decisions based on data. It doesn’t help. They fail to deliver – again. And again with an alpha release. They then spend another 21 months putting together a beta release, significantly cutting back on features. By the time they deliver 1.0 nobody cares.

It’s a sad story. It’s also boring at times – the team spends hours, days in fruitless, philosophical bull sessions; in meaningless, abstract, circular arguments; asking the same questions, confronting the same problems, facing (or avoiding) the same decisions again and again. As the book says, “it’s Groundhog Day, again”. I hope that the writer may have not fully captured what happened in the design sessions and planning meetings – that, like the Survivor tv show, we only see what the camera wants us too, or happens to, an incomplete picture. That these people were actually more focused, more serious, more responsible than they come across.

The project failed on so many levels. A failure to set understandable, achievable goals. A failure to understand, or even attempt to articulate requirements. A failure to take advantage of talent. A failure to engage, to establish and maintain momentum. A failure to manage risks. A failure to learn from mistakes - their own, or others.

Most fundamentally, it was a failure of leadership. Mitch Kapor failed to hold himself and the team accountable. He failed to provide the team with meaningful direction, failed to understand and explain what was important. He failed to create an organization where people understood what it took to deliver something real and useful, where people cared about results, about the people who would, hopefully, someday, use what they were supposed to be building. And he gave up: he gave up well before 2008 when he left the company; he gave up almost from the start, he gave up when the hard, real work of building out his vision had actually begun.

Wednesday, October 7, 2009

A Joel Test for Software Security

Back in 2000, Joel Spolsky, software developer, entrepreneur, founder of StackOverflow and popular blogger on the business of building software, proposed a “highly irresponsible, sloppy test to rate the quality of a software team”, known as The Joel Test.

The Joel Test is a crude but effective tool for checking the maturity of a software development team, using simple, concrete questions to determine whether a team is following core best practices. The test could use a little sprucing up, to reflect improvements in the state of the practice over the last 10 years, to take into account some of the better ideas introduced with XP and Scrum. For example, “Do you make daily builds?” (question 3) should be updated to ask whether the team is following Continuous Integration. And you can argue that “Do you do hallway usability testing” (question 12) should be replaced with a question that asks whether the team works closely and collaboratively with the customer (or customer proxy) on requirements, product planning and prioritization, usability, and acceptance. And one of the questions should ask whether the team conducts technical (design and code) reviews (or pair programming).

A number of other people have considered how to improve and update the Joel Test. But all in all, the Joel Test has proved useful and has stood the test of time. It is simple, easy to remember, easy to understand and apply, it is directly relevant to programmers and test engineers (the people who actually do the work), it is provocative and it is fun. It makes you think about how software should be built, and how you measure up.

How does the Joel Test work?

It consists of 12 concrete yes/no questions that tell a lot about how the team works, how it builds software, how disciplined it is. A yes-score of 12 is perfect (of course), 11 is tolerable, a score of 10 or less indicates serious weaknesses. The test can be used by developers, or by managers, to rate their own organization; by developers who are trying to decide whether to take a job at a company (ask the prospective employer the questions, to see how challenging or frustrating your job will be); or for due diligence as a quick “smoke test”.

A recent post by Thomas Ptacek of Matsano Security explores how to apply the Joel Test to network security and IT management. In the same spirit, I am proposing a “Joel Test” for software security: a simple, concrete, informal way to assess a team’s ability to build secure software. This is a thought experiment, a fun way of thinking about software security and what it takes to build secure software, following the example of the Joel Test, its principles and its arbitrary 12-question framework. It is not, of course, an alternative to comprehensive maturity frameworks like SAMM or BSIMM, which I used as references in preparing this post, but I think a simple test like this can still provide useful information.

So, here is my attempt at the 12 questions that should be asked in a Software Security “Joel Test”:

1. Do you have clear and current security policies in place so developers know what they should be doing, and what they should not be doing? Realistic, concrete expectations, not legalese or boiler plate. Guidelines that programmers can follow and do follow in building secure software.

2. Do you have someone (person or a team) who is clearly responsible for software security? Someone who helps review design and code from a security perspective, who can coach and mentor developers and test engineers, provide guidelines and oversight, make risk-based decisions regarding security issues. If everybody is accountable for security, then nobody is accountable for security. You need to have someone who acts as coach and cop, who has knowledge and authority.

3. Do you conduct threat modeling, as part of, or in addition to, your design reviews? This could be lightweight or formal, but some kind of structured security reviews need to be done especially for new interfaces, major changes.

4. Do your code reviews include checks for security and safety issues? If you have to ask, “ummm, what code reviews?”, then you have a lot of work ahead of you.

5. Do you use static analysis checking for security (as well as general quality) problems as part of your build?

6. Do you perform risk-based security testing? Does this include destructive testing, regular penetration testing by expert pen testers, and fuzz testing?

7. Have you had an expert, security-focused review of your product’s architecture and design? To ensure that you have a secure baseline, or to catch fundamental flaws in your design that need to be corrected.

8. Do product requirements include security issues and needs? Are you, and your customers, thinking about security needs up front?

9. Does the team get regular training in secure development and defensive coding? Microsoft’s SDL recommends that team members get training in secure design, development and testing at least once per year to reinforce good practices and to stay current with changes in the threat landscape.

10. Does your team have an incident response capability for handling security incidents? Are you prepared to deal with security incidents, do you know how to escalate, contain and recover from security breaches or respond to security problems found outside of development, communicate with customers and partners.

11. Do you track security issues and risks in your bug database / product backlog for tracking and followup? Are security issues made visible to team members for remediation?

12. Do you provide secure configuration and deployment and/or secure operations guidelines for your operations team or customers?

These are the 12 basic, important questions that come to my mind. It would be interesting to see alternative lists, to find out what I may have missed or misunderstood.