Thursday, December 15, 2011

2011: The State of Software Security and Quality

It’s the end of the year. Time to look back on what you’ve done, what you’ve learned, your successes and mistakes, and what you learned from them. I also like to look at the big picture: not just my team and the projects that I manage, or even the company that I work for, but software development in general. How are we doing as an industry, are we getting better, where are we falling behind, what are the main drivers for developers and development managers?

Besides the usual analysis from InformationWeek, and from Forrester and Gartner (if you can afford them), there’s some interesting data available from software quality and software security vendors.

CAST Software
, a vendor of code structural analysis tools, publishes an annual Report on Application Software Health based on an analysis of 745 applications from 160 companies (365 million lines of code) using CAST's static analysis platform. A 25-page executive summary of the report is available free with registration – registration also provides you access to some interesting research from Gartner and Forrester Research, including a pretty good paper from Forrester Research on application development metrics.

The code that CAST analyzed is in different languages, approximately half of it in Java. Their findings are kinda interesting:
  • They found no strong correlation between the size of the system and structural quality except for COBOL apps where bigger apps have more structural quality problems.
  • COBOL apps also tended to have more high-complexity code modules.
  • I thought that the COBOL findings were interesting, although I haven’t worked with COBOL code in a long long time. Maybe too interesting – CAST decided that the findings for COBOL were so far outside of the norm that “consequently we do not believe that COBOL applications should be directly benchmarked against other technologies”.
  • For 204 applications, information was included on what kind of application development method was followed. Applications developed with Agile methods and Waterfall methods had similar profiles for business risk factors (robustness, performance, security) but Agile methods were not as effective when it comes to cost factors (transferability, changeability), which would seem counter-intuitive, given that Agile methods are intended to reduce the cost of change.
  • The more releases per year, the less robust, less secure and less changeable the code was. Even the people at CAST don’t believe this finding.
CAST also attempts to calculate an average technical debt cost for applications, using the following formula:
(10% of low severity findings + 25% of medium severity findings + 50% of high severity findings) * # of hours to fix a problem * cost per hour for development time
The idea is that not all findings from the static analysis tool need to be fixed (maybe 10% of low severity findings and 25% of medium severity findings), but at least half of the high severity issues found need to be fixed. They assume that on average each of these fixes can be made in 1 hour, and estimate that the average development cost for doing this work is $75 per hour. This results in an average technical debt cost of $3.61 per LOC. For Java apps, the cost is much higher, at $5.42 per LOC. So your average 100 KLOC Java system carries about $500,000 worth of technical debt....

It’s difficult to say how real or meaningful these findings are. The analysis depends a lot on the extensiveness of CAST’s static analysis checkers (with different checking done for different languages, making it difficult to compare findings across languages) and the limited size of their customer base. As their tools improve and their customer base grows, this analysis will become more interesting, more accurate and more useful.

The same goes for the State of Software Security Report from Veracode, a company that supplies static and dynamic software security testing services. Like the CAST study, this report attempts to draw wide-ranging conclusions from a limited data set – in this case, the analysis of almost 10,000 “application builds” over the last 18 months (which is a lot less than 10,000 applications, as the same application may be analyzed at least twice if not more in this time window). The analysis focused on web apps (75% of the applications reviewed were web apps). Approximately half of the code was in Java, one quarter in .NET, the rest in C/C++, PHP, etc.

Their key findings:
  • 8 out of 10 apps fail to pass Veracode’s security tests on the first pass - the app contains at least 1 high-risk vulnerability.
  • For web apps, the top vulnerability is still XSS. More than half of the vulnerabilities found are XSS, and 68% of web apps are vulnerable to XSS attacks.
  • 32% of web apps were vulnerable to SQL Injection, even though only 5% of all vulnerabilities found were SQL Injection issues.
  • For other apps, the most common problems were in error handling (19% of all issues found) and cryptographic mistakes (more than 46% of these apps had issues with crypto).
Another interesting analysis is available from SmartBear Software, a vendor of software quality tools, which recently sponsored a webinar by Capers Jones on The State of Software Quality in 2011. Capers Jones draws from a much bigger database of analysis over a much longer period of time. He provides a whirlwind tour of facts and findings in this webinar, summarizing and updating some of the material that you can find in his books.

SmartBear provides some other useful free resources, including a good white paper on best practices for code review. This is further explored in the chapter on Modern Code Reviews by SmartBear’s Jason Cohen in the book Making Software: What Really Works, and Why we Believe it, a good collection of essays on software development practices and the state of research in software engineering.

All of these reports are free – you need to sign up, but you will get minimal hassle – at least I haven’t been hassled much so far. I’m not a customer of any of these companies (we use competing or alternative solutions that we are happy with for now) but I am impressed with the work that these companies do to make this information available to the community.

There were no real surprises from this analysis. We already know what it takes to build good software, we just have to make sure that we actually do it.

Tuesday, December 6, 2011

Devops has made Release and Deployment Cool

Back 10 years or so when Extreme Programming came out, it began to change the way that programmers thought about testing. XP made software developers accountable for testing their own code. XPers gave programmers practices like Test-First Development and simple, free community-based automated testing tools like xUnit and FIT and Fitnesse.

XP made testing cool. Programmers started to care about how to write good automated tests and achieving high levels of test code coverage and about optimizing feedback loops from testing and continuous integration. Instead of throwing code over the wall to a test team, programmers began to take responsibility for reviewing and testing their own code and making sure that it really worked. It’s taken some time, but most of these ideas have gone mainstream, and the impact has been positive for software development and software quality.

Now Devops is doing the same thing with release and deployment. People are finding new ways to make it simpler and easier to release and deploy software, using better tools and getting developers and ops staff together to do this.

And this is a good thing. Because release and deployment, maybe even more than testing, has been neglected by developers. It’s left to the end because it’s the last thing that you have to do – on some big, serial life cycle projects, you can spend months designing and developing something before you get around to releasing the code. Release and deployment is hard – it involves all sorts of finicky technical details and checks. To do it you have to understand how all the pieces of the system are laid out, and you need to understand the technology platform and operational environment for the system, and how Ops needs the system to be setup and monitored, how they need it wired in, what tools they use and how these tools work, and work through operational dependencies, and compliance and governance requirements. You have to talk to different people in a different language, learn and care about their wants and needs and pain points. It’s hard to get all of this right, and it’s hard to test it, and you’re under pressure to get the system out. Why not just give Ops the JARs and EARs and WARs and ZIPs (and your phone number in case anything goes wrong) and let them figure it out? We’re back to throwing the code over a wall again.

Devops, by getting Developers and Operations staff working together and sharing technology and solving problems together, is changing this. It’s making developers, and development managers like me, pay more attention to release and deployment (and post-deployment) requirements. Not just getting it done. Getting developers and QA and Operations staff to think together about how to make release and deployment and configuration simpler and faster, about what could go wrong and then making sure that it doesn’t go wrong, for every release – not just when there is a problem or if Ops complains. Replacing checklists and instructions with automated steps. Adding post-release health checks. Building on Continuous Integration to Continuous Delivery, making it easier and safer and less expensive to release to test as well as production. This is all practical, concrete work, and a natural next step for teams that are trying to design and build software faster.

One difference between the XP and Devops stories is that there’s a lot more vendor support in Devops than there was in the early days of Agile development. Commercial vendors for products like Chef and Puppet and UrbanCode (which has rebranded their Anthill Pro build and release toolset the DevOps Platform) and ThoughtWorks Studios with Go, and even IBM and HP are involved in Devops and pushing Devops ideas forward.

This is good and bad. Good – because this means that there are tools that people can use and people who can help them understand how to use them. And there’s somebody to help sponsor the conferences and other events that bring people together to explore Devops problems. Bad, because in order to understand and appreciate what’s going on and what’s really useful in Devops you have to wade through a growing amount of marketing noise. It’s too soon to say yet whether the real thought leaders and evangelists will be drowned out by vendor product managers and consultants – much like the problem that the Agile development community faces today.

Wednesday, November 30, 2011

Incorrect bug fixes

Whenever developers fix a bug there is of course a chance of making a mistake – not fixing the bug correctly or completely, or introducing a regression, an unexpected side effect. This is a serious problem in system maintenance. In Geriatric Issues of Aging Software, Capers Jones says that
Roughly 7 percent of all defect repairs will contain a new defect that was not there before. For very complex and poorly structured applications, these bad fix injections have topped 20 percent.
An interesting new study How do Fixes Become Bugs? looks at mistakes made by developers trying to fix bugs. The study was done on bug fixes made to large operating system code bases. Their findings aren’t surprising, but they are interesting:
  • Somewhere between 15-25% of fixes for bugs are found to be incorrect in the field. Almost half of these mistakes are serious (can cause crashes, hangs, data corruptions or security problems).
  • Concurrency bugs are the most difficult to fix correctly: 39% of concurrency bug fixes are wrong, and fixes on data race bugs can easily introduce deadlocks or reveal other bugs that were previously hidden. Not surprising given that the analysis was of operating system code. But this still hilights the risks in trying to fix concurrent code.
  • The risk of making mistakes is magnified if the person making the fix is not familiar with the code. More than 25%of incorrect fixes are made by developers who had never previously touched this part of the code before.
The main reasons for incorrect bug fixes:
  • Bug fixes are usually done under tight timelines – bug fixers don’t have the chance to think about potential side-effects and the interaction with the rest of the system, and testers don’t have enough time to thoroughly regression test the fix.
  • Bug fixing has a narrow focus – the developer is focused on understanding and fixing the bug, and doesn’t bother to understand the wider context of the system, often doesn’t even check for other places where the same fix needs to be made. Testers are also narrowly focused on proving that the fix works, and don’t look outside of the specific problem.
  • Lack of understanding of the code base – fixes, especially high-risk fixes (like concurrency changes) should be made by whoever understands the code the best.

So: don't let people who don't know the code well try to fix high-risk problems like concurrency bugs. But you knew that already, didn't you.

Tuesday, November 29, 2011

Iterationless Development – the latest New New Thing

Thanks to the Lean Startup movement, Iterationless Development and Continuous Deployment have become the New New Thing in software development methods. Apparently this has gone so far that “there are venture firms in Silicon Valley that won’t even fund a company unless they employ Lean startup methodologies”.

Although most of us don’t work in a Web 2.0 social media startup, or anything like one, it’s important to cut through the hype and see what we can learn from these ideas. One of the most comprehensive descriptions I’ve seen so far of Iterationless Development is a (good, but buzzword-heavy) presentation by Erik Huddleston that explains how development is done at Dachis Group, which builds online social communities. The development team’s backlog is updated on a just-in-time basis, and includes customer business requirements (defined as minimum features), feedback from Operations (data from analytics and results of Devops retrospectives), and minimally required technical architecture.

Work is managed using Kanban WIP limits and queues. Developers create tests for each change or fix up front. Every check-in kicks off automated tests and static analysis checks for complexity and code duplication as part of Continuous Integration. If it passes these steps, the change is promoted to a test environment, and the code must then be reviewed for architectural oversight (they use Atlassian’s Crucible online code review tool to do this).

Once all of the associated change sets have been reviewed, the code changes are deployed to staging for acceptance testing and review by product management, before being promoted to production. All production changes (code change sets, environment changes and database migration sets) are packaged into Chef recipes and progressively rolled out online. It’s a disciplined and well-structured approach that depends a lot on automation and a good tool set.

Death to Time Boxing

What makes Iterationless Development different is obviously the lack of time boxing – instead of being structured in sprints or spikes, work is done in a continuous flow. According to Huddleston, iterationless Kanban is “here to stay” and is “much more productive than artificial time boxing”.

In a separate blog post, he talks about the death of iterations. While he agrees that iterations have benefits – providing a fixed and consistent routine for the team to follow, a forcing function to drive work to conclusion (nothing focuses the mind like a deadline), and logical points for the team to synch up with the rest of the business – Huddleston asserts that working in time boxes is unnatural and unnecessary. That the artificial and arbitrary boundaries defined by time boxes force people to compromise on solutions, and force them to cut corners in order to meet deadlines.

I agree that time boxes are arbitrary – but no more arbitrary than a work day or work week, or a month or a financial quarter; all cycles that businesses follow. In business we are always working towards a deadline, whether it is hard and real or soft and arbitrary. This is how work gets done. And this doesn’t change if we are working in time boxes or without them.

In iterationless Kanban, the pressure to meet periodic time box deadlines is replaced with the constant pressure to deliver work as fast as possible, to meet individual task deadlines. Rapid cycling in short time boxes is hard enough on teams over a long period of time. Continuous, interrupt-driven development with a tight focus on optimizing cycle time is even harder. The dials are set to on and they stay that way. Kanban makes this easy, giving the team, and the customer and management, the tools to continuously visualize work in progress, identify bottlenecks and delays and squeeze out waste – to maximize efficiency. This is a manufacturing process model remember. The emphasis on tactical optimization and fast-feedback loops, and the “myopic focus on eliminating waste” is just that – short-sighted and difficult to sustain.

With time boxes there are at least built-in synch points, chances for the team to review and reset, so that people can reflect on what they have done, look for ways to improve, look ahead at what they need to do next, and then build up again to an optimal pace. This isn’t waste. Cycling up and down is important and necessary to keep people from getting burnt out and to give them a chance to think and to get better at what they do.

Risk is managed in the same tactical, short-sighted way. Teams working on one issue at a time end up managing risk one issue at a time, relying heavily on automated testing and in-stream controls like code reviews. This is good, but not good enough for many environments: security and reliability risks need to be managed in a more comprehensive, systemic way. Even integrating feedback from Ops isn’t enough to find and prevent deep problems. Working in Agile time boxes is already trading technical risks for speed and efficiency. Iterationless Development and Continuous Deployment, focused on eliminating waste and on accelerating cycle time, pushes these tradeoffs even further, into the danger zone.

Huddleston is also critical of “boxcaring” – batching different pieces of work together in a time box – because it interferes with simple prioritization and introduces unnecessary delays. But batching work together that makes sense to do together can be a useful way to reduce risk and cost. Take a simple example. The team is working on feature 1a . Once it's done, they move on to feature 1b, then 1c. All of this work requires changing the same parts of code, the same or similar testing and reviews, and has a similar impact on operations. By batching this work together, you might deliver it slower, but you can reduce waste and minimize risk by delivering it once, rather than 3 times.

Iterationless Development Makes Sense…

Iterationless Development using Kanban as a control structure is an effective way to deal with excessive pressure and uncertainty – like in an early-stage startup, or a firefighting support team. It’s good for rapid innovation and experimental prototyping, building on continuous feedback from customers and from Operations – situations where speed and responsiveness to the business and customers is critical, more important than minimizing technical and operational risks. It formalizes the way that most successful web startups work – come up with a cool idea, build a prototype as quickly as possible, and then put it out and find out what customers actually want before you run out of cash. But it’s not a one-size-fits-all solution to software development problems.

All software development methods are compromises – imperfect attempts at managing risks and uncertainty. Sequential or serial development methods attempt to specify and fix the solution space upfront, and then manage to this fixed scope. Iterative, time-boxed development helps teams deal with uncertainty by breaking business needs down into small, concrete problems and delivering a working solution in regular steps. And iterationless, continuous-flow allows teams to rapidly test ideas and alternatives, when the problem isn’t clear and nobody is sure yet what direction to go in.

There’s no one right answer. What approach you follow depends on what your priorities and circumstances are, and what kind of problems and risks you need to solve today.

Tuesday, November 15, 2011

Diminishing Returns in software development and maintenance

Everyone knows from reading The Mythical Man Month that as you add more people to a software development project you will see diminishing marginal returns.

When you add a person to a team, there’s a short-term hit as the rest of the team slows down to bring the new team member up to speed and adjusts to working with another person, making sure that they fit in and can contribute. There’s also a long-term cost. More people means more people who need to talk to each other (n x n-1 / 2), which means more opportunities for misunderstandings and mistakes and misdirections and missed handoffs, more chances for disagreements and conflicts, more bottleneck points.

As you continue to add people, the team needs to spend more time getting each new person up to speed and more time keeping everyone on the team in synch. Adding more people means that the team speeds up less and less, while people costs and communications costs and overhead costs keep going up. At some point negative returns set in – if you add more people, the team’s performance will decline and you will get less work done, not more.

Diminishing Returns from any One Practice

But adding too many people to a project isn’t the only case of diminishing returns in software development. If you work on a big enough project, or if you work in maintenance for long enough, you will run into problems of diminishing returns everywhere that you look.

Pushing too hard in one direction, depending too much on any tool or practice, will eventually yield diminishing returns. This applies to:
- Manual functional and acceptance testing
- Test automation
- Any single testing technique
- Code reviews
- Static analysis bug finding tools
- Penetration tests and other security reviews

Aiming for 100% code coverage on unit tests is a good example. Building a good automated regression safety net is important – as you wire in tests for key areas of the system, programmers get more confidence and can make more changes faster.

How many tests are enough? In Continuous Delivery, Jez Humble and David Farley set 80% coverage as a target for each of automated unit testing, functional testing and acceptance testing. You could get by with lower coverage in many areas, higher coverage in core areas. You need enough tests to catch common and important mistakes. But beyond this point, more tests get more difficult to write, and find fewer problems.

Unit testing can only find so many problems in the first place. In Code Complete, Steve McConnell explains that unit testing can only find between 15% and 50% (on average 30%) of the defects in your code. Rather than writing more unit tests, people’s time would be better spent on other approaches like exploratory system testing and code reviews or stress testing or fuzzing to find different kinds of errors.
Too much of anything is bad, but too much whiskey is enough.
Mark Twain, as quoted in Code Complete
Refactoring is important for maintaining and improving the structure and readability of code over time. It is intended to be a supporting practice – to help make changes and fixes simpler and clearer and safer. When refactoring becomes an end in itself or turns into Obsessive Refactoring Disorder, it not only adds unnecessary costs as programmers waste time over trivial details and style issues, it can also add unnecessary risks and create conflict in a team.

Make sure that refactoring is done in a disciplined way, and focus refactoring on those areas that need it the most: on code that is frequently changed, routines that are too big, too hard to read, too complex and error-prone. Putting most of your attention refactoring (or if necessary rewriting) this code will get you the highest returns.

Less and Less over Time

Diminishing returns also set in over time. The longer that you spend working the same way and with the same tools, the less benefits you will see. Even core practices that you’ve grown to depend on don’t pay back over time, and at some point may cost more than they are worth.

It’s time again for New Year’s resolutions – time to sign up at a gym and start lifting weights. If you stick with the same routine for a couple of months, you will start to see good results. But after a while your body will get used to the work – if you keep doing the same things the same way your performance will plateau and you will stop seeing gains. You will get bored and stop going to the gym, which will leave more room for people like me. If you do keep going, trying to push harder for returns, you will overtrain and injure yourself.

The same thing happens to software teams following the same practices, using the same tools. Some of this is due to inertia. Teams, organizations reach an equilibrium point and they want to stay there. Because it is comfortable, and it works – or at least they understand it. And because the better the team is working, the harder it is to get better – all the low-hanging fruit has been picked. People keep doing what worked for them in the past. They stop looking beyond their established routines, stop looking for new ideas. Competence and control lead to complacency and acceptance. Instead of trying to be as good as possible, they settle for being good enough.

This is the point of inspect-and-adapt in Scrum and other time boxed methods – asking the team to regularly re-evaluate what they are doing and how they are doing it, what’s going well and what isn’t, what they should do more of or less of, challenging the status quo and finding new ways to move forward. But even the act of assessing and improving is subject to diminishing returns. If you are building software in 2-week time boxes, and you’ve been doing this for 3, 4 or 5 years, then how much meaningful feedback should you really expect from so many superficial reviews? After a while the team finds themselves going over the same issues and problems and coming up with the same results. Reviews become an unnecessary and empty ritual, another waste of time.

The same thing happens with tools. When you first start using a static analysis bug checking tool for example, there’s a good chance that you will find some interesting problems that you didn’t know were in the code – maybe even more problems than you can deal with. But once you triage this and fix up the code and use the tool for a while, the tool will find fewer and fewer problems until it gets to the point where you are paying for insurance – it isn’t finding problems any more, but it might someday.

In "Has secure software development reached its limits?” William Jackson argues that SDLCs – all of them – eventually reach a point of diminishing returns from a quality and security standpoint, and that Microsoft and Oracle and other big shops are already seeing diminishing returns from their SDLCs. Their software won’t get any better – all they can do is to keep spending time and money to stay where they are. The same thing happens with Agile methods like Scrum or XP – at some point you’ve squeezed everything that you can from this way or working, and the team’s performance will plateau.

What can you do about diminishing returns?

First, understand and expect returns to diminish over time. Watch for the signs, and factor this into your expectations – that even if you maintain discipline and keep spending on tools, you will get less and less return for your time and money. Watch for the team’s velocity to plateau or decline.

Expect this to happen and be prepared to make changes, even force fundamental changes on the team. If the tools that you are using aren’t giving returns any more, then find new ones, or stop using them and see what happens.

Keep reviewing how the team is working, but do these reviews differently: review less often, make the reviews more focused on specific problems, involve different people from inside and outside of the team. Use problems or mistakes as an opportunity to shake things up and challenge the status quo. Dig deep using Root Cause Analysis and challenge the team’s way of thinking and working, look for something better. Don’t settle for simple answers or incremental improvements.

Remember the 80/20 rule. Most of your problems will happen in the same small number of areas, from a small number of common causes. And most of your gains will come from a few initiatives.

Change the team’s driving focus and key metrics, set new bars. Use Lean methods and Lean Thinking to identify and eliminate bottlenecks, delays and inefficiencies. Look at the controls and tests and checks that you have added over time, question whether you still need them, or find steps and checks that can be combined or automated or simplified. Focus on reducing cycle time and eliminating waste until you have squeezed out what you can. Then change your focus to quality and eliminating bugs, or to simplifying the release and deployment pipeline, or some other new focus that will push the team to improve in a meaningful way. And keep doing this and pushing until you see the team slowing down and results declining. Then start again, and push the team to improve again along another dimension. Keep watching, keep changing, keep moving ahead.

Thursday, November 3, 2011

Real, useful security help for software developers

There's lots of advice on designing and building secure software. All you need to do is: Think like an attacker. Minimize the Attack Surface. Apply the principles of Least Privilege and Defense in Depth and Economy of Mechanism. Canonicalize and validate all input. Encode and escape output within the correct context. Use encryption properly. Manage sessions in a secure way....

But how are development teams actually supposed to do all of this? How do they know what's important, and what's not? What frameworks and libraries should they use? Where are code samples that they can review and follow? How can they test the software to see if they did everything correctly?

Read my latest post at the SANS Appsec Street Fighter blog for the best of the tools, cheat sheets and programming books that I've found to help development teams deal with the details of building secure software.

Monday, October 24, 2011

Rolling Forward and other Deployment Myths

There is more and more writing on Devops lately, which is good and bad. There still remains a small core of thoughtful people that are worth listening to and learning from. There’s more and more marketing from vendors and consultants jumping on the Devops bandwagon. There’s some naïve silliness (“Hire wicked smart people and give them all access to root.”) which can probably be safely ignored. And then there’s stuff that is half-right and half-wrong, too dangerous to be ignored. Like this recent post on Rollbacks and Other Deployment Myths, in which John Vincent lists 5 “myths” about system deployment, which I want to take some time to respond to here.

Change is change?

The author tries to make the point that “Change is neither good or bad. It’s just change.” Therefore we do not need to be afraid of making changes.

I don’t agree. This attitude to change ignores the fact that once a system is up and running and customers are relying on it to conduct their business, whatever change you are making to the system is almost never as important as making sure that the system keeps running properly. Unfortunately, changes often lead to problems. We know from Visible Ops that based on studies of hundreds of companies, 80% of operational failures are caused by mistakes made during changes. This is where heavyweight process control frameworks like ITIL and COBIT, and detective change control tools like Tripwire came from. To help companies get control over IT change, because people had to find some way to stop shit from breaking.

Yes, I get the point that in IT we tend to over-compensate, and I agree that calling in the Release Police and trying to put up a wall around all changes isn’t sustainable. People don’t have to work this way. But trivializing change, pretending that changes don't lead to problems, is dangerous.

Deploys are not risky?

You can be smart and careful and break changes down into small steps and try to automate code pushes and configuration changes, and plan ahead and stage and review and test all your changes, and after all of this you can still mess up the deploy. Even if you make frequent small changes and simplify the work and practice it a lot.

For systems like Facebook and online games and a small number of other cases, maybe deployment really is a non-issue. I don’t care if Facebook deploys in the middle of the day – I can usually tell when they are doing a “zero downtime” deploy (or maybe they are “transparently” recovering from a failure) because data disappears temporarily or shows up in the wrong order, functions aren’t accessible for a while, forms don’t resolve properly, and other weird shit happens, and then things come back later or they don’t. As a customer, do I care? No. It’s an inconvenience, and it’s occasionally unsettling ("WTF just happened?"), but I get used to it and so do millions of others. That’s because most of us don’t use Facebook or systems like this for anything important.

For business-critical systems handling thousands of transactions a second that are tied into hundreds of other company’s systems (the world that I work in) this doesn’t cut it. Maybe I spend too much time at this extreme, where even small problems with compatibility that only affect a small number of customers, or slight and temporary performance slow downs, are a big deal. But most people I work with and talk to in software development and maintenance and system operations agree that deployment is a big deal and needs to be done with care and attention, no matter how simple and small the changes are and no matter how clean and simple and automated the deployment process is.

Rollbacks are a myth?

Vincent wants us to “understand that it’s typically more risky to rollback than rolling forward. Always be rolling forward.”

Not even the Continuous Deployment advocates (who are often some of the most radical – and I think some of the most irresponsible – voices in the Devops community) agree with this – they still roll back if they find problems with changes.

“Rollbacks are a myth” is an echo of the “real men fail forward” crap I heard at Velocity last year and it is where I draw the line. It's one thing to state an extreme position for argument's sake or put up a straw man – but this is just plain wrong.

If you're going to deploy, you have to anticipate roll back and think about it when you make changes and you have to test rolling back to make sure that it works. All of this is hard. But without a working roll back you have no choice other than to fail forward (whatever this means, because nobody who talks about it actually explains how to do it), and this is putting your customers and their business at unnecessary risk. It’s not another valid way of thinking. It’s irresponsible.

James Hamilton wrote an excellent paper on Designing and Delivering Internet-Scale Services when he was at Microsoft (now he’s a an executive and Distinguished Engineer at Amazon). Hamilton’s paper remains one of the smartest things that anyone has written about how to deal with deployment and operational problems at scale. Everyone who designs, builds, maintains or operates an online system should read it. His position on roll back is simple and obvious and right:
Reverting to the previous version is a rip cord that should always be available on any deployment.
Everything fails. Embrace failure.

I agree that everything can and will fail some day, that we can’t pretend that we can prevent failures in any system. But I don’t agree with embracing failure, at least in business-critical enterprise systems, where recovering from a failure means lost business and requires unraveling chains of transactions between different upstream and downstream systems and different companies, messing up other companies' businesses as well as your own and dealing the follow-on compliance problems. Failures in these kinds of systems, and a lot of other systems, are ugly and serious, and they should be treated seriously.

We do whatever we can to make sure that failures are controlled and isolated, and we make sure that we can recover quickly if something goes wrong (which includes being able to roll back!). But we also do everything that we can to prevent failures. Embracing failure is fine for online consumer web site startups – let’s leave it to them.

SLAs

I wanted to respond to the points about SLAs, but it’s not clear to me what the author was trying to say. SLAs are not about servers. Umm, yes that’s right of course...

SLAs are important to set business expectations with your customers (the people who are using the system) and with your partners and suppliers. So that your partners and suppliers know what you need from them and what you are paying them for, and so that you know if you can depend on them when you have to. So that your customers know what they are paying you for. And SLAs (not just high-level uptime SLAs, but SLAs for Recovery Time and Recovery Point goals and incident response and escalation) are important so that your team understands the constraints that they need to work under, what trade-offs to make in design and implementation and in operations.

Under-compensating is worse than Over-compensating

I spent more time than I thought I would responding to this post, because some of what the author says is right – especially in the second part of his post, Deploy All the Things where he provides some good practical advice on how to reduce risk in deployment. He’s right that Operations main purpose isn’t to stop change – it can’t be. We have to be able to keep changing, and developers and operations have to work together to do this in safe and efficient ways. But trivializing the problems and risks of change and over-simplifying how to deal with these risks and how to deal with failures, isn’t the way to do this. There has to be a middle way between the ITIL and COBIT world of controls and paper and process, and cool Web startups failing forward, a way that can really work for the rest of us.

Wednesday, October 12, 2011

You can’t be Agile in Maintenance?

I’ve been going over a couple of posts by Steve Kilner that question whether Agile methods can be used effectively in software maintenance. It’s a surprising question really. There are a lot of maintenance teams who have had success following Agile methods like Scrum and Extreme Programming (XP) for some time now. We’ve been doing it for almost 5 years, enhancing and maintaining and supporting enterprise systems, and I know that it works.

Agile development naturally leads into maintenance – the goal of incremental Agile development is to get working software out to customers as soon as possible, and get customers using it. At some point, when customers are relying on the software to get real business done and need support and help to keep the system running, teams cross from development over to maintenance. But there’s no reason for Agile development teams to fundamentally change the way that they work when this happens.

It is harder to introduce Agile practices into a legacy maintenance team – there are a lot of technical requirements and some cultural changes that need to be made. But most maintenance teams have little to lose and lots to gain from borrowing from what Agile development teams are doing. Agile methods are designed to help small teams deal with a lot of change and uncertainty, and to deliver software quickly – all things that are at least as important in maintenance as they are in development. Technical practices in Extreme Programming especially help ensure that the code is always working – which is even more important in maintenance than it is in development, because the code has to work the first time in production.

Agile methods have to be adapted to maintenance, but most teams have found it necessary to adapt these methods to fit their situations anyways. Let’s look at what works and what has to be changed to make Agile methods like Scrum and XP work in maintenance.

What works well and what doesn’t

Planning Game

Managing maintenance isn’t the same as managing a development project – even an Agile development project. Although Agile development teams expect to deal with ambiguity and constant change, maintenance teams need to be even more flexible and responsive, to manage conflicts and unpredictable resourcing problems. Work has to be continuously reviewed and prioritized as it comes in – the customer can’t wait for 2 weeks for you to look at a production bug. The team needs a fast path for urgent changes and especially for hot fixes.

You have to be prepared for support demands and interruptions. Structure the team so that some people can take care of second-level support, firefighting and emergency bug fixing and the rest of the team can keep moving forward and get something done. Build slack into schedules to allow for last-minute changes and support escalation.

You will also have to be more careful in planning out maintenance work, to take into account technical and operational dependencies and constraints and risks. You’re working in the real world now, not the virtual reality of a project.

Standups

Standups play an important role in Agile projects to help teams come up to speed and bond. But most maintenance teams work fine without standups – since a lot of maintenance work can be done by one person working on their own, team members don’t need to listen to each other each morning talking about what they did yesterday and what they’re going to do – unless the team is working together on major changes. If someone has a question or runs into a problem, they can ask for help without waiting until the next day.

Small releases

Most changes and fixes that maintenance teams need to make are small, and there is almost always pressure from the business to get the code out as soon as it is ready, so an Agile approach with small and frequent releases makes a lot of sense. If the time boxes are short enough, the customer is less likely to interrupt and re-prioritize work in progress – most businesses can wait a few days or a couple of weeks to get something changed.

Time boxing gives teams a way to control and structure their work, an opportunity to batch up related work to reduce development and testing costs, and natural opportunities to add in security controls and reviews and other gates. It also makes maintenance work more like a project, giving the team a chance to set goals and to see something get done. But time boxing comes with overhead – the planning and setup at the start, then deployment and reviews at the end – all of which adds up over time. Maintenance teams need to be ruthless with ceremonies and meetings, pare them down, keep only what’s necessary and what works.

It’s even more important in maintenance than in development to remember that the goal is to deliver working code at the end of each time box. If some code is not working, or you’re not sure if it is working, then extend the deadline, back some of the changes out, or pull the plug on this release and start over. Don’t risk a production failure in order to hit an arbitrary deadline. If the team is having problems fitting work into time boxes, then stop and figure out what you’re doing wrong – the team is trying to do too much too fast, or the code is too unstable, or people don’t understand the code enough – and fix it and move on.

Reviews and Retrospectives

Retrospectives are important in maintenance to keep the team moving forward, to find better ways of working, and to solve problems. But like many practices, regular reviews reach a point of diminishing returns over time – people end up going through the motions. Once the team is setup, reviews don’t need to be done in each iteration unless the team runs into problems.

Schedule reviews when you or the team need them. Collect data on how the team is working, on cycle time and bug report/fix ratios, correlate problems in production with changes, and get the team together to review if the numbers move off track. If the team runs into a serious problem like a major production failure, then get to the bottom of it through Root Cause Analysis.

Sustainable pace / 40-hour week

It’s not always possible to work a 40-hour week in maintenance. There are times when the team will be pushed to make urgent changes, spend late nights firefighting, releasing after hours and testing on weekends. But if this happens too often or goes on too long the team will burn out. It’s critical to establish a sustainable pace over the long term, to treat people fairly and give them a chance to do a good job.

Pairing

Pairing is hard to do in small teams where people are working on many different things. Pairing does make sense in some cases – people naturally pair-up when trying to debug a nasty problem or walking through a complicated change – but it’s not necessary to force it on people, and there are good reasons not to.

Some teams (like mine) rely more on code reviews instead of pairing, or try to get developers to pair when first looking at a problem or change, and at the end again to review the code and tests. The important thing is to ensure that changes get looked at by at least one other person if possible, however this gets done.

Collective Code Ownership

Because maintenance teams are usually small and have to deal with a lot of different kinds of work, sooner or later different people will end up working on different parts of the code. It’s necessary, and it’s a good thing because people get a chance to learn more about the system and work with different technologies and on different problems.

But there’s still a place for specialists in maintenance. You want the people who know the code the best to make emergency fixes or high-risk changes – or at least have them review the changes – because it has to work the first time. And sometimes you have no choice – sometimes there is only one person who understands a framework or language or technical problem well enough to get something done.

Coding Guidelines – follow the rules

Getting the team to follow coding guidelines is important in maintenance to help ensure the consistency and integrity of the code base over time – and to help ensure software security. Of course teams may have to compromise on coding standards and style conventions, depending on what they have inherited in the code base; and teams that maintain multiple systems will have to follow different guidelines for each system.

Metaphor

In XP, teams are supposed to share a Metaphor: a simple high-level expression of the system architecture (the system is a production line, or a bill of materials) and common names and patterns that can be used to describe the system. It’s a fuzzy concept at best, a weak substitute for more detailed architecture or design, and it’s not of much practical value in maintenance. Maintenance teams have to work with the architecture and patterns that are already in place in the system.

What is important is making sure that the team has a common understanding of these patterns and the basic architecture so that the integrity isn’t lost – if it hasn’t been lost already. Getting the team together and reviewing the architecture, or reverse-engineering it, making sure that they all agree on it and documenting it in a simple way is important especially when taking over maintenance of a new system and when you are planning major changes.

Simple Design

Agile development teams start with simple designs and try to keep them simple. Maintenance teams have to work with whatever design and architecture that they inherit, which can be overwhelmingly complex, especially in bigger and older systems. But the driving principle should still be to design changes and new features as simple as the existing system lets you – and to simplify the system’s design further whenever you can.

Especially when making small changes, simple, just-enough design is good – it means less documentation and less time and less cost. But maintenance teams need to be more risk adverse than development teams – even small mistakes can break compatibility or cause a run-time failure or open a security hole. This means that maintainers can’t be as iterative and free to take chances, and they need to spend more time upfront doing analysis, understanding the existing design and working through dependencies, as well as reviewing and testing their changes for regressions afterwards.

Refactoring

Refactoring takes on a lot of importance in maintenance. Every time a developer makes a change or fix they should consider how much refactoring work they should do and can do to make the code and design clearer and simpler, and to pay off technical debt. What and how much to refactor depends on what kind of work they are doing (making a well-thought-out isolated change, or doing shotgun surgery, or pushing out an emergency hot fix) and the time and risks involved, how well they understand the code, how good their tools are (development IDEs for Java and .NET at least have good built-in tools that make many refactorings simple and safe) and what kind of safety net they have in place to catch mistakes – automated tests, code reviews, static analysis.

Some maintenance teams don’t refactor because they are too afraid of making mistakes. It’s a vicious circle – over time the code will get harder and harder to understand and change, and they will have more reasons to be more afraid. Others claim that a maintenance team is not working correctly if they don’t spend at least 50% of their time refactoring.

The real answer is somewhere in between – enough refactoring to make changes and fixes safe. There are cases where extensive refactoring, restructuring or rewriting code is the right thing to do. Some code is too dangerous to change or too full of bugs to leave the way it is – studies show that in most systems, especially big systems, 80% of the bugs can cluster in 20% of the code. Restructuring or rewriting this code can pay off quickly, reducing problems in production, and significantly reducing the time needed to make changes and test them as you go forward.

Continuous Testing

Testing is even more important and necessary in maintenance than it is in development. And it’s a major part of maintenance costs. Most maintenance teams rely on developers to test their own changes and fixes by hand to make sure that the change worked and that they didn’t break anything as a side effect. Of course this makes testing expensive and inefficient and it limits how much work the team can do. In order to move fast, to make incremental changes and refactoring safe, the team needs a better safety net, by automating unit and functional tests and acceptance tests.

It can take a long time to put in test scaffolding and tools and write a good set of automated tests. But even a simple test framework and a small set of core fat tests can pay back quickly in maintenance, because a lot changes (and bugs) tend to be concentrated in the same parts of the code – the same features, framework code and APIs get changed over and over again, and will need to be tested over and over again. You can start small, get these tests running quickly and reliably and get the team to rely on them, fill in the gaps with manual tests and reviews, and then fill out the tests over time. Once you have a basic test framework in place, developers can take advantage of TFD/TDD especially for bug fixes – the fix has to be tested anyways, so why not write the test first and make sure that you fixed what you were supposed to?

Continuous Integration

To get Continuous Testing to work, you need a Continuous Integration environment. Understanding, automating and streamlining the build and getting the CI server up and running and wiring in tests and static analysis checks and reporting can take a lot of work in an enterprise system, especially if you have to deal with multiple languages and platforms and dependencies between systems. But doing this work is also the foundation for simplifying release and deployment – frequent short releases means that release and deployment has to be made as simple as possible.

Onsite Customer / Product Owner

Working closely with the customer to make sure that the team is delivering what the customer needs when the customer needs it is as important in maintenance as it is in developing a new system. Getting a talented and committed Customer engaged is hard enough on a high-profile development project – but it’s even harder in maintenance. You may end up with too many customers with conflicting agendas competing for the team’s attention, or nobody who has the time or ability to answer questions and make decisions. Maintenance teams often have to make compromises and help fill in this role on their own.

But it doesn’t all fit….

Kilner’s main point of concern isn’t really with Agile methods in maintenance. It’s with incremental design and development in general – that some work doesn’t fit nicely into short time boxes. Short iterations might work ok for bug fixes and small enhancements (they do), but sometimes you need to make bigger changes that have lots of dependencies. He argues that while Agile teams building new systems can stub out incomplete work and keep going in steps, maintenance teams have to get everything working all at once – it’s all or nothing.

It’s not easy to see how big changes can be broken down into small steps that can be fit into short time boxes. I agree that this is harder in maintenance because you have to be more careful in understanding and untangling dependencies before you make changes, and you have to be more careful not to break things. The code and design will sometimes fight the kinds of changes that you need to make, because you need to do something that was never anticipated in the original design, or whatever design there was has been lost over time and any kind of change is hard to make.

It’s not easy – but teams solve these problems all the time. You can use tools to figure out how much of a dependency mess you have in the code and what kind of changes you need to make to get out of this mess. If you are going to spend “weeks, months, or even years” to make changes to a system, then it makes sense to take time upfront to understand and break down build dependencies and isolate run-time dependencies, and put in test scaffolding and tests to protect the team from making mistakes as they go along. All of this can be done in time boxed steps. Just because you are following time boxes and simple, incremental design doesn’t mean that you start making changes without thinking them through.

Read Working With Legacy Code – Michael Feathers walks through how to deal with these problems in detail, in both object oriented and procedural languages. What to do if it takes forever to make a change. How to break dependencies. How to find interception points and pinch points. How to find structure in the design and the code. What tests to write and how to get automated tests to work.

Changing data in a production system, especially data shared with other systems, isn’t easy either. You need to plan out API changes and data structure changes as carefully as possible, but you can still make data and database changes in small, structured steps.

To make code changes in steps you can use Branching by Abstraction where it makes sense (like making back-end changes) and you can protect customers from changes through Feature Flags and Dark Launching like Facebook and Twitter and Flickr do to continuously roll out changes – although you need to be careful, because if taken too far these practices can make code more fragile and harder to work with.

Agile development teams follow incremental design and development to help them discover an optimal solution through trial-and-error. Maintenance teams work this way for a different reason – to manage technical risks by breaking big changes down and making small bets instead of big ones.

Working this way means that you have to put in scaffolding (and remember to take it out afterwards) and plan out intermediate steps and review and test everything as you make each change. Sometimes it might feel like you are running in place, that it is taking longer and costing more. But getting there in small steps is much safer, and gives you a lot more control.
Teams working on large legacy code bases and old technology platforms will have a harder time taking on these ideas and succeeding with them. But that doesn’t mean that they won’t work. Yes, you can be Agile in maintenance.

Tuesday, October 4, 2011

Dealing with security vulnerabilities ... er... bugs

A serious problem in many organizations is that development teams (and their business sponsors) don't take ownership for understanding and managing software security risks, and often try to ignore vulnerabilities or hide them. Without real pressure from the top, it's hard to convince developers and management that dealing with security vulnerabilities is a priority because vulnerabilities aren't requirements or real problems — they are potential problems and risks that can be put off until later.

This is wrong. Vulnerabilities found in pen testing and reviews and scans are either bugs — real problems in the code that should be fixed — or they are noise — false positives or motherhood that can be ignored. Treating them as something different and distinct and managing them in a different way is a mistake.

Read my latest post at the SANS AppSec Street Fighter blog on Dealing with Security Vulnerabilities.

Monday, September 26, 2011

Takeaways from OWASP AppSec USA 2011

Last week I attended the OWASP AppSec USA conference in Minneapolis. It was my first time at an OWASP event, and it was an impressive show. More than 600 attendees, some high-quality speakers, lots of vendor representation.

There were several different tracks over the two days of the conference: attacks and defenses, security issues in Mobile computing and Cloud computing, thought leadership, OWASP tools, security patterns and secure SDLCs. And there were lots of opportunities to talk to vendors and smart people. Here were the hilights for me:

Jim Manico showed that there’s hope for developers to write secure code, at least when it comes to protecting web apps from XSS attacks, if they use common good sense and new frameworks and libraries like Google’s Caja, the technology behind OWASP’s Java HTML Sanitizer.

There is even some hope for writing secure Mobile apps, as long as developers are careful and informed. This is a more reassuring story than what I was hearing from security experts earlier in the year, when everyone was focused on how pathetically insecure the platforms and tools were for writing Mobile apps.

It's also possible to be secure in the Cloud as long as you take the time to understand the Cloud services that you are using and how they are implemented and what assumptions they make, and if you think through your design. Which a lot of people probably aren’t bothering to do.

There's too much code already out there

We’re still trying to find ways to build software that is secure. But a bigger problem for many organizations is that there is already too much code out there that is insecure. Jeff Williams at Aspect Security guesstimates that there are more than 1 trillion lines of code in the world, and around 15 million developers around the world writing more of it. A lot of this legacy code contains serious security problems – but it is too expensive to find these and fix these problems. We need better tools to do this, tools that scale up to enterprises and down to small companies.

Many organizations, especially big enterprises with thousands of applications, are relying heavily on scanning tools to find security vulnerabilities. But it takes too long and costs too much to work through all of the results and to find real risks from the long lists of findings and false positives – to figure out what problems need to be fixed and how to fix them.

HP thinks that they have an answer by correlating dynamic scanning and static code analysis results. The idea is to scan the source code, then run dynamic scans against an application under a run-time monitor which captures problems found by the dynamic scanner and records where the problems occurred in the code. By matching up these results, you can see which static analysis findings are exploitable, and it makes it easier to find and fix problems found in run-time testing. And combining these techniques also makes it possible to find new types of vulnerabilities that can’t be found using just one or other of the approaches. I heard about this idea earlier in the year when Brian Chess presented it at SANS Appsec, but it is interesting to see how far it has come since then. Hybrid testing is limited to HP technology for now (SPI Dynamics and Fortify) and it’s limited by what problems the scanners can find, but it looks like a good step forward in the test tool space.

Good Pen Testing and Bad Pen Testing

It’s clear talking to vendors that there are good pen tests and bad pen tests. In most cases, good pen tests are what they offer, and bad pen tests are what you would get from a competitor. Comparing one vendor to another isn't easy, since an application pen test means different things to different people, and the vendors all had different stories and offerings. Some stressed the quality of their tools, others stressed the experience and ingenuity of their testers. Some offered automated scans with some interpretation, which could work fine for plain-vanilla web apps. Some do combined source code review and scanning. Some don’t want help understanding the code or scoping an engagement – they have a fixed-price menu running from running scans to manual pen tests using scanners and fuzzers and attack proxies. Others expect to spend a lot of time scoping and understanding what to test, to make sure that they focus on what's important.

SSL is broken

Moxie Marlinspike’s keynote on the problems with SSL and the CA model was entertaining but scary. He showed that the circle of trust model that underlies SSL – I trust the host because my browser trusts the CA and the CA trusts the host – is fundamentally broken now that CAs like Comodo and DigiNotar (and maybe others) have been compromised.

His solution is Convergence, an alternative way to establish trust between parties, where the client is responsible for choosing trust providers (notaries) rather than being locked into a single trust provider selected by the host.

Metrics and numbers for fun

I like numbers and statistics, here are a few that I found especially interesting. Some of them were thrown out quickly, I hope that I got them right.

At a panel on secure development, one expert estimated that the tax for introducing a secure SDLC, including training people, getting tools and getting people to use them, was about 15% of the total development costs. Although he qualified this number by saying that it was from a pilot, another panelist agreed that it was reasonable, so it seems a good rule of thumb to use for now.

John Steven at Cigital says they have found that an inexperienced team can expect to have 38-40 security vulnerabilities per KLOC. A mature team, following good security hygiene, can get this down to 7-10 vulnerabilities per KLOC.

Chris Wysopal at Veracode presented some interesting numbers in his presentation on Application Security Debt – you can read more on his ideas on applying the technical debt metaphor to application security and building economic models for the costs of this debt on his blog. He referenced findings from the 2010 Verizon Breach study and tried to correlate these findings with analysis work that Veracode has done using data from testing customer applications. This showed that the most common application-related breach was through Backdoors or hidden Command Channels which I was surprised at, and so was he, since Veracode’s own analysis shows that application Backdoors are not that common. I assumed from this analysis that application Backdoors were so dangerous because they were so easily found and exploited by attackers. However, when I read the Verizon report again, what Verizon calls a Backdoor is a command channel opened up by malware, which is something quite different to a Backdoor left in an application, so the comparison doesn’t stand up. This leaves SQL Injection to be the number one application vulnerability, something that shouldn't surprise anyone.

The “A Word”

Most of the speakers and panelists at the conference were from enterprise companies or from vendors trying to sell to the enterprise. Not surprisingly, as a result there was little said about the needs and constraints of small companies, and at the sessions I attended, the speakers had not come to terms with the reality that most companies are adopting or have already adopted more incremental agile development practices. At one panel, Agile development was referred to as the “A Word”.

We urgently need tools and practices that small, fast-moving teams can use, because these teams are building and maintaining a lot of the world's software. It's time for OWASP and ISC2 and the rest of the application security community to accept this and understand what it means and offer help where it's really needed.

Dumbest thing said at the conference

This award goes to one vendor on a panel on security testing who said that it is irresponsible for developers to work with new technologies where they don't understand the security risks, that developers who work on the bleeding edge are putting their customers and businesses at unnecessary risk. Besides being blindingly obvious, it shows a complete lack of connection to the realities of software development and the businesses that rely on software (which is pretty much every business today). Dinis Cruz pointed out the absurdity of this statement, by admitting that we don’t fully understand the security risks of the common platforms that we use today like Java and .NET. This means that nobody should be writing software at all - or maybe it would be ok if we all went back to writing batch programs in COBOL and FORTRAN.

Impressed and Humbled

Overall I learned some good stuff, and I enjoyed the conference and the opportunity to meet some of the people behind OWASP. I was impressed and humbled by the commitment and accomplishments of the many OWASP contributors. Seeing how much they have done, talking to them, understanding how much they care, gives me some hope that we can all find a way to build better software together.

Wednesday, September 7, 2011

Standups – take ‘em or leave ‘em

We left ‘em.

Standup meetings are a core practice in Agile methods like Scrum and XP. Each day the team meets briefly to answer 3 questions: What did I get done yesterday? What am I going to do today? What is getting in my way?

Standups offer a quick check on what’s happening, what’s changed, who ‘s working on what, who needs help. The meeting is supposed to be short and sweet, no more than 15 minutes a day. Martin Fowler lists some good reasons to hold standups:
- Share commitment
- Communicate status
- Identify obstacles to be solved
- Set direction and focus
- Help to build a team

However, not everyone finds that standups are necessary and some people have started to question the value of standups over time and are looking for substitutes. A number of people have suffered through poorly-run standup meetings. And some people just plain don’t like them.

Our team has been successful following Scrum and XP practices as we transitioned from delivering phased releases (a release every 2-3 months) to 2-3 small releases a month. We looked closely at how other teams worked, spent time learning about incremental and iterative development methods, and took what made sense to us and tried it and adapted it. But one of the Agile practices that we did not take up was standups.

When we introduced the idea of standups to the team a few years ago, we were surprised by the response. The team was roughly split. Some people, especially people who were new to the team, or people who had worked on successful Scrum teams or XP teams before, liked the idea. But other people who had been on the team from the beginning were opposed, some of them strongly opposed. And they had some good reasons:

We were already running an operational business and building and delivering new software at the same time. Team members were helping to support the system, helping with day-to-day operations, and working closely with the business side. Some people were working late and weekends to get changes in or to do after-hours testing, others were working early to help with startup issues or to coordinate with partners, some people were working in different timezones, some people were working at home, others were at the beck and call of integration partners and important clients. It would not be possible to schedule a daily meeting, even a short one, which would work well for everyone on a sustained basis. This is one example (and there many others), where Agile development ideas which work well in the controlled, artificial reality of a development project need to be rethought and adapted to fit day-to-day operational demands and constraints.

Some of the team had their hands full with important changes or problems that had to be taken care of urgently. They met regularly with other team members to work through design issues and reviewed each other’s code. They got help from other team members or managers when they needed it – there was no need to wait for a daily meeting to bring up blockers or get whatever information they needed.

And there were a couple of introverts who hated all meetings and didn’t understand why a meeting that they had to go to everyday and standup at would be any better than any other meeting. They saw it as a half hour wasted out of every day – because they knew that a 15-minute meeting, when you include the time taken to save what you were working on, go to the meeting, standup, get out of the meeting and get back to working productively, would mean at least 30 minutes if not more time every day from what they were doing.

Rather than try to fight with the team to implement a practice that we weren’t sure was really needed, we decided to find other ways to get the same results.

Alternatives to standups

There are other ways to share status within (and outside of) the team. Because we have people working together in different countries and different groups, we rely heavily on our bug tracking system to manage the backlog of work, to schedule and plan work, and to radiate status information. The bug tracking system is used by and shared with everyone: developers, testers, support, business operations, systems engineering, IT, compliance and management. Everything to do with the software and systems operations is tracked here, including new features and bug fixes and security vulnerabilities and problems found in testing, and operational and infrastructural changes, compliance reviews and new clients being onboarded. Using the system’s dashboards and notification feeds everyone can easily tell what is happening and what will be happening and when. The team doesn’t need standups to know what is happening, who is working on what, what the schedule is or what needs to be worked on next.

Managers who need to know what’s going on and who want to make sure that everyone is on track can get all of this through MBWA. And regular one-on-one meetings help reinforce priorities and make sure that everyone gets face time and a chance to ask questions without wasting the rest of the team’s time.

Standups help at the start

I can see that standups are more useful in the early stages of a project, when the team is starting to come together and everyone is working through the design and learning the domain and the technology and feeling out each other, finding out each other’s strengths and quirks.

Without standups, it is harder for new people joining the team to understand what is going on and what’s important, and figure out where they fit in. Pairing up new team members with someone more experienced helps, but if you have a couple or more people joining a team, it makes sense to hold standups for a while at least to help them get up to speed.

Declining value over time

But when people have been working together for a while, standups offer declining value. Once the team has gelled and people know each other and can depend on each other, they don’t need to meet in a room every morning to talk about what they are working on. Like some other practices that are important and useful in the beginning, the team grows up or grows out of them.

And as you move more into maintenance and support work, when people on the team are working on smaller individual changes and fixes and the work tends to be more interrupt-driven, standups are a waste. What one person is working on doesn’t have much if anything to do with what somebody else is doing. They don’t need to, or want to, listen to each other talking about what they did and what they’re going to do, because if it was important they would already know about it anyways.

Stuffed Pigs, Nerf Balls and other Silly Games

One of the things I don’t like about standups is that they are fundamentally paternalistic – you’re treating the team like children, forcing them to get together and stand up in a room every day (they have to stand up because they can’t be trusted to sit down) and making them speak in turn for only a few minutes (but only of course if they are a pig, not a chicken). And on some teams, people go so far as to hand around a stuffed pig or a Nerf ball or follow some other awkward ritual to make sure that people don’t speak out of turn. If this sounds silly and childish, it is. You have a room full of children: team members who can’t ask questions, they can’t solve problems, they have to stand there and follow silly rituals. Day after day, week after week, month after month, year after year…

Rather than treating adults like little children, why not let people meet when they need to or want to – to solve problems, to review requirements or designs, to respond to incidents (Root Cause Analysis), to plan, and to make sure that people know when requirements or priorities are changing in a significant way.

Would we have had different results if we had implemented standups? Probably. Would the results have been better for the team and for the company? I’m not sure. We found other good ways for the team, and the rest of the company, to track what was going on and what needed to be done next. We’ve had some misunderstandings, and some people did get off track – standups might have helped prevent this. And standups would have helped new people joining the team, to come up to speed. But the rest of the team gelled quickly during our crazy startup days, and built up a high level of trust and openness and shared commitment without standups. Can you build high-performing, collaborative teams without standups? Of course you can.

Standups have a place, especially early on with people who don’t know each other, where everyone is learning and uncertain about what to do and about each other. But don’t do something, or keep doing something if you don’t think that you need to, just because somebody read it in a book or learned about it in a 3 day SCM class. It’s your team, and you can find your own way to succeed.

Wednesday, August 24, 2011

Bugs and Numbers: How many bugs do you have in your code?

If you follow Zero Bug Tolerance of course you’re not supposed to have any bugs to fix after the code is done. But let’s get real. Is there any way to know how many bugs you're missing and will have to fix later, and how many bugs you might already have in your code? Are there any industry measures of code quality that you can use as a starting point?

For questions like these, the first place I look is one of the books by Capers Jones: arguably the leading expert in all things to do with software development metrics. There’s Applied Software Measurement, or Estimating Software Costs, or Software Engineering Best Practices: Lessons from Successful Projects in the Top Companies. These books offer different views into a fascinating set of data that Capers Jones has collected over decades from thousands of different projects. It’s easy to spend hours getting lost in this data set, and reading through and questioning the findings that he draws from it.

So what can we learn from Capers Jones about bugs and defect potentials and defect density rates? A lot, actually.

On average 85% of bugs introduced in design and development are caught before the code is released (this is the average in the US as of 2009). His research shows that this defect removal rate has stayed roughly the same over 20 years, which is disappointing given the advances in tools and methods over that time.

We introduce 5 bugs per Function Point (Capers Jones is awfully fond of measuring everything by Function Points, which are an abstract way of measuring code size), depending on the type of system being built. Web systems are a bit lower surprisingly, at 4 bugs per Function Point; other internal business systems are 5, military systems average around 7. Using backfiring (a crude technique to convert function points back into LOC measures) you can equivalence 1 Function Point to about 50-55 lines of Java code.

For the sake of simplicity, let’s use 1 Function Point = 50 LOC, and keep in mind that all of these numbers are really rough, and that using backfiring techniques to translate Function Points to source code statements introduces a probability of error, but it’s a lot easier than trying to think in Function Points. And all I want here is a rough indicator of how much trouble a team might be in.

If 85% of bugs are hopefully found and fixed before the code is released, this leaves 0.75 bugs per Function Point unfound (and obviously unfixed) in the code when it gets to production. Which means that for a small application of 1,000 Function Points (50,000 or so lines of Java code), you could expect around 750 defects at release .

And this is only accounting for the bugs that you don’t already know about: a lot of code is released with a list of known bugs that the development team hasn’t had a chance to fix, or doesn’t think is worth fixing, or doesn’t know how to fix. And, this is just your code: it doesn’t account for bugs in the technology stack that the application depends on: the frameworks and application platform, database and messaging middleware, and any open source libraries or COTS that you take advantage of.

Of these 750+ bugs around 25% will be severity 1 show stoppers – real production problems that cause something significant to break.

Ouch – no wonder most teams will spend a lot of time on support and fixing bugs after releasing a big system. Of course, if you’re building and releasing software incrementally, you’ll find and fix more of these bugs as you go along, but you’ll still be fixing a lot of bugs in production.

Remember that these are rough averages. And remember (especially the other guys out there), we can’t all be above average, no matter how much we would like to be. For risk management purposes, it might be best to stick with averages, or even consider yourself below the bar.

Also keep in mind that defect potentials increase with the size of the system – big apps have more bugs on average. Not only is there a higher potential to write buggy code in bigger systems, but as the code base gets bigger and more complex it’s also harder to find and fix bugs. So big systems get released with even more bugs, and really big apps with a lot more bugs.

All of this gets worse in maintenance

In maintenance, the average defect potential for making changes is higher than in development, about 6 bugs per Function Point instead of 5. And the chance of finding and fixing mistakes in your changes is lower (83%). This is all because it’s harder to work with legacy code that you didn’t write and don’t understand all that well. So you should expect to release 1.08 bugs per Function Point when changing code in maintenance, instead of 0.75 bugs per Function Point.

And maintenance teams still have to deal with the latent bugs in the system, some of which may hide in the code for years, or forever. This includes heisenbugs and ghosts and weird timing issues and concurrency problems that disappear when you try to debug them. On average, 50% of residual latent defects are found each calendar year. The more people using your code, the faster that these bugs will be found.

Of course, once you find these bugs, you still have to fix them. The average maintenance programmer can be expected to fix around 10 bugs per month – and maybe implement some small enhancements too. That's not a great return on investment.

Then there’s the problem of bug re-injections, or regressions – when a programmer breaks something accidentally as a side-effect of making a fix. On average, programmers fixing a bug will introduce a new bug 7% of the time – and this can run as high as 20% for complex, poorly-structured code. Trying to fix these bad fixes is even worse – programmers trying to fix these mistakes have a 15% chance of still messing up the fix, and a 30% chance of introducing yet another bug as a side effect! It's better to roll-back the fix and start again.

Unfortunately, all of this gets worse over time. Unless you are doing a perfect job of refactoring and continuously simplifying the code, you can expect code complexity to increase an average of between 1% and 3% per year. And most systems get bigger over time, as you add more features and copy-and-paste code (of course you don't do that): the code base for a system under maintenance increases between 5-10% per year. As the code gets bigger and more complex, the chance for more bugs also increases each year.

But what if we’re not average? What if we’re best in class?

What if you are doing an almost perfect job, if you are truly best in class? Capers Jones finds that best in class teams create half as many bugs as average teams (2.5 or fewer defects per Function Point instead of 5), and they find and fix 95% or more of these bugs before the code is released. That sounds impressive - it means only 0.125 bugs per Function Point. But for a 50,000 LOC system, that’s still somewhere around 125 bugs on delivery.

And as for zero bugs? In his analysis of 13,000 projects over a period of more than 40 years, there were 2 projects with no defects reported within a year of release. So you can aspire to it. But don’t depend on it.

Monday, August 15, 2011

The C14N challenge

Failing to properly validate input data is behind at least half of all application security problems. In order to properly validate input data, you have to start by first ensuring that all data is in the same standard, simple, consistent format – a canonical form. This is because of all the wonderful flexibility in internationalization and data formatting and encoding that modern platforms and especially the Web offer. Wonderful capabilities that attackers can take advantage of to hide malicious code inside data in all sorts of sneaky ways.

Canonicalization is a conceptually simple idea: take data inputs, and convert all of it into a single, simple, consistent normalized internal format before you do anything else with it. But how exactly do you do this, and how do you know that it has been done properly? What are the steps that programmers need to take to properly canonicalize data? And how do you test for it? This is where things get fuzzy as hell.

To read my latest post on canonicalization problems (and the search for solutions), go to the SANS Application Security blog.


Monday, July 11, 2011

Developing and Testing in the Cloud

There’s a lot of hype around “the Cloud” and what it can do.

One of the things that I am interested in is Cloud solutions that can help small software companies, and especially to kickstart software startups. Good tools that development teams can take advantage of to build and test their own stuff, without all of the hassle and expense of internal IT, buying and provisioning their own gear, setting up the tools and systems and networks and finding someone who understands all of it well enough to do it right and to keep it running. I want to find Cloud-based technology that can do for software teams what Salesforce.com does for customer-service businesses, so that software teams can get started off quickly, and doing things right from the start.

Software teams need a core set of shared tools and capabilities:
  • Source code management / version control
  • Bug tracking
  • Collaboration and shared documentationBuild and Continuous Integration
  • Source code scanning and static analysis
  • Unit and functional testing
  • System, load and stress testing
  • Code deployment.
Can the Cloud provide all of this in an effective way?

Managing and Building Code in the Cloud

I keep running into people using GitHub to host their source code repositories. Not just people working on Open Source projects, but companies using GitHub to manage commercial projects in hosted private repositories, a service used by a number of startups. In GitHub, developers have a platform for managing code with the Git distributed version control system, and they get access to a good set of tools for a development team, especially distributed teams: admin functions to control permissioning, wikis, a bug tracking system and an online code review tool.

One alternative to GitHub is BitBucket which uses the Mercurial DVCS. Like GitHub, BitBucket can be used to manage Open Source and private projects, and it looks like it offers a similar set of management capabilities and tools. GitHub is free for Open Source projects and cheap for small teams while BitBucket’s pricing is based on the number of users, small teams (up to 5 users) are free for open or closed source.

Another On Demand SCM platform built on Mercurial is Fog Creek Software’s Kiln which integrates nicely with Fog Creek’s bug tracking and planning system FogBugz.

Atlassian, which bought BitBucket last year also offers Jira Studio a comprehensive hosted development platform centered on a Subversion code repository integrated with the rest of Atlaissian’s strong development toolset: Jira bug tracking, FishEye for code searching, Confluence wiki, Greenhopper for Agile team planning, Elastic Bamboo for build and Continuous Integration on Amazon EC2 and Crucible for online code review. That’s almost everything that a team needs (all that is missing is static analysis and a functional and load testing platform). Presumably the same integrated development tool support will extend to BitBucket over time, although there are already some overlaps between BitBucket and Jira Studio.

And there’s CloudBees Dev@cloud which offers a choice of secure Git, SVN or Maven repositories, and Continuous Integration through a hosted Jenkins server.

Spring Source’s Code2Cloud might be a good option especially for Java / Spring developers when it is ready – the details are fuzzy at the moment.

Google Code is cool, but for now is only available for Open Source projects.

For enterprises, IBM recently announced Smart Business Development and Test Cloud and IBM Smart Business Development and Test on the IBM Cloud (also known as SmartCloud Enterprise - I’m not sure what the difference is between them, I got too worn out reading through the marketing speak) , with all sorts of IBM technology and partner tools. This stuff isn’t for the faint of heart, or the small of wallet.

Continuous Integration in the Cloud

Continuous Integration is a problem that development teams need to solve before they get too far along, and setting up a CI server and keeping it running isn’t trivial. It’s another good fit for the elastic On Demand model of the Cloud, paying for more infrastructure only when you need it.

Besides Atlassian’s hosted end-to-end platform, and CloudBees’ Jenkins as a Service which can be used to build code hosted on a repository accessible through the Internet, there are a few options for CI in the Cloud (courtesy of Pascal Thivent on Stack Overflow):

For Open Source projects there is CodeBetter based on the TeamCity build server. For proprietary projects there is Mike CI which is hosted on Amazon EC2, and includes interfaces to Subversion, Git and Mercurial repositories; and CI Foundry. However, I don’t know how real these services are. Mike CI was down the last couple of times I went to check on it, and CI Foundry’s home page includes this not-very-convincing testimonial:
What people are saying...
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris quis ligula vel ligula varius cursus tristique eget sem. Mauris sagittis, ante id imperdiet auctor, sem mi condimentum neque, quis feugiat quam dui dictum risus.
- Chris Read
Continuous Deployment expert
Thivent suggests that you can roll-your- own CI solution on Amazon EC2 or maybe use something like “CI in a box”. Using an On Demand platform like EC2 makes sense for CI, rather than provisioning your own gear: it’s a bit more work for you to do the setup, but you can rely on Amazon’s infrastructure to reduce costs and simplify operations.

Bug Tracking in the Cloud

There are some decent and reasonably priced Cloud-based bug tracking systems suitable for small teams, including Fog Creek Software’s FogBugz and Atlassian’s Jira Hosted option both of which are also included in their hosted code management suites.

Testing in the Cloud

The Cloud has a lot of promise for app testing, especially for large-scale performance and load testing. You’re building a cool new Web platform that needs to support 10,000 or 100,000 concurrent sessions according to your business plan. Before your investors put more money in, they want to see just how real your software is, how far away you are from delivering something that could work. Of course you don’t have the money or time or IT skills to build a large-scale test center of your own at this point, and there’s a lot of waste in doing this – you need a lot of gear, but you won’t need to use all of it very often.

You can spin up a test farm on EC2 surprisingly quickly, even a big one – and as long as you don’t need to run the tests for a long time and you don’t ask for too high a service level, it won’t cost you much at all. You pay for what you need when you need it.

To run stress tests you need more than just the app deployed in a farm. You also need a scalable stress test harness to drive load scenarios and to measure the results. There are some viable options available in the Cloud today: SOASTA with its CloudTest solution for load testing and extreme stress testing, scaling from smallish sites to really big; HP Load Runner in the Cloud running on EC2; JMeter in the Cloud; LoadStorm and BrowserMob.

There are also some other interesting test capabilities, including technology from Sauce Labs which lets you run Selenium tests on Web apps in the Cloud or manually test your app against different Cloud-based instantly-available browsers, all with video records of failed tests and other problems found. Simple, useful and cool.

A different take on Cloud-based testing that could work for startups and smaller companies is crowd-sourced testing services like uTest, where you get people “in the Cloud” to test your app (web apps, desktop apps and mobile apps), including functional coverage and regression testing (you give them tests, they run them and maybe help improve them), exploratory testing, usability testing, load testing and test planning and management. Like other Cloud-based On Demand solutions, you pay for work when you need it. uTest has a community of something like 40,000 testers available, although it’s not clear how many of them are much good.

Mob4Hire is another crowd-sourced solution specifically for mobile apps – they offer functional testing and usability testing and other services like market research.

Static Analysis in the Cloud

Some of the leading static analysis suppliers have also gone to the Cloud. HP offers Fortify on Demand to scan code for security vulnerabilities, and IBM has Rational Appscan OnDemand which includes assistance from an experienced team to help you interpret the results and plan remediation. Both of these solutions are targeted more towards the enterprise than small teams.

Veracode, which runs binary code analysis for security vulnerabilities and now also offers web vulnerability scanning, is only available through the Cloud. They offer a free trial scan of small Java apps for XSS and SQL Injection flaws (two of the most common and fatal web application vulnerabilities).

For startups that care about building reliable and secure apps (and most of them should, especially Web 2.0 startups and mobile app developers) it probably makes more sense to look at simpler developer IDE-based tools like Findbugs and Google’s CodePro Analytix (both free) or Klocwork Solo for Java which is priced per user.

The Tradeoffs

Managing your code, building it and testing it in the Cloud has a number of clear advantages:
  • Faster time to market – you can get access to the infrastructure and tools quickly, with minimal hassle and setup time.
  • Savings on cost and Capex – especially for load testing – you don’t have to pay upfront, you only have to pay for what you use when you use it, so you can better manage your cash flow.
  • Convenience – you don’t have to provision and operate the gear yourself and add to your operational headaches.
  • Access to specialist skills – your Cloud provider will know more about these technologies and how to solve a specific problem much better than you can afford to, so you don’t need to contract or hire gurus to help solve problems or make sure everything is setup properly, or waste time figuring it out yourself.
  • Service levels – they’ll have more people to help keep things running, to make sure code and data are backed up, systems are cleaned up and monitored and patched. They’re less likely to lose your code and data than you are.
When it comes to security, the common argument for working with a Cloud-based provider is that they will be more responsible and serious about security than most of their customers will be or can be. Because the of economies of scale, they can invest in doing things right, and because they have to be secure to compete. GitHub for example looks like they have a good security program in place and they are hosted by Rackspace so the data center, network and servers will be setup and taken care of much better than a small company, especially a startup, can afford to or would know how to do.

The concern with using a Cloud platform like this comes down to some fundamental points:

1. It’s a shared platform, used by a lot of companies. The more successful the platform is, the more customers and interesting and valuable data / IP that they manage, the more attractive a target they are to bad guys. A small company by itself is an easy target, but not especially interesting or easy to find. But a platform that holds data for a lot of small, innovative companies? Yummy….[pronounced in a deep, growly voice with a foreign accent]

2. You take on some important risks any time that you outsource something that is core and critical to your business. Your data / code / customer base is more important to you than it is to whoever you outsource this responsibility to. If something goes badly wrong, if you can’t get an important change or an important bug fix out to a customer because the Cloud service isn’t working, or if the integrity or confidentiality of your code is compromised, they will be sad, and they will lose a customer (you)... but you may go out of business.

You can ask for, and pay for, good SLAs - but Amazon’s major EC2 outage earlier this year showed that even the best providers can’t meet their SLA commitments. While Amazon did a lot of things right in handling and recovering from that outage, it affected a lot of customers for too long.

3. There’s more to a secure online platform than using SSL and network firewalls and running in a good data center - pretending otherwise is at best naïve. As a customer, you need to be confident in how the Cloud services provider designed and built the software architecture and platform to protect the confidentiality and integrity of your IP and data, in their multi-tenant partitioning scheme and software security controls, and in their SDLC. Security vulnerabilities in application code are a serious source of risk to businesses on the Web, and most companies still do a poor job of building apps in a secure way.

DropBox, a Cloud-based document storage facility is an example of a platform that seemed to be secure and confidential, until somebody started to look deeper into it, and they continue to run into problems with basic security issues Unfortunately, none of the hosted platforms provide any kind of statement on their secure SDLC that I could find, on what steps they take to ensure that the platform is designed and implemented safely.

Testing in the Cloud, especially load testing for Web apps, looks to be a no-brainer – the business case is too compelling to ignore. Many of the other tools have a lot of promise: something like Atlassian’s end-to-end hosted studio would save a lot of time and trouble for a small company, and the upfront cost savings are real.

It comes down to how much trust you can put into your Cloud provider and their SLAs for availability and support, and whether in the end you can afford to trust your core IP to somebody else. 37 Signals, which builds and operates its own Cloud solutions for project management and document sharing, doesn’t think it is a good idea:
We host all the source code for our applications internally for obvious security reasons. That’s not to say Github’s private repository hosting isn’t a good option, especially if you want a hassle-free setup. It’s just not for us.
JH 22 Aug 08
For some companies, especially startups, IP embodied in their code is critically important – it may be all that they have. Ironically, it’s these companies that are most likely to use a Cloud platform like GitHub or Jira Studio or BitBucket. Deciding whether to trust your future to the Cloud is not an easy decision to make. But as the technology continues to get better, it’s a question that is worth asking.