Thursday, June 25, 2009

The Value of Static Analysis Tools

Just how effective is static analysis, what does it protect you from?

There is a lot of attention given to static analysis tools, especially from the software security community - and some serious venture capital money being thrown at static analysis tool providers such as Coverity.

The emphasis on using static analysis tools started with Cigital's CTO Gary McGraw in his definitive book on Software Security: Building Security In. In a recent interview with Jim Manico for OWASP (Jan 2009), Dr. McGraw went so far as to say that

“My belief is that everybody should be using static analysis tools today. And if you are not using them, then basically you are negligent, and you should prepare to be sued by the army of lawyers that have already hit the beach”.

Statements like this, from a thought leader in the software security community, certainly encourage companies to spend more money on static analysis tools, and of course should help Cigital’s ownership position in leading static analysis tool provider Fortify Software, which Cigital helped to found.

You can learn more about the important role that static analysis plays in building secure systems from Brian Chess, CTO of Fortify, in his book Secure Programming with Static Analysis.

Secure software development maturity models like SAMM and BSIMM emphasize the importance of code reviews to find bugs and vulnerabilities, but especially the use of static analysis tools, and OWASP has a number of free tools and projects in this area.

Now even Gartner has become interested in this the emerging emerging static analysis marketand its players – evidence that the hype is reaching, or has reached, a critical point. In Gartner’s study of what they call Static Application Security Testing (SAST) suppliers (available from Fortify), they state that

“ must adopt SAST technology and processes because the need is strategic. Enterprises should use a short-term, tactical approach to vendor selection and contract negotiation due to the relative immaturity of the market.” Well there you have it: whether the products are ready or not, you need to buy them.

Gartner’s analysis puts an emphasis on full-service offerings and suites: principally, I suppose, because CIOs at larger companies, who are Gartner’s customers, don’t want to spend a lot of time finding the best technology and prefer to work with broad solutions from strategic technology partners, like IBM or HP (neither of which has strong static analysis technology yet, so watch for acquisitions of the independents to fill out their security tool portfolios, as they did in the dynamic analysis space). Unfortunately, this has led vendors like Coverity to spend their time and money on filling out a larger ALM portfolio, building and buying in technology for build verification, software readiness (I still don’t understand who would use this) and architecture analysis rather than investing in their core static analysis technology. On their site, Coverity proudly references a recent story "Coverity: a new Mercury Interactive in the making?", which should make their investors happy and their customers nervous – as a former Mercury Interactive, now HP customer, I can attest that while the acquisition of Mercury by HP may have been good for HP and good for Mercury’s investors, it was not good for Mercury’s customers, at least the smaller ones.

The driver behind static analysis is its efficiency: you buy a tool, you run it, it scans thousands and thousands of lines of code and finds problems, you fix the problems, now you’re secure. Sounds good, right?

But how effective is static analysis? Do these tools find real security problems?

We have had success with static analysis tools, but it hasn’t been easy. Starting in 2006, some of our senior developers started working with FindBugs (we’re a Java shop) because it was free, it was easy to get started with, and it found some interesting, and real, problems right away. After getting to understand the tool and how it worked, some cleanup and a fair amount of time invested by a smart, diligent and senior engineer to investigate false positives and setup filters on some of the checkers, we added FindBugs checking to our automated build process, and it continues to be our first line of defense in static analysis. The developers are used to checking the results of the FindBugs analysis daily, and we take all of the warnings that it reports seriously.

Later in 2006, as part of our work with Cigital to build a software security roadmap, we conducted a bake-off of static analysis tool vendors including Fortify, Coverity, Klocwork (who would probably get more business if they didn't have a cutesy name that is so hard to remember), and a leading Java development tools provider whose pre-sales support was so irredeemably pathetic that we could not get their product installed, never mind working. We did not include Ounce Labs at the time because of pricing, and because we ran out of gas, although I understand that they have a strong product.

As the NIST SAMATE study confirms, working with different tool vendors is a confusing and challenging and time-consuming process: the engines work differently, which is good since they catch different types of problems, but there is no consistency in the way that warnings are reported or rated, and different terms are used by different vendors to describe the same problem. And there is the significant problem of dealing with noise: handling the large number of false positives that get reported by all of the tools (some are better than others), understanding what to take seriously.

At the time of our initial evaluation, some of the tools were immature, especially the C/C++ tools that were being extended into the Java code checking space (Coverity and Klocwork). Fortify was the most professional and prepared of the suppliers. However, we were not able to take advantage of Fortify’s data flow and control flow analysis (one of the tool’s most powerful analysis capabilities) because of some characteristics of our software architecture. We verified with Fortify and Cigital consultants that it was not possible to take advantage of the tool’s flow analysis, even with custom rules extensions, without fundamentally changing our code. This left us with relying on the tool’s simpler security pattern analysis checkers, which did not uncover any material vulnerabilities. We decided that with these limitations, the investment in the tool was not justified.

Coverity’s Java checkers were also limited at that time. However, by mid 2007 they had improved the coverage and accuracy of their checkers, especially for security issues and race conditions checking, and generally improved the quality of their Java analysis product. We purchased a license for Coverity Prevent, and over a few months worked our way through the same process of learning the tool, reviewing and suppressing false positives, and integrating it into our build process. We also evaluated an early release of Coverity’s Dynamic Thread Analysis tool for Java: unfortunately the trial failed, as the product was not stable – however, it has potential, and we will consider looking at it again in the future when it matures.

Some of the developers use Klocwork Developer for Java, now re-branded as Klocwork Solo, on a more ad hoc basis: for small teams, the price is attractive, and it comes integrated into Eclipse.

In our build process we have other static analysis checks, including code complexity checking and other metric trend analysis using an open source tool JavaNCSS to help identify complex (and therefore high high-risk) sections of code, and we have built proprietary package dependency analysis checks into our build to prevent violation of dependency rules. One of our senior developers has now started working with Structure101 to help us get a better understanding of our code and package structure and how it is changing over time. And other developers use PMD to help cleanup code, and the static analysis checkers included in IntelliJ.

Each tool takes different approaches and has different strengths, and we have seen some benefits in using more than one tool as part of a defense-in-depth approach. While we find by far the most significant issues in manual code reviews or exploratory testing or through our automated regression safety net, the static analysis tools have been helpful in finding real problems.

While FindBugs does only “simple, shallow analysis of network security vulnerabilities", and analysis of malicious code vulnerabilities as security checks, it is good at finding small, stupid coding mistakes that escape other checks, and the engine continues to improve over time. This open source project deserves more credit: it offers incredible value to the Java development community, and anyone building code in Java that who does not take advantage of it is a fool.

Coverity reports generally few false positives, and is especially good for finding potential thread safety problems and null pointer (null return and forward null) conditions. It also comes with a good management infrastructure for trend analysis and review of findings. Klocwork is the most excitable and noisiest of all of our tools, but it includes some interesting checkers that are not available in the other tools – although after manual code reviews and checks by the other static analysis tools, there is rarely anything of significance left for it to consider.

But more than the problems that the tools find directly, the tools help to identify areas where we may need to look deeper: where the code is complex, or too smarty pants fancy, or otherwise smelly, and requires followup review. In our experience, if a mature analysis tool like FindBugs reports warnings that don’t make sense, it is often because it is confused by the code, which in turn is a sign that the code needs to be cleaned up. We have also seen the number of warnings reported decline over time as developers react to the “nanny effect” of the tools’ warnings, and change and improve their coding practices to avoid being nagged. And the final benefit of using these tools is that this frees up the developers to concentrate on higher-value work in their code reviews: they don’t have to spend so much time looking out for fussy, low-level coding mistakes, because the tools have found them already, so the developers can concentrate on more important and more fundamental issues like correctness, proper input validation and error handling, optimization, simplicity and maintainability.

While we are happy with the toolset we have in place today, I sometimes wonder whether we should beef up our tool-based code checking. But is it worth it?

In a presentation at this year’s JavaOne conference, Prof. Bill Pugh, the Father of FindBugs says that

“static analysis, at best, might catch 5-10% of your software quality problems.”

He goes on to say, however, that static analysis is 80+% effective at finding specific defects and cheaper than other techniques for catching these same defects – silent, nasty bugs and programming mistakes.

Prof. Pugh emphasizes that static analysis tools have value in a defense-in-depth strategy for quality and security, combined with other techniques; that “each technique is more efficient at finding some mistakes than others”; and that “each technique is subject to diminishing returns”.

In his opinion, “testing is far more valuable than static analysis”, and “FindBugs might be more useful as an untested code detector than a bug detector”. If FindBugs finds a bug, you have to ask: “Did anyone test that code”?”. In our experience, Prof Pugh’s FindBugs findings can be applied to the other static analysis tools as well.

As technologists, we are susceptible to the belief that technology can solve our problems – the classic “silver bullet” problem. When it comes to static analysis tools, you’d be foolish not to use a tool at all, but at the same time you’d be foolish to expect too much from them – or pay too much for them.

Wednesday, June 17, 2009

How long can this go on?

Our team delivers software iteratively and incrementally, and over the past 3 years we have experimented with longer (1-2 months) and shorter (1-2 week) iterations, adjusting to circumstances, looking for the proper balance between cost and control.

There are obvious costs in managing an iteration: startup activities (planning, prioritization, kickoff, securing the team's commitment to the goals of the release), technical overheads like source code and build management (branching and merging and associated controls), status reporting to stakeholders, end-to-end system and integration testing, and closure activities (retrospectives, resetting). We don’t just deliver “ship-quality” software at the end of an iteration: in almost every case we go all the way to releasing the code to production, so our costs also include packaging, change control, release management, security and operations reviews, documentation updates and release notes and training, certifications with partners, data conversion, rollback testing, and pre- and post-implementation operations support. Yep, that’s a lot of work.

All of these costs are balanced against control: our ability to manage and contain risks to the project, to the product, and to the organization. I explored how to manage risks through iterative, incremental development in an earlier post on risk management.

We’ve found that if an iteration is too long (a month or more), it is hard to defend the team from changes to priorities, to prevent new requirements from coming in and disrupting the team’s focus. And in a longer cycle, there are too many changes and fixes that need to be reviewed and tested, increasing the chance of mistakes or oversights or regressions.

Shorter releases are easier to manage because, of course, they are necessarily smaller. We can manage the pressure from the business-side for changes because of the fast delivery cycle (except for emergency hot fixes, we are usually able to convince our product owner, and ourselves, to wait for the next sprint since it is only a couple of weeks away) and it is easier for everyone to get their heads around what was changed in a release and how to verify it. And shorter cycles keep us closer to our customers, not only giving us faster feedback, but demonstrating constant value. I like to think of it as a “value pipeline”, continuously streaming business value to customers.

One of my favorite books on software project management, Johanna Rothman’s Manage It!, recommends making increments smaller to get faster feedback – feedback not just on the product, but on how you build it and how you can improve. The smaller the iteration, the easier to look at it from beginning-to-end and see where time is wasted, what works, what doesn’t, where time is being spent that isn’t expected.

“Shorter timeboxes will make the problems more obvious so you can solve them.”

Ms. Rothman recommends using the “Divide-by-Two Approach to Reduce Iteration Size”: if the iterations aren’t succeeding, divide the length in half, so 6 weeks becomes 3 weeks, and so on. Smaller iterations provide feedback - longer ones mask the problems.

Ms. Rothman also says that it is difficult to establish a rhythm for the team if iterations are too long. In “Selecting the Right Iteration Length for Your Software Development Process”, Mike Cohn of Mountain Goat Software examines the importance of establishing a rhythm in incremental development. He talks about the need for a sense of urgency: if an iteration is too long, it takes too much time for the team to “warm up” and take things seriously. Of course, this needs to be balanced against keeping the team in a constant state of emergency, and burning everyone out.

Some of the other factors that Mr. Cohn finds important in choosing an iteration length:
- how long can you go without introducing change – avoiding requirements churn during an iteration.
- if cycles are too short (for example, a week) small issues, like a key team member coming down with a cold, can throw the team’s rhythm off and impact delivery.

All of this supports our experience: shorter (but not too-short) cycles help establish a rhythm and build the team’s focus and commitment, constantly driving to delivering customer value. And shorter cycles help manage change and risk.

Now we are experimenting with an aggressive, fast-tracked delivery model: a 3-week end-to-end cycle, with software delivered to production every 2 weeks. The team starts work on designing and building the next release while the current release is in integration, packaging and rollout, overlapping development and release activities. Fast-tracking is difficult to manage, and can add risk if not done properly. But it does allow us to respond quickly to changing business demands and priorities, while giving us time for an intensive but efficient testing and release management process.

We'll review how this approach works over the next few months and change it as necessary, but we intend to continue with short increments. However, I am concerned about the longer-term risks, the potential future downsides to our rapid delivery model.

In The Decline and Fall of Agile James Shore argues that rapid cycling short cuts up-front design:

“Up-front design doesn't work when you're using short cycles, and Scrum doesn't provide a replacement. Without continuous, incremental design, Scrum teams quickly dig themselves a gigantic hole of technical debt. Two or three years later, I get a call--or one of my colleagues does. "Changes take too long and cost too much!" I hear. "Teach us about test-driven development, or pairing, or acceptance testing!" By that time, fixing the real problems requires paying back a lot of technical debt, and could take years.”

While Mr. Shore is specifically concerned about loose implementations of Scrum, and its lack of engineering practices compared with other approaches like XP (see also Martin Fowler of ThoughtWorks on the risks of incremental development without strong engineering discipline), the problem is a general one for teams working quickly, in short iterations: even with good engineering discipline, rapid cycling does not leave a lot of time for architecture, design and design reviews, test planning, security reviews... all of those quality gating activities that waterfall methods support. This is a challenge for secure software development, as there is little guidance available on effectively scaling software security SDLC practices to incremental, agile development methods, something that I will explore more later.

Trying to account for architecture and platform decisions and tooling and training in an upfront “iteration zero” isn’t enough, especially if your project is still going strong after 2 or 3 years. What I worry about (and I worry about a lot of things) is that, moving rapidly from sprint to sprint, the team cannot stop and look at the big picture, to properly re-assess architecture and platform technology decisions made earlier. Instead all the team has a chance to do is make incremental, smaller-scale improvements (tighten up the code here, clean up an interface there, upgrade some of the technology stack), which may leave fundamental questions unanswered, trading off short-term goals (deliver value, minimize the cost and risk of change) with longer-term costs and uncertainties.

One of the other factors that could affect quality in the longer term is the pressure on the team to deliver in a timebox. In Technical Debt: Warning Signs, Catherine Powell raises the concern that developers committing to a date may put schedule ahead of quality:

“Once you've committed to a release date and a feature set, it can be hard to change. And to change it because you really want to put a button on one more screen? Not likely. The "we have to ship on X because X is the date" mentality is very common (and rightly so - you can't be late forever because you're chasing perfection). However, to meet that date you're likely to cut corners, especially if you've underestimated how much time the feature really takes, or how much other stuff is going on.”

Finally, I am concerned that rapid cycling does not give the team sufficient opportunities to pause, to take a breath, to properly reset. If they are constantly moving heads down from one iteration to another, do team members really have a chance to reflect, understand and learn? One of the reasons that I maintain this blog is exactly for this: to explore problems and questions that my team and I face; to research, to look far back and far ahead, without having to focus on the goals and priorities of the next sprint.

These concerns, and others, are explored in Traps & Pitfalls of Agile Development - a Non-Contrarian View:

"Agile teams may be prone to rapid accumulation of technical debt. The accrual of technical debt can occur in a variety of ways. In a rush to completion, Iterative development is left out. Pieces get built (Incremental development) but rarely reworked. Design gets left out, possibly as a backlash to BDUF. In a rush to get started building software, sometimes preliminary design work is insufficient. Possibly too much hope is placed in refactoring. Refactoring gets left out. Refactoring is another form of rework that often is ignored in the rush to complete. In summary, the team may move too fast for it's own good."

Our team’s challenge is not just to deliver software quickly: like other teams that follow these practices, we’ve proven that we can do that. Our challenge is to deliver value consistently, at an extremely high level of quality and reliability, on a continual and sustainable basis. Each design and implementation decision has to be made carefully: if your customer's business depends on you making changes quickly and perfectly, without impacting their day-to-day operations, how much risk can you afford to take on to make changes today so that the system may be simpler and easier to change tomorrow, especially in today's business environment? It is a high stakes game we're playing. I recognize that this is a problem of debt management, and I'll explore the problems of technical debt and design debt more later.

The practices that we have followed have worked well for us so far. But is there a point where rapid development cycles, even when following good engineering practices, provide diminishing returns? When does the accumulation of design decisions made under time pressure, and conscious decisions to minimize the risk of change, add up to bigger problems? Does developing software in short cycles, with a short decision-making horizon, necessarily result in long-term debt?
Site Meter