Building Real Software: September 2014

In my last post, I explained how to do code reviews properly. I recommended taking advantage of static analysis tools like Findbugs, PMD, Klocwork or Fortify to check for common mistakes and bad code before passing the code on to a reviewer, to make the reviewer’s job easier and reviews more effective.

Some readers asked whether static analysis tools can be used instead of manual code reviews. Manual code reviews add delays and costs to development, while static analysis tools keep getting better, faster, and more accurate. So can you automate code reviews, in the same way that many teams automate functional testing? Do you need to do manual reviews too, or can you rely on technology to do the job for you?

Let’s start by understanding what static analysis bug checking tools are good at, and what they aren’t.

What static analysis tools can do – and what they can’t do

In this article, Paul Anderson at GrammaTech does a good job of explaining how static analysis bug finding works, the trade-offs between recall (finding all of the real problems), precision (minimizing false positives) and speed, and the practical limitations of using static analysis tools for finding bugs.

Static analysis tools are very good at catching certain kinds of mistakes, including memory corruption and buffer overflows (for C/C++), memory leaks, illegal and unsafe operations, null pointers, infinite loops, incomplete code, redundant code and dead code.

A static analysis tool knows if you are calling a library incorrectly (as long as it recognizes the function), if you are using the language incorrectly (things that a compiler could find but doesn’t) or inconsistently (indicating that the programmer may have misunderstood something).

And static analysis tools can identify code with maintainability problems, code that doesn't follow good practice or standards, is complex or badly structured and a good candidate for refactoring.

But these tools can’t tell you when you have got the requirements wrong, or when you have forgotten something or missed something important – because the tool doesn't know what the code is supposed to do. A tool can find common off-by-one mistakes and some endless loops, but it won’t catch application logic mistakes like sorting in descending order instead of ascending order, or dividing when you meant to multiply, referring to buyer when it should have been seller, or lessee instead of lessor. These are mistakes that aren't going to be caught in unit testing either, since the same person who wrote the code wrote the tests, and will make the same mistakes.

Tools can’t find missing functions or unimplemented features or checks that should have been made but weren't. They can’t find mistakes or holes in workflows. Or oversights in auditing or logging. Or debugging code left in by accident.

Static analysis tools may be able to find some backdoors or trapdoors – simple ones at least. And they might find some concurrency problems – deadlocks, races and mistakes or inconsistencies in locking. But they will miss a lot of them too.

Static analysis tools like Findbugs can do security checks for you: unsafe calls and operations, use of weak encryption algorithms and weak random numbers, using hard-coded passwords, and at least some cases of XSS, CSRF, and simple SQL injection. More advanced commercial tools that do inter-procedural and data flow analysis (looking at the sources, sinks and paths between) can find other bugs including injection problems that are difficult and time-consuming to trace by hand.

But a tool can’t tell you that you forgot to encrypt an important piece of data, or that you shouldn't be storing some data in the first place. It can’t find logic bugs in critical security features, if sensitive information could be leaked, when you got an access control check wrong, or if the code could fail open instead of closed.

And using one static analysis tool on its own to check code may not be enough. Evaluations of static analysis tools, such as NIST's SAMATE project (a series of comparative studies, where many tools are run against the same code), show almost no overlap between the problems found by different tools (outside of a few common areas like buffer errors) even when the tools are supposed to be doing the same kinds of checks. Which means that to get the most out of static analysis, you will need to run two or more tools against the same code (which is what SonarQube, for example, which integrates its own static analysis results with other tools, including popular free tools, does for you). If you’re paying for commercial tools, this could get very expensive fast.

Tools vs. Manual Reviews

Tools can find cases of bad coding or bad typing – but not bad thinking. These are problems that you will have to find through manual reviews.

A 2005 study Comparing Bug Finding Tools with Reviews and Tests used Open Source bug finding tools (including Findbugs and PMD) on 5 different code bases, comparing what the tools found to what was found through code reviews and functional testing. Static analysis tools found only a small subset of the bugs found in manual reviews, although the tools were more consistent – manual reviewers missed a few cases that the tools picked up.

Just like manual reviews, the tools found more problems with maintainability than real defects (this is partly because one of the tools evaluated – PMD – focuses on code structure and best practices). Testing (black box – including equivalence and boundary testing – and white box functional testing and unit testing) found fewer bugs than reviews. But different bugs. There was no overlap at all between bugs found in testing and the bugs found by the static analysis tools.

Finding problems that could happen - or do happen

Static analysis tools are good at finding problems that “could happen”, but not necessarily problems that “do happen”.

Researchers at Colorado State University ran static analysis tools against several releases of different Open Source projects, and compared what the tools found against the changes and fixes that developers actually made over a period of a few years – to see whether the tools could correctly predict the fixes that needed to be made and what code needed to be refactored.

The tools reported hundreds of problems in the code, but found very few of the serious problems that developers ended up fixing. One simple tool (Jlint) did not find anything that was actually fixed or cleaned up by developers. Of 112 serious bugs that were fixed in one project, only 3 were also found by static analysis tools. In another project, only 4 of 136 bugs that were actually reported and fixed were found by the tools. Many of the bugs that developers did fix were problems like null pointers and incorrect string operations – problems that static analysis tools should be good at catching, but didn’t.

The tools did a much better job of predicting what code should be refactored: developers ended up refactoring and cleaning up more than 70% of the code structure and code clarity issues that the tools reported (PMD, a free code checking tool, was especially good for this).

Ericsson evaluated different commercial static analysis tools against large, well-tested, mature applications. On one C application, a commercial tool found 40 defects – nothing that could cause a crash, but still problems that needed to be fixed. On another large C code base, 1% of the tool’s findings turned out to be bugs serious enough to fix. On the third project, they ran 2 commercial tools against an old version of a C system with known memory leaks. One tool found 32 bugs, another 16: only 3 of the bugs were found by both tools. Surprisingly, neither tool found the already known memory leaks – all of the bugs found were new ones. And on a Java system with known bugs they tried 3 different tools. None of the tools found any of the known bugs, but one of the tools found 19 new bugs that the team agreed to fix.

Ericsson’s experience is that static analysis tools find bugs that are extremely difficult to find otherwise. But it’s rare to find stop-the-world bugs – especially in production code – using static analysis.

This is backed up by another study on the use of static analysis (Findbugs) at Google and on the Sun JDK 1.6.0. Using the tool, engineers found a lot of bugs that were real, but not worth the cost of fixing: deliberate errors, masked errors, infeasible situations, code that was already doomed, errors in test code or logging code, errors in old code that was “going away soon” or other relatively unimportant cases. Only around 10% of medium and high priority correctness errors found by the tool were real bugs that absolutely needed to be fixed.

The Case for Security

So far we've mostly looked at static analysis checking for run-time correctness and general code quality, not security.

Although security builds on code quality – vulnerabilities are just bugs that hackers look for and exploit – checking code for correctness and clarity isn’t enough for a secure app. A lot of investment in static analysis technology over the past 5-10 years has been in finding security problems in code, such as common problems listed in OWASP’s Top 10 or the SANS/CWE Top 25 Most Dangerous Software Errors.

A couple of studies have looked at the effectiveness of static analysis tools compared to manual reviews in finding security vulnerabilities. The first study was on a large application that had 15 known security vulnerabilities found through a structured manual assessment done by security experts. Two different commercial static analysis tools were run across the code. The tools together found less than half of the known security bugs – only the simplest ones, the bugs that didn't require a deep understanding of the code or the design.

And of course the tools reported thousands of other issues that needed to be reviewed and qualified or thrown away as false positives. These other issues including some run-time correctness problems, null pointers and resource leaks, and code quality findings (dead code, unused variables), but no other real security vulnerabilities beyond those already found by the manual security review.

But this assumes that you have a security expert around to review the code. To find security vulnerabilities, a reviewer needs to understand the code (the language and the frameworks), and they also need to understand what kind of security problems to look for.

Another study shows how difficult this is. Thirty developers were hired to do independent security code reviews of a small web app (some security experts, others web developers). They were not allowed to use static analysis tools. The app had 6 known vulnerabilities. 20% of the reviewers did not find any of the known bugs. None of the reviewers found all of the known bugs, although several found a new XSS vulnerability that the researchers hadn’t known about. On average, 10 reviewers would have had only an 80% chance of finding all of the security bugs.

And, not Or

Static analysis tools are especially useful for developers working in unsafe languages like C/C++ (where there is a wide choice of tools to find common mistakes) or dynamically typed scripting languages like Javascript or PHP (where unfortunately the tools aren't that good), and for teams starting off learning a new language and framework. Using static analysis is (or should be) a requirement in highly regulated, safety critical environments like medical devices and avionics. And until more developers get more training and understand more about how to write secure software, we will all need to lean on static analysis (and dynamic analysis) security testing tools to catch vulnerabilities.

But static analysis isn't a substitute for code reviews. Yes, code reviews take extra time and add costs to development, even if you are smart about how you do them – and being smart includes running static analysis checks before you do reviews. If you want to move fast and write good, high-quality and secure code, you still have to do reviews.You can’t rely on static analysis alone.