Wednesday, May 23, 2012

The pursuit of protection: How much testing is “enough”?

I’m definitely not a testing expert. I’m a manager who wants to know when the software that we are building is finished, safe and ready to ship.

Large-scale enterprise systems – the kinds of systems that I work on – are essentially hard to test. They have lots of rules and exceptions and lots of interfaces and lots of customization for different customers and partners, and lots of operational dependencies, and they deal with lots of data. We can’t test everything – there are tens of thousands or hundreds of thousands of different scenarios and different paths to follow.

This gets easier and harder if you are working in Agile methods, building and releasing small pieces of work at a time. Most changes or new features are easy enough to understand and test by themselves. The bigger problem is in understanding the impact of each change on the rest of the system that has already been built, what side-effects the change may have, what might have broke. This gets harder if a change is introduced in small steps over several releases, so that some parts are incomplete or even invisible to the test team for a while.

People who write flight control software or medical device controllers need to do exhaustive testing, but the rest of us can’t afford to, and there are clearly diminishing returns. So if you can’t or aren’t going to “test everything”, how do you know when you’re done testing? One answer is that you’re done testing when you run out of time to do any more testing. But that’s not good enough.

You’re done testing when your testers say they’re done

Another answer is that you’re done when the test team says they’re done. When all of the static analysis findings have been reviewed and corrected. When all of the automated tests pass. When the testers have made sure that all features that are supposed to be complete were completed and secure, finished their test checklists, made sure that the software is usable and checked for fit-and-finish, tested for performance and stability, made sure that the deployment and rollback steps work, and completed enough exploratory testing that they’ve stopped finding interesting bugs, and the bugs that they have found (the important ones at least) have all been fixed and re-checked.

This of course assumes that they tested the right things – that they understood the business requirements and priorities, and found most of the interesting and important bugs in the system. But how do you know that they’ve done a good job?

What a lot of testers do is black box testing, which falls into two different forms:

  1. Scripted functional and acceptance testing, manual and automated – how good the testing is depends on how complete and clear the requirements are (which is a challenge for small Agile teams working through informal requirements that keep changing), and how much time the testers have to plan out and run their tests.
  2. Unscripted behavioural or exploratory manual testing – depends on the experience and skill of the tester, and on their familiarity with the system and their understanding of the domain.

With black box testing, you have to trust in the capabilities and care of the people doing the testing work. Even if they have taken a structured, methodical approach to defining and running tests they are still going to miss something. The question is – what, and how much?

Using Code Coverage

To know when you’ve tested enough, you have to stop testing in the dark. You have to look inside the code, using white box structural testing techniques to understand what code has been tested, and then look closer at the code to figure out how to test the code that wasn’t.

A study at Microsoft over 5 years involving thousands of testers found that with scripted, structured functional testing, testers could cover as much as 83% of the code. With exploratory testing they could raise this a few percentage points, to as high as 86%. Then, by looking at code coverage and walking through what was tested and what wasn’t, they were able to come up with tests that brought coverage up above 90%.

Using code coverage this way, instrumenting code under test and then looking into the code and reviewing and improving the tests that have already been written and figuring out what new tests to write, needs testers and developers to work together even more closely.

How much code coverage is enough?

If you’re measuring code coverage, the question that comes up is how much coverage is enough?

What percentage of your code should be covered before you can ship? 100%? 90%? 80%? You will find a lot of different numbers in the literature and I have yet to find solid evidence showing that any given number is better than another. Cedric Beust, Breaking Away from the Unit Test Group Think

In Continuous Delivery, Jez Humble and David Farley set 80% coverage as a target for each of automated unit testing, functional testing and acceptance testing. Based on their experience, this should provide comprehensive testing.

Some TDD and XP advocates argue for 100% automated test coverage, which is a target to aim for if you are starting off from scratch and want to maintain high standards, especially for smaller systems. But 100% is unnecessarily expensive, and it’s a hopeless target for a large legacy system that doesn’t have extensive automated tests already in place. You’ll reach a point of diminishing returns as you continue to add tests, where each tests costs more to write and finds less. The more tests that you write, the more tests will be bad tests – duplicate tests that seem to test different things but don’t, tests that don’t test anything important (even if they help make the code coverage numbers look a little better), tests that don’t work but look like they do. All of these tests, good or bad, need to run continuously and need to be maintained and get in the way of making changes. The costs keep going up. How many shops can afford to achieve this level of coverage, and sustain it over a long period of time, or even want to?

Making Code Coverage Work for You

On the team that I manage now, we rely on automated unit and functional testing at around 70% (statement) coverage – higher in high-risk areas, lower in others. Obviously, automated coverage is also higher in areas that are easier to test with automated tools. We hit this level of coverage more than 3 years ago and it has held steady since then. There hasn’t been a good reason to push it higher – it gives us enough of a safety net for developers to make most changes safely, and it frees the test team up to focus on risks and exceptions.

Of course with the other kinds of testing that we do, manual functional testing and exploratory testing and multi-player war games, semi-automated integration testing and performance testing, and operational system testing, coverage in the end is much higher than 70% for each release. We’ve instrumented some of our manual testing work, to see what code we are covering in our smoke tests and integration testing and exploratory testing work, but it hasn’t been practical so far to instrument all of the testing to get a final sum in a release.

Defect Density, Defect Seeding and Capture/Recapture – Does anybody really do this?

In an article in IEEE Software Best Practices from 1997, Steve McConnell talks about using statistical defect data to understand when you have done enough testing.

The first approach is to use Defect Density data (# of defects per KLOC or some other common definition of size) from previous releases of the system, or even other systems that you have worked on. Add up how many defects were found in testing (assuming that you track this data – some Lean/Agile teams don’t, we do) and how many were found in production. Then measure the size of the change set for each of these releases to calculate the defect density. Do the same for the release that you are working on now, and compare the results. Assuming that your development approach hasn’t changed significantly, you should be able to predict how many more bugs still need to be found and fixed. The more data, of course, the better your predictions.

Defect Seeding, also known as bebugging,is where someone inserts bugs on purpose and then you see how many of these bugs are found by other people in reviews and testing. The percentage of the known [seeded] bugs not found gives an indication of the real bugs that remain. Apparently some teams at IBM, HP and Motorola have used Defect Seeding, and it must come up a lot in interviews for software testing labs (Google “What is Defect Seeding?”), but it doesn’t look like a practical or safe way to estimate test coverage. First, you need to know that you’ve seeded the “right” kind of bugs, across enough of the code to be representative – you have to be good at making bugs on purpose, which isn’t as easy as it sounds. If you do a Mickey Mouse job of seeding the defects and make them too easy to find, you will get a false sense of confidence in your reviews and testing – if the team finds most or all of the seeded bugs, that doesn’t mean that they’ve found most or all of the real bugs. Bugs tend to be simple and obvious, or subtle and hard to find, and bugs tend to cluster in code that was badly designed or badly written, so the seeded bugs need to somehow represent this. And I don’t like the idea of putting bugs into code on purpose. As McConnell points out, you have to be careful in removing the seeded bugs and then do still more testing to make sure that you didn’t break anything.

And finally, there is Capture/Re-Capture, an approach used to estimate wildlife populations (catch and tag fish in a lake, then see how many of the tagged fish you catch again later), which Watts Humphrey introduced to software engineering as part of TSP to estimate remaining defects from the results of testing or reviews. According to Michael Howard, this approach is sometimes used at Microsoft for security code reviews, so let’s explore this context. You have two reviewers. Both review the same code for the same kinds of problems. Add up the number of problems found by the first reviewer (A), the number found by the second reviewer (B), and separately count the common problems that both reviewers found, where they overlap (C). The total number of estimated defects: A*B/C. The total number of defects found: A+B-C. The total number of defects remaining: A*B/C – (A+B-C).

Using Michael Howard’s example, if Reviewer A found 10 problems, and Reviewer B found 12 problems, and 4 of these problems were found by both reviewers in common, the total number of estimated defects is 10*12/4=30. The total number of defects found so far: 18. So there are 12 more defects still to be found.

I’m not a statistician either, so this seems like magic to me, and to others. But like the other statistical techniques, I don’t see it scaling down effectively. You need enough people doing enough work over enough time to get useful stats. It works better for large teams working in Waterfall-style, with a long test-and-fix cycle before release. With a small number of people working in small, incremental batches, you get too much variability – a good reviewer or tester could find most or all of the problems that the other reviewers or testers found. But this doesn’t mean that you’ve found all of the bugs in the system.

Your testing is good enough until a problem shows that it is not good enough

In the end, as Martin Fowler points out, you won’t really know if your testing was good enough until you see what happens in production:

The reason, of course, why people focus on coverage numbers is because they want to know if they are testing enough. Certainly low coverage numbers, say below half, are a sign of trouble. But high numbers don't necessarily mean much, and lead to “ignorance-promoting dashboards”. Sufficiency of testing is a much more complicated attribute than coverage can answer. I would say you are doing enough testing if the following is true:
  • You rarely get bugs that escape into production, and
  • You are rarely hesitant to change some code for fear it will cause production bugs.

Test everything that you can afford to. Release the code. When problems happen in production, fix them, then use Root Cause Analysis to find out why they happened and to figure out how you’re going to prevent problems in the future, how to improve the code and how to improve the way you write it and how you test it. Keep learning and keep going.

Thursday, May 17, 2012

Software Development Metrics that Matter

As an industry we do a surprisingly poor job of measuring the work that we do and how well we do it. Outside of a relatively small number of organizations which bought into expensive heavyweight models like CMMI or TSP/PSP (which is all about measuring on a micro-level) or Six Sigma, most of us don’t measure enough, don't measure the right things or understand what to do with the things that we do measure. We can’t agree on something as basic as how to measure programmer productivity or even on consistent measures of system size –should we count lines of code (LOC, SLOC, NCLOC, ELOC) or IFPUG function points or object oriented function points or weighted micro function points or COSMIC Full Function Points, or the number of classes or number of something else…?

Capers Jones, who has spent most of his career trying to understand the data that we do collect, has this to say:

The software industry lacks standard metric and measurement practices. Almost every software metric has multiple definitions and ambiguous counting rules… The result of metrics problems is a lack of solid empirical data on software costs, effort, schedules, quality, and other tangible matters. Strengths and Weaknesses of Software Metrics, 2006

But how can we get better, or know that we need to get better, or know if we are getting better, if we can’t or don’t measure something? How can we tell if a new approach is helping, or make a business case for more people or a rewrite, or even justify our existing costs in hard times, without proof?

What do you really have to measure?

Different metrics matter to management, the team and the customer. There are metrics that you have to track because of governance or compliance requirements, or because somebody more important than you said so. There are metrics that the team wants to track because they find them useful and want to share. There are stealth metrics that as a manager you would like to track quietly because you think they will give you better insight into the team’s performance and identify problems, but you aren’t sure yet or you don’t want people to adapt and try to game what you are measuring. And most important, the measures that the customer actually cares about: are you delivering what they need when they need it, is the software reliable and usable?

Like most people who manage development shops, it’s much more important to me to get work done than to collect measurement data. Any data must be easy and cheap to collect and simple to use and understand. If it takes too much work, people won’t be able to keep it up over the long term – they’ll take shortcuts or just stop doing it. The best metrics are the ones that come for free, as part of the work that we are already doing. http://www.amazon.com/Managing-Design-Factory-Donald-Reinertsen/dp/0684839911 I need a small number of metrics that can do double duty, that can be combined and used in different ways.

Size

In order to compare data over time or between systems and teams, you need a consistent measure of the size of the system, the size of the code base and how this is changing.Counting the Lines of Code, whether it is Source Lines of Code (SLOC)or Effective Lines of Code (ELOC) or NCLOC (Non Commented Lines of Code), is simple and cheap to measure for a code base and easy to understand (as long as you have a consistent way of dealing with blank lines and other formatting issues), but you can’t use it to compare code written in different languages and on different platforms.

This is where Function Points come in – a way of standardizing measurements of work for estimation and costing independent of the platform. Unfortunately it’s not easy to understand or measure Function Points – while there are lots of tools that can measure the number of lines of code in a code base and every programmer can see and understand what 100 or 1000 lines of code in a certain language looks like, Function Point calculations need to be done by people who are certified Function Point counters (I’m not kidding), using different techniques depending on which of the 20 different Function Point measurement methods they choose to follow.

Function Points have been around since the late 1970s but this way of measuring hasn’t caught on widely because it isn’t practical – too many programmers and managers don’t understand it and aren’t convinced it is worth the work. See “Function Points are Fantasy Points”.

A rough compromise is to use “backfiring” rules-of-thumb to convert from LOC in each language to Function Points (you’ll need to find a good backfiring conversion table). With backfiring, 1 Function Point is approximately 55 lines of Java (on average), or 58 lines of C#, or 148 lines of C. There are a lot of concerns about creeping inaccuracies using this approach, but for rough comparisons of size between systems in different languages it’s all that we really have for now.

Where are you spending time and money, and are you doing this responsibly?

Everybody has to track at least some basic cost data – its part of running any project or business. What’s interesting is coming up with useful breakdowns of cost data. You want to know how much people are spending on new stuff, on enhancements and other changes, on bug fixing and other rework, on support, on large-scale testing work, on learning – and how much time is being wasted. Whatever buckets you come up with should be coarse-grained – you don’t need detailed, expensive time tracking and other cost data to make management decisions. If everyone is tracking time for billing purposes or project costing or OPEX/CAPEX financial reporting or some other reason it’s easy. Otherwise, it’s probably enough to take samples; ask everyone to track time for a while, work with this data, then track time again for a while to see what changes. With a common definition of size across systems (see above), you can compare costs across teams and systems, as well as look at trends over time.

For maintenance teams, Capers Jones recommends that teams track Assignment Scope: the amount of code that one programmer can maintain and support in a year. You can use this for planning purposes (how many people will you need to maintain a system), watch this change over time and compare between teams and systems (based on size again), or even against industry benchmarks. According to Capers Jones’s data, an average programmer should be able to maintain and support around 1,000 Function Points of code per year, depending on the programmer’s experience, the language, quality of code and tools.

Speed

Everybody wants software delivered as fast as possible. Agile teams measure Velocity to see how fast they are delivering at any point in time, but the problem with velocity, as Mike Cohn and others have explained in detail, is that it can’t be compared across teams, or even within the same team over a long period of time, because the definition of how much is being delivered, in Story Points, isn’t stable or standardized – Story Points mean different things to different teams, and even different things for the same team over a long enough period of time (as people enter and leave the team and as they work on different kinds of problems). Velocity is only useful as a short-term/medium-term predictor of one team’s delivery speed.

Teams that are continuously delivering to production can measure speed by the number of changes they are making, and the how big the changes are (we’re back to measuring size again). Etsy measures the frequency of changes and the size of the change sets, and correlates this with production incident data (frequency and impact and time to recover) to help understand whether they are changing too much or too quickly to be safe.

Probably the most useful measure of speed comes from Lean and Kanban. To determine if you are keeping up with customer demand, measure the turnaround or cycle time on important changes and fixes: how long it takes from when the customer finds a problem or asks for a change to when they get what they asked for. Turnaround is simple to measure and obvious to the business and to the team, especially if you use something like Cumulative Flow Diagrams to make cycle time visible. It’s easy to see if you are keeping up or falling behind, if you are speeding up or slowing down. Focusing on turnaround ties you directly to what is important to the business and your customers.

Reliability and Quality

But speed isn’t everything – there’s no point in delivering as fast as possible if what you are delivering is crap. You also need some basic measures of quality and reliability.

For online businesses, Uptime is the most common and most useful measurement of reliability: measuring the number of operational problems, the time between operational failures (MTTF), and the time to recover from each failure (MTTR). Basic Ops stuff.

And you need some kind of defect data, which come free if you are using a bug tracking system like Bugzilla or Jira. Identify areas in the code that have the most bugs, measure how long it takes to fix bugs and how many bugs the team can fix each month. Over the longer term, track the bug opened/closed ratio to see if there are more bugs being found than normal, if the team is falling behind on fixing bugs, if they need to slow down and review and fix code rather than attempting to deliver features that aren’t actually ready.

To compare between teams or systems or to watch trends develop over time, one of the key metrics is defect density: the number of bugs per KLOC (or whatever measure of code size you are using – back to the size measurement again). With enough data, you can even try to use defect density to help determine when code is really ready to be shipped.

How Healthy is the Code?

You also want to measure the health of the code base, how much technical debt you are taking on, using code analysis tools. Code coverage is a useful place to start, especially if you are putting in automated testing, or even if you just want to measure the effectiveness of your manual testing. Tools like EMMA and Clover instrument the code and trace execution as you are testing to give you a picture of what code was covered by testing, and more important, what code wasn’t. It’s easy to drill down and see where the holes in your testing are, and to build trends from this data, to set code coverage targets and watch out for people cutting back on testing under pressure.

Measuring Code Complexity, identifying your most complex code and watching to see if complexity is increasing over time helps you make decisions about where to focus your reviews and testing and refactoring work. If you are measuring both test coverage and complexity, you can correlate the data together to identify high risk code: complex routines that are poorly tested. Code health can also be measured by static analysis checkers like Findbugs and PMD or Coverity or Klocwork – tools that look for coding mistakes and security vulnerabilities and violations of good coding practices. Higher-level code quality management tools like Sonar and CodeExcellence, which consolidate and correlate data from multiple static analysis tools, give you an overall summary picture of the health of all of your code, and let you look at code health over time and across teams and systems.

Using Metrics in Ways that aren’t Evil

Many programmers think that metrics are useless or evil, especially if they are used by management to evaluate and compare programmer productivity. IBM for example, apparently uses static measurement data to scorecard application developers and teams and in an attempt to quantify programmer performance. Using this kind of data to rate developers for HR purposes can easily lead to abuse and reinforcing of the wrong behavior. If you are measuring “productivity” by size and simple code quality measures, no programmers will want to take on hard problems or step up to maintain gnarly legacy code because their scores will always look bad. Writing lots of simple code to make a tool happy doesn’t make someone a better programmer.

Only measure things to identify potential problems and negative trends, strong points and weak points in the code and your development process. If you want to set goals for the team to improve speed or cost of quality, let them decide what to measure and how to do it. And measure whatever you need to build a business case – to prove to others that the team is moving forward and doing a good job, or to make a case for change. Make metrics work for the team. You can do all of this without adding much to the cost of development, and without being evil.

Thursday, May 10, 2012

Building security into a development team

Getting application developers to understand and take responsibility for software security is difficult. Bootstrapping an Appsec program requires that you get the team up to speed quickly on security risks and what problems they need to look for, how to find and fix and prevent these problems, what tools to use, and convince them that they need to take security seriously. One way to do this is to train everyone on the development team on software security.

But at RSA 2011, Caleb Sima’s presentation Don’t Teach Developers Security challenged the idea that training application developers on software security will make a meaningful difference. He points out (rightly) that you can’t teach most developers anything useful about secure software development in a few hours (which as much Appsec training as most developers will get anyways). At best training like this is a long-term investment that will only pay off with reinforcement and experience – the first step on a long road.

Most developers (he suggests as many as 90 out of 100) won’t take a strong interest in software security regardless. They are there to build stuff, that’s what they get paid for, that’s what they care about and that’s what they do best. Customers love them and managers (like me) love them too because they deliver, and that’s what we want them spending their time doing. We don’t want or need them to become AppSec experts. Only a few senior, experienced developers will “get” software security and understand or care about all of the details, and in most cases this is enough. The rest of the team can focus on writing good defensive code and using the right frameworks and libraries properly.

Caleb Sima recommends starting an Appsec program by working with QA. Get an application security assessment: a pen test or a scan to identify security vulnerabilities in the app. Identify the top 2 security issues found. Then train the test team on these issues, what they look like, how to test for them, what tools to use. It’s not practical to expect a software tester to become a pen testing expert, but they can definitely learn how to effectively test for specific security issues. When they find security problems they enter them as bugs like any other bug, and then it’s up to development to fix the bugs.

Get some wins this way first. Then extend security into the development team. Assign one person as a security controller for each application: a senior developer who understands the code and who has the technical skills and experience to take on security problems. Give them extra Appsec training and the chance to play a leadership role. It’s their job to assess technical risks for security issues. They decide on what tools the team will use to test for security problems, recommend libraries and frameworks for the team to use, and help the rest of the team to write secure code.

What worked for us

Looking back on what worked for our Appsec program, we learned similar lessons and took some of the same steps.

While we were still in startup, I asked one of our senior developers to run an internal security assessment and make sure that our app was built in a secure way. I gave him extra time to learn about secure development and Appsec, and gave him a chance to take on a leadership role for the team. When we brought expert consultants in to do additional assessments (a secure design review and code review and pen testing) he took the lead on working with them and made sure that he understood what they were doing and what they found and what we needed to do about it. He selected a static analysis tool and got people to use it. He ensured that our framework code was secure and used properly, and he reviewed the rest of the team’s code for security and reliability problems. Security wasn’t his entire job, but it was an important part of what he did. When he eventually left the team, another senior developer took on this role.

Most development teams have at least 1 developer who the rest of the team respects and looks to for help on how to use the language and platform correctly. Someone who cares about how to write good code and who is willing to help others with tough coding problems and troubleshooting. Who handles the heavy lifting on frameworks or performance engineering work. This is the developer that you need to take on your core security work. Someone who likes to learn about technical stuff and who picks new things up quickly, who understands and likes hard technical stuff (like crypto and session management), who makes sure that things get done right.

Without knowing it we ended up following a model similar to Adobe’s “security ninja” program, although on a micro-scale. Most developers on the team are white belts or yellow belts with some training in secure software development and defensive programming. Our security lead is the black belt, with deeper technical experience and extra training and responsibility for leading software security for the application. Although we depended on external consultants for the initial assessments and to help us lay out a secure development roadmap, we have been able to take responsibility for secure development into the development team. Security is a part of what they do and how they design and build software today.

This model works and it scales. If as a manager you look at security as an important and fundamental technical problem that needs to be solved (rather than a pain-in-the-ass that needs to be gotten over), then you will find that your senior technical people will take it seriously. And if your best technical people take security seriously, then the rest of the team will too.

Thursday, May 3, 2012

Application Security at Scale

This week’s SANS AppSec conference in Las Vegas took on Application Security at Scale: how can we scale application security programs and technologies to big organizations, to small organizations and across organizations to millions of programmers world wide. You can find the presentation slides here. Lots of hilights for me: The conference was kicked off by Jeremiah Grossman from WhiteHat Security who made it clear that the problem of web application security alone is much bigger than we can take care of with the people and technology that we have today. We need to try different things like:
  1. Game-ification: get developers interested and involved in Appsec using games and challenges like capture-the-flag, or the Elevation of Privilege card game (a game I have to try out)
  2. Use peer pressure and score cards between teams, products, business units – drive better application security through competition (as we learned later, Cisco is one of the organizations score carding business units and products to drive improvement in software security programs)
  3. Good, simple (and I will add inexpensive) online training to get as many developers as possible up to speed on secure design and coding
  4. Write good, usable security frameworks and libraries and build security in by default into the major application frameworks – unfortunately we don’t know what frameworks will get widely adopted until they are widely adopted, so we will always be playing catch-up
  5. Build security into the developer’s workflow – this is what SD Elements is doing
  6. Use WAFs and virtual patching where it makes sense – to raise the bar on attacks by plugging simple issues found by scanners (WAFs used properly could block more than 2/3 of web application vulnerabilities, the kinds that scanners find); and to secure legacy code that nobody wants to try to figure out and fix by hand or in shops where it is too expensive and slow to get fixes out (in many Agile / Devops shops, it’s faster to fix and deploy the code than it is to put in a patch to the WAF).
  7. Bug Bounty programs – if this works for Google and Facebook, it could work for you.
Chris Eng at Veracode presented some metrics collected from the scans that they have done for customers over the past 18 months. The interesting thing for me was the correlation between attack data (from the Verizon 2011 Data Breach Report)and vulnerability data, the intersections highlighting what we need to focus on. Just like last year (and the year before) SQL injection is the leading problem: 32% of apps scanned had SQL injection vulnerabilities, 20% of attacks are SQL injection.

The XSS Problem (and some Solutions)

According to Veracode’s data, 68% of web apps have XSS vulnerabilities. This is no surprise after you listen to Jim Manico explain in detail what programmers have to do to prevent XSS. Even getting every developer building web apps to understand all of the different rules for context-correct output encoding and escaping isn’t going to solve the problem: there are too many details for developers to take care of without missing something or making mistakes. “It’s more complex to stop XSS in large-scale apps than it is to do applied crypto and key management properly…. We have never seen a web app that can’t be attacked through XSS”. But there is hope – in a later presentation he explained how Context-Aware Auto-Escaping (aka Auto-Encoding) technology like JXT (a close-to-drop-in replacement for JSP, if you are writing well-formed JSP) and Ivan Ristic’s work on Apache Velocity Auto-Escaping can help protect at least some web apps from XSS. The most promising new technology for me was HTML5 iFrame Sandboxing which looks like it could actually be dropped in today to protect apps, at least if your customers are using modern browsers. I also learned about JavaScript object freezing and sealing to help protect rich client apps.

I was on a panel that looked at application security in small companies, together with Nick Galbreath at Etsy and Cameron Morris at Partnet – a small company that builds online web shopping portals for the US DoD. At Partnet everyone owns security: many of the developers have been trained on application security, all of them understand the OWASP Top 10, all code that is checked in is reviewed. I presented a case study on our AppSec program, what worked and what didn’t for us, from startup to now. Nick built on an earlier presentation by Zane Lackey at Etsy which explained some of the security controls in Etsy’s frameworks and Continuous Deployment pipeline, and the extensive monitoring and instrumentation feedback loops that they have from production back into development. This includes cool automated checks on changes to high-risk code (they have automated tests that hash specific pieces of code, if the hash value changes the build system automatically alerts the AppSec team and provides them the change set for code review).

Being able to deploy several times a day means that they can deploy new code (including security fixes) extremely quickly with high confidence. Nick also presented on rate limiting to monitor and control event activity in production. I am critical of Continuous Deployment – too many people who try to follow this model put speed ahead of reliability and security, and unnecessarily put their customers at risk. They don’t know when they have crossed the line from a web startup to running a real business. But if you are going to try this and want to do it right, learn everything that you can from Etsy.

Mobile AppSec - Android is a Train-Wreck

From the panel on mobile Appsec: Google has a good page on writing secure apps for Android (I am guessing it is this page) but it’s clear that most developers don’t know about it. The consensus was that Apple’s IOS is the most secure smart phone platform, and Android is a train wreck – according to one of the researchers (Georgia Weidman at Bulb Security), 1/2 of the default Android apps have serious security vulnerabilities.

Secure Frameworks and APIs

Chenxi Wang’s Day 2 keynote emphasised the importance of isolating security code and sharing and reusing security code through APIs. This was reinforced in a later panel by Jason Chan at Netflix and Adam Migus at E*Trade – like us, both these organizations rely on simple, extensible secure frameworks or APIs that developers can use to take care of problems like identity management, permissioning, crypto, secure transport, validation. At E*Trade, they find that most security vulnerabilities are because developers didn’t use this code (or didn’t use it properly). It doesn’t cost that much to write and support a secure framework – a few smart people can take care of this for the rest of the organization. And there are Open Source examples today like Apache Shiro that we can use to solve a lot of common security problems.

Pen Testing

An excellent panel on “inside the mind of a pen tester”, how expert pen testers think and work and approach problems, the tools that they use (I learned about chaining multiple attack proxies like Burp Suite and Zap together to take advantage of the different strengths of each tool). The most important thing in the pen test is for developers and management to get a clear understanding of the risks in the application: what kind of problems the testers found, how serious were they, what you have to do to fix them and what you have to do to prove that you fixed them. The real value of pen testing, like any other kind of testing, is the information that you get out of the test. If you’re not learning from pen tests, if the next time the tester comes back and tests the same system and finds the same problems, what are you paying for?

These pen testing experts had mixed opinions of WAFs – if a customer has a WAF it is usually installed out-of-the-box without tuning, and doesn’t present more than a speed bump to a determined attacker. But like anti-virus protection, it will stop most drive-by attacks – some big sites are seeing that as much as 10% or even 20% of their traffic is potentially dangerous, and a good WAF should be able to catch at least some of this.

Most memorable quotes from the conference

There is no such thing as an internal application. Jeremiah Grossman, WhiteHat Security
You can checkbox compliance but you can’t checkbox security. Monica Bush, University of Wisconsin-Madison
The closing message was sobering. We need more people who understand AppSec and who can write secure code – a lot more people. Big companies may only have a few AppSec generalists supporting thousands of developers, most companies don’t have anyone at all. This isn’t enough. Point-in-time assessments like pen tests aren’t enough either, because the attack space is always changing and the code is always changing – in a few weeks or months at most the results of a pen test may be invalidated. What we are doing today isn’t enough and it’s not going to scale. We need more security burned in and we need continuous security testing, which means more people who understand AppSec and better and more effective tools.