Wednesday, May 15, 2013

Certified Agile: The PMI-ACP Exam

I sat for the Project Management Institute’s Agile Certified Practitioner (PMI-ACP) exam earlier this week. The PMI-ACP tests your understanding of common Agile development methods, values and practices. It focuses on basic Agile principles, and on Scrum and XP in detail, as well as fundamentals of Lean and Kanban.

Unlike the PMP, there is no Book of Knowledge which defines best practices and a process framework for this certification. Instead there is a certification content outline that explains at a high level the tools, techniques, knowledge and skills that you will be expected to know and will be tested on, and a reference list of books to read which includes some of the usual suspects. Out of this list I’d recommend reading Mike Cohn’s books on Agile Estimating and Planning and User Stories - they are useful for the exam and they're worth reading regardless. If you’re not working in an XP shop you should also read Kent Beck’s Extreme Programming Explained to make sure that you understand XP, and you must read up on the basics of Lean and Kanban. And of course you need to memorize the Agile Manifesto and the Twelve Principles of Agile Software Development front to back.

But I know from writing the PMP several years ago that experience and general reading aren’t enough to prepare for a PMI certification exam. PMI wants everyone who holds a certification to know the same things, and to share the same values and to think and act the same way. There’s an emphasis on orthodoxy – you’re tested not on what you would do (based on your experience and common practical knowledge), but what you should do according to PMI's definition of what “the right way" is to do something. And PMI’s exams are as much a test of your ability to read and write an exam as they are of the subject matter, with trick questions and trip-up answers and questions which are purposefully hard to understand, and even some extra questions thrown in which don’t make sense at all. Writing a test like this is not fun, although the PMI-ACP exam is certainly not as hard as the PMP exam - you shouldn’t need the 3+ hours that you’re given to complete this test.

So like others, I decided to use an exam prep guide to finish my studying.

The PMI-ACP Exam: How to Pass on Your First Try by Andy Crowe is a quick overview of the material that you should know for the exam. Easy to read and easy to follow, it defines key terms and “doing Agile right”, roles and responsibilities and rituals and tools, and covers communication and collaboration issues, and includes some sample questions (and access to a sample online exam). This is not an especially insightful book, but I found it useful for last minute review and cramming.

I did most of my studying with Mike Griffiths’ PMI-ACP Exam Prep: A Course in a Book for Passing the PMI Agile Certified Practitioner (PMI-ACP) Exam, a much more complete study guide, and a good overview of Agile development that is worth keeping and reading on its own. This book builds on materials that Griffiths published earlier on his blog and it is especially good on Agile reporting tools.

Griffiths is one of the experts who created the PMI-ACP program and so he understands what you need to know in depth, and he is a good writer. However, his book is harder to study from than Crowe’s, because it contains a lot more details and because it is structured around the artificial domains that PMI uses to describe Agile development. This results in several discontinuities, where an idea or practice is introduced under “Value Driven Delivery” and then continues later under “Adaptive Planning” or “Continuous Improvement” or one of the other domains (it is not necessary by the way to learn the domains for the exam).

If you have solid experience with Agile development (which you need to in order to meet the qualifying bar) especially Scrum and XP, you should be able to pass the exam with the help of Griffiths’ guide and some general reading to fill in gaps.

Studying for the PMI-ACP has made me examine Agile development ideas and practices in more detail (which is why I decided to apply for the certification). But it hasn't changed how I think about Agile practices and methods or how I think you should follow them. I am just as convinced today as I was before that the key is not following some method in a pure way, but instead to build your own toolkit, to borrow what works from different methods and adapt them to your specific requirements, constraints and situation. And the more that you know and understand about Agile methods and practices, the more tools you have for your toolkit.

Tuesday, May 7, 2013

Appsec – Can anything Stop the Bad Guys?

WhiteHat Security recently published their 2012 report on website security. Like Veracode, WhiteHat collects and analyzes data from security tests run across their customer base each year. WhiteHat's analysis focuses on data from dynamic testing of 15,000 sites at 650 organizations – all results manually reviewed and verified. From this data they are able to see trends and to build industry scorecards. The report makes for fascinating reading.

On average, web sites are getting more secure each year: the average web site had over 1,000 vulnerabilities in 2007, and only 56 in 2012. SQL injection, the most popular and most serious attack vector, is found in only 7% of their customer’s web sites.

This is the good news.

What made WhiteHat’s analysis this year especially valuable is that they also surveyed customers about their secure SDLC practices and the effectiveness of their security programs. Although the survey set was small (less than 20% of customers responded), this data allowed WhiteHat to correlate vulnerability data with secure SDLC practices operational controls, as well as appsec program drivers and breach data.

Compliance impact on Appsec

White Hat found that the main driver for fixing security vulnerabilities is compliance – this matches up with findings from the SANS Appsec survey last year.

But they also found that compliance is the number one reason that some vulnerabilities don’t get fixed: many organizations are following the letter of the law, doing what compliance says that they have to and only what they have to, not going any further even if it would make sense to do so from a risk management perspective or to meet customer demands.

Best Practices and Tools – What Works?

Training developers seems to help. More than half of White Hat’s customers had done at least some security training for developers. Organizations that invested in security training for developers had 40% fewer vulnerabilities and resolved them 59% faster.

But other best practices and tools don’t seem to be effective.

Just over half of customers relied on application libraries or frameworks with centralized security controls. Relying too much on these controls seems to provide a false sense of security: organizations that used security libraries or frameworks with security controls had 64% more vulnerabilities and resolved them 27% slower.

One factor that makes these organizations more vulnerable is that if the underlying framework is exploitable, then all of the sites that rely on it are vulnerable, like the recent security problems with Rails. Another problem may be that developers are naïve about what a security library will do for them: Apache Shiro or something like it for example will take care of a lot of application security problems, but it won’t protect your app from SQL injection or XSS or CSRF or other common attacks, leaving big holes for the bad guys. There’s more work that still needs to be done to make an application secure.

Organizations that use static analysis had 15% more vulnerabilities found through WhiteHat's dynamic testing, and resolved them on average 26% slower. Maybe because running a tool doesn't do anything if you don’t fix the vulnerabilities. Or because there isn't a high overlap between the vulnerabilities that static analysis finds and what’s found through dynamic analysis.

But Nothing Stops Breaches

85% of WhiteHat's customers test their apps pre-production, a third of them before every change is pushed out. These organizations are trying to do the right thing.

But almost one quarter of White Hat’s customers had experienced security breaches as a result of an application vulnerability. It doesn't seem to matter if they tested often, or if they trained their developers, or how much they trained them, or if they used use static analysis or secure libraries or a WAF or other operational security controls. These organizations were just as likely to experience a breach as organizations that didn't do as much training or as much testing or didn't use the tools.

WhiteHat’s report raises a lot of fascinating questions. Do the breach findings mean that security testing, or developer training or using secure libraries or other tools don’t work?

Or is this simply evidence of the essential asymmetry of the “Attacker’s Advantage and the Defender’s Dilemma”? Even though the number of serious vulnerabilities on average is declining significantly year on year, 86% of all the web sites that WhiteHat tested had at least one serious vulnerability (and keep in mind that WhiteHat - or any other vendor - can't catch every vulnerability). On average only 61% of these vulnerabilities were fixed and it took 193 days for this to get done. All it takes is one vulnerability for the bad guys to get in, and we’re still giving them too many chances and too much time to succeed.

Or maybe we just need more time to see the results of training and testing and tools and other best practices. Time for developers to understand and fix legacy bugs and to change how they design and build software to be more safe and secure in the first place, to “build security in”. Time for management to understand that compliance shouldn't be the main driver for building secure software. Time to raise the bar enough that the bad guys start looking for another, easier target. We’ll have to wait another year to see WhiteHat’s next report and see if some more time makes any real difference.

Monday, April 29, 2013

What does Code Ownership do to Code?

In my last post, I talked about Code Ownership models, and why you might want to choose one code ownership model (strong, weak/custodial or collective) over another. Most of the arguments over code ownership focus on managing people, team dynamics, and the effects on delivery. But what about the longer term effects on the shape, structure and quality of code – does the ownership model make a difference? What are the long-term effects of letting everyone working on the same code, or of having 1 or 2 people working on the same pieces of code for a long time?

Collective Code Ownership and Code Quality

Over time, changes tend to concentrate in certain areas of code: in core logic and in and behind interfaces (listen to Michael Feathers’ fascinating talk Discovering Startling Things from your Version Control System). This means that the longer a system has been running, the more chances there are for people to touch the same code. Some interesting research work backs up what should be obvious: that the people who understand the code the best are the people who work on it the most, and the people who know the code the best make less mistakes when changing it.

In Don’t Touch my Code!, researchers at Microsoft (BTW, the lead author Christian Bird is not a relative of mine, at least not a relative who I know) found that as more people touch the same piece of code, it leads to more opportunities for misunderstandings and more mistakes. Not surprisingly, people who hadn't worked on a piece of code before made more mistakes, and as the number of developers working on the same module increased, so did the chance of introducing bugs.

Another study, Ownership and Experience in Fix-Inducing Code tries to answer which is more important in code quality: “too many cooks spoil the broth”, or “given enough eyeballs, all bugs are shallow”? Does more people working on the same code lead to more bugs, or does having more people working on the code mean that there are more chances to find bugs early? This research team found that a programmer’s specific experience with the code was the most important factor in determining code quality – code that is changed by the programmer who does most of the work on that code is of higher quality than code written by someone who doesn't normally work on the code, even if that someone is a senior developer who has worked on other parts of the code. And they found that the fewer the people working on a piece of code, the fewer the bugs that needed to be fixed.

And a study on contributions to Linux reinforces that as the number of developers working on the same piece of code increase, the chance of bugs and security problems increases significantly: code touched by more than 9 developers is 16x more likely to have security vulnerabilities, and more vulnerabilities are introduced by developers who are making changes across many different pieces of code.

Long-term Effects of Ownership Approach on Code Structure

I've worked at shops where the same programmers have owned the same code for 3 or 4 or 5 or even 10 years or sometimes even longer. Over that time, that programmer’s biases, strengths, weaknesses and idiosyncrasies are all amplified, wearing deep grooves in the code. This can be a good thing, and a bad thing.

The good thing is that with one person making most or all of the changes, internal consistency in any piece of code will be high – you can look at a piece of code written by that developer and once you understand their approach and way of thinking, the patterns and idioms that they prefer, everything should be familiar and easy to follow. Their style and approach might have changed over time as they learned and improved as a developer, but you can generally anticipate how the rest of the code will work, and you’ll recognize what they are good at and what their blind spots are, what kind of mistakes they are prone to: as I mentioned in the earlier post, this makes code easier to review and easier to test and so easier to find and fix bugs.

If a developer tends to write good, clean, tight code, and if they are diligent about refactoring and keeping the code clean and tight, then most of the code will be good, clean, tight and easy to follow. Of course it follows that if they tend to write sloppy, hard-to-understand, poorly structured code, then most of it will be sloppy, hard-to-understand and poorly-structured. Then again, even this can be a good thing – at least bad code is isolated, and you know what you have to rewrite, instead of someone spreading a little bid of badness everywhere.

When ownership changes – when the primary contributor leaves, and a new owner takes over, the structure and style of the code will change as well. Maybe not right away, because a new owner usually takes some time to get used to the code before they put their stamp on it, but at some point they’ll start adapting it – even unconsciously – to their own preferences and biases and ways of thinking, refactoring or rewriting it to suit them.

If a lot of developers have worked on the same piece of code, they will introduce different ideas, techniques and approaches over time as they each do their part, as they refactor and rewrite things according to their own ideas of what is easy to understand and what isn't, what’s right and wrong. They will each make different kinds of mistakes. Even with clear and consistent shared team conventions and standards, differences and inconsistencies can build up over time, as people leave and new people join the team, creating dissonance and making it harder to follow a thought through the code, harder to test and review, and harder to hold on to the design.

Ownership Models and Refactoring

But as Michael Feathers has found through mining version control history, there is also a positive Ownership Effect on code as more people work on the same code.

Over time, methods and classes tend to get bigger because it’s easier to add code to an existing method than to write a new method, and easier to add another method to an existing class than create a new class. By correlating the number of developers who have touched a piece of code with method size, Feathers research shows that as the number of developers working on a piece of code increases, the average method size tends to get smaller. In other words, having multiple people working on a code base encourages refactoring and simpler code, because people who aren't familiar with the code have to simplify it first in order to understand it.

Feathers has also found that code behind APIs tends to be especially messy – because some interfaces are too hard to change, programmers are forced to come up with their own workarounds behind the scenes. Martin Fowler explains how this problem is made worse by strong code ownership, which inhibits refactoring and makes the code more internally rigid:

In strong code ownership, there's my code and your code. I can't change your code. If I want to change the name of one of my methods, and it's called by your code, I've got to get you to change the call into me before I can change my name. Or I've got to go through the whole deprecation business. Essentially any of my interfaces that you use become published in that situation, because I can't touch your code for any reason at all.

There's an intermediate ground that I call weak code ownership. With weak code ownership, there's my code and your code, but it is accepted that I could go in and change your code. There's a sense that you're still responsible for the overall quality of your code. If I were just going to change a method name in my code, I'd just do it. But on the other hand, if I were going to move some responsibilities between classes, I should at least let you know what I'm going to do before I do it, because it's your code. That's different than the collective code ownership model.

Weak code ownership and refactoring are OK. Collective code ownership and refactoring are OK. But strong code ownership and refactoring are a right pain in the butt, because a lot of the refactorings you want to make you can't make. You can't make the refactorings, because you can't go into the calling code and make the necessary updates there. That's why strong code ownership doesn't go well with refactoring, but weak code ownership works fine with refactoring.
(Design Principles and Code Ownership)

Ownership, Technical Debt or Deepening Insight

An individual owner has a higher tolerance for complexity, because after all it’s their code and they know how it works and it’s not really that hard to understand (not for them at least) so they don’t need to constantly simplify it just to make a change or fix something. It's also easy for them to take short cuts, and even short cuts on short cuts. This can build up over time until you end up with a serious technical debt problem – one person is always working on that code, not because the problem is highly specialized, but because the code has reached a point where nobody else but Scotty can understand it and make it work.

There’s a flip side to spending more time on code too. The more time that you spend on the same problem, the deeper you can see into it. As you return to the same code again and again you can recognize patterns, and areas that you can improve, and compromises that you aren't willing to accept any more. As you learn more about the language and the frameworks, you can go back and put in simpler and safer ways of doing things. You can see what the design really should be, where the code needs to go, and take it there.

There's also opportunity cost of not sticking to certain areas. Focusing on a problem allows you to create better solutions. Specifically, it allows you to create a vision of what needs to be done, work towards that vision and constantly revise where necessary... If you're jumping from problem to problem, you're more likely to create an inferior solution. You'll solve problems, but you'll be creating higher maintenance costs for the project in the long term.
Jay Fields Taking a Second Look at Collective Code Ownership

So far I've found that the only way for a team to take on really big problems is by breaking the problems up and letting different people own different parts of the solution. This means taking on problems and costs in the short term and the long term, trading off quality and productivity against flexibility and consistency – not only flexibility and consistency in how the team works, but in the code itself.

What I've also learned is that whether you have a team of people who each own a piece of the system, or a more open custodian environment, or even if everyone is working everywhere all of the time, you can’t let people do this work completely on their own. It’s critical to have people working together, whether you are pairing in XP or doing regular egoless code reviews. To help people work on code that they’ve never seen before – or to help long-time owners recognize their blind spots. To mentor and to share new ideas and techniques. To keep people from falling into bad habits. To keep control over complexity. To reinforce consistency – across the code base or inside a piece of code.

Thursday, April 25, 2013

Code Ownership – Who Should Own the Code?

A key decision in building and managing any development team is agreeing on how ownership of the code will be divided up: who is going to work on what code; how much work can be, and should be, shared across the team; and who will be responsible for code quality. The approach that you take has immediate impact on the team’s performance and success, and a long-term impact on the shape and quality of the code.

Martin Fowler describes three different models for code ownership on a team:

  1. Strong code ownership – every module is owned exclusively by someone, developers can only change the code that they own, and if they need to change somebody else’s code, they need to talk to that owner and get the owner’s agreement first – except maybe in emergencies.

  2. Weak code ownership – where modules are still assigned to owners, but developers are allowed to change code owned by other people. Owners are expected to keep an eye on any changes that other people make, and developers are expected to ask for permission first before making changes to somebody else’s code.

    This can be thought of as a shared custody model, where an individual is forced to share ownership of their code with others; or Code Stewardship, where the team owns all of the code, but one person is held responsible for the quality of specific code, and for helping other people make changes to it, reviewing and approving all major changes, or pairing up with other developers as necessary. Brad Appleton says the job of a code steward is not to make all of the changes to a piece of code, but to “safeguard the integrity + consistency of that code (both conceptually and structurally) and to widely disseminate knowledge and expertise about it to others”.

  3. Collective Code Ownership – the code base is owned or shared by the entire team, and everyone is free to make whatever changes they need – or want – to make, including refactoring or rewriting code that somebody else originally wrote. This is a model that came out of Extreme Programming, where the Whole Team is responsible together for the quality and integrity of the code and for understanding and keeping the design.

Arguments against Strong/Individual Code Ownership

Fowler and other XP advocates such as Kent Beck don’t like strong individual code ownership, because it creates artificial barriers and dependencies inside the team. Work will stall and pause if you need to wait for somebody to make or even approve a change, and one owner can often become the critical path for the entire team. This could encourage developers to come up with their own workarounds and compromises. For example, instead of changing an API properly (which would involve a change to somebody else’s code), they might shoe horn in a change, like stuffing something into an existing field. Or they might take a copy of somebody’s code and add whatever they need to it, making maintenance harder in the future.

Other arguments against strong ownership are that it can lead to defensiveness and protectionism on the part of some developers (“hey, don’t touch my code!”), where they take any criticism of the code as a personal attack, creating tension on the team and discouraging reviewers from offering feedback and discouraging refactoring efforts; and local over-optimization, if developers are given too much time to spend to polish and perfect their precious code without thinking of the bigger picture.

And of course there is the “hit by a truck factor” to consider – the impact that a person leaving the team will have on productivity if they’re the only one who works on a piece of code.

Ward Cunningham. one of the original XPers, also believes that there is more pride of ownership when code is shared, because everyone’s work is always on display to everyone else on the team.

Arguments against Collective Code Ownership

But there are also arguments against Collective Code Ownership. A post by Mike Spille lists some problems that he has seen when teams try to “over-share” code:

  • Inconsistency. No overriding architecture is discernible, just individual solutions to individual problems. Lots of duplication of effort results, often leading to inconsistent behavior
  • Bugs. People "refactoring" code they don't really understand break something subtle in the original code.
  • Constant rounds of "The Blame Game". People have a knee jerk reaction to bugs, saying "It worked when I wrote it, but since Joe refactored it....well, that's his problem now.".
  • Slow delivery. Nobody has any expertise in any given domain, so people are spending more time trying to understand other people's code, less time writing new code.

Matthias Friedrich, in Thoughts on Collective Code Ownership believes that Collective Code Ownership can only work if you have the right conditions in place:

  • Team members are all on a similar skill level
  • Programmers work carefully and trust each other
  • The code base is in a good state
  • Unit tests are in place to detect problematic changes (although unit tests only go so far)

Remember that Collective Code Ownership came out of Extreme Programming. Successful team ownership depends on everyone sharing an understanding of the domain and the design, and maintaining a high-level of technical discipline: not only writing really good automated tests as a safety net, but everyone following consistent code conventions and standards across the code base, and working in pairs because hopefully one of you knows the code, or at least with two heads you can try to help each other understand it and make fewer mistakes.

Another problem with Collective Code Ownership is that ownership is spread so thin. Justin Hewlett talks about the Tragedy of the Commons problem: people will take care of their own yard, but how many people will pick up somebody else’s litter in the park, or on a street - even if they walk in that park or down that street everyday? If the code belongs to everyone, then there is always “someone else” who can take care of it – whoever that “someone else” may be. As a developer, you’re under pressure, and you may never touch this piece of code again, so why not get whatever you need to do as quickly as possible and get on to the next thing on your list, and let "somebody else" worry about refactoring or writing that extra unit test or...?

Code Ownership in the Real World

I've always worked on or with teams that follow individual (strong or weak) code ownership, except for an experiment in pure XP and Collective Code Ownership on one team over 10 years ago. One (or maybe two) people own different pieces of the code and do all or most of the heavy lifting work on that code. Because it only makes sense to have the people who understand the code best do most of the work, or the most important work. It’s not just because you want the work “done right” – sometimes you don’t really have a choice over who is going to do the work.

As Ralf Sudelbucher points out, Collective Code ownership assumes that all coding work is interchangeable within a team, which is not always true.

Some work isn't interchangeable because of technology: different parts of a system can be written in different languages, with different architectures. You have to learn the language and the framework before you can start to understand the other problems that need to be solved.

Or it might be because of the problem space. Sure, there is always coding on any project that is “just typing”: journeyman work that is well understood, like scaffolding work or writing another web form or another CRUD screen or fixing up a report or converting a file format, work that has to be done and can be taken on by anyone who has been on the team for a while and who understands where to find stuff and how things are done – or who pairs up with somebody who knows this.

But other software development involves solving hard domain problems and technical problems that require a lot of time to understand properly – where it can take days, weeks, months or sometimes even years to immerse yourself in the problem space well enough to know what to do, where anyone can’t just jump in and start coding, or even be of much help in a pair programming situation.

The worst disasters occur when you turn loose sorcerers' apprentices on code they don't understand. In a typical project, not everyone can know everything - except in some mature domains where there have been few business paradigm shifts in the past decade or two.
Jim Coplien, Code Ownership

I met someone who manages software development for a major computer animation studio. His team has a couple of expert developers who did their PHDs and post grad work in animating hair – that’s all that they do, and even if you are really smart you’ll need years of study and experience just to understand how they do what they do.

Lots of scientific and technical engineering domains are also like this – maybe not so deeply specialized, but they involve non-trivial work that can’t be easily or competently done by generalists, even competent generalists. Programming medical devices or avionics or robotics or weapons control; or any business domain where you are working at the leading edge of problem solving, applying advanced statistical models to big data analysis or financial trading algorithms or risk-management models; or supercomputing and high-scale computing and parallel programming, or writing an operating system kernel or solving cryptography problems or doing a really good job of User Experience (UX) design. Not everyone understands the problems that need to be solved, not everyone cares about the problems and not everyone can do a good job of solving them.

Ownership and Doing it Right

If you want the work done right, or need it to be done right the first time, it should be done by someone who has worked on the code before, who knows it and who has proven that they can get the job done. Not somebody who has only a superficial familiarity with the code. Research work by Microsoft and others have shown that as more people touch the same piece of code, there is more chance of misunderstandings and mistakes – and that the people who have done the most work on a piece of code are the ones who make the fewest mistakes.

Fowler comes back to this in a later post about “Shifting to Code Ownership” where he shares a story from a colleague who shifted a team from collective code ownership to weak individual code ownership because weaker or less experienced programmers were making mistakes in core parts of the code and impacting quality, velocity and the team’s morale. They changed their ownership model so anyone could work around the code base, but if they needed to change core code, they had to do this with the help of someone who knew that part of the code well.

In deciding on an an ownership approach, you have to make a trade-off between flexibility and quality, team ownership and individual ownership. With individual ownership you can have siloing problems and dependencies on critical people, and you’ll have to watch out for trucks. But you can get more done, faster, better and by fewer people.

Thursday, April 18, 2013

Architecture-Breaking Bugs – when a Dreamliner becomes a Nightmare

The history of computer systems is also the history of bugs, including epic, disastrous bugs that have caused millions of $ in damage and destruction and even death, as well as many other less spectacular but expensive system and project failures. Some of these appear to be small and stupid mistakes, like the infamous Ariane 5 rocket crash, caused by a one-line programming error. But a one-line programming error, or any other isolated mistake or failure, cannot cause serious damage to a large system, without fundamental failures in architecture and design, and failures in management.

Boeing's 787 Dreamliner Going Nowhere
The Economist, Feb 26 2013

These kinds of problems are what Barry Boehm calls “Architecture Breakers”: where a system’s design doesn't hold up in the real world, when you run face-first into a fundamental weakness or a hard limit on what is possible with the approach that you took or the technology platform that you selected.

Architecture Breakers happen at the edges – or beyond the edges – of the design, off of the normal, nominal, happy paths. The system works, except for a “one in a million” exceptional error, which nobody takes seriously until a “once in a million” problem starts happening every few days. Or the system crumples under an unexpected surge in demand, demand that isn't going to go away unless you can’t find a way to quickly scale the system to keep up – and if you can’t, you won’t have a demand problem any more because those customers won’t be coming back. Or what looks like a minor operational problem turns out to be the first sign of a fundamental reliability or safety problem in the system.

Dreamliner is Troubled by Questions about Safety
NY Times, Jan 10, 2013

Finding Architecture Breakers

It starts off with a nasty bug or an isolated operational issue or a security incident. As you investigate and start to look deeper you find more cases, gaping holes in the design, hard limits to what the system can do, or failures that can’t be explained and can’t be stopped. The design starts to unravel as each problem opens up to another problem. Fixing it right is going to take time and money, maybe even going back to the drawing board and revisiting foundational architectural decisions and technology choices. What looked like a random failure or an ugly bug just turned into something much uglier, and much much more expensive.

Deepening Crisis for the Boeing 787
NY Times, Jan 17 2013

What makes these problems especially bad is that they are found late, way past design and even past acceptance testing, usually when the system is already in production and you have a lot of real customers using it to get real work done. This is when you can least afford to encounter a serious problem. When something does go wrong, it can be difficult to recognize how serious it is right away. It can take two or three or more failures before you realize – and accept – how bad things really are and before you see enough of a pattern to understand where the problem might be.

Boeing Batteries Said to Fail 10 Times Before Incident
Bloomberg, Jan 30 2013

By then you may be losing customers and losing money and you’re under extreme pressure to come up with a fix, and nobody wants to hear that you have to stop and go back and rewrite a piece of the system, or re-architect it and start again – or that you need more time to think and test and understand what’s wrong and what your options are before you can even tell them how long it might take and how much it could cost to fix things.

Regulators Around the Globe Ground Boeing 787s
NY Times, Jan 18 2013

What can Break your Architecture?

Most Architecture Breakers are fundamental problems in important non-functional aspects of a system:

  • Stability and data integrity: some piece of the system won’t stay up under load or fails intermittently after the system has been running for hours or days or weeks, or you lost critical customer data or you can’t recover and restore service fast enough after an operational failure.
  • Scalability and throughput: the platform (language or container or communications fabric or database – or all of them) are beautiful to work with, but can’t keep up as more customers come in, even if you throw more hardware at it. Ask Twitter about trying to scale-out Ruby or Facebook about scaling PHP or anyone who has ever tried to scale-out Oracle RAC.
  • Latency – requirements for real-time response-time/deadline satisfaction escalate, or you run into unacceptable jitter and variability (you chose Java as your run-time platform, what happens when GC kicks in?).
  • Security: you just got hacked and you find out that the one bug that an attacker exploited is only the first of hundreds or thousands of bugs that will need to be found and fixed, because your design or the language and the framework that you picked (or the way that you used it) is as full of security holes as Swiss cheese.
These problems can come from misunderstanding what an underlying platform technology or framework can actually do – what the design tolerances for that architecture or technology are. Or from completely missing, overlooking, ignoring or misunderstanding an important aspect of the design.

These aren’t problems that you can code your way out of, at least not easily. Sometimes the problem isn't in your code any ways: it’s in a third party platform technology that can’t keep up or won’t stay up. The language itself, or an important part of the stack like the container, database, or communications fabric, or whatever you are depending on for clustering and failover or to do some other magic. At high scale in the real world, almost any piece of software that somebody else wrote can and will fall short of what you really need, or what the vendor promised.

Boeing, 787 Battery Supplier at Odds over Fixes
Wall Street Journal, Feb 27 2013

You’ll have to spend time working with a vendor (or sometimes with more than one vendor) and help them understand your problem, and get them to agree that it’s really their problem, and that they have to fix it, and if they can’t fix it, or can’t fix it quickly enough, you’ll need to come up with a Plan B quickly, and hope that your new choice won’t run into other problems that may be just as bad or even worse.

How to Avoid Architecture Breakers

Architecture Breakers are caused by decisions that you made early and got wrong – or that you didn't make early enough, or didn't make at all. Boehm talks about Architecture Breakers as part of an argument against Simple Design – that many teams, especially Agile teams, spend too much time focused on the happy path, building new features to make the customer happy, and not enough time on upfront architecture and thinking about what could go wrong. But Architecture Breakers have been around a lot longer than Agile and simple design: in Making Software (Chapter 10 Architecting: How Much and When), Boehm goes back to the 1980s when he first recognized these kinds of problems, when Structured Programming and later Waterfall were the “right way” to do things.

Boehm’s solution is more and better architecture definition and technical risk management through Spiral software development: a lifecycle with architecture upfront to identify risk areas, which are then explored through iterative, risk-driven design, prototyping and development in multiple stages. Spiral development is like today’s iterative, incremental development methods on steroids, using risk-based architectural spikes, but with much longer iterative development and technical prototyping cycles, more formal risk management, more planning, more paperwork, and much higher costs.

Bugs like these can’t all be solved by spending more time on architecture and technical risk management upfront – whether through Spiral development or a beefed up, disciplined Agile development approach. More time spent upfront won`t help if you make naïve assumptions about scalability, responsiveness or reliability or security; or if you don’t understand these problems well enough to identify the risks. Architecture Breakers won’t be found in design reviews – because you won’t be looking for something that you don’t know could a problem – unless maybe you are running through structured failure modelling exercises like FMEA (Failure mode and effect analysis) or FMECA (Failure mode, effects and criticality analysis), which force you to ask hard questions, but which few people outside of regulated industries have even heard about.

And Architecture Breakers can't all be caught in testing, even extended longevity/soak testing and extensive fuzzing and simulated failures and fault injection and destructive testing and stress testing – even if all the bugs that are found this way are taken seriously (because these kinds of extreme tests are often considered unrealistic).

You have to be prepared to deal with Architecture Breakers. Anticipating problems and partitioning your architecture using something like the Stability Patterns in Michael Nygard’s excellent book Release It! will at least keep serious run-time errors from spreading and taking an entire system out (these strategies will also help with scaling and with containing security attacks). And if and when you do see a “once in a million” error in reviews or testing or production, understand how serious it can be, and act right away – before a Dreamliner turns into a nightmare.

Thursday, April 11, 2013

Software Security Status Quo?

Veracode has released the company’s State of Software Security Report for 2012, the 5th in a series of annual reports that analyzes data collected from customers using Veracode’s cloud-based application security scanning services.

The Important Numbers

As Veracode’s data set continues to get bigger, with more customers and more apps getting scanned, the results get more interesting.

For Web apps, the state of vulnerabilities remains unchanged over the past 18 months:

  • 1/3 of apps remain vulnerable to SQL Injection
  • 2/3 of apps remain vulnerable to XSS, and at least half of all vulnerabilities found in scanning are XSS vulnerabilities
Over the last 3 years, there are no significant changes in the occurrence of different vulnerabilities that cannot be accounted for by changes and improvements in Veracode's technology or testing methods.

For mobile platforms (Android, iOS and Java ME), the most common vulnerabilities found are related to crypto: 64% of Android apps, 58% of iOS apps, and 47% of Java ME apps have crypto vulnerabilities. Outside of crypto, the vulnerability distributions for the different mobile platforms are quite different. It’s possible that these differences are due to fundamental strengths and weaknesses of each platform (different architectures, different APIs and default capabilities provided), but I think that it is still too early to draw meaningful conclusions from this data, as the size of the data set is still very small (although it continues to increase in size, from 1% of the total sample to 3% over the last 18 months).

But Security Vulnerabilities are Getting Fixed, Right?

Some interesting data on remediation, based on Veracode customers resubmitting the same code base for subsequent scans. Almost half of their customers resubmit all or almost all of their apps for re-scanning, regardless of how critical the app is considered to the customer’s business. What’s interesting is which vulnerabilities people chose to fix - bugs that are found in the first scan, but don’t show up later.

For Java, the bugs that are most often fixed are:

  1. Untrusted search path
  2. CRLF injection
  3. Untrusted initialization
  4. Session fixation
  5. Dangerous function

So the first bugs to be fixed seem to be the easiest ones for developers to understand and take care of – low hanging fruit. Remediation decisions don’t seem to be based on risk, but on “let’s see what we can fix now and get the security guys off of our backs”. Security bugs are getting fixed, but it’s clear that SQL Injection and XSS bugs aren’t getting fixed fast enough, because there too many of these vulnerabilities to fix, and because many developers still don’t understand these problems well enough to fix them or prevent them in the first place. PHP developers are much more likely to remediate SQL injection vulnerabilities than Java or .NET developers, but it’s not clear why.

The Art and Science of Predictions

The report results were presented today in a webinar titled “We See the Future … and it’s Not Pretty”, which walked through the data and the predictions that Veracode drew from the data. While the findings seem sound, the predictions are less so: for example, that there will be higher turnover in security jobs (including CISO positions) because appsec programs are not proving effective, and security staff will give up – or get fired – as a result. I can’t see the thread that leads from the data to these conclusions. The authors should read (or re-read) The Signal and the Noise to understand what should go into a high-quality prediction, and what people should try to predict and what they shouldn't.

Tuesday, April 9, 2013

Penetration Testing Shouldn't be a Waste of Time

In a recent post on “Debunking Myths: Penetration Testing is a Waste of Time”, Rohit Sethi looks at some of the disadvantages of the passive and irresponsible way that application pen testing is generally done today: wait until the system is ready to go live, hire an outside firm or consultant, give them a short time to try to hack in, fix anything important that they find, maybe retest to get a passing grade, and now your system is 'certified secure'.

A test like this “doesn't tell you:

  • What are the potential threats to your application?
  • Which threats is your application “not vulnerable” to?
  • Which threats did the testers not assess your application for? Which threats were not possible to test from a runtime perspective?
  • How did time and other constraints on the test affect the reliability of results? For example, if the testers had 5 more days, what other security tests would they have executed?
  • What was the skill level of the testers and would you get the same set of results from a different tester or another consultancy?”

Sethi stresses the importance of setting expectations and defining requirements for pen testing. An outside pen tester will not be able to understand your business requirements or the internals of the system well enough to do a comprehensive job – unless maybe if your app is yet another straightforward online portal or web store written in PHP or Ruby on Rails, something that they have seen many times before.

You should assume that pen testers will miss something, possibly a lot, and there’s no way of knowing what they didn't test or how good a job they actually did on what they did test. You could try defect seeding to get some idea of how careful and smart they were (and how many bugs they didn’t find), but this assumes that you know an awful lot about your system and about security and security testing (and if you’re this good, you probably don’t need their help). Turning on code coverage analysis during the test will tell you what parts of the code didn't get touched – but it won’t help you identify the code that you didn't write but should have, which is often a bigger problem when it comes to security.

You can’t expect a pen tester to find all of the security vulnerabilities in your system – even if you are willing to spend a lot of time and money on it. But pen tests are important because this is a way to find things that are hard for you to find on your own:

  1. Technology-specific and platform-specific vulnerabilities
  2. Configuration and deployment mistakes in the run-time environment
  3. Pointy-Hat problems in areas like authentication and session management that should have been taken care of by the framework that you are using, if it works and if you are using it properly
  4. Fussy problems in information leakage, object enumeration and error handling – problems that look small to you but can be exploited by an intelligent and motivated attacker with time on their side
  5. Mistakes in data validation or output encoding and filtering, that look small to you but…
And if you’re lucky, some other problems that you should have caught on your own but didn’t, like weaknesses in workflow or access control or password management or a race condition.

Pen testing is about information, not about vulnerabilities

The real point of pen testing, or any other kind of testing, is not to find all of the bugs in a system. It is to get information.
  1. Information on examples of bugs in the application that need to be reviewed and fixed, how they were found, and how serious they are.
  2. Information that you can use to calibrate your development practices and controls, to understand just how good (or not good) you are at building software.

Testing doesn't provide all possible information, but it provides some. Good testing will provide lots of useful information. James Bach (Satisfice)

This information leads to questions: How many other bugs like this could there be in the code? Where else should we look for bugs, and what other kinds of bugs or weaknesses could there be in the code or the design? Where did these bugs come from in the first place? Why did we make that mistake? What didn't we know or what didn't we understand? Why didn't we catch the problems earlier? What do we need to do to prevent them or to catch them in the future? If the bugs are serious enough, or there are enough of them, this means going all the way through RCA and exercises like 5 Whys to understand and address Root Cause.

To get high-quality information, you need to share information with pen testers. Give the pen tester as much information as possible

  • Walk through the app with pen testers, hilight the important functions, and provide documentation
  • Take time to explain the architecture and platform
  • Share results of previous pen tests
  • Provide access behind proxies etc

Ask them for information in return: ask them to explain their findings as well as their approach, what they tried and what they covered in their tests and what they didn't, where they spent most of their time, what problems they ran into and where they wasted time, what confused them and what surprised them. Information that you can use to improve your own testing, and to make pen testing more efficient and more effective in the future.

When you’re hiring a pen tester, you’re paying for information. But it’s your responsibility to get as much good information as possible, to understand it and to use it properly.

Site Meter