Building Real Software: September 2013

Thursday, September 19, 2013

Code Reviews Change Over Time

We've been doing code reviews for about 4 years now.

Getting Started with Code Reviews

From the start, developers would help each other out, look at code when someone asked, or sometimes a lead or a senior developer would step in and review code if we were seeing problems in testing or if someone had just joined the team and we anticipated that they needed some extra oversight. We also hired an expert to come in and do a security code review.

After we launched the system, we decided to be proactive and started risk-based reviews: we asked anyone writing code in high-risk areas (like framework and security plumbing, APIs, core business logic, or areas where we had seen problems before) to get a code review. We could have stopped there, and got a lot of value out of code reviews. But we decided to keep going, and made code reviews a standard practice.

This didn't happen overnight. It wasn't difficult to convince the team that code reviews could help – they’d already seen good results from the risk-based reviews that we had done. But it was difficult to change the way that people worked, and to make sure that they had enough time to do reviews properly: time to schedule and carry out the reviews, and time to understand and act on the feedback. And it also took a while to come up with an effective code review process.

At first we asked developers to pick a buddy and arrange a review. The results were mixed. Sometimes a developer would shop around for someone nice or someone really busy who they figured would let them off lightly; or two developers would trade-off (I’ll show you mine if you’ll show me yours) so they could get the reviews over with as quickly as possible. Because people didn't know how much time they needed to get reviews completed, reviews were often left until too late, and done after the code was already tested and released.

And because most people didn't have a lot of experience doing reviews, they weren't sure what they should look for, and how to give meaningful feedback. Developers were frustrated (he told me to do it this way, but on another change she told me that I should do it that way) and sometimes upset by negative criticism.

Eventually we decided that the leads would do most of the reviews. Although this added a lot more to the leads’ workload, and meant that they couldn't do as much coding themselves, it helped in a few ways. The lead developers usually had a much better understanding of the requirements and what the code was supposed to do than another developer who happened to be available, which mean that they had a better chance of finding real mistakes. And because the same person was doing most of their reviews, developers received consistent feedback.

How we do Code Reviews

How we do reviews has stayed pretty much the same over the years.

Non-trivial code changes are reviewed (either before or after the check-in) regardless of who wrote the code or what the code does. We don’t hold formal review meetings with a projector or print outs, and we haven’t found it necessary to use a review tool like Code Collaborator or Crucible or the internal tools that Google or Facebook use to manage and track reviews, although looking back adopting a tool like this early might have helped the team start off better.

Sometimes reviews are done face-to-face, but most of the time they’re done offline: the reviewer and developer exchanging information and maybe patch files by email, because we've found this to be more convenient and easier for everyone to schedule (although people will meet to go through the review in detail if they decide it is necessary).

What has changed over time is what the reviewers look for, and what they find.

Reviews for Correctness

It seems like programmers spend more time arguing over what to look for in code reviews than they actually spend doing code reviews.

We started doing code reviews to improve code quality and to find problems that we weren't catching in testing. Not as a way of teaching inexperienced developers how to write better code, or a way to spread knowledge about how the code works across the team. These might be happy side effects, but reviews should be about making sure that the code works right.

Instead of using long check lists, reviewers started off by asking a small set of questions when they looked at code:

Does the code do what it is supposed to, and does it look like the logic is correct?

Are there any obvious coding mistakes or code that looks fishy? Is there anything obviously missing or incomplete? What has to be fixed to make sure that the code will work correctly?

Reviewing for correctness is hard work – a reviewer has to spend some time understanding the requirements, and more time thinking about the code and trying to get into the head of the person who wrote it.

But even without a lot of time or familiarity, a good reviewer can still see glaring logical mistakes and oversights, inconsistencies (you changed a, b, and d, but you didn't change c. Was this on purpose?), common coding mixups (looking out for < instead of <= or sometimes > in comparisons, off-by-one errors, using the wrong variables in calculations or comparisons – buyer instead of seller, inviter instead of invitee), debugging code left on by accident, redundant code and checks that don’t seem to be needed and other suspect code that doesn't look right or is confusing.

Reviewers can also look out for basic security issues (access control rules/roles being applied properly, correct handling of sensitive data), backwards compatibility, and proper use of APIs and language features (especially for people new to the team).

And while we don’t ask reviewers to spend a lot of time reviewing automated tests, they may also do spot checks to look for holes in the test suite – especially if they find an obvious coding mistake (which means that some tests are either missing or wrong).
Does this code look the same and work the same as the rest of the code base?

Does it follow the style and conventions that the team has agreed to (indentation, headers, capitalization and naming conventions, import sequence…)? Does it use standard templates and auto-formatting rules? We’re not talking about nitpicky perfection here, but we don’t want every piece of code to look different either.

Architectural conventions. Does it use the framework and common libraries properly? Does it properly follow the big patterns (MVC etc) that we want people to understand and use?

We wanted developers to work out issues around consistency and style early on while the team was coming together and everyone was still learning and exploring. This turned out to take longer and cause more trouble than we expected. Not only is it tedious to check for, but style can also be a near religious thing for some people. A developer won’t argue over whether something is a bug or not, at least not after they've understood what’s wrong and why. And most developers appreciate learning how to use the framework or language more effectively, as long as this is done in a polite and respectful way. But some people get seriously wound up over aesthetics and style, what kind of code looks cleaner or is more elegant.

Reviewers could offer other feedback of course, advice and suggestions on how to simplify the solution and improve the code, but we left it up to the developer when or if to act on this advice as long as the code is correct and follows conventions.

Reviewing for Understandability

Over time, what you look for and what you get out of code reviews should change. Because the code has changed. And the people doing the work – the developers and the reviewers – have changed too.

It got easier for reviewers to find bugs and bad code as developers learned more about how to use the code checkers in their IDEs and after we found some good static analysis tools to automatically check for common coding bugs and bad coding practices. This mean that reviewers could spend their time looking for more important design mistakes, or timing errors or sequence accounting errors or concurrency bugs like race conditions and potential deadlocks, and other problems that the tools can’t find.

As the team gained more experience supporting the system in production, troubleshooting and fixing problems and working with Ops, reviewers started to look more closely at defensive programming and run-time safety: error and exception handling (especially exception conditions that are difficult to test for), data validation and limits checking, timeouts and retries, properly enforcing API contracts, and logging and alerting. Making sure that the system “wouldn't boink”.

Because we worked out most of the consistency issues early, it stopped being necessary to review for consistency after a while. There was lots of code, so it was natural and easy for developers to build on what was already there, and to keep following the existing style and patterns.

But this also meant that there was a lot more code that had to be read and understood. So code reviewers became more concerned with anything that made the code harder to read and understand. Although fussing over details like method and variable names and whether comments are needed or make sense might not seem that important, all of this feeds back into correctness – reviewers can’t tell if the code is correct if they aren't sure that they really understand it.

Reviewers also started looking out more for unnecessary code duplication because clones create more opportunities for mistakes when you have to make changes, and code that that hasn't been updated to use new libraries or APIs, or code that isn't needed any more (especially a problem if you use feature switches and don’t get around to removing them), or refactoring that didn't get finished or that got out of control.

Understanding, not Criticizing

Reviews are about understanding the code, and making sure that it works right, not criticizing it.

It’s important to stay out of arguments about “what I think good code is vs. what you think good code is”. A review should never run along the lines of “I know what good OO code is supposed to look like, and apparently you don’t, so let me show you.”

A review should only be about “I have to be able to understand this code well enough to make sure that it works, and in case I have to fix something in this code myself, even though I wouldn't have written it this way and wish that you hadn't (but I’ll keep this part to myself).”

When reviewers step across this line, when people’s pride and sense of professional self-worth are involved, things can get ugly fast.

Besides avoiding arguments, it’s a waste of time giving feedback and suggestions that nobody is going to act on because they’re too busy to get to it or they don’t agree or it isn't necessary. If you want to get good value out of reviews, keep them focused on what’s important.

The main questions that reviewers try to answer now when they look at code are:

Does this code do what it is supposed to do? And will it keep working even if something goes wrong?

Reviewers keep looking for the same kinds of obvious logic and functional problems as they did before. But they also ask: What happens if it runs into an error? Will Ops be able to tell what the problem was? What happens if some other part of the system that it relies on fails? Will it retry and recover, will it fail gracefully, or will it go boom? What’s the chance that this code could get garbage passed to it, and what would happen if it did? And can it pass garbage to some other part of the system?
Can I be sure that this code does what it is supposed to do? Can I understand the code? Can I follow the changes that were made?

Is the code maintainable? Could I fix it in production if something went wrong? Is there any code too that is cute or too clever for its own good? Are there any obvious, safe and simple things that can be done to make the code simpler and easier to understand and easier to change?

Some things get better with time, other things don’t

By looking at changes to code bases over time, Michael Feathers has found that most code is rarely or never changed once it is written; most changes are made to over and over to the same small percentage of the code base.

Because reviewers keep seeing the same code being changed, it changes how they work:

Reviewers who have seen the code before don’t have to spend as much time figuring out what’s going on, they understand the context and have a better chance of understanding the changes and whether the changes are done correctly.

Looking at the same code again, reviewers may see legacy problems that they didn't notice before. And they should be able to offer more valuable, less superficial feedback on how the code can be simplified and improved.

There’s a trade-off being made here too of course. Reviewers can also get stale looking at the same code, and stop looking closely enough. And the reviewer’s biases and blind spots will get amplified over time.
As reviewers work with the same developers over and over, they will get to know each other better. Reviewers will learn what to look out for, what the developer’s strengths and weaknesses are, the developer’s tendencies and biases and blind spots, the kinds of mistakes and oversights that they tend to make.

As developers get to know the reviewers better, feedback will get shorter and simpler, and there will be fewer misunderstandings.

And because developers will know more about what to expect from a review, they will start to check their own code more carefully, and try to find the problems before the reviewer does. This is what Jason Cohen at SmartBear calls “the Ego Effect”: developers write better code because they know that somebody is going to be looking at it closely, especially if it’s someone they respect. Some people feel that this is the most important reason to do code reviews – to make developers work more carefully, so they don’t look foolish in front of somebody else; and to think about writing code that other people will read.

Avoiding Diminishing Returns

Like any practice in software development, you will eventually see diminishing returns from code reviews especially if you keep doing the same things, looking for the same problems in the same way. You may even reach a point where code reviews aren't worth the cost or time.

This hasn't happened to us. We continue to get a lot of value out of reviews, maybe even more now.

Even as our tools have improved and as we've strengthened our testing capability we keep doing reviews because bug checking tools, reviews and testing find different problems and different kinds of problems. We've also found that reviews make testing more effective and efficient, because there’s less back-and-forth between developers and testers over mistakes that are caught and corrected earlier.

And now that reviews are an accepted part of how people work, we don’t have to spend time selling and re-selling the value of code reviews to the team or to the business. As long as code reviews are being taken seriously, and as long as you care about writing good code, they’re worth doing.

Thursday, September 12, 2013

The Real Cost of Change in Software Development

There are two widely opposed (and often misunderstood) positions on how expensive it can be to change or fix software once it has been designed, coded, tested and implemented. One holds that it is extremely expensive to leave changes until late, that the cost of change rises exponentially. The other position is that changes should be left as late as possible, because the cost of changing software is – or at least can be – essentially flat (that’s why we call it “soft”ware).

Which position is right? Why should we care? And what can we do about it?

Exponential Cost of Change

Back in the early 1980s, Barry Boehm published some statistics (Software Engineering Economics, 1981) which showed that the cost of making a software change or fix increases significantly over time – you can see the original curve that he published here.

Boehm looked at data collected from Waterfall-based projects at TRW and IBM in the 1970s, and found that the cost of making a change increases as you move from the stages of requirements analysis to architecture, design, coding, testing and deployment. A requirements mistake found and corrected while you are still defining the requirements costs almost nothing. But if you wait until after you've finished designing, coding and testing the system and delivering it to the customer, it can cost up to 100x as much.

A few caveats here. First, the cost curve is much higher in large projects (in smaller projects, the cost curve is more like 1:4 instead of 1:100). Those cases where the cost of change rises up to 100x are rare, what Boehm calls Architecture-Breakers, where the team gets a fundamental architectural assumption wrong (scaling, performance, reliability) and doesn't find out until after customers are already using the system and running into serious operational problems. And this analysis was all done on a small data sample from more than 30 years ago, when developing code was much more expensive and time-consuming and paperworky, and the tools sucked.

A few other studies have been done since then which mostly back up Boehm's findings – at least the basic idea that the longer it takes for you to find out that you made a mistake, the more expensive it is to correct it. These studies have been widely referenced in books like Steve McConnell’s Code Complete, and used to justify the importance of early reviews and testing:

Studies over the last 25 years have proven conclusively that it pays to do things right the first time. Unnecessary changes are expensive. Researchers at Hewlett-Packard, IBM, Hughes Aircraft, TRW, and other organizations have found that purging an error by the beginning of construction allows rework to be done 10 to 100 times less expensively than when it's done in the last part of the process, during system test or after release (Fagan 1976; Humphrey, Snyder, and Willis 1991; Leffingwell 1997; Willis et al. 1998; Grady 1999; Shull et al. 2002; Boehm and Turner 2004).

In general, the principle is to find an error as close as possible to the time at which it was introduced. The longer the defect stays in the software food chain, the more damage it causes further down the chain. Since requirements are done first, requirements defects have the potential to be in the system longer and to be more expensive. Defects inserted into the software upstream also tend to have broader effects than those inserted further downstream. That also makes early defects more expensive.

There’s some controversy over how accurate and complete this data is, how much we can rely on it, and how relevant it is today when we have much better development tools and many teams have moved from heavyweight sequential Waterfall development to lightweight iterative, incremental development approaches.

Flattening the Cost of Changing Code

The rules of the game should change with iterative and incremental development – because they have to.

Boehm realized back in the 1980s that we could catch more mistakes early (and therefore reduce the cost of development) if we think about risks upfront and design and build software in increments, using what he called the Spiral Model, rather than trying to define, design and build software in a Waterfall sequence.

The same ideas are behind more modern, lighter Agile development approaches. In Extreme Programming Explained (the 1st edition, but not the 2nd) Kent Beck states that minimizing the cost of change is one of the goals of Extreme Programming, and that a flattened change cost curve is “the technical premise of XP”:

Under certain circumstances, the exponential rise in the cost of changing software over time can be flattened. If we can flatten the curve, old assumptions about the best way to develop software no longer hold…

You would make big decisions as late in the process as possible, to defer the cost of making the decisions and to have the greatest possible chance that they would be right. You would only implement what you had to, in hopes that the needs you anticipate for tomorrow wouldn't come true. You would introduce elements to the design only as they simplified existing code or made writing the next bit of code simpler.

It’s important to understand that Beck doesn't say that with XP the change curve is flat. He says that these costs can be flattened if teams work towards this, leveraging key practices and principles in XP, such as:

Simple Design, doing the simplest thing that works, and deferring design decisions as late as possible (YAGNI), so that the design is easy to understand and easy to change
continuous, disciplined Refactoring to keep the code easy to understand and easy to change
Test-First Development – writing automated tests upfront to catch coding mistakes immediately, and to build up a testing safety net to catch mistakes in the future
developers collaborating closely and constantly with the Customer to confirm their understanding of what they need to build and working together in Pairs to design solutions and solve problems, and catch mistakes and misunderstandings early
relying on working software over documentation to minimize the amount of paperwork that needs to be done with each change (write code, not specs)
the team’s experience working incrementally and iteratively – the more that people work and think this way, the better they will get at it.

All of this makes sense and sounds right, although there are no studies that back up these assertions, which is why Beck dropped this change curve discussion from the second edition of his XP book. But by then the idea that change could be flat with Agile development had already become accepted by many people.

The importance of Feedback

Scott Amber agrees that the cost curve can be flattened in Agile development, not because of Simple Design, but because of the feedback loops which are fundamental to iterative, incremental development. Agile methods optimize feedback within the team, developers working closely together with each other and with the Customer and relying on continuous face-to-face communications. And following technical practices like Test-First Development and Pair Programming and Continuous Integration makes these feedback loops even tighter.

But what really matters is getting feedback from the people using the system – it’s only then that you know if you got it right or what you missed. The longer that it takes to design and build something and get feedback from real users, the more time and work that is required to get working software into a real customer’s hands, the higher your cost of change really is.

Optimizing and streamlining this feedback loop is what is driving the Lean Startup approach to development: defining a Minimum Viable Product (something that just barely does the job), getting it out to customers as quickly as you can, and then responding to user feedback through Continuous Deployment and A/B testing techniques until you find out what customers really want.

Even flat change can still be expensive

Even if you do everything to optimize these feedback loops and minimize your overheads, this still doesn’t mean that change will come cheap. Being fast isn’t good enough if you make too many mistakes along the way.

The Post Agilist uses the example of painting a house: assume that it costs $1,000 each time you paint the house, whether you paint it blue or red or white. The cost of change is flat. But if you have to paint it blue first, then red, then white before everyone is happy, you’re wasting time and money.

“No matter how expensive or change the ‘cost of change’ curve may be, the fewer changes that are made, the cheaper and faster the result will be…Planning is not a four letter word.” [however, I would like to point out that “plan” is].

Spending too much time upfront in planning and design is waste. But not spending enough time upfront to find out what you should be building and how you should be building it before you build it, and not taking the care to build it carefully, is also a waste.

Change gets more expensive over time

You also have to accept that the incremental cost of change will go up over the life of a system, especially once a system is being used. This is not just a technical debt problem. The more people using the system, the more people who might be impacted by the change if you get it wrong, the more careful you have to be, which means you need to spend more time on planning and communicating changes, building and testing a roll-back capability, and roll changes out slowly using Canary Releases and Dark Launching – which add costs and delays to getting feedback.

There are also more operational dependencies that you have to understand and take care of, and more data that you have to change or fix up, making changes even more difficult and expensive. If you do things right, keep a good team together and manage technical debt responsibly, these costs should rise gently over the life of a system – and if you don’t, that exponential change curve will kick in.

What is the Real Cost of Change?

Is the real cost of change exponential, or is it flat? The truth is somewhere in between.

There’s no reason that the cost of making a change to software has to be as high as it was 30 years ago. We can definitely do better today, with better tools and better, cheaper ways of developing software. The keys to minimizing the costs of change seem to be:

Get your software into customer hands as quickly as you can. I am not convinced that any organization really needs to push out software changes 10-50-100x a day, but you don’t want to wait months or years for feedback either. Deliver less, but more often. And because you’re going to deliver more often, it makes sense to build a Continuous Delivery pipeline so that you can push changes out efficiently and with confidence. Use ideas from Lean Software Development and maybe Kanban to identify and eliminate waste and to minimize cycle time.
We know that even with lots of upfront planning and design thinking, we won’t get everything right upfront - this is the Waterfall fallacy. But it’s also important not to waste time and money iterating when you don’t need to. Spending enough time upfront in understanding requirements and in design to get it at least mostly right the first time can save a lot later on.
And whether you’re working incrementally and iteratively, or sequentially, it makes good sense to catch mistakes early when you can, whether you do this through Test First Development and pairing, or requirements workshops and code reviews, whatever works for you.

Tuesday, September 3, 2013

This is how Facebook Develops and Deploys Software. Should you care?

A recently published academic paper by Prof. Dror Feitelson at Hebrew University, Eitan Frachtenberg a research scientist at Facebook, and Kent Beck (who is also doing something at Facebook), describes Facebook’s approach to developing and deploying their front-end software. While it would be more interesting to understand how back-end development is done (this is where the real heavy lifting is done scaling up to handle hundreds of millions of users), there are a few things in the paper that are worth knowing about.

Continuous Deployment at Facebook is not Continuous Deployment

Rather than planning work out in projects or breaking work into time boxed Sprints, Facebook developers do most of their work in independent, small changes which are released frequently. This makes sense in Facebook’s online business model, everyone constantly tuning the platform and trying out new options and applications in different user communities, seeing what sticks. It’s a credit to their architecture that so many small, independent changes can actually be done independently and cheaply.

Facebook says that they follow Continuous Deployment, but it’s not Continuous Deployment the way that IMVU made popular where every change is pushed out to customers immediately, or even how a company like Etsy does Continuous Deployment.

At Facebook, code can be released twice a day, but this is done mostly for bug fixes and internal code. New production code is released once per week: thousands of changes by hundreds of developers are packaged up by their small release team on Sundays, run through automated regression testing, and released on Tuesday if the developers who contributed the changes are present. Release engineers assess the risk of changes based on the size of the change, the amount of discussion done in code reviews (which is recorded through an internal code review tool), and on each developer’s “push karma”: how many problems they have seen from code by this developer before.

A tool called “Gatekeeper” controls what features are available to what customers to support dark launching, and all code is released incrementally – to staging, then a subset of users, and so on. Changes can be rolled-back if necessary – individually, or, as a last resort, an entire code release. However, like a lot of Silicon Valley devops shops, they mostly follow the “Real Men only Roll Forward” motto.

Code Ownership

A key to the culture at Facebook is that developers are individually responsible for the code that they wrote, for testing it and supporting it in production. This is reflected in their code ownership model:

Developers must also support the operational use of their software — a combination that’s become known as “devops.” This further motivates writing good code and testing it thoroughly. Developers’ personal stake in keeping the system running smoothly complements the engineering procedures and lets the system maintain quality at scale. Methodologies and tools aren’t enough by themselves because they can always be misused. Thus, a culture of personal responsibility is critical.

Consequently, most source files are modified by only a few engineers. Although at least one other engineer reviews all changes before they’re committed, a third of the source files have only been edited by one engineer, and another quarter by two. Only 10 percent of the files are handled by more than seven engineers. On the other hand, the distribution of engineers per file has a heavy tail, with the most widely shared file handled by no fewer than 870 distinct engineers. These widely shared files are predominantly library files and also include major configuration and top-level PHP files.

Testing? We don’t need no stinking testing…

Facebook doesn't have an independent test team, because, they say, they don’t need one.

First, they depend a lot on code reviews to find bugs:

At Facebook, code review occupies a central position. Every line of code that’s written is reviewed by a different engineer than the original author. This serves multiple purposes: the original engineer is motivated to ensure that the code is of high quality, the reviewer comes with a fresh mind and might find defects or suggest alternatives, and, in general, knowledge about coding practices and the code itself spreads throughout the company.

Developers are also responsible for writing unit tests and their own regression tests – they have “tens of thousands of regression tests” (which doesn't sound like near enough for 10+ million lines of mostly PHP code compiled into C++, both languages which are easy to make coding mistakes in) and automated performance tests.

And developers also test the software by using the development version of Facebook for their personal Facebook use. According to the authors, “this is just one aspect of the departure from traditional software development”. But Facebook developers using their own software internally (and passing this off as “testing”) is no different than the early days at Microsoft where employees were supposed to “eat their own dog food”, a practice that did little if anything to improve the quality of Microsoft products.

Facebook also depends on customers to test the software for them. Software is released in steps for A/B testing and “live experimentation” on subsets of the user base, whether customers want to participate in this testing or not. Because their customer base is so large, they can get meaningful feedback from testing with even a small percentage of users, which at least minimizes the risk and inconvenience to customers.

Security???

While performance is an important consideration for developers at Facebook, there is no mention of security checks or testing anywhere in this description of how Facebook develops and deploys software. No static analysis, dynamic analysis/scanning, pen testing or explanation of how the security team and developers work together, not even for “privacy sensitive code” – although this code is “held to a higher standard” they don’t explain what this “higher standard” is. Presumably they rely on use of libraries and frameworks to handle at least some appsec problems, and possibly look for security bugs in their code reviews, but they don't say.

There isn’t much information available on Facebook’s appsec program anywhere. The security team at Facebook seems to spend a lot of time educating people on how to use Facebook safely and how to develop Facebook apps safely and running their bug bounty program which pays outsiders to find security bugs for them.

A search on security on Facebook mostly comes back with a long list of public security failures, privacy violations and application security vulnerabilities found over the years and continuing up to the present day. Maybe the lack of an effective appsec program is the reason for this.

This is the way Facebook is Developed. Should you care?

While it’s interesting to get a look inside a high-profile organization like Facebook and how they approach development at scale, it’s not clear why this paper was written. There is little about what Facebook is doing (on their front-end development at least) that is unique or innovative, except maybe the way they use BitTorrent to push code changes out to thousands of servers like Twitter does, something that I already heard about a few years ago at Velocity and that has been written about before.

I like the idea of developers being responsible for their work, all the way into production, which is a principle that we also follow. Code reviews are good. Dark launching features is a good practice and has been a common practice in systems for a long time (even before it was called "dark launching"). Not having testers or doing appsec is not good. Otherwise, I'm not sure what the rest of us can learn from or would want to use from this.