Building Real Software: November 2012

Tuesday, November 27, 2012

Why Scrum Won

In the 1990s and early 2000s a number of different lightweight "agile" development methods sprung up.

Today a few shops use Extreme Programming, including most notably ThoughtWorks and Industrial Logic. But if you ask around, especially in enterprise shops, almost everybody who is “doing Agile” today is following Scrum or something based on Scrum.

What happened? Why did Scrum win out over XP, FDD, DSDM, Crystal, Adaptive Software Development, Lean, and all of the other approaches that have come and gone? Why are most organizations following Scrum or planning to adopt Scrum and not the Agile Unified Process or Crystal Clear (or Crystal Yellow, or Orange, Red, Violet, Magenta or Blue, Diamond or Sapphire for that matter)?

Is Scrum that much better than every other idea that came out of the Agile development movement?

Simplicity wins out

Scrum’s principal strength is that it is simpler to understand and follow than most other methods – almost naively simple. There isn't much to it: short incremental sprints, daily standup meetings, a couple of other regular and short planning and review meetings around the start and end of each sprint, some work to prioritize (or order) the backlog and keep it up-to-date, simple progress reporting, and a flat, simple team structure. You can explain Scrum in detail in a few pages and understand it less than an hour.

This means that Scrum is easy for managers to get their heads around and easy to implement, at a team-level at least (how to successfully scale to enterprise-level Scrum in large integrated programs with distributed teams using Scrum of Scrums or Communities of Practice or however you are supposed to do it, is still fuzzy as hell).

Scrum is easy for developers to understand too and easy for them to follow. Unlike XP or some of the other more well-defined Agile methods, Scrum is not prescriptive and doesn't demand a lot of technical discipline. It lets the team decide what they should do and how they should do it. They can get up to speed and start “doing Agile” quickly and cheaply.

But simplicity isn't the whole answer

But there’s more to Scrum’s success than simplicity. The real trick that put Scrum out front is certification. There’s no such thing as a Certified Extreme Programmer but there are thousands of certified ScrumMasters and certified product owners and advanced certified developers and even more advanced certified professionals and the certified trainers and coaches and officially registered training providers that certified them.

And now the PMI has got involved with its PMI-ACP Certified Agile Practitioner designation which basically ensures that people understand Scrum, with a bit of XP, Lean and Kanban thrown in to be comprehensive.

Whether Scrum certification is good or bad or useful at all is beside the point.

Certification helped Scrum succeed for several reasons. First, certification lead to early codification and standardization of what Scrum is all about. Consultants still have their own ideas and continue to fight between themselves over the only right way to do Scrum and the future direction of Scrum and what should be in Scrum and what shouldn't, but the people who are implementing Scrum don’t need to worry about the differences or get caught up in politics and religious wars.

Certification is a win win win…

Managers like standardization and certification – especially busy, risk-adverse managers in large mainstream organizations. If they are going to “do Agile”, they want to make sure that they do it right. By paying for standardized certified training and coaching on a standardized method, they can be reassured that they should get the same results as everyone else. Because of standardization and certification, getting started with Scrum is low risk: it’s easy to find registered certified trainers and coaches offering good quality professional training programs and implementation assistance. Scrum has become a product – everyone knows what it looks like and what to expect.

Certification also makes it easier for managers to hire new people (demand a certification upfront and you know that new hires will understand the fundamentals of Scrum and be able to fit in right away) and it’s easier to move people between teams and projects that are all following the same standardized approach.

Developers like this too, because certification (even the modest CSM) helps to make them more employable, and it doesn't take a lot of time, money or work to get certified.

But most importantly, certification has created a small army of consultants and trainers who are constantly busy training and coaching a bigger army of certified Scrum practitioners. There is serious competition between these providers, pushing each other to do something to get noticed in the crowd, saturating the Internet with books and articles and videos and webinars and blogs on Scrum and Scrumness, effectively drowning out everything else about Agile development.

And the standardization of Scrum has also helped create an industry of companies selling software tools to help manage Scrum projects, another thing that managers in large organizations like, because these tools help them to get some control over what teams are doing and give them even more confidence that Scrum is real. The tool vendors are happy to sponsor studies and presentations and conferences about Agile (er, Scrum), adding to the noise and momentum behind Scrum.

Scrum certification is a win win win: for managers, developers, authors, consultants and vendors.

It looks like David Anderson may be trying to do a similar thing with Kanban certification. It’s hard to see Kanban taking over the world of software development – while it’s great for managing support and maintenance teams, and helps to control work flow at a micro-level, Kanban doesn't fit for larger project work. But then neither does Scrum. And who would have picked Scrum as the winner 10 years ago?

Tuesday, November 20, 2012

Predictability - Making Promises you can Keep

Speed – being first to market, rapid innovation and conducting fast cheap experiments – is critically important to startups and many high tech firms. This is where Lean Startup ideas and Continuous Deployment come in. And this is why many companies are following Agile development, to design and deliver software quickly and flexibly, incorporating feedback and responding to change.

But what happens when software – and the companies that build software – grow(s) up? What matters at the enterprise level? Speed isn't enough for the enterprise. You have to balance speed and cost and quality. And stability and reliability. And security. And maintainability and sustainability over the long term. And you have to integrate with all of the other systems and programs in your company and those of your customers and partners.

Last week, a group of smart people who manage software development at scale got together to look at all of these problems, at Construx Software’s Executive Summit in Seattle. What became clear is that for most companies, the most important factor isn’t speed, or productivity or efficiency – although everyone is of course doing everything they can to cut costs and eliminate waste. And it isn’t flexibility, trying to keep up with too much change. What people are focused on most, what their customers and sponsors are demanding, is predictability – delivering working software when the business needs it, being a reliable and trusted supplier to the rest of the business, to customers and partners.

Enterprise Agile Development and Predictability

Steve McConnell’s keynote on estimation in Agile projects kicked this off. A lot of large companies are adopting Agile methods because they’ve heard and seen that these approaches work. But they’re not using Agile out of the box because they’re not using it for the same reasons as smaller companies.

Large companies are adapting Agile and hybrid plan-based/Waterfall approaches combining upfront scope definition, estimating and planning, with delivering the project incrementally in Agile time boxes. This is not about discovery and flexibility, defining and designing something as you go along – the problems are too big, they involve too many people, too many parts of the business, there are too many dependencies and constraints that need to be satisfied. Emergent design and iterative product definition don’t apply here.

Enterprise-level Agile development isn’t about speed either, or “early delivery of value”. It’s about reducing risk and uncertainty. Increasing control and visibility. Using story points and velocity and release Burn Up reporting and evidence of working software to get early confidence about progress on the project and when the work will be done.

The key is to do enough work upfront so that you can make long-term commitments to the business – to understand what the business needs, at least at a high level, and estimate and plan this out first. Then you can follow Agile approaches to deliver working software in small steps, and to deal with changes as they come in. As McConnell says, it’s not about “responding to change over following a plan”. It’s having a plan that includes the ability to respond to change.

By continuously delivering small pieces of working software, and calibrating their estimates with real project data using velocity, a team working this way will be able to narrow the “cone of uncertainty” much faster – they’ll learn quickly about their ability to deliver and about the quality of their estimates, as much as 2x faster than teams following a fully sequential Waterfall approach.

There are still opportunities to respond to change and reprioritize. But this is more about working incrementally than iteratively.

Kanban and Predictability

Enterprise development managers are also looking at Kanban and Lean Development to manage waste and to maintain focus. But here too the value is in improving predictability, to smooth work out and reduce variability by finding and eliminating delays and bottlenecks. It’s not about optimization and Just-in-Time planning.

As David Anderson explained in his keynote on Delivering Better Predictability, Business Agility and Good Governance with Kanban, senior executives care about productivity, cost and quality – but what they care about most is predictability. The goal of a good software development manager is to be able to make a customer promise that their organization can actually meet.

You do this by keeping the team focused on the business of making software, and trying to drive out everything else: eliminating delays and idle time, cutting back administrative overhead, not doing work that will end up getting thrown away, minimizing time wasted in multi-tasking and context-switching, and not starting work before you’ve finished the work that you’ve already committed to. Anderson says that managers like to start on new work as it comes in because “starting gives your customer a warm comfortable feeling” – until they find out you’ve lied to them, because “we’re working on it” doesn’t mean that the work is actually getting done, or will ever get done. This includes fixing bugs – you don’t just fix bugs right away because you should, you fix bugs as they’re found because the work involved is smaller and more predictable than trying to come back and fixing them later.

Teams can use Kanban to dynamically prioritize and control the work in front of them, to balance support and maintenance requirements against development work and fixed date commitments with different classes of service, and limit Work in Progress (WIP) to shorten lead times and improve throughput, following the Theory of Constraints. This lets you control variability and improve predictability at a micro-level. But you can also use actual throughput data and Cumulative Flow reporting to project forward on a project level and understand how much work the team can do and when they will be done.

What’s interesting to me is seeing how the same ideas and methods are being used in very different ways by very different organizations – big and small – to achieve success, whether they are focused on fast iterations and responding to rapid change, or managing large programs towards a predictable outcome.

Friday, November 9, 2012

Health Checks, Run-time Asserts and Monkey Armies

After going live, we started building health checks into the system – run-time checks on operational dependencies and status to ensure that the system is setup and running correctly. Over time we have continued to add more run-time checks and tests as we have run into problems, to help make sure that these problems don’t happen again.

This is more than pings and Nagios alerts. This is testing that we installed the right code and configuration across systems. Checking code build version numbers and database schema versions. Checking signatures and checksums on files. That flags and switches that are supposed to be turned on or off are actually on or off. Checking in advance for expiry dates on licenses and keys and certs.

Sending test messages through the system. Checking alert and notification services, make sure that they are running and that other services that are supposed to be running are running, and that services that aren't supposed to be running aren't running. That ports that are supposed to be open are open and ports that are supposed to be closed are closed. Checks to make sure that files and directories that are supposed to be there are there, that files and directories that aren't supposed to be there aren't, that tables that are supposed to be empty are empty. That permissions are set correctly on control files and directories. Checks on database status and configuration.

Checks to make sure that production and test settings are production and test, not test and production. Checking that diagnostics and debugging code has been disabled. Checks for starting and ending record counts and sequence numbers. Checking artefacts from “jobs” – result files, control records, log file entries – and ensuring that cleanup and setup tasks completed successfully. Checks for run-time storage space.

We run these health checks at startup, or sometimes early before startup, after a release or upgrade, after a failover – to catch mistakes, operational problems and environmental problems. These are tests that need to run quickly and return unambiguous results (things are ok or they’re not). They can be simple scripts that run in production or internal checks and diagnostics in the application code – although scripts are easier to adapt and extend. Some require hooks to be added to the application, like JMX.

Run-time Asserts

Other companies like Etsy do something similar with run-time asserts, using a unit test approach to check for conditions that must be in place for the system to work properly.

These tests can (and should) be run on development and test systems too, to make sure that the run-time environments are correct. The idea is to get away from checks being done by hand, operational checklists and calendar reminders and manual tests. Anything that has a dependency, anything that needs a manual check or test, anything in an operational checklist should have an automated run-time check instead.

Monkey Armies

The same ideas are behind Netflix’s over-hyped (though not always by Netflix) Simian Army, a set of robots that not only check for run-time conditions, but that also sometimes take automatic action when run-time conditions are violated – or even violate run-time conditions to test that the system will still run correctly.

The army includes Security Monkey, which checks for improperly configured security groups, firewall rules, expiring certs and so on; and Exploit Monkey, which automatically scans new instances for vulnerabilities when they are brought up. Run-time checking is taken to an extreme in Conformity Monkey, which shuts down services that don’t adhere to established policies, and the famous Chaos Monkey, which automatically forces random failures on systems, in test and in production.

It’s surprising how much attention Chaos Monkey gets – maybe it’s the cool name, or because Netflix has Open Sourced it along with some of their other monkeys. Sure it’s ballsy to test failover in production by actually killing off systems during the day, even if they are stateless VM instances which by design should failover without problems (although this is the point, to make sure that they really do failover without problems like they are supposed to).

There's more to Netflix's success than run-time fault injection and the other monkeys. Still, automatically double-checking as much as you can at run-time is especially important in an engineering-driven, rapidly-changing Devops or Noops environment where developers are pushing code into production too fast to properly understand and verify in advance. But whether you are continuously deploying changes to production (like Etsy and Netflix) or not, getting developers and ops and infosec together to write automated health checks and run-time tests is an important part of getting control over what's actually happening in the system and keeping it running reliably.

Monday, November 5, 2012

SANS Ask the Expert - the Cost of Remediation

An interesting interview with Dan Cornell of Denim Group on the work that they are doing to understand the cost of remediating security vulnerabilities, here on the SANS Application Street Fighter blog.