Monday, April 18, 2016

DevOpsDays: Empathy, Scaling, Docker, Dependencies and Secrets

Last week I attended DevOpsDays 2016 in Vancouver. I was impressed to see how strong the DevOps community has grown from the time that I attended my first DevOpsDays event in Mountain View in 2012. There were more than 350 attendees, all of them doing interesting and important work.

Here are the main themes that I followed at this conference:

Empathy – Humanizing Engineering and Ops

There was a strong thread running through the conference on the importance of the human side of engineering and operations, understanding and empathizing with people across the organization. There were two presentations specifically on empathy: one from an engineering perspective by Joyent’s Matthew Smillie, and another excellent presentation on the neuroscience of empathy by Dave Mangot at Librato, which explained how we are all built for empathy and that it is core to our survival. There was also a presentation on gender issues, and several breakout sessions on dealing with people issues and bringing new people into DevOps.

Another side to this was how we use tools to collaborate and build connections between people. More people are depending more on – and doing more with – chat systems like HipChat and Slack to do ChatOps. Using chat as a general interface to other tools, leveraging bots like Hubot to automatically trigger and guide actions, such as tracking releases and handling incidents.

In some organizations, standups are being replaced with Chatups, as people continue to find new ways to engage and connect with other people working remotely and inside and outside of teams.

Scaling DevOps

All kinds of organizations are dealing with scaling problems in DevOps.

Scaling their organizations. Dealing with DevOps at the extremes, at really large organizations and figuring out how to effectively do DevOps in small teams.

Scaling Continuous Delivery. Everyone is trying to push out more changes, faster and more often in order to reduce risk (by reducing the batch size of changes), increase engagement (for users and developers), and improve the quality of feedback. Some organizations are already reaching the point where they need to manage hundreds or thousands of pipelines, or optimize single pipelines shared by hundreds of engineers, building and shipping out changes (or newly baked containers) several times a day to many different environments.

A common story for CD as organizations scale up goes something like this:

  1. Start out building a CD capability in an ad hoc way, using Jenkins and adding some plugins and writing custom scripts. Keep going until it can’t keep up.
  2. Then buy and install a commercial enterprise CD toolset, transition over and run until it can’t keep up.
  3. Finally, build your own custom CD server and move your build and test fleet to the cloud and keep going until your finance department shouts at you.
Scaling testing. Coming up with effective strategies for test automation where it adds most value – in unit testing (at the bottom of the test pyramid), and end-to-end system testing (at the top of the pyramid). Deciding where to invest your time. Understanding the tools and how to use them. What kind of tests are worth writing, and worth maintaining.

Scaling architecture. Which means more and more experiments with microservices.

Docker, Docker, Docker

Docker is everywhere. In pilots. In development environments. In test environments especially. And more often now, in production. Working with Docker, problems with Docker, and questions about Docker came up in many presentations, break outs and hallway discussions.

Docker is creating new problems at the start and end of the CD pipeline.

First, it moves configuration management upfront into the build step. Every change to the application or change to the stack that it is built and runs on requires you to “bake a new cake” (Diogenes Rettori at Openshift) and build up and ship out a new container. This places heavy demands on your build environment. You need to find effective and efficient ways to manage all of the layers in your containers, caching dependencies and images to make builds run fast.

Docker is also presenting new challenges at the production end. How do you track and manage and monitor clusters of containers as the application scales out? Kubernetes seems to be the tool of choice here.

Depending on Dependencies

More attention is turning to builds and dependency management, managing third party and open source dependencies. Identifying, streamlining and securing these dependencies.

Not just your applications and their direct dependencies – but all of the nested dependencies in all of the layers below (the software that your software depends on, and the software that this software depends on, and so on and so on). Especially for teams working with heavy stacks like Java.

There was a lot of discussion on the importance of tracking dependencies and managing your own dependency repositories, using tools like Archiva, Artifactory or Nexus, and private Docker registries. And stripping back unnecessary dependencies to reduce the attack surface and run-time footprint of VMs and containers. One organization does this by continuously cutting down build dependencies and spinning up test environments in Vagrant until things break.

Docker introduces some new challenges, by making dependency management seem simpler and more convenient, and giving developers more control over application dependencies – which is good for them, but not always good for security:

  • Containers are too fat by default - they include generic platform dependencies that you don’t need and - if you leave this up to developers - developer tools that you don’t want to have in production.
  • Containers are shipped with all of the dependencies baked in. Which means that as containers are put together and shipped around, you need to keep track of what versions of what images were built with what versions of what dependencies and when, where they have been shipped to, and what vulnerabilities need to be fixed.
  • Docker makes it easy to pull down pre-built images from public registries. Which means it is also easy to pull images that are out of date or that could contain malware.
You need to find a way to manage these risks without getting in the way and slowing down delivery. Container security tools like Twistlock can scan for vulnerabilities, provide visibility into run-time security risks, and enforce policies.

Keeping Secrets Secret

Docker, CD tooling, automated configuration management tools like Chef and Puppet and Ansible and other automated tooling create another set of challenges for ops and security: how to keep the credentials, keys and other secrets that these tools need safe. Keeping them out of code and scripts, out of configuration files, and out of environment variables.

This needs to be handled through code reviews, access control, encryption, auditing, frequent key rotation, and by using a secrets manager like Hashicorp’s Vault.

Passion, Patterns and Problems

I met a lot of interesting, smart people at this conference. I experienced a lot of sincere commitment and passion, excitement and energy. I learned about some cool ideas, new tools to use and patterns to follow (or to avoid).

And new problems that need to be solved.

Wednesday, December 23, 2015

DZone's 2015 Guide to Application Security

DZone recently published a Guide to Application Security. It provides a good overview of effective appsec tools and practices, including my article 10 Steps to Secure Software, which looks at the latest release of OWASP's Proactive Controls project.

Wednesday, December 9, 2015

Help make Software Development Safe and Secure

The OWASP community is working on a new set of secure developer guidelines, called the "OWASP Proactive Controls". The latest draft of these guidelines have been posted in "world edit" mode so that anyone can make direct comments or edits to the document, even anonymously.

You can help make software development safer and more secure by reviewing and contributing to the guidelines at this link:

https://docs.google.com/document/d/1e38W6fGv6PmTEFSAwCr9rOj_ACAeKz1bKYgDj2mCACs/edit?usp=sharing.

Thanks for your help!

Wednesday, November 11, 2015

DevOps for Financial Services

The e-book I wrote this summer for O'Reilly DevOps for Finance: Reducing Risk through Continuous Delivery. It looks at DevOps and Continuous Delivery from the perspective of improving reliability and reducing operational and technical risk, while improving security and meeting compliance requirements. It includes an analysis of the challenges that financial services organizations face, and how to address these challenges, with case studies from LMAX, ING, Capital One, Wealthfront and my own firm.

Thursday, August 20, 2015

How to Prevent Catastrophic Failures in Complex Distributed Systems

In his now famous paper How Complex Systems Fail, Dr. Richard Cook explains how and why failures happen in complex systems:

Some Rules of failure in Complex Systems

4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.

The net of this: Complex systems are essentially and unavoidably fragile. We can try, but we can’t stop them from failing – there are too many moving pieces, too many variables and too many combinations to understand and to test. And even the smallest change or mistake can trigger a catastrophic failure.

A New Hope

But new research at the University of Toronto on catastrophic failures in complex distributed systems offers some hope – a potentially simple way to reduce the risk and impact of these failures.

The researchers looked at distributed online systems that had been extensively reviewed and tested, but still failed in spectacular ways.

They found that most catastrophic failures were initially triggered by minor, non-fatal errors: mistakes in configuration, small bugs, hardware failures that should have been tolerated. Then, following rule #3 above, a specific and unusual sequence of events had to occur for the catastrophe to unravel.

The bad news is that this sequence of events can’t be predicted – or tested for – in advance.

The good news is that catastrophic failures in complex, distributed systems may actually be easier to fix than anyone previously thought. Looking closer, the researchers found that almost all (92%) catastrophic failures are the result of incorrect handling on non-fatal errors. These mistakes in error handling caused the system to behave unpredictably, causing other errors, which weren’t always handled correctly or predictably, creating a domino effect.

More than half (58%) of catastrophic failures could be prevented by careful review and testing of error handling code. In 35% of the cases, the faults in error handling code were trivial: the error handler was empty or only logged a failure, or the logic was clearly incomplete. Easy mistakes to find and fix. So easy that the researchers built a freely available static analysis checker for Java byte code, Aspirator, to catch many of these problems.

In another 23% of the cases, the error handling logic of a non-fatal error was so wrong that basic statement coverage testing or careful code reviews would have caught the mistakes.

The next challenge that the researchers encountered was convincing developers to take these mistakes seriously. They had to walk developers through understanding why small bugs in error handling, bugs that “would never realistically happen” needed to be fixed – and why careful error handling is so important.

This is a challenge that we all need to take up – if we hope to prevent catastrophic failure in complex distributed systems.

Tuesday, July 7, 2015

Don’t Blame Bad Software on Developers – Blame it on their Managers

There’s a lot of bad software out there. Unreliable, insecure, unsafe and unusable. It’s become so bad that some people are demanding regulation of software development and licensing software developers as “software engineers” so that they can be held to professional standards, and potentially sued for negligence or malpractice.

Licensing would ensure that everyone who develops software has at least a basic level of knowledge and an acceptable level of competence. But licensing developers won’t ensure good software. Even well-trained, experienced and committed developers can’t always build good software. Because most of the decisions that drive software quality aren’t made by developers – they’re made by somebody else in the organization.

Product managers and Product Owners. Project managers and program managers. Executive sponsors. CIOs and CTOs and VPs of Engineering. The people who decide what’s important to the organization, what gets done and what doesn’t, and who does it – what problems the best people work on, what work gets shipped offshore or outsourced to save costs. The people who do the hiring and firing, who decide how much money is spent on training and tools. The people who decide how people are organized and what processes they follow. And how much time they get to do their work.

Managers – not developers – decide what quality means for the organization. What is good, and what is “good enough”.

Management Mistakes

As a manager, I’ve made a lot of mistakes and bad decisions over my career. Short-changing quality to cut costs. Signing teams up for deadlines that couldn’t be met. Giving marketing control over schedules and priorities, trying to squeeze in more features to make the customer or a marketing executive happy. Overriding developers and testers who told me that the software wouldn’t be ready, that they didn’t have enough time to do things properly. Letting technical debt add up. Insisting that we had to deliver now or never, and that somehow we would make it all right later.

I’ve learned from these mistakes. I think I know what it takes to build good software now. And I try to hold to it. But I keep seeing other managers make the same mistakes. Even at the world’s biggest and most successful technology companies, at organizations like Microsoft and Apple.

These are organizations that control their own destinies. They get to decide what they will build and when they need to deliver it. They have some of the best engineering talent in the world. They have all the good tools that money can buy – and if they need better tools, they just write their own. They’ve been around long enough to know how to do things right, and they have the money and scale to accomplish it.

They should write beautiful software. Software that is a joy to use, and that the rest of us can follow as examples. But they don’t even come close. And it’s not the fault of the engineers.

Microsoft Quality

Problems with software quality at Microsoft are so long-running that “Microsoft Quality” has become a recognized term, for software that is just barely “good enough” to be marginally accepted – and sometimes not even that good.

Even after Microsoft became a dominant, global enterprise vendor, quality has continued to be a problem. A 2014 Computer World article “At Microsoft, quality seems to be job none” complains about serious quality and reliability problems in early versions of Windows 10. But Windows 10 is supposed to represent a sea change for Microsoft under their new CEO, a chance to make up for past mistakes, to do things right. So what's going wrong?

The culture and legacy of “good enough” software has been in place for so long that Microsoft seems to be trapped, unable to improve even when they have recognized that good enough isn’t good enough anymore. This is a deep-seated organizational and cultural problem. A management problem. Not an engineering problem.

Apple’s Software Quality Problems

Apple sets themselves apart from Microsoft and the rest of the technology field, and charges a premium based on their reputation for design and engineering excellence. But when it comes to software, Apple is no better than anyone else.

From the epic public face plant of Apple Maps, to constant problems in iTunes and the App Store, problems with iOs updates that fail to install, data lost somewhere in the iCloud, serious security vulnerabilities, error messages that make no sense, and baffling inconsistencies and restrictions on usability, Apple’s software too often disappoints in fundamental and embarrassing ways.

And like Microsoft, Apple management seems have lost their way:

I fear that Apple’s leadership doesn’t realize quite how badly and deeply their software flaws have damaged their reputation, because if they realized it, they’d make serious changes that don’t appear to be happening. Instead, the opposite appears to be happening: the pace of rapid updates on multiple product lines seems to be expanding and accelerating.

I suspect the rapid decline of Apple’s software is a sign that marketing is too high a priority at Apple today: having major new releases every year is clearly impossible for the engineering teams to keep up with while maintaining quality. Maybe it’s an engineering problem, but I suspect not — I doubt that any cohesive engineering team could keep up with these demands and maintain significantly higher quality.

Marco Arment, Apple has lost the functional high ground, 2015-01-04

Recent announcements at this year’s WWDC indicate that Apple is taking some extra time to make sure that their software works. More finish, less flash. We’ll have to wait and see whether this is a temporary pause or a sign that management is starting to understand (or remember) how important quality and reliability actually is.

Managers: Stop Making the Same Mistakes

If companies like Microsoft and Apple, with all of their talent and money, can’t build quality software, how are the rest of us supposed to do it? Simple. By not making the same mistakes:

  1. Putting speed-to-market and cost in front of everything else. Pushing people too hard to hit “drop dead dates”. Taking “sprints” literally: going as fast as possible, not giving the team time to do things right or a chance to pause and reflect and improve.

    We all have to work within deadlines and budgets, but in most business situations there’s room to make intelligent decisions. Agile methods and incremental delivery provide a way out when you can’t negotiate deadlines or cost, and don’t understand or can’t control the scope. If you can’t say no, you can say “not yet”. Prioritize work ruthlessly and make sure that you deliver the important things as early as you can. And because these things are important, make sure that you do them right.

  2. Leaving testing to the end. Which means leaving bug fixing to after the end. Which means delivering late and with too many bugs.

    Disciplined Agile practices all depend on testing – and fixing – as you code. TDD even forces you to write the tests before the code. Continuous Integration makes sure that the code works every time someone checks in. Which means that there is no reason to let bugs build up.

  3. Not talking to customers, not testing ideas out early. Not learning why they really need the software, how they actually use it, what they love about it, what they hate about it.

    Deliver incrementally and get feedback. Act on this feedback, and improve the software. Rinse and repeat.

  4. Ignoring fundamental good engineering practices. Pretending that your team doesn’t need to do these things, or can’t afford to do them or don’t have time to do them, even though we’ve known for years that doing things right will help to deliver better software faster.

    As a Program Manager or Product Owner or a Business Owner you don’t need to be an expert in software engineering. But you can’t make intelligent trade-off decisions without understanding the fundamentals of how the software is built, and how software should be built. There’s a lot of good information out there on how to do software development right. There’s no excuse for not learning it.

  5. Ignoring warning signs.

    Listen to developers when they tell you that something can’t be done, or shouldn’t be done, or has to be done. Developers are generally too willing to sign on for too much, to reach too far. So when they tell you that they can’t do something, or shouldn’t do something, pay attention.

And when you make mistakes - which you will, learn from them, don’t waste them. When something goes wrong, get the team to review it in a retrospective or run a blameless post mortem to figure out what happened and why, and how you can get better. Learn from audits and pen tests. Take negative feedback from customers seriously. This is important, valuable information. Treat it accordingly.

As a manager, the most important thing you can do is to not set your team up for failure. That’s not asking for too much.

Wednesday, June 24, 2015

Top 10 Lists for Designing and Writing Secure and Safe Software

If you care about writing secure code, should know all about these Top 10 lists:

OWASP Top 10

The OWASP Top 10 is a community-built list of the 10 most common and most dangerous security problems in online (especially web) applications. Injection flaws, broken authentication and session management, XSS and other nasty security bugs.

These are problems that you need to be aware of and look for, and that you need to prevent in your design and coding. The Top 10 explains how to test for each kind of problem to see if your app is vulnerable (including common attack scenarios), and basic steps you can take to prevent each problem.

If you’re working on mobile apps, take time to understand the OWASP Top 10 Mobile list.

IEEE Top Design Flaws

The OWASP Top 10 is written more for security testers and auditors than for developers. It’s commonly used to classify vulnerabilities found in security testing and audits, and is referenced in regulations like PCI-DSS.

The IEEE Center for Secure Design, a group of application security experts from industry and university researchers, has taken a different approach. They have come up with a Top 10 list that focuses on identifying and preventing common security mistakes in architecture and design.

This list includes good design practices such as: earn or give, but never assume trust; identify sensitive data and how they should be handled; understand how integrating external components changes your attack surface. The IEEE’s list should be incorporated into design patterns and used in design reviews to try and deal with security issues early.

OWASP Proactive Controls

IEEE’s approach is principle-based – a list of things that you need to think about in design, in the same way that you think about things like simplicity and encapsulation and modularity.

The OWASP Proactive Controls, originally created by security expert Jim Manico, is written at the developer level. It is a list of practical, concrete things that you can do as a developer to prevent security problems in coding and design. How to parameterize queries, and encode or validate data safely and correctly. How to properly store passwords and to implement a forgot password feature. How to implement access control – and how not to do it.

It points you to Cheat Sheets and other resources for more information, and explains how to leverage the security features of common languages and frameworks, and how and when to use popular, proven security libraries like Apache Shiro and the OWASP Java Encoder.

Katy Anton and Jason Coleman have mapped all of these controls together (the OWASP Top 10, OWASP Proactive Controls and the IEEE Security Flaws), showing how the OWASP Proactive Controls implement safe design practices from the IEEE list and how they prevent or mitigate OWASP Top 10 risks.

You can use these maps to look for gaps in your application security practices, in your testing and coding, and in your knowledge, to identify areas where you can learn and improve.

Site Meter