There is a lot of talk in the devops community about the importance of sharing principles and values, and about silo busting: breaking down the “wall of confusion” between developers and operations to create agile, cross-functional teams. Radical improvement through fundamental organizational changes and building an entirely new culture.
But it doesn’t have to be that hard. All it took us was 3 simple, but important, steps.
Reliability First
When we first launched our online platform, things were pretty crazy. Sales was busy with customer feedback and onboarding more customers. Development was still finishing the backlog of features that were supposed to already be done, responding to changes from sales and partners, and helping to support the system. Ops was trying to stabilize everything, help onboard more customers and address performance issues as more customers came on. We were all rushing forwards, but not always in the same direction.
Our CEO recognized this and made an important decision. He made reliability the #1 priority – the reliability and integrity of our systems and of our customers’ data, and the services we provided. For everyone: not just ops, but development, sales, marketing, compliance, admin. Above everything else. It was more important not to mess up the customers that we had than to get new customers or hit deadlines or cut costs.
Reliability, resilience, integrity have remained our #1 driver for the company over several years as we continued to grow.
This meant that everyone was working towards the same goals – and the goals were easy to understand and measure: improving MTTF, MTTD and MTTR windows; reducing bug counts and variability in response time, improving results of audits and pen tests.
It gave people more reasons to work together at more levels..
It reduced politics and conflicts to a minimum.
Development’s first priority changed from pushing features out ASAP to making sure that the system was running optimally and that any changes wouldn't negatively impact customers. This meant more time spent with ops on understanding the run-time, more time troubleshooting operational issues, more reviews and regression testing and stress testing, anticipating compatibility issues, planning for roll-back and roll-forward recovery.
Smaller, more frequent releases
Spending some more time on testing and reviews and working with ops meant that it took longer to complete a feature. But we still had to keep up with customer demands – we still had to deliver.
We did this by shortening the software release cycle, from 2-3 months to 2-3 weeks or sometimes shorter. Delivering less in each release, sometimes only 1 new feature or some fixes, but delivering much more often. If a change or feature had to be delayed, if developers or testers needed more time to make sure that it was ready, it wasn't a big deal – if something wasn't ready this week, it would be ready next week or soon after, still fast enough for the business and for customers.
Planning and working in shorter horizons meant that development could respond faster to changes in direction and changing priorities, so developers were always focused on delivering what was most important at the time.
Shorter releases drove development to be more agile, to think and work faster. To automate more testing. To pay more attention to deployment, make the steps simpler and more efficient – and safer.
Fewer changes batched together made it easier to review and test. Less chances to make mistakes. Easier to understand what went wrong when we did make a mistake, and less time to fix it.
RCA – Learn from Mistakes
We still made mistakes, shit still happened. When something went seriously wrong, it was my job to explain it to our customers and owners. What went wrong, why, and what we were going to do to make sure that it didn’t happen again.
We didn't know about blameless post mortems, but this is the way we did it anyway. We got developers and testers and ops and managers together in Root Cause Analysis sessions to carefully examine what happened, what went wrong, understand why, and fix it.
We made sure that people focused on the facts and on problem solving: what happened, what happened next, what did we see, what didn’t we see, why? What could we do to fix it or to prevent it from happening again or to recognize and respond to problems like this more effectively in the future? Better training, better tools, better procedures, better documentation, better error handling, better testing and reviews, better configuration checks and run-time checking, better information and better ways of communicating it.
Focusing on details and problems, not people. Proving that it was ok to make mistakes, but not ok to hide them. We got much better: at operations, testing, design, deployment, monitoring, incident handling. And better as an organization. We built transparency and trust within and across teams. We learned how to move forward from failure, and to be more resilient and confident in our ability to deal with serious problems.
Delivering Better and Faster Together
We didn't restructure or change who we were as an organization. Dev and ops still work in separate organizations for different managers in different countries. They have their own projects and their own ways of working, and they don’t always speak the same language or agree on everything.
We have lots of checks and balances and handoffs and paperwork between dev and ops to make sure that things are done properly and to make the regulators happy. There are still more steps that we could automate or simplify, more we can do to build out our Continuous Delivery pipelines, more things we can get out of Puppet and Vagrant and other cool tools.
But if devops is about developers and operations sharing responsibility for the system, trusting each other and helping each other to make sure that the system is always working correctly and optimally, looking for better solutions together, delivering better and faster – then we’ve been doing devops for a while now.