After going live, we started building health checks into the system – run-time checks on operational dependencies and status to ensure that the system is setup and running correctly. Over time we have continued to add more run-time checks and tests as we have run into problems, to help make sure that these problems don’t happen again.
This is more than pings and Nagios alerts. This is testing that we installed the right code and configuration across systems. Checking code build version numbers and database schema versions. Checking signatures and checksums on files. That flags and switches that are supposed to be turned on or off are actually on or off. Checking in advance for expiry dates on licenses and keys and certs.
Sending test messages through the system. Checking alert and notification services, make sure that they are running and that other services that are supposed to be running are running, and that services that aren't supposed to be running aren't running. That ports that are supposed to be open are open and ports that are supposed to be closed are closed. Checks to make sure that files and directories that are supposed to be there are there, that files and directories that aren't supposed to be there aren't, that tables that are supposed to be empty are empty. That permissions are set correctly on control files and directories. Checks on database status and configuration.
Checks to make sure that production and test settings are production and test, not test and production. Checking that diagnostics and debugging code has been disabled. Checks for starting and ending record counts and sequence numbers. Checking artefacts from “jobs” – result files, control records, log file entries – and ensuring that cleanup and setup tasks completed successfully. Checks for run-time storage space.
We run these health checks at startup, or sometimes early before startup, after a release or upgrade, after a failover – to catch mistakes, operational problems and environmental problems. These are tests that need to run quickly and return unambiguous results (things are ok or they’re not). They can be simple scripts that run in production or internal checks and diagnostics in the application code – although scripts are easier to adapt and extend. Some require hooks to be added to the application, like JMX.
Other companies like Etsy do something similar with run-time asserts, using a unit test approach to check for conditions that must be in place for the system to work properly.
These tests can (and should) be run on development and test systems too, to make sure that the run-time environments are correct. The idea is to get away from checks being done by hand, operational checklists and calendar reminders and manual tests. Anything that has a dependency, anything that needs a manual check or test, anything in an operational checklist should have an automated run-time check instead.
The same ideas are behind Netflix’s over-hyped (though not always by Netflix) Simian Army, a set of robots that not only check for run-time conditions, but that also sometimes take automatic action when run-time conditions are violated – or even violate run-time conditions to test that the system will still run correctly.The army includes Security Monkey, which checks for improperly configured security groups, firewall rules, expiring certs and so on; and Exploit Monkey, which automatically scans new instances for vulnerabilities when they are brought up. Run-time checking is taken to an extreme in Conformity Monkey, which shuts down services that don’t adhere to established policies, and the famous Chaos Monkey, which automatically forces random failures on systems, in test and in production.
It’s surprising how much attention Chaos Monkey gets – maybe it’s the cool name, or because Netflix has Open Sourced it along with some of their other monkeys. Sure it’s ballsy to test failover in production by actually killing off systems during the day, even if they are stateless VM instances which by design should failover without problems (although this is the point, to make sure that they really do failover without problems like they are supposed to).
There's more to Netflix's success than run-time fault injection and the other monkeys. Still, automatically double-checking as much as you can at run-time is especially important in an engineering-driven, rapidly-changing Devops or Noops environment where developers are pushing code into production too fast to properly understand and verify in advance. But whether you are continuously deploying changes to production (like Etsy and Netflix) or not, getting developers and ops and infosec together to write automated health checks and run-time tests is an important part of getting control over what's actually happening in the system and keeping it running reliably.