Dynamic Infrastructure, Disaster Recovery and Netflix’s Simian Army

Disaster Recovery has never been a fun thing to plan – and even less of a fun thing to test successfully. Especially around those mission critical apps that really need this level of D.R. Planning and Testing.

Traditionally, cold standby servers, often entire standby environments need to be provisioned, brought up and kept running. Not only are these environments super-expensive to create, but also ridiculously time consuming to maintain (consider keeping each tier of the standby environment in sync with each tier of the live production environment).

Cloud Solution versus Traditional Solution – for Common Scenarios

Have you ever worried about a production server (or servers) in your farm crashing ?
Have you ever had potentially breaking code changes checked in by developers – that haven’t been completely tested prior to production push?
Have you ever dealt with increasing response times from your web application – in the face of increasing user sessions? Often, causing your application to crash?

Scenario	Traditional Solution	Cloud Solution
A server in your web-(or data) farm crashes….	Failover to existing nodes that haven’t crashed. However, if multiple nodes crash – or the load balancer crashes, you are still looking at serious downtime.	A monitoring service detects the crashed node and automatically notifies / triggers a ‘server template’ that spins up a new one. No manual intervention required. The new instance can be configured exactly liked the failed instance – or can be configured to a prior state.
The development team checks in ‘potentially breaking changes’ that need to be tested out.	A new TEST environment needs to be provisioned – the new codebase deployed and the full testing life cycle needs to be followed.	Your CI tool (Continuous Integration) of choice automatically detects the check-in – and spins up an additional TEST VM to deploy the code to. Your CI Automated Testing then runs your suite of tests against the new codebase – and identifies all the ‘breaking’ changes. This saves a lot of manual effort in creating a test environment and running an entire QA effort around the new code.
Load (Concurrent users) spike up suddenly – causing existing servers to choke, maybe even crash.	You would need to manually provision more hardware – for either vertical or horizontal scaling – or both! Time consuming and expensive – and entails downtime!	Infrastructure’s auto scaling capabilities spins up new instances and dynamically adds them to the pool. No manual intervention required.

Server Templates (The Magic Of)

Thanks to ‘server templates’ (e.g. VMWare templates, AWS AMIs, AWS CloudFormation Templates), spinning up entire VMs with a few lines of code has become a straightforward exercise in the cloud world. More importantly, these VMs can be defined with specific ‘roles’ – a WebServer Role, a DB Role etc. The exact role ‘configuration’ can be stored on a configuration server (CHEF Server, PuppetMaster, ANSIBLE Tower…) – making it immune from any accidental overwrites/destruction.

The bottom line is that you not only get a blueprint for automatic infrastructure creation – you also get a safe for locking this blueprint so that no one can destroy it. Template Repositories, Template Versioning, hardening of repositories – these are all evolving at a rapid pace, making the cloud-center solutions as as secure as traditional data center solutions.

What does all this have to do with Disaster Recovery?

As you probably guessed from the recap above, in the cloud world, keeping COLD STANDBYs just doesn’t make much sense. When hardware fails, it is relatively painless to re-generate an identical copy of the crashed server.

The devil is in the details, of course, and one has to be mindful of how to recover any data, log files etc. on the crashed server. For e.g. – all the performance metrics (CPU usage, average memory usage etc. are all lost with the server crash). There are, fortunately, cloud patterns that help with centralized logging, data updates, performance metrics and other commonly needed server stats.

Here’s the rub…Cold Standby Servers can be replaced with on-demand, re-buildable instances. The instances do not need to be on standby – all that is needed are (well-tested) server templates that are easily accessible in case of a disaster situation. These server templates can recreate the crashed instances – in a way that retrieves all of the configuration data that was part of the crashed instance.

Netflix’s Simian Army – Testing D.R. in the real world

Say – your team has designed the perfect D.R. Strategy. Just how far do you test it? Do you take down just the data tier? Do you bring down your entire environment to simulate a real-world disaster situation?

Disasters are random events – and, in order to simulate true disasters, one needs to RANDOMLY bring down (pieces of) a production environment.

Netflix does just that. With an automated tool (named Chaos Monkey), Netflix randomly seeks out instances to destroy. No one is given a ‘heads up’ that Chaos Monkey is about to run; it just runs and wreaks havoc along the way. If you have a truly resilient environment, the monkey’s attempts are essentially negated by new instances spinning up to replace the destroyed ones. Otherwise, the monkey exposes any weaknesses in the infrastructure.

If you think destroying single instances is extreme, how about destroying an entire data center? Netflix’s Chaos Gorilla does just that. It removes an entire Availability Zone and tests for repercussions. Pretty gutsy if you think about it.

There are a few more monkeys up Netflix’ sleeves. They even have a ‘latency monkey’ that randomly introduces artificial delays into the servers serving content to see if upstream servers can handle the ‘throttling’ effectively.

Summary

Traditionally, D.R. meant ‘cold standby’ environments entailing huge costs along with maintenance headaches. Traditionally, TESTING D.R. scenarios was also a challenge – as it was both expensive and time consuming. More often than not, what was tested was not TRUE disaster, but rather a scaled down version of a disaster scenario. With the option of building an entire data center in the cloud, one can now leverage all of the auto recovery features and devops improvements in cloud technology. A Server going down (for whatever reason), is no longer a cause for serious concern. Not only can a cloud service detect the crashed server, it can notify the appropriate server creation template to ‘spin up’ an equivalent server. What about all the configuration data etc. on the crashed server? That too, through innovative cloud template patterns, can be recovered from a centralized repository.

Planning (and Testing) D.R. for your critical, high-performance web apps, no longer needs to be the expensive and risky proposition that it used to be in the past.

Thoughts? Comments?

Anuj holds professional certifications in Google Cloud, AWS as well as certifications in Docker and App Performance Tools such as New Relic. He specializes in Cloud Security, Data Encryption and Container Technologies.

Initial Consultation

Anuj Varma – who has written 1209 posts on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.