Ridonculous Failover Strategies in AWS

Let us define some terms before we get started.

STANDBY environment – means an active, functioning environment, ready to take over instantly and automatically
BACKUP environment – means a passive, secondary environment, which will need a manual step to become the ACTIVE environment.
FAILOVER environment – Can be either a STANDBY or a BACKUP environment.
RIDONCULOUS – When something is SUCH a good deal, that it is beyond ridiculous !

This post will discuss a Basic failover option (which entails ZERO cost) , an advanced option and a super-advanced pattern that addresses cross-region failover and can serve as a D.R. solution as well.

Basic (Cheapest as in almost free, Since backup deployment incurs no charges)

Cost : $0 + RightScale Subscription
Downtime Entailed – 5 – 15 minutes (depending on time required to load up database)
Key AWS Technologies Leveraged – Elastic IP Addresses, Multiple ‘Deployments’ using RightScale Cloud Management Subscription.
Data Loss? – Yes, up to 10 minutes of data loss, as the db backups (e.g. .bak file in SQL Server) writes out to S3 every 10 minutes.
Recovery from Single Instance Failure Recovery – Yes
Recovery from entire AZ failure – Yes
Cross Region Failover – No
This Basic Failover Architecture leverages Elastic IPs and RightScale’s Cloud Management Subscription.

The Basic Failover Architecture leverages Elastic IPs and Cloned Deployments

The idea is that your failover environment stays dormant until a manual ‘resuscitation’ is performed. On being resuscitated, the environment is assigned the SAME IPs as your (now inactive) production environment. Elasticity of IP addresses means they are like post its – you can take it from instance A and stick it on instance B – and it carries the same information.

The production deployment runs in a primary availability zone.
A backup clone deployment is ready to be launched in a different availability zone if the primary zone fails. Deployments can be created at no cost from the AWS Management Console.
If the primary zone ever fails, all you will have to do is manually launch the backup deployment.
The estimated downtime would be about 5-10 minutes after you launch the backup deployment, plus any additional time that is needed to load the database. When launching the backup deployment, make sure that the appropriate EIPs are associated with the new frontend instances.
No DNS based failover required here. Just ensure that your two frontends webservers (in the backup environment) are using the same Elastic IPs as the production deployment and that the Associate IP at launch? box is checked.
This is the most affordable failover option; creating multiple deployments in AWS is completely free. Only when the deployment is launched, do you start getting billed for the instances. This enables one to create as many clones of one’s production deployment as they want.

Advanced Failover Option (Double the cost of a single environment)

Cost : Double that of primary environment, plus cost of replicating the DB (data transfer cost during DB replication)
Downtime Entailed – Zero minutes
Key AWS Technologies Leveraged – Multi-AZ RDS Replication, ELB and Route 53 based monitoring of ELB.
Recovery from Single Instance Failure Recovery – Yes
Recovery from entire AZ failure – Yes
Cross Region Failover – No
This Advanced Failover Architecture leverages EIPs, Elastic Load Balancer (ELB), RDS and Route 53.

In this setup

A single instance can be deleted at any time and the site will continue to operate normally and does not require any responsive action.
If an availability zone completely fails, you will still have a completely functional site running in a different availability zone.
If your primary App server or Master DB fails, all incoming requests will be rerouted (automatically) to ELB2 and serve content from Slave-DB in us-east-1b.
Route 53 automatically points to ELB2, so no changes in DNS records are needed.

RDS Notes

Multi-AZ deployment – sync replication for MariaDB, MySQL, Oracle and PostgreSQL are available by default.
Failover times are typically 60-120 seconds.
The failover mechanism automatically changes the DNS record of the DB instance to point to the standby DB instance.
As a result, you will need to re-establish (not re-configure) any existing connections to your DB instance.
Read Replica – Can span across regions. Can promote readreplica to prod instance if needed.

Super Advanced Pattern – Combine the Read Replica (Cross Regions) and Multi-AZ (Sync, same region) to get Multi-Region Replication and Failover

Cost : Double that of primary environment, plus cost of data transfer during DB replication
Downtime Entailed – Zero minutes
Key AWS Technologies Leveraged – Multi-AZ RDS Replication, ELB and Route 53 based monitoring of ELB.
Recovery from Single Instance Failure Recovery – Yes
Recovery from entire AZ failure – Yes
Cross Region Failover – No
This Super Advanced Architecture leverages Read Replicas, Multi-AZ RDS and CloudFormation (and optionally Route 53)

Salient Points of the Super Advanced Pattern

The pattern uses 2 CloudFormation templates.
The two templates, primary and secondary, are built from the same particle sets.
This ensures each will implement an identical infrastructure in both regions. In this case, only the default parameter values are different, minimizing user input in each region.
Since both the primary and secondary region share templates built from the same particles, failover is completed by taking a snapshot of the read replica and updating the secondary stack to enable Multi-AZ based on that snapshot.
CloudFormation will then initialize a RDS Multi-AZ Instance alongside the read replica in the recovery region.

Why not use Beanstalk instead of EC2 Websites?

BeanStalk is not fault-tolerant between regions. So EC2 hosted websites is the only option for cross-region failover.

Where does Route 53 fit into all this?

In the Basic pattern, Route 53 is not needed (since a manual launch of the backup environment is needed for recovery). In the advanced and super-advanced pattern, Route 53 helps automatically switch the DNS from one AZ to another AZ.

Deployment with zero downtimes

On a related topic, if you need continuous deployments across regions, with ZERO downtimes, try this blog post.

-Use Elastic Container Service with ELBs

Summary

Failover is a hot topic in the cloud world. Certainly, the cloud offers way more options than traditional datacenters. In a traditional data center, it would be near impossible to have a ZERO dollar solution that served as a complete failover backup environment. It might also be ridiculously hard to have a failover node sitting on the other side of the world (a different REGION, in cloud terminology). This post described some basic and advanced options allowing for failover nodes to exist in either the same region or even, geographically removed regions. These architectures are smart enough to recover (failover) from single instance failures to entire environment failures.

Anuj holds professional certifications in Google Cloud, AWS as well as certifications in Docker and App Performance Tools such as New Relic. He specializes in Cloud Security, Data Encryption and Container Technologies.

Initial Consultation

Anuj Varma – who has written 1210 posts on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.