Ridonculous Failover Options in the Cloud (AWS)

Let us define some terms before we get started.

STANDBY environment – means an active, functioning environment, ready to take over instantly and automatically
BACKUP environment – means a passive, secondary environment, which will need a manual step to become the ACTIVE environment.
FAILOVER environment – Can be either a STANDBY or a BACKUP environment.
RIDONCULOUS – When something is SUCH a good deal, that it is beyond ridiculous !

This post will discuss a Basic failover option (which entails ZERO cost) , an advanced failover option and a super-advanced pattern that addresses cross-region failover and can serve as a D.R. solution as well.

Basic (Cheapest as in free, Since backup deployment incurs no charges)

Cost : $0 + RightScale Subscription
Downtime Entailed – 5 – 15 minutes (depending on time required to load up database)
Key AWS Technologies Leveraged – Elastic IP Addresses, Multiple ‘Deployments’ using RightScale Cloud Management Subscription.
Data Loss? – Yes, up to 10 minutes of data loss, as the db backups (e.g. .bak file in SQL Server) writes out to S3 every 10 minutes.
Recovery from Single Instance Failure Recovery – Yes
Recovery from entire AZ failure – Yes
Cross Region Failover – No
This Basic Failover Architecture leverages Elastic IPs and RightScale’s Cloud Management Subscription.

Basic Option Details – The idea is that your failover environment stays dormant until a manual ‘resuscitation’ is performed. On being resuscitated, the environment is assigned the SAME IPs as your (now inactive) production environment. Elasticity of IP addresses means they are like post its – you can take it from instance A and stick it on instance B – and it carries the same information.

The production deployment runs in a primary availability zone.
A backup clone deployment is ready to be launched in a different availability zone if the primary zone fails.
This is the most affordable failover option, because deployments are completely free. Simply create a clone of your production deployment and save it as a backup deployment.
If the primary zone ever fails, all you will have to do is manually launch the backup deployment.
The estimated downtime would be about 5-10 minutes after you launch the backup deployment, plus any additional time that is needed to load the database. When launching the backup deployment, make sure that the appropriate EIPs are associated with the new public facing instances.
No DNS based failover required here. Just ensure that your two frontends webservers (in the backup environment) are using the same Elastic IPs as the production deployment and that the Associate IP at launch? box is checked.

Advanced Failover Option (Double the cost of a single environment)

Cost : Double that of primary environment, plus cost of data transfer during DB replication
Downtime Entailed – Zero minutes
Key AWS Technologies Leveraged – Multi-AZ RDS Replication, ELB and Route 53 based monitoring of ELB.
Recovery from Single Instance Failure Recovery – Yes
Recovery from entire AZ failure – Yes
Cross Region Failover – No
This Advanced Failover Architecture leverages Elastic Load Balancer (ELB) and RDS (and optionally Route 53)

Advanced Option Details

A single instance can be deleted at any time and the site will continue to operate normally and does not require any responsive action. If your primary App server or Master DB fails, all incoming requests will be rerouted (automatically) to ELB2 and serve content from Slave-DB in us-east-1b.
If an availability zone completely fails, you will still have a completely functional site running in a different availability zone.
Route 53 automatically points to ELB2, so no changes in DNS records are needed.
RDS Notes – The failover mechanism automatically changes the DNS record of the DB instance to point to the standby DB instance. As a result, you will need to re-establish (not re-configure) any existing connections to your DB instance.

Super Advanced Pattern – Combine the Read Replica (Cross Regions) and Multi-AZ (Sync, same region) to get Multi-Region Replication and Failover

Cost : Double that of primary environment, plus cost of data transfer during DB replication
Downtime Entailed – Zero minutes
Key AWS Technologies Leveraged – CloudFormation, Multi-AZ RDS, Read Replica RDS, ELB and Route 53.
Recovery from Single Instance Failure Recovery – Yes
Recovery from entire AZ failure – Yes
Cross Region Failover – Yes
This Super Advanced Architecture leverages Read Replicas, Multi-AZ RDS and CloudFormation

Salient Points of the Super Advanced Pattern

The use of condensation implements DRY principles when creating CloudFormation templates.
The two templates, primary and secondary, are built from the same particle sets.
This ensures each will implement an identical infrastructure in both regions. In this case, only the default parameter values are different, minimizing user input in each region.
Since both the primary and secondary region share templates built from the same particles, failover is completed by taking a snapshot of the read replica and updating the secondary stack to enable Multi-AZ based on that snapshot.
CloudFormation will then initialize a RDS Multi-AZ Instance alongside the read replica in the recovery region.

Why not use Beanstalk instead of EC2 Websites?

BeanStalk is not fault-tolerant between regions. So EC2 hosted websites is the only option for cross-region failover.

Where does Route 53 fit into all this? What is DNS based Failover?

In the Basic pattern, Route 53 is not needed (since a manual launch of the backup environment is needed for recovery). In the advanced and super-advanced pattern, Route 53 helps automatically switch the DNS from one AZ to another AZ. Often, AWS uses the term ‘DNS based failover’. They are usually referring to utilizing Cloud 53 in the manner outlined above.

Deployment with zero downtimes

On a related topic, if you need continuous deployment of your application code, across geographical regions, with ZERO downtimes, try this blog post.

-Use Elastic Container Service with ELBs

Summary

Failover is a hot topic in the cloud world. Certainly, the cloud offers way more options than traditional data centers. In a traditional data center, it would be near impossible to have a ZERO dollar solution that served as a complete failover backup environment. It might also be ridiculously hard to have a failover node sitting on the other side of the world (a different REGION, in cloud terminology).

This post described some basic and advanced options allowing for failover nodes to exist in either the same region or even, geographically removed regions. These architectures are smart enough to recover (failover) from single instance failures to entire environment failures.

In addition, there are now effective cloud patterns help us deploy code (new application releases etc.) without entailing any downtimes.

Anuj holds professional certifications in Google Cloud, AWS as well as certifications in Docker and App Performance Tools such as New Relic. He specializes in Cloud Security, Data Encryption and Container Technologies.

Initial Consultation

Anuj Varma – who has written 1209 posts on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.