Disaster Recovery and Failover in AWS–Some notes from the field

Anuj Varma — Wed, 21 Jan 2015 18:41:41 +0000

(See also, a newer post on advanced D.R. in the cloud)

D.R. is not the same as Failover – though several people seem to use these interchangeably. Failover typically means that there is a passive node (or nodes) available – that can quickly become the Active Node if needed, in the same clustered environment. The physical location of the standby node is not that important – and is typically in the same datacenter as the primary node. For e.g. your web server farm may have 2 or 3 nodes configured. An Oracle RAC has built-in support for multiple failover nodes.

Disaster Recovery differs from this in that the RECOVERY node (or nodes) CANNOT (should not) reside in the same data-center (or anywhere close to it). A disaster would affect the entire data center – so it wouldn’t make sense to have your recovery node in the same location as the primary node Hence, provisioning for DR is different from provisioning for Failover.

Replicate all tiers – not just RDS (data tier)

1. Replicate the entire Application – not just RDS. Use Multi-AZ for PROD environments.

Web and Middle Tier– Your application must be able to function in both source and target locations. That means replicating the rest of your application to the same target location – and pointing it to the read-replica (the failover RDS instance). For the Web Tier, this is a little more convoluted than for the data tier. Essentially – your options are
- a) Full blown EC2 instance creation using CloudFormation (Server Templates).
- b) Live s3 bucket that serves as a failover node – with appropriate route 53 configuratio
- For option a) – a full blown web tier replication, one would need to set up custom server templates that would be able to spin up a new Web Server from either an AMI or a VMWare template – and have the app code copied from a ‘configuration’ server.
- For option b) – there is a shortcut – if your web tier is simple enough that it can run off an S3 bucket. In that case, all you do is set up a live replica of your website on the S3 bucket , keep monitoring your EC2 instance with the active website – and if the health check fails, simply route to the S3 bucket (Route 53 is the easiest way to accomplish this).
Data Tier – RDS (in AWS) offers a LIVE read-replica of your RDS instance – which is an ‘out of the box’ solution for a D.R. scenario.
Production – For PROD environments, use Multi-AZ deployment (mirroring) and provisioned IOPS. It is much harder to change this after the fact – if you want to ‘upsize’ your RDS instance.

2. Set up replication ( failure ) alerts
Every service has the potential to fail – and AWS based replication services (including RDS Replication) are no exception. You can configure service alerts (using Amazon SNS), to inform you of the success/failure of your environment replication.

3. Database/Data Tier – Perform daily backups (see the side note below on what you can and cannot do in RDS)
Enabling backups is a good idea because it is simple and effective. Also, to work with read-replicas, you need backups enabled.

4. Utilize multi-AZ (availability zone) architecture
Make use of multi-AZ architecture on AWS for availability of mission critical applications. In particular, enabling multi-AZ on RDS is the simplest way to replicate the instance within the same region.

If you’re replicating across regions, isolating RDS in a single AZ will introduce downtime as a direct result of the replication mechanism (e.g. backups, read-replica or a combination).

For PROD environments, use Multi-AZ deployment (mirroring) and provisioned IOPS. It is much harder to change this after the fact – if you want to ‘upsize’ your RDS instance.
Recreating a prod environment from a dev or staging environment (Create a backup, restore a database from backup – Read this post )
Pushing large amounts of data into an RDS instance – using SQL Bulk Copy (bcp) – read this post.

Side Note – Backup and Restore in RDS

Restoring from a backup essentially involves dropping your existing RDS instance and creating a new one. This has implications. The endpoint address associated with your original RDS instance is lost in this process – and you will need to remap (anything using that instance) to the new URL.

What can you NOT do with RDS?

You can’t copy, paste or create files in the underlying disk system. If your on-site DB server has non-SQL related files on disk, they can’t be ported across.
You can’t run batch files, Windows Command Shell files or PowerShell scripts in the host.
You can’t directly monitor disk space, CPU usage or memory usage from the host. AWS provides a different way for monitoring.
You can’t copy backup files into the local disk from another location and restore databases from there.
You can’t decide which drive your database files go to, AWS has a default location for that.

Summary

Recovering from a DR scenario is not something that you want to take a chance with. Fortunately, applications hosted in the cloud (or utilizing cloud infrastructure) are a lot easier to recover – than conventional apps. This is due to several built-in features such as a live read-replica, multi-AZ architecture – that makes it possible to experiment with different DR configurations.

Contact Anuj Varma to see if he can help with your cloud DR needs. (See also, a newer post on advanced D.R. in the cloud)

The post Disaster Recovery and Failover in AWS–Some notes from the field appeared first on Anuj Varma, Hands-On Technology Architect, Clean Air Activist.

aws read replica Archives - Anuj Varma, Hands-On Technology Architect, Clean Air Activist

Disaster Recovery and Failover in AWS–Some notes from the field

Replicate all tiers – not just RDS (data tier)

Side Note – Backup and Restore in RDS

Summary