Backups and Disaster Recovery in AWS - TriNimbus

Backups and Disaster Recovery in AWS

When first learning about AWS many people ask: “Do I need to explicitly deal with backups of my data and disaster recovery plans? Given that AWS takes care of the underlying hardware, doesn’t that imply that AWS makes sure my data is never lost?”

The thinking that AWS automatically deals with backup and disaster recovery is fed mostly by articles popularizing and oversimplifying cloud computing. These articles tend to emphasize that you don’t have to worry about maintaining your own hardware, leaving it up to your imagination and wishful thinking as to what it means for your data, servers, and network infrastructure.

Unfortunately, it turns out that issues related to backups and disaster recovery need to be explicitly addressed in the cloud as much as in a physical data center. On the bright side, AWS provides technologies that simplify this process.

The goal of this article is to summarize these technologies and show how they can be used to keep your data and systems protected.

Terminology

To understand how AWS supports backup and disaster recovery, it is critical to understand Amazon’s concept of “availability zones” and “regions”.

“Availability zone” (AZ, e.g. “us-east-1a”) corresponds to an AWS datacenter with its own internet connection, redundant power supplies, fire protection, etc. Individual AZs are designed to be fully independent (failure of one AZ shouldn’t impact the other AZs).

“Region” (e.g. “us-east-1”) is a group of AZs in geographical proximity, connected with private, high-speed, and low-latency network.

Different regions (e.g. “us-east-1” and “us-west-1”) communicate with each other over the Internet.

Data Storage Options in AWS

AWS provides a number of options for storing your data:

Instance (or ephemeral) store: Hard drives (or solid state drives) in the physical server on which your virtual machine is running. While very fast, this type of storage provides minimal protection against data loss. More importantly, data in the instance store is lost if the virtual machine is stopped (though it will survive reboot). Therefore, this storage is recommended only for data that can be re-created from other sources, or is continuously being replicated among multiple locations.

Elastic Block Store (EBS) volumes: Block-level storage that is automatically replicated on multiple physical devices within one AZ. EBS volumes exist independently of virtual machines, and can be detached from one virtual machine and reattached to another one (but they can’t be attached to multiple virtual machines at the same time).

EBS Snapshots: Snapshots of EBS volumes built on top of S3 (see below), i.e. automatically replicated across multiple AZs within the same region.

Simple Storage Service (S3): Storage for blobs that are automatically replicated across multiple AZs within the same region. Note that S3 is not a file system, nor is it a block-level device. Content of S3 is accessed using simple “put” and “get” methods.

Storage Gateway: Storage gateway was originally designed for hybrid environments, providing on-premise iSCSI storage backed-up by S3. Today, storage gateway can be used in the cloud, providing iSCSI accessible storage backed-up by S3.

Glacier: Highly reliable storage of data optimized for cost and durability. It is designed for archiving and infrequent data access (time to retrieve data is measured in hours).

File Systems

A number of AWS users consider S3 or EBS snapshots to provide sufficient levels of protection for their data. With this approach standard backup tools are used to create backup files, which are then uploaded to S3 or stored on EBS volumes that are snapshotted on a regular basis. Depending on the application, EBS volumes on which the live application data is stored can be snapshotted as well. S3 lifecycle policies can be used to control the amount of data on S3, or archiving to Glacier.

In case of large-scale disasters (e.g. earthquakes) it is theoretically possible that all AZs within one region will be impacted. To protect against these types of events, some users duplicate their data among multiple regions. This level of cross-region replication is not provided by AWS and needs to be implemented explicitly.

To protect against unauthorized access to AWS account, some users replicate their backups to S3 under a secondary AWS account.

For mission critical applications TriNimbus recommends the cross-region approach using a secondary AWS account. Read more about multiple AWS accounts here.

Relational Database Services

Amazon’s Relational Database Service (RDS) is a managed wrapper around MySQL, PostgreSQL, MS SQL, or Oracle relational databases. It provides built-in functionality for real-time replication across multiple AZs within a region, as well as scheduled snapshots. Some AWS users consider this functionality sufficient for their backup and disaster recovery plans.

For mission-critical applications TriNimbus recommends that the automatic snapshots created by RDS are copied to S3 in another region under a secondary AWS account.

EC2 Instances

To backup their EC2 instances (Amazon’s speak for “virtual machines”), some users create new AMIs (Amazon Machine Images) every time they update the instances with new software or configuration. The AMIs are automatically replicated to multiple AZs within a region. For added security some users copy their AMIs to other regions.

An alternative approach to disaster recovery of EC2 instances is to have them scripted using tools such as Chef, Puppet, Salt, or even Bash or PowerShell.

TriNimbus recommends using scripting, as it provides better control over the installed software and its configuration.

Sometimes, especially for certain Windows applications, scripting may be too cumbersome. In these situations the only practical approach to disaster recovery for EC2 instances may be creating AMIs, or backing-up all software patches and documenting all configuration changes.

Network Infrastructure

Network infrastructure (virtual private cloud, subnets, security groups, elastic load balancers, etc.) exists on the level of a region. As long as at least one AZ within the region is operational, there is no need to re-create this infrastructure. For this reason some AWS users do not consider explicit backups of their network infrastructure important.

In a very unlikely case of a long-term failure of a whole region, the infrastructure needs to be re-created in the AZ of an alternate region. This has to be done either manually, or by having the infrastructure scripted (e.g. using AWS CloudFormation).

TriNimbus recommends that the network infrastructure is scripted. The scripts can be used not only for disaster recovery but also to create additional environments (e.g. for testing).

Hopefully this overview helps to clear some of the confusion about the backups and disaster recovery many newcomers to AWS have. TriNimbus offers various solutions for managing and supporting environments, storage and backup, and disaster recovery. Learn more about these solutions here.