Amazon Redshift and Designing for Resilience

All Things Distributed Now Go Build! Articles @werner

Amazon Redshift and Designing for Resilience

February 15, 2013 • 761 words

As you may remember from our announcement at re: Invent in November 2012, Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service that delivers fast query performance at less than one tenth the cost of most traditional data warehouse systems. I’ve been eagerly waiting for Amazon Redshift’s launch since we announced the service preview at re: Invent and I’m delighted that it’s now available for all customers in the US East (N. Virginia) Region, with additional AWS Regions planned for the coming months. To get started with Amazon Redshift, visit: http://aws.amazon.com/redshift.

Amazon Redshift and Resilience

Previously, I’ve written at length about how Amazon Redshift achieves high performance. Today, I’m going to focus on Amazon Redshift’s durability and fault tolerance.

Amazon Redshift uses local attached storage to deliver high IO performance. To provide data durability, Amazon Redshift maintains multiple copies of your data at all times. When you load data into an Amazon Redshift cluster, it is synchronously replicated to multiple drives on other nodes in the cluster. Your data is also automatically replicated to Amazon S3 which is designed for 99.99999999% durability. Backups of your data to Amazon S3 are continuous, incremental, and automatic. This combination of in-cluster replication and continuous backup to Amazon S3 ensures you have a highly durable system. You simply load the data and Amazon Redshift takes care of the rest.

Amazon Redshift implements a number of features that make the service resilient to drive and node failures within the data warehouse cluster. Although individual component failures are rare, as the number of components in a system increases, the probability of any single component failing also increases. The probability of a drive failure in a large cluster is the probability of an individual drive failure times the number of drives in the cluster. If you have a 50-node 8XL cluster containing a total of 1,200 hard drives, you will inevitably experience a drive failure at some point. You have to anticipate these sorts of failures and design your systems to be resilient to them.

Amazon Redshift continuously monitors your data warehouse cluster for drive and node failures. If Amazon Redshift detects a drive failure, it automatically begins using the other in-cluster copy of the data on that drive to serve queries while also creating another copy of the data on healthy drives within the cluster. If all of the copies within the cluster are unavailable, it will bring the data down from S3. This is all entirely transparent to the running system. If Amazon Redshift detects a failure that requires a node to be replaced, it automatically provisions and configures a new node and adds it to your cluster so you can resume operations.

But what about a scenario in which you need to restore an entire cluster? You can use any of the saved system or user backups to restore a copy of your cluster with a few clicks. Amazon Redshift automatically provisions and configures your cluster and begins restoring data from Amazon S3 to each node in your cluster in parallel. Amazon Redshift’s streaming restore feature enables you to resume querying as soon as the new cluster is created and basic metadata is restored. The data itself will be pulled down from S3 in the background, or brought in on demand as needed by individual queries. This is important since most queries in a typical data warehouse only access a small fraction of the data. For example, you might have three years of data in the warehouse, but have most queries referencing the last day or week. These queries will become performant quickly, as the hot data set is brought down.

I tell developers all the time to plan for failure and to design their systems around it. Performance is important, but it just doesn’t matter unless the system is up. I’m very happy to see Amazon Redshift incorporating sound principles of distributed systems design for achieving availability and durability at petabyte scale. I can’t wait to see how our customers will use the service.

Amazon Redshift and Amazon DynamoDB

I’m also pleased that we have built a powerful and easy-to-use integration between Amazon Redshift and one of our other highly available and durable services: Amazon DynamoDB. You can move all of your Amazon DynamoDB data into an Amazon Redshift table with a single command run from within Amazon Redshift:

copy table_redshift from 'dynamodb:// table_dynamodb' 
credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx' readratio 50;

I’m excited that Amazon Redshift is now available to everyone. I can’t wait to see how our customers will use the service.