All Things Distributed
Today, we are excited to announce the limited preview of Amazon Redshift, a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables customers to obtain dramatically increased query performance when analyzing datasets ranging in size from hundreds of gigabytes to a petabyte or more, using the same SQL-based business intelligence tools they use today. Customers have been asking us for a data warehouse service for some time now and we’re excited to be able to deliver this to them.
Amazon Redshift uses a variety of innovations to enable customers to rapidly analyze datasets ranging in size from several hundred gigabytes to a petabyte and more. Unlike traditional row-based relational databases, which store data for each row sequentially on disk, Amazon Redshift stores each column sequentially. This means that Redshift performs much less wasted IO than a row-based database because it doesn’t read data from columns it doesn’t need when executing a given query. Also, because similar data are stored sequentially, Amazon Redshift can compress data efficiently, which further reduces the amount of IO it needs to perform to return results.
Amazon Redshift’s architecture and underlying platform are also optimized to deliver high performance for data warehousing workloads. Redshift has a massively parallel processing (MPP) architecture, which enables it to distribute and parallelize queries across multiple low cost nodes. The nodes themselves are designed specifically for data warehousing workloads. They contain large amounts of locally attached storage on multiple spindles and are connected by a minimally oversubscribed 10 Gigabit Ethernet network. This configuration maximizes the amount of throughput between your storage and your CPUs while also ensuring that data transfer between nodes remains extremely fast.
When you provision an Amazon Redshift cluster, you can select from 1 to 100 nodes depending on your storage and performance requirements and easily scale up or down as those requirements change. You have a choice of two node types when provisioning a cluster, an extra large node (XL) with 2TB of compressed storage or an eight extra large (8XL) with 16TB of compressed storage. Amazon Redshift’s MPP architecture makes it easy to resize your cluster to keep pace with your storage and performance requirements. You can start with 2TB of capacity in your data warehouse cluster and easily scale up to a petabyte and more.
Parallelism isn’t just about queries. Amazon Redshift takes it a step further by applying it operations like loads, backups, and restores. For example, when loading data from Amazon S3, you simply issue a SQL copy command with the location of your S3 bucket. Redshift analyzes the contents of your bucket and parallel loads each node simultaneously, taking advantage of the increased bandwidth of multiple connections to S3. If you choose to load your data in a round robin fashion, you’re done. If you choose a hash-partitioning scheme, your data is automatically redistributed to the correct node. Amazon Redshift also extends this parallelism to backups, which are taken from each node and are automated, continuous, and incremental. Restoring a cluster from an S3 backup is also a node-parallel operation. With all of these operations, our goal is to minimize the time you spend performing operations with large data sets.
The result of our focus on performance has been dramatic. Amazon.com’s data warehouse team has been piloting Amazon Redshift and comparing it to their on-premise data warehouse for a range of representative queries against a two billion row data set. They saw speedups ranging from 10x – 150x!
Until now, these levels of performance and scalability were prohibitively expensive. I’m happy to say that this is not how we do things at Amazon. You can get started with a single 2TB Amazon Redshift node for $0.85/hour On-Demand and pay by the hour with no long-term commitments or upfront costs. This works out to $3,723 per terabyte per year. If you have stable, long running workloads, you can take advantage of our three year reserved instance pricing to lower Redshift’s price to under $1,000 per terabyte per year, one tenth the price of most data warehousing solutions available to customers today. In the case of Amazon.com’s data warehouse team, their existing data warehouse is a multi-million dollar system with 32 nodes, 128 CPUs, 4.2TB of RAM, and 1.6PB of disk. They achieved their speedups with an Amazon Redshift cluster with 2 8XL nodes and an effective 3 year reserved instance price of $3.65/hour, or less than $32,000 per year.
In addition to being expensive, self-managed on-premise data warehouses require significant time and resource to administer. Loading, monitoring, tuning, taking backups, and recovering from faults are complex and time-consuming tasks. Amazon Redshift changes this by managing all the work needed to set up, operate, and scale a data warehouse enabling you to focus on analyzing your data and generating business insights.
We designed Amazon Redshift with integration and compatibility in mind. Redshift integrates with Amazon Simple Storage Service (S3) and Amazon DynamoDB, with support for Amazon Relational Database Service (RDS) and Amazon Elastic MapReduce coming soon. You can connect your SQL-based clients or business intelligence tools to Amazon Redshift using standard PostgreSQL drivers over JDBC or ODBC connections. Jaspersoft and MicroStrategy have already certified Amazon Redshift for use with their platforms, with additional business intelligence tools coming soon.
I believe that Amazon Redshift’s combination of performance, price, manageability, and compatibility will make analyzing larger and larger data sets economically justifiable. I look forward to seeing how our customers put this technology to work.
To learn more about Amazon Redshift, visit the AWS blog and sign up for an invitation to the limited preview at http://aws.amazon.com/redshift.