Expanding the Cloud: Moving large data sets into Amazon S3 with AWS Import/Export.

| | Comments (5)

Before networks were everywhere, the easiest way to transport information from one computer in your machine room was to write the data to a floppy disk, run to the computer and load the data there from that floppy. This form of data transport was jokingly called "sneaker net". It was efficient because networks only had limited bandwidth and you wanted to reserve that for essential tasks.

In some ways the computing world has changed dramatically; networks have become ubiquitous and the latency and bandwidth capabilities have improved immensely. Next to this growth in network capabilities we have been able to grow something else to even bigger proportions, namely our datasets. Gigabyte data sets are considered small, terabyte sets are common place, and we see several customers working with petabyte size datasets.

No matter how much we have improved our network throughput in the past 10 years, our datasets have grown faster, and this is likely to be a pattern that will only accelerate in the coming years. While network may improve another other of magnitude in throughput, it is certain that datasets will grow two or more orders of magnitude in the same period of time.

At the same time processing large amounts of data has become common place. Where this used to be the domain of Physics and Biotech researchers or maybe business intelligence, now increasingly other domains are being driven by large datasets. In research we see that traditional social sciences such as psychology and history are moving to become data driven. In the commercial world for example no ecommerce site can function anymore without mining massive amounts of data to optimize recommendations to their customers. Also in the systems management domain, data sets are growing faster and faster, consequently backup and disaster recovery has to deal with increasingly large sets. Log files and monitoring also spew out more and more relevant data.

Many of our customers have large datasets and would love to move into our storage services and process them in Amazon EC2. However moving these large datasets over the network can be cumbersome. If you look at typical network speeds and how long it would take to move a terabyte dataset:

speedtable.jpg

Depending on the network throughput available to you and the data set size it may take rather long to move your data into Amazon S3. To help customers move their large data sets into Amazon S3 faster, we offer them the ability to do this over Amazon's internal high-speed network using AWS Import/Export.

AWS Import/Export allows you to ship your data on one or more portable storage devices to be loaded into Amazon S3. For each portable storage device to be loaded, a manifest explains how and where to load the data, and how to map file to Amazon S3 object keys. After loading the data into Amazon S3, AWS Import/Export stores the resulting keys and MD5 Checksums in log files such that you can check whether the transfer was successful.

AWS Import/Export is of great help to many of our customers who have to handle large data sets. We continue to listen to our customers to make sure we are adding features, tools and services that help them solve real problems. For more information on AWS Import/Export visit the detail page.

For more background on the evolution of large data sets and the challenges with moving them over the network you should read some papers and interviews with Jim Gray who was a pioneer in the area of computing.

5 Comments

Saurabh said:

Werner:

Bucket Explorer now supports creating the manifest file / signing it with AWS keys and creating the packaging slip for Import/Export Service:

http://www.bucketexplorer.com/documentation/amazon-s3--aws-import-export-service.html

Jason Davies said:

This is pretty genius.

I'd be interested to know about security measures taken when shipping data e.g. to prevent data being leaked by someone taking a disk image presumably you could simply encrypt the disks, and then run some decryption software at the other end on your EC2 instance?

clarke said:

I would think you'd want to support a Drobo or similar RAID type storage solution. Having tried to copy 1TB of an Oracle DB onto a Firewire800 1TB disk, the speeds was so slow it would be quicker for me to upload on a T3.

Though to a certain degree, conceptually I would think you'd want a burstable or ultra-high throughput Internet connection to AWS anyhow(or any cloud service) to minimize performance issues for your end users.

Andy said:

We are so glad Amazon made initial data transfer even simpler. No longer you have to wait weeks before you initial backup completes. Hopefully this will make the tools like CloudBerry Backup for S3 even more appealing. http://cloudberrydrive.com .
The only thing I would recommend to Amazon is not to require customers to sign up to yet another service but to make it a part of Amazon S3. Besides, make it possible track import status using REST API, now you can only track it by sending emails.

This is great news for shipping huge volumes of data to the cloud, especially for anyone without a T3 connection.

As an Amazon partner, RainStor (http://www.rainstor.com) who provide a cloud archive service use a different approach to moving large volumes of “structured” data (e.g. data from databases, logs files etc.) to S3. RainStor provide a client-side software VM with their service that automatically compresses and encrypts data prior to uploading to S3. With compression rates typically 40:1, this dramatically reduces the upload time and costs. As the clients own the encryption keys, the data is always secure and never vulnerable during transfer or when the data is at rest in S3.

It’s worth considering whether this data loader component should be released as an independent utility for use with other cloud services?

About this Entry

This page contains a single entry by Werner Vogels published on May 20, 2009 9:00 PM.

Automating the management of Amazon EC2 using Amazon CloudWatch, Auto Scaling and Elastic Load Balancing was the previous entry in this blog.

Amazon is in-world and hiring! is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.