Before networks were everywhere, the easiest way to transport information from one computer in your machine room was to write the data to a floppy disk, run to the computer and load the data there from that floppy. This form of data transport was jokingly called "sneaker net". It was efficient because networks only had limited bandwidth and you wanted to reserve that for essential tasks.

In some ways the computing world has changed dramatically; networks have become ubiquitous and the latency and bandwidth capabilities have improved immensely. Next to this growth in network capabilities we have been able to grow something else to even bigger proportions, namely our datasets. Gigabyte data sets are considered small, terabyte sets are common place, and we see several customers working with petabyte size datasets.

No matter how much we have improved our network throughput in the past 10 years, our datasets have grown faster, and this is likely to be a pattern that will only accelerate in the coming years. While network may improve another other of magnitude in throughput, it is certain that datasets will grow two or more orders of magnitude in the same period of time.

At the same time processing large amounts of data has become common place. Where this used to be the domain of Physics and Biotech researchers or maybe business intelligence, now increasingly other domains are being driven by large datasets. In research we see that traditional social sciences such as psychology and history are moving to become data driven. In the commercial world for example no ecommerce site can function anymore without mining massive amounts of data to optimize recommendations to their customers. Also in the systems management domain, data sets are growing faster and faster, consequently backup and disaster recovery has to deal with increasingly large sets. Log files and monitoring also spew out more and more relevant data.

Many of our customers have large datasets and would love to move into our storage services and process them in Amazon EC2. However moving these large datasets over the network can be cumbersome. If you look at typical network speeds and how long it would take to move a terabyte dataset:

speedtable.jpg

Depending on the network throughput available to you and the data set size it may take rather long to move your data into Amazon S3. To help customers move their large data sets into Amazon S3 faster, we offer them the ability to do this over Amazon's internal high-speed network using AWS Import/Export.

AWS Import/Export allows you to ship your data on one or more portable storage devices to be loaded into Amazon S3. For each portable storage device to be loaded, a manifest explains how and where to load the data, and how to map file to Amazon S3 object keys. After loading the data into Amazon S3, AWS Import/Export stores the resulting keys and MD5 Checksums in log files such that you can check whether the transfer was successful.

AWS Import/Export is of great help to many of our customers who have to handle large data sets. We continue to listen to our customers to make sure we are adding features, tools and services that help them solve real problems. For more information on AWS Import/Export visit the detail page.

For more background on the evolution of large data sets and the challenges with moving them over the network you should read some papers and interviews with Jim Gray who was a pioneer in the area of computing.

The Amazon Elastic Compute Cloud (Amazon EC2) embodies much of what makes infrastructure as a service such a powerful technology; it enables our customers to build secure, fault-tolerant applications that can scale up and down with demand, at low cost. Core in achieving these levels of efficiency and fault-tolerance is the ability to acquire and release compute resources in a matter of minutes, and in different Availability Zones.

Of course the best way to achieve efficiency and fault-tolerance while maintaining good performance is to fully automate the management of the Amazon EC2 Instances, such that you can optimize the use of the compute resources in different scenarios. Higher levels of automation allow your applications to quickly respond to changes in usage patterns and failure events in a pre-determined manner.

At Amazon we have tremendous experience with building our applications this way; we make sure that customers are getting consistent performance, even if whole datacenters are failing. To facilitate this we have built unique infrastructure technologies that help our engineers automate the scalability and fault-tolerance of the Amazon ecommerce platform. Core in those technologies is the ability to monitor and measure every possible resource and activity in real-time, and to automate new capacity deployment and the management of services and applications based on the information that flows through the monitoring system.

cloudwatch_small.jpg

With the launch of Amazon CloudWatch, Auto Scaling and Amazon Elastic Load Balancing we are now making these unique technologies available to our Amazon Web Services customers. These features will help our customers to monitor their Amazon EC2 Instances, automatically scale them up and down based on the monitoring data, and to efficiently distribute requests to their applications over the different instances even if they are running in different Availability Zones.

These new infrastructure services consist of 3 core parts:

  • Amazon CloudWatch enables you to monitor Amazon EC2 Instances and Elastic Load Balancers in real-time. It will aggregate and report on metrics such as CPU utilization, data transfer and disk usage, as well as requests rates and request latency.
  • Auto Scaling allows you to automatically acquire and release Amazon EC2 Instances based on the metrics reported through Amazon CloudWatch. You can define the conditions under which this should happen and when these conditions are met, Auto Scaling will automatically add or remove compute resources.
  • Amazon Elastic Load Balancing will distribute incoming application traffic over your Amazon EC2 instances that are running in a single or multiple Availability zones. It can detect the health of Amazon EC2 instances and will stop routing traffic to unhealthy instances until they have recovered and become healthy again.

These services will be of great value to Amazon Web Services customers to simplify the management of their applications and services. With the introduction of these services it will become even easier to optimize performance and fault-tolerance at low-cost.

You can find more information at the detail pages for Amazon CloudWatch, Auto Scaling and Elastic Load Balancing.

Also check out the blog post at the AWS developer weblog with more examples and details, the Rightscale blog with their experiences and my blog post in october 2008 for more background.

Making A Dramatic Difference

| | Comments (2)

nepal

As some of you may know both my daughters are studying Drama in London. Last time when I visited them I met two friends of Kim, Georgia Munnion and Lauren Hopkins. They are all classmates and they are graduating this year.

Georgia and Lauren impressed me with a plan they have for spending the two months after their graduation in Nepal providing educational Drama Workshops to Himalayan children.

We will use Drama as a basis for a process of individual social development, and to improve individual, group and team-building skills, through a range of individual and ensemble exercises. Using our acquired theatre knowledge and experience, we will provide education and enjoyment to underprivileged children, who are mostly unable to return home during their holidays, due to snow-bound and monsoon-bound trails and the recent outbreak of war.

This is a program they have completely designed themselves and as such they are also fully self funded. They could really use some help to make this plan a reality. I have already contributed and if you are someone who appreciates passion in young people who want to make a difference by personally working on making the world a better place, maybe you should also.

Read more about their adventure and plans, and about ways you can maybe contribute at the Making a Dramatic Difference weblog.

Good Advice on Keeping Your Database Simple and Fast.

| | Comments (9)

Keeping your database simple and fast is often difficult if you use higher level frameworks such as ActiveRecords in Ruby or Java object persistence technologies such as Hibernate. There is a lot of magic that is happening out of sight that you have no control over. If you then have to scale your application it is often the relational database that these technologies require that becomes the performance and scaling bottleneck. Often requiring complex custom implementations of partitioning and sharding to make it work.

The AWS services Amazon S3 and Amazon SimpleDB were designed to handle the dominant storage usage patterns within Amazon and they greatly reduced our need to rely on relational storage for scaling our systems. But it is almost never the case that a single storage technique is used in applications and services that need to operate at enterprise scale. For example it is a common pattern that objects stored in S3 using a primary key, have a collection of secondary keys (e.g. metadata) stored in SimpleDB. SimpleDB provides very fast indexing for querying of the metadata that will return primary keys of objects located in S3.

At SXSW Interactive there was a great panel/presentation by Mike Subelsky, co-founder of AWS customer OtherInbox , about their experiences with scaling Ruby-on-Rails applications in the Cloud. They demonstrated that with Amazon EC2 and Amazon S3 Ruby/Rails scales just fine. The room was packed and there was some great Q&A.

During the Q&A presentation co-founder and CEO of OtherInbox Joshua Baer gave some great insight in the changing role of relational databases and some really good advice about how they were able to keep their database simple and fast. After the session I asked Joshua to explain it once more for the readers of this weblog.

.

Flexibility is a key advantage of using Amazon Web Services; you can obtain resources instantaneously without the headache of owning them. If you no longer need the resource, you release it and only pay for what you have used. This is a very powerful model that has helped many of our customers drive capital expense out of their IT operation. It has helped both enterprises and startups reduce the risk that comes with developing new products and businesses.

While this on-demand flexibility is ideal for a whole range of scenarios, some Amazon EC2 customers who have more predictable workloads have asked us for even greater flexibility in the cost model through the ability to reserve capacity. To address this need we have introduced Amazon EC2 Reserved Instances, which provides customers who have predictable usage patterns with a way to even further reduce costs.

Using Amazon EC2 Reserved Instances is really simple: you make a low one-time payment for each instance you want to reserve and in turn you receive a significant discount on the hourly charges for that instance. Furthermore, you don't pay hourly charges at all during periods when you have the instance turned off. Reserved Instances give customers more flexibility to reduce their IT costs. As these instances work exactly the same as On-Demand Instances, customers have the power to seamlessly extend their reserved base with scalable on-demand capacity.

money-belt.jpg Many of our customers find levels of predictability in many of their workloads. And not just those with major computational tasks such as HPC and Data mining, but many different types of customers, from enterprises running ERP and CRM applications, to media companies running portals and media streaming, to young businesses serving Web based applications are able to find a predictable base level of usage that can now be served through Amazon EC2 Reserved Instances.

Reserved Instances also give customers who already have an existing infrastructure in place a transition model that is closer to their current strategy, but at significant cost saving. Customers who currently own their own hardware to meet capacity needs have a total cost of ownership that does not only include the server, network equipment, rack space, etc., but also includes operation costs such as power, cooling, system administration, etc. which has to be paid regardless of how much of the capacity is being used. Reserved Instances at first glance appear similar, as you make a single upfront payment to reserve capacity, but the operational cost is only paid if the capacity is indeed used, allowing for the best of worlds, resulting in a significant cost reduction but maintaining the benefits of elastic computing.

For more details see the Amazon EC2 detail page and the AWS Developer Blog.

Expanding the Cloud: Expanding Amazon EC2 for Windows

| | Comments (1)

Today we have some important news for our Amazon EC2 customers who are running Windows Server and Windows SQLServer instances and who have been looking to extend their coverage for fault-tolerance and locality reasons. Starting today Windows instances can be launched in an additional Availability Zone is the US and they can also be launched in two Availability Zones in Europe. This allows developers who use our Windows instances to build solutions that can tolerate various failure and recovery scenarios. It also puts Windows Server into the hands of developers who want low latency for their European customers.

We have also integrated these features into the Amazon AWS console, such that you now can use the console to launch instances in any of our regions, regardless whether it is Europe or the US. Popular third party tools such as Elastic Fox also have been upgraded and provide additional functionality such as advanced naming an tagging.

Rolling out services and features globally is very important to us as it allows our customers to take a uniform approach to serving their customers world-wide.

Eventually Consistent - Revisited

| | Comments (12) | TrackBacks (7)

I wrote a first version of this posting on consistency models about a year ago, but I was never happy with it as it was written in haste and the topic is important enough to receive a more thorough treatment. ACM Queue asked me to revise it for use in their magazine and I took the opportunity to improve the article. This is that new version.

Eventually Consistent - Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.

At the foundation of Amazon's cloud computing are infrastructure services such as Amazon's S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously.

Under the covers these services are massive distributed systems that operate on a worldwide scale. This scale creates additional challenges, because when a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and need to be accounted for up front in the design and architecture of the system. Given the worldwide scope of these systems, we use replication techniques ubiquitously to guarantee consistent performance and high availability. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner; under a number of conditions the customers of these services will be confronted with the consequences of using replication techniques inside the services.

One of the ways in which this manifests itself is in the type of data consistency that is provided, particularly when the underlying distributed system provides an eventual consistency model for data replication. When designing these large-scale systems at Amazon, we use a set of guiding principles and abstractions related to large-scale data replication and focus on the trade-offs between high availability and data consistency. In this article I present some of the relevant background that has informed our approach to delivering reliable distributed systems that need to operate on a global scale. An earlier version of this text appeared as a posting on the All Things Distributed weblog in December 2007 and was greatly improved with the help of its readers.

Teamwork

| | Comments (13) | TrackBacks (1)

A question I get asked frequently is how working in industry is different from working in academia. My answer from the beginning has been that the main difference is teamwork. While in academia there are collaborations among faculty and there are student teams working together, the work is still rather individual, as is the reward structure. In industry you cannot get anything done without teamwork. Products do not get build by individuals but by teams; definition, implementation, delivery and operation are all collaborative processes that have many people from many different disciplines working together.

information week cover As such the Information Week's Chief of the Year award cannot be my award. It is an award for all the Amazonians who in the past years have developed technologies and processes that are so innovative that they have defined a whole business landscape: first in ecommerce and now with Amazon Web Services they are defining Cloud Computing through the delivery of Infrastructure as a Service. Compared to the immense work that was needed to make all of this work, my involvement has been small.

A relentless focus on innovation by all Amazonians has made this possible: from new hardware development to the definition of new business models, from building ultra-reliable storage services to a massively scalable compute cloud, from pervasive monitoring and performance control to revolutionary efficient software architectures. At a scale and with reliability, performance and cost-effectiveness that is unparalleled in today's technology world. All these advances are based on 13 years of experience with building the world's most customer centric ecommerce operation, and as such the success of AWS is absolutely not the work of a single individual but the success of all Amazonians.

But this is only the beginning. We are intent on building the world's most customer-centric cloud computing operation and, as we have done with ecommerce, we will not accept the old norms of what must be done. We will always focus on what our customers need and work backwards from there. We will continue to innovate and roll out services and features that address the real needs of our customers.

It is still only Day One...

Expanding the Cloud: Amazon EC2 in Europe

| | Comments (8)

Starting today the Amazon Elastic Computing Cloud (EC2) supports the ability to launch instances in multiple geographically distinct regions. The new EU region enables users to launch instances in Europe.

This addresses the requests from many our European customers and from companies that want to run instances closer to European customers. Over the past year I have visited with many of our European customers and frequently they remarked "if only we had EC2 in Europe". We heard their requests loud and clear and have worked very hard to roll out the European Region. This is a very important milestone on the road to local access to all our services.

These are three of the main drivers for the requests by our customers

  1. Lower latency from EC2 instances to their clients. The European Region can be accessed with low latency from all major European network hubs.
  2. Low latency access to data stored in the Amazon Simple Storage Service (S3). A large number of customers have stored data into the European Region of Amazon S3. With the new European region this data can now be accessed with low latency from within EC2 at no cost
  3. Regulatory requirements may require that data be stored in the EU and/or processing take place within the EU. With the European Regions of Amazon S3 and Amazon EC2 developers now can address those requirements.

globe-europe The new European Region will also contain two Availability Zones such that developers can build applications that can tolerate a variety of failure scenarios. One can even develop fail-over scenarios that will span multiple continents. Amazon Elastic Block Storage will also be available to our customers that launch instances in the European Region.

With the European Regions of Amazon EC2, S3 and SQS, combined with Amazon CloudFront, developers now have a full set of services that can help them address the European market.

I am very excited about the launch of the Amazon EC2 in European and I am looking forward to work with our European partners and customers to roll out their applications and services in the EU Region.

More details on the Amazon EC2 detail page , the AWS blog and at RightScale

Expanding the Cloud: Amazon CloudFront

| | Comments (19) | TrackBacks (4)

Today marks the launch of Amazon CloudFront, the new Amazon Web Service for content delivery. It integrates seamlessly with Amazon S3 to provide low-latency distribution of content with high data transfer speeds through a world-wide network of edge locations. It requires no upfront commitments and is a pay-as-you-go service in the same style as the other Amazon Web Services.

Amazon CloudFront has been designed to be fast; the service will cache copies of the content in edge locations close to the end-user's location, significantly lowering the access latency to the content. High sustainable data transfer rates can be achieved with the service especially when distributing larger objects.

Amazon CloudFront will be useful for many different application scenarios such as giving your customers low-latency access to popular objects and protecting your site from popularity surges; other popular examples are low-cost delivery of rich media and sustainable fast transfer rates for software distributions.

See also the posting on the AWS Developer weblog and at Rightscale.

Amazon CloudFront

Seamless integration

A content delivery service that would extend Amazon S3 has been something that is very high on the wish list of our customers. They were already successfully using Amazon S3 for some of their content distribution needs, but many wanted the choice to do so with even lower latency and with higher data transfer rates to any place in the world.

Customers really appreciate the scalability, reliability and cost-effectiveness of Amazon S3 and the fact that it integrates so easily with Amazon EC2. Amazon CloudFront builds further on that seamless integration by making it really simple to distribute Amazon S3 content world-wide. The combination of the two services is really powerful: Amazon S3 will give you durable storage of your data, and the network of edge locations on three continents used by the Amazon CloudFront will deliver the content to your customers with low latency from the most appropriate location.

The network of edge locations

To ensure low-latency delivery, Amazon CloudFront uses a network of edge locations world-wide:

  • United States: Ashburn (VA), Dallas/Fort Worth, Los Angeles, Miami, Newark, Palo Alto, Seattle and St. Louis
  • Europe: Amsterdam, Dublin, Frankfurt and London
  • Asia: Hong Kong and Tokyo

These edge locations work together to direct customers' requests to the edge location that can provide the response with the lowest latency.

Simplicity

Because Amazon CloudFront follows the core principles of all Amazon Web Services it is a unique content delivery service. The simplicity in getting started has been described by many of our early customers as a very important feature.

Using Amazon CloudFront is dead simple:

  1. Put your objects in an Amazon S3 bucket.
  2. Call the CreateDistribution API with the name of the S3 bucket, which will return your distribution's domain name.
  3. Use the new domain name in urls on your web or in your application. Whenever these urls are accessed CloudFront will determine the optimal edge location from where to serve your content.

Many of our private beta customers have reported that it only took them 10-15 minutes from the moment that they first signed up for the service to the moment that Amazon CloudFront was distributing their content.

The second Amazon Web Services principle that sets Amazon CloudFront apart is that no upfront commitments are necessary and you only pay for what you have used. There are no upfront fees or high volume requirements and no negotiations are necessary because we have published low prices from the start. This brings content delivery in the hands of all businesses, and you can exploit the benefits of Amazon's world-wide network of edge locations, regardless of whether you are a highly popular website, a small blog, a complex enterprise application or a developer doing some prototyping.

Tools such as S3Fox have support for Amazon CloudFront built-in such that if you want to avoid any programming you can immediately start exploiting world-wide, low-latency content delivery.

A core distributed systems component

It is not uncommon to think about a service for content delivery such as Amazon CloudFront only in the context of media distribution for web sites, but it actually plays a more fundamental role.

There are two main technology components to such a service; the first is intelligent request routing, which routes requests to the location that can best serve the user given a series of requirements and the status of the network. The second technology component is that of object caching, which is a fundamental building block in both operating systems and in distributed systems.

For example your operating system will have a file cache, where it will store popular, recently-accessed files in memory to provide much faster access and greater throughput. Without a file cache your whole computer would appear much slower as all work would happen at the speed of the disk instead of memory.

Caching is an essential technique that is used to make sure that components can operate at the fastest speed possible, to overcome the performance differences that exist in systems. For example CPU's have caches that are much faster than memory, memory works as caches for disks, local disks can function as caches for remote disks, etc.

In distributed systems caching is primarily used to provide fast access to popular objects that are located in remote storage servers. These systems of caching servers often cooperate to create massive aggregate world-wide capacity to provide low latency access. And by using globally decentralized cache servers for distribution, very high data transfer speed can be achieved.

Caching technology has long been the center piece of computer systems research and in Amazon CloudFront we use the type of highly advanced algorithms for reliability and scale that you have come to expect from our Amazon services.

Many of our customers will look to Amazon CloudFront for rock solid content distribution for websites, but its application is not limited to that. Developers can easily integrate the service into their desktop and server applications and benefit from the advanced routing and caching that Amazon CloudFront offers. For example enterprise style applications such as NASDAQ's Market Replay application are ideal candidates to integrate Amazon CloudFront to provide low latency access to popular market data while reducing the cost of data transfers.

Graphic by Renato Valdés Olmos of Postmachina