All Things Distributed
<img src=”/images/10yrsseed.jpg”/ width="650">
The epoch of AWS is the launch of Amazon S3 on March 14, 2006, now almost 10 years ago. Looking back over the past 10 years, there are hundreds of lessons that we’ve learned about building and operating services that need to be secure, reliable, scalable, with predictable performance at the lowest possible cost. Given that AWS is a pioneer in building and operating these services world-wide, these lessons have been of crucial importance to our business. As we’ve said many times before, “There is no compression algorithm for experience.” With over a million active customers per month, who in turn may serve hundreds of millions of their own customers, there is no lack of opportunities to gain more experience and perhaps no better environment for continuous improvement in the way we serve our customers.
I have picked a few of these lessons to share with you in the hope that they may be of use for you as well.
1. Build evolvable systems
Almost from day one, we knew that the software we were building would not be the software that would be running a year later. The expectation was that with each order or two of magnitude, we would need to revisit and revise the architecture to make sure we could address the issues of scale.
But we couldn’t adopt the old style approach of upgrading systems through a maintenance outage, as many businesses around the world are relying on our platform for 24/7 availability. We needed to build such an architecture that we could introduce new software components without taking the service down. Marvin Theimer, Amazon Distinguished Engineer, once jokingly said that the evolution of Amazon S3 could best be described as starting off as a single engine Cessna plane, but over time the plane was upgraded to a 737, then a group of 747s, all the way to the large fleet of Airbus 380s that it is now. All the while, we were refueling in midair and moving customers from plane to plane without them even realizing it.
2. Expect the unexpected
Failures are a given and everything will eventually fail over time: from routers to hard disks, from operating systems to memory units corrupting TCP packets, from transient errors to permanent failures. This is a given, whether you are using the highest quality hardware or lowest cost components.
This becomes an even more important lesson at scale: for example, as S3 processes trillions and trillions of storage transactions, anything that has even the slightest probability of error will become realistic. Many of those failure scenarios can be anticipated beforehand, but many more are unknown at design and build time.
We needed to build systems that embrace failure as a natural occurrence even if we did not know what the failure might be. Systems need to keep running even if the “house is on fire.” It is important to be able to manage pieces that are impacted without the need to take the overall system down. We’ve developed the fundamental skill of managing the “blast radius” of a failure occurrence such that the overall health of the system can be maintained.
3. Primitives not frameworks
Pretty quickly, we started to realize that the way customers would like to use our services was a work in progress. When customers left the constraining, old world of IT hardware and datacenters behind, they started to develop systems with new and interesting usage patterns that no one had ever seen before. As such, we needed to be ultra-agile to make sure we were catering to our customers’ needs.
One of the most important mechanisms we provided was to offer customers a collection of primitives and tools, where they could pick and choose their preferred way to engage with the AWS cloud, instead of only providing one framework that they are forced to use, which includes everything and the kitchen sink. This approach has enabled our customers to become so successful, that even later generations of AWS services make use of exactly the same primitive services our customers have become accustomed to.
It is also important to realize that it is hard to predict what certain priorities are for your customers until they have the service in their hands and actually start building with it. This is why we deliver new services often with a minimal feature set and allow our customers to help drive the roadmap for extending the service with new features.
4. Automation is key
Developing software services that need to be operated is radically different from building software that needs to be shipped to customers. Managing systems at scale requires a very different mindset to ensure that we meet the reliability, performance, and scalability expectations of our customers.
A key mechanism to achieve this is to automate the management as much as possible, removing error prone, manual operations. To do this, we needed to build management APIs that control the key functionality of our operations. AWS helps its customers do this too. By decomposing your applications into essential building blocks, each with a management API, you can apply automation rules to maintain reliable and predictable performance at scale. A good litmus test has been that if you need to SSH into a server or an instance, you still have more to automate.
5. APIs are forever
This was a lesson we had already learned from our experiences with Amazon retail, but it became even more important for AWS’s API-centric business. Once customers started building their applications and systems using our APIs, changing those APIs becomes impossible, as we would be impacting our customer’s business operations if we would do so. We knew that designing APIs was a very important task as we’d only have one chance to get it right.
6. Know your resource usage
When building a financial model for a service to identify the appropriate charging model, be sure to have good data about the cost of the service and its operations, especially for running a high volume – low margin business. AWS needed to be very conscious as a service provider about our costs so that we could afford to offer our services to customers and identify areas where we could drive operational efficiencies to cut costs further, and then offer those savings back to our customers in the form of lower prices.
An example in the early days where we did not know the resources required to serve certain usage patterns was with S3: We had assumed that the storage and bandwidth were the resources we should charge for; after running for a while, we realized that the number of requests was an equally important resource. If customers have many tiny files, then storage and bandwidth don’t amount to much even if they are making millions of requests. We had to adjust our model to account for the all the dimensions of resource usage so that AWS could be a sustainable business.
7. Build security in from the ground up
Protecting your customers should always be your number one priority, and it certainly has been for AWS… from both an operational perspective as well as tools and mechanisms; it will forever be our number one investment area.
One approach that we learned quickly is that to build secure services, it is necessary to integrate security at the very beginning of service design. The security team is not a group that does validation after something has been built. They must be partners on day one to make sure that security is fundamentally rock solid from the ground up. There is no compromise when it comes to security.
8. Encryption is a first-class citizen
Encryption is a key mechanism for customers to ensure that they are in full control over who has access to their data. Ten years ago, tools and services for encryption were hard to use and it wasn’t until a few years into our operations that we learned how to best integrate encryption into our services.
It started by providing server-side encryption in S3 for compliance use cases. If you would inspect any disks in our datacenters, none of the data would be accessible. But with the launch of Amazon CloudHSM (for hardware security models) and later Amazon Key Management Service, customers could use their own keys for encryption, which removed the need for AWS to manage their keys.
For some time now, support for encryption has been integrated at the design phase of each new service. For example, in Amazon Redshift, each of the data blocks is encrypted by default with a random key and the collection of these random keys is again encrypted with a master key. The master key can be provided by customers, ensuring that they are the only ones who can decrypt and have access to their critical business data or personal identifiable information.
Encryption continues to be a high priority for our business. We will continue to make it even easier for our customers to make use of encryption so they can better protect themselves and their customers.
9. The importance of the network
AWS has come to support many different workloads; from high-volume transaction processing to video transcoding at scale, from high-performance parallel computing to massive web site traffic. Each of those workloads have unique requirements when it comes to the network.
AWS has developed a unique skill to innovate datacenter layout and operations, such that we can have flexible network infrastructure that can be adapted to meet the needs of our customers’ workloads, whatever they may be. We have learned over time that we should not be afraid to develop our own hardware solutions to ensure our customers can achieve their goals. This enables us to meet our very specific requirements, such as the ability to isolate AWS customers from each other on the network to achieve the highest levels of security.
Another successful example of how AWS-designed networking hardware and software enabled us to further improve performance for our customers was in addressing the virtualization tax on network access from virtual machines. Because network access is a shared resource, customers previously could experience significant jitter on the network at times. Developing a NIC that supported single root IO virtualization allowed us to give each VM its own hardware virtualized NIC. This lowered latency more than 2x and delivered more than 10x improvement in latency variability on the network.
10. No gatekeepers
The AWS team has delivered many services and features over time to create a very broad and deep platform for our customers. But AWS is so much more than the services that we built in-house: a very rich ecosystem exists of services delivered by our partners that extends the platform into many new directions.
For example, we have partners like Stripe offering payment services to Twilio making telephony programmable on AWS. Many of our customers are also building platforms themselves on top of AWS to serve specific vertical needs: Philips is building their Healthsuite Digital Platform for managing healthcare data, Ohpen has built a platform for retail banking on AWS, Eagle Genomics has built a genomics processing platform, and many more. What’s essential is that there are no gatekeepers on the AWS platform that tell our partners what they can and cannot do. “No gatekeepers” liberates the innovative processes and opens the door for many unexpected inventions, which are sure to follow.
I am looking forward to seeing what we learn – and AWS customers accomplish – over the next 10 years. And remember, it is still Day One …