Customer Centricity at Amazon Web Services

| Comments ()

In the 2013 Amazon Shareholder letter, Jeff Bezos spent time explaining the decision to pursue a customer-centric way in our business.

As regular readers of this letter will know, our energy at Amazon comes from the desire to impress customers rather than the zeal to best competitors. We don’t take a view on which of these approaches is more likely to maximize business success. There are pros and cons to both and many examples of highly successful competitor-focused companies. We do work to pay attention to competitors and be inspired by them, but it is a fact that the customer-centric way is at this point a defining element of our culture.

AWS has built a reputation over the years for the breadth and depth of our services and the pace of our innovation with 280 features released in 2013. One area we don’t spend a lot of time discussing is the significant investments we’ve made in building a World Class Customer Service and Technical Support function. These are the people working behind the scenes helping customers fully leverage all of AWS’s capabilities when running their infrastructure on AWS. We launched AWS Support in 2009 and since the launch the mission has remained constant: to help customers of all sizes and technical abilities to successfully utilize the products and features provided by AWS.

Customers are frequently surprised to hear we have a Support organization that not only helps customers via email, phone, chat or web cases, but also builds innovative software to deliver better customer experiences. In recent years, this team has released technology such as Support for Health Checks, AWS Trusted Advisor, Support API’s, Trusted Advisor API’s, and many more. One customer facing feature Jeff highlighted in the Shareholder letter was AWS Trusted Advisor which is a tool that our support organization built to move support from reactive help to proactive, preventative help.

I can keep going – Kindle Fire’s FreeTime, our customer service Andon Cord, Amazon MP3’s AutoRip – but will finish up with a very clear example of internally driven motivation: Amazon Web Services. In 2012,AWS announced 159 new features and services. We’ve reduced AWS prices 27 times since launching 7 years ago, added enterprise service support enhancements, and created innovative tools to help customers be more efficient. AWS Trusted Advisor monitors customer configurations, compares them to known best practices, and then notifies customers where opportunities exist to improve performance, enhance security, or save money. Yes, we are actively telling customers they’re paying us more than they need to. In the last 90 days, customers have saved millions of dollars through Trusted Advisor, and the service is only getting started. All of this progress comes in the context of AWS being the widely recognized leader in its area – a situation where you might worry that external motivation could fail. On the other hand, internal motivation – the drive to get the customer to say “Wow” – keeps the pace of innovation fast.

We’ve always focused on getting highly skilled support engineers with all hires requiring the same technical certification process (Tech Bar Raisers) as any of our developers building services. Over the years, we have scaled the AWS Support organization to meet customer need from a team with heavy Linux Sys Admin with strong Networking skills in one location to a large global team, located in 17 locations around the world with Windows Sys Admins, Networking Engineers, DBAs, Security Specialists, Developers, and many more specializations. In 2013, we spent a lot of time developing sophisticated internal tools that make supporting our customers more efficient including intelligent skills based case routing tools that provide in-depth technical information to engineers that help address customer needs. Our customers tell us that the service is 78% better than it was 3 years ago.

Our customers can feel confident that AWS will work every day to deliver World Class Support for our customers. I wanted to share a new video with you where our customers discuss this critical behind the scenes function in a little more detail. After watching, I believe that you will have a better perspective on our mission, our support options, and the benefits that our customers derive from their use of AWS Support:

Updated Lampson's Hints for Computer Systems Design

| Comments ()

This year I have not been able to publish many back-to-basics readings, so I will not close the year with a recap of those. Instead I have a video of a wonderful presentation by Butler Lampson where he talks about the learnings of the past decades that helped him to update his excellent 1983 "Hints for computer system design".

The presentation was part of the Heidelberg Laureate Forum helt in September of this year. At the Forum many of the Abel, Fields and Turing Laureates held presentations. Our most famous computer scientists like Fernando Carbato, Stephen Cook, Edward Feigenbaum, Juris Hartmanis, John Hopcroft, Alan Kay, Vinton Cerf, etc. were all at the Forum. You can find a list of selected video presentation here

For me the highight was Butler's presentation on Hints and Principles for Computer Science Design. I include it here as it is absolutely worth watching.

We launched DynamoDB last year to address the need for a cloud database that provides seamless scalability, irrespective of whether you are doing ten transactions or ten million transactions, while providing rock solid durability and availability. Our vision from the day we conceived DynamoDB was to fulfil this need without limiting the query functionality that people have come to expect from a database. However, we also knew that building a distributed database that has unlimited scale and maintains predictably high performance while providing rich and flexible query capabilities, is one of the hardest problems in database development, and will take a lot of effort and invention from our team of distributed database engineers to solve. So when we launched in January 2012, we provided simple query functionality that used hash primary keys or composite primary keys (hash + range). Since then, we have been working on adding flexible querying. You saw the first iteration in April 2013 with the launch of Local Secondary Indexes (LSI). Today, I am thrilled to announce a fundamental expansion of the query capabilities of DynamoDB with the launch of Global Secondary Indexes (GSI). This new capability allows indexing any attribute (column) of a DynamoDB table and performing high-performance queries at any table scale.

Going beyond Key-Value

Advanced Key-value data stores such as DynamoDB achieve high scalability on loosely coupled clusters by using the primary key as the partitioning key to distribute data across nodes. Even though the resulting query functionality may appear more limiting than a relational database on a cursory examination, it works exceedingly well for a wide range of applications as evident from DynamoDB's rapid growth and adoption by customers like Electronic Arts, Scopley, HasOffers, SmugMug, AdRoll, Dropcam, Digg and by many teams at Amazon.com (Cloud Drive, Retail). DynamoDB continues to be embraced for workloads in Gaming, Ad-tech, Mobile, Web Apps, and other segments where scale and performance are critical. At Amazon.com, we increasingly default to DynamoDB instead of using relational databases when we don’t need complex query, table join and transaction capabilities, as it offers a more available, more scalable and ultimately a lower cost solution.

For non-primary key access in advanced key-value stores, a user has to resort to either maintaining a separate table or some form of scatter-gather query across partitions. Both these options are less than ideal. For instance, maintaining a separate table for indexes forces users to maintain consistency between the primary key table and the index tables. On the other hand, with a scatter gather query, as the dataset grows, the query must be scattered more and more resulting in poor performance over time. DynamoDB's new Global Secondary Indexes remove this fundamental restriction by allowing "scaled out" indexes without ever requiring any book-keeping on behalf of the developer. Now you can run queries on any item attributes (columns) in your DynamoDB table. Moreover, a GSI's performance is designed to meet DynamoDB's single digit millisecond latency - you can add items to a Users table for a gaming app with tens of millions of users with UserId as the primary key, but retrieve them based on their home city, with no reduction in query performance.

DynamoDB Refresher

DynamoDB stores information as database tables, which are collections of individual items. Each item is a collection of data attributes. The items are analogous to rows in a spreadsheet, and the attributes are analogous to columns. Each item is uniquely identified by a primary key, which is composed of its first two attributes, called the hash and range. DynamoDB queries refer to the hash and range attributes of items you’d like to access. These query capabilities so far have been based on the default primary index and optional local secondary indexes of a DynamoDB table:

  • Primary Index: Customers can choose from two types of keys for primary index querying: Simple Hash Keys and Composite Hash Key / Range Keys. Simple Hash Key gives DynamoDB the Distributed Hash Table abstraction. The key is hashed over the different partitions to optimize workload distribution. For more background on this please read the original Dynamo paper. Composite Hash Key with Range Key allows the developer to create a primary key that is the composite of two attributes, a “hash attribute” and a “range attribute.” When querying against a composite key, the hash attribute needs to be uniquely matched but a range operation can be specified for the range attribute: e.g. all orders from Werner in the past 24 hours, or all games played by an individual player in the past 24 hours.
  • Local Secondary Index: Local Secondary Indexes allow the developer to create indexes on non-primary key attributes and quickly retrieve records within a hash partition (i.e., items that share the same hash value in their primary key): e.g. if there is a DynamoDB table with PlayerName as the hash key and GameStartTime as the range key, you can use local secondary indexes to run efficient queries on other attributes like “Score.” Query “Show me John’s all-time top 5 scores” will return results automatically ordered by score.

What are Global Secondary Indexes?

Global secondary indexes allow you to efficiently query over the whole DynamoDB table, not just within a partition as local secondary indexes, using any attributes (columns), even as the DynamoDB table horizontally scales to accommodate your needs. Let’s walk through another gaming example. Consider a table named GameScores that keeps track of users and scores for a mobile gaming application. Each item in GameScores is identified by a hash key (UserId) and a range key (GameTitle). The following diagram shows how the items in the table would be organized. (Not all of the attributes are shown)

Now suppose that you wanted to write a leaderboard application to display top scores for each game. A query that specified the key attributes (UserId and GameTitle) would be very efficient; however, if the application needed to retrieve data from GameScores based on GameTitle only, it would need to use a Scan operation. As more items are added to the table, scans of all the data would become slow and inefficient, making it difficult to answer questions such as

  • What is the top score ever recorded for the game "Meteor Blasters"?
  • Which user had the highest score for "Galaxy Invaders"?
  • What was the highest ratio of wins vs. losses?

To speed up queries on non-key attributes, you can specify global secondary indexes. For example, you could create a global secondary index named GameTitleIndex, with a hash key of GameTitle and a range key of TopScore. Since the table's primary key attributes are always projected into an index, the UserId attribute is also present. The following diagram shows what GameTitleIndex index would look like:

Now you can query GameTitleIndex and easily obtain the scores for "Meteor Blasters". The results are ordered by the range key, TopScore.

Efficient Queries

Traditionally, databases have been scaled as a whole –tables and indexes together. While this may appear simple, it masked the underlying complexity of varying needs for different types of queries and consequently different indexes, which resulted in wasted resources. With global secondary indexes in DynamoDB, you can now have many indexes and tune their capacity independently. These indexes also provide query/cost flexibility, allowing a custom level of clustering to be defined per index. Developers can specify which attributes should be “projected” to the secondary index, allowing faster access to often-accessed data, while avoiding extra read/write costs for other attributes.

Start with DynamoDB

The enhanced query flexibility that global and local secondary indexes provide means DynamoDB can support an even broader range of workloads. When designing a new application that will operate in the AWS cloud, first take a look at DynamoDB when selecting a database. If you don’t need the table join capabilities of relational databases, you will be better served from a cost, availability and performance standpoint by using DynamoDB. If you need support for transactions, use the recently released transaction library. You can also use GSI features with DynamoDB Local for offline development of your application. As your application becomes popular and goes from being used by thousands of users to millions or even tens of millions of users, you will not have to worry about the typical performance or availability bottlenecks applications face from relational databases that require application re-architecture. You can simply dial up the provisioned throughput that your app needs from DynamoDB and we will take care of the rest without any impact on the performance of your app.

Dropcam tells us that they adopted DynamoDB for seamless scalability and performance as they continue to innovate on their cloud based monitoring platform which has grown to become one of the largest video platforms on the internet today. With GSIs, they do not have to choose between scalability and query flexibility and instead can get both out of their database. Guerrilla Games, the developer of Killzone Shadow Fall uses DynamoDB for online multiplayer leaderboards and game settings. They will be leveraging GSIs to add more features and increase database performance. Also, Bizo, a B2B digital marketing platform, uses DynamoDB for audience targeting. GSIs will enable lookups using evolving criterion across multiple datasets.

These are just a few examples where GSIs can help and I am looking forward to our customers building scalable businesses with DynamoDB. I want application writers to focus on their business logic, leaving the heavy-lifting of maintaining consistency across look-up attributes to DynamoDB. To learn more see Jeff Barr’s blog and the DynamoDB developer guide.

As I discussed in my re:Invent keynote earlier this month, I am now happy to announce the immediate availability of Amazon RDS Cross Region Read Replicas, which is another important enhancement for our customers using or planning to use multiple AWS Regions to deploy their applications. Cross Region Read Replicas are available for MySQL 5.6 and enable you to maintain a nearly up-to-date copy of your master database in a different AWS Region. In case of a regional disaster, you can simply promote your read replica in a different region to a master and point your application to it to resume operations. Cross Region Read Replicas also enable you to serve read traffic for your global customer base from regions that are nearest to them.

About 5 years ago, I introduced you to AWS Availability Zones, which are distinct locations within a Region that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same region. Availability Zones have since become the foundational elements for AWS customers to create a new generation of highly available distributed applications in the cloud that are designed to be fault tolerant from the get go. We also made it easy for customers to leverage multiple Availability Zones to architect the various layers of their applications with a few clicks on the AWS Management Console with services such as Amazon Elastic Load Balancing, Amazon RDS and Amazon DynamoDB. In addition, Amazon S3 redundantly stores data in multiple facilities and is designed for 99.999999999% durability and 99.99% availability of objects over a given year. Our SLAs offer even more confidence to customers running applications across multiple Availability Zones. Amazon RDS offers a monthly uptime percentage SLA of 99.95% per Multi-AZ database instance. Amazon EC2 and EBS offer a monthly uptime percentage SLA of 99.95% for instances running across multiple Availability Zones.

As AWS expanded to 9 distinct AWS Regions and 25 Availability Zones across the world during the last few years, many of our customers started to leverage multiple AWS Regions to further enhance the reliability of their applications for disaster recovery. For example, when a disastrous earthquake hit Japan in March 2011, many customers in Japan came to AWS to take advantage of the multiple Availability Zones. In addition, they also backed up their data from the AWS Tokyo Region to AWS Singapore Region as an additional measure for business continuity. In a similar scenario here in the United States, Milind Borate, the CTO of Druva, an enterprise backup company using AWS told me that after hurricane Sandy, he got an enormous amount of interest from his customers in the North Eastern US region to replicate their data to other parts of the US for Disaster Recovery.

Up until AWS and the Cloud, reliable Disaster Recovery had largely remained cost prohibitive for most companies excepting for large enterprises. It traditionally involved the expense and headaches associated with procuring new co-location space, negotiating pricing with a new vendor, adding racks, setting up network links and encryption, taking backups, initiating a transfer and monitoring it until the operation complete. While the infrastructure costs for basic disaster recovery could have been very high, the associated system and database administration costs could be just as much or more. Despite incurring these costs, given the complexity, customers could have found themselves in a situation where the restoration process does not meet their recovery time objective and/or recovery point objective. AWS provides several easy to use and cost effective building blocks to make disaster recovery very accessible to customers. Using the S3 copy functionality, you can copy the objects/files that are used by your application from one AWS Region to another. You can use the EC2 AMI copy functionality to make your server images available in multiple AWS Regions. In the last 12 months, we launched EBS Snapshot Copy, RDS Snapshot Copy, DynamoDB Data Copy and Redshift Snapshot Copy, all of which help you to easily restore the full stack of your application environments in a different AWS Region for disaster recovery. Amazon RDS Cross Region Read Replica is another important enhancement for supporting these disaster recovery scenarios.

We have heard from Joel Callaway from Zoopla, a property listing and house prices website in UK that attracts over 20 million visits per month, that they are using the RDS Snapshot Copy feature to easily transfer hundreds of GB of their RDS databases from the US East Region to the EU West (Dublin) Region every week using a few simple API calls. Joel told us that prior to using this feature it used to take them several days and manual steps to set up a similar disaster recovery process. Joel also told us that he is looking forward to using Cross Region Read Replicas to further enhance their disaster recovery objectives.

AWS customers come from over 190 countries and a lot of them in turn have global customers. Cross Region Read Replicas also make it even easier for our global customers to scale database deployments to meet the performance demands of high-traffic, globally disperse applications. This feature enables our customers to better serve read-heavy traffic from an AWS Region closer to their end users to provide a faster response time. Medidata delivers cloud-based clinical trial solutions using AWS that enable physicians to look up patient records quickly and avoid prescribing treatments that might counteract the patient’s clinical trial regimen. Isaac Wong, VP of Platform Architecture with Medidata, told us that their clinical trial platform is global in scope and the ability to move data closer to the doctors and nurses participating in a trial anywhere in the world through Cross Region Read Replicas enables them to shorten read latencies and allows their health professionals to serve their patients better. Isaac also told us that using Cross Region Replication features of RDS, he is able to ensure that life critical services of their platform are not affected by regional disruption. These are great examples of how many of our customers are very easily and cost effectively able to implement disaster recovery solutions as well as design globally scalable web applications using AWS.

Note that building a reliable disaster recovery solution entails that every component of your application architecture, be it a web server, load balancer, application, cache or database server, is able to meet the recovery point and time objectives you have for your business. If you are going to take advantage of Cross Region Read Replicas of RDS, make sure to monitor the replication status through DB Event Notifications and the Replica Lag metric through CloudWatch to ensure that your read replica is always available and keeping up. Refer to the Cross Region Read Replica section of the Amazon RDS User Guide to learn more.

AWS re:Invent 2013

| Comments ()

Today we are kicking off AWS re:Invent 2013. Over the course of the next three days, we will host more than 200 sessions, training bootcamps, and hands on labs taught by expert AWS staff as well as dozens of our customers.

This year’s conference kicks off with a keynote address by AWS Senior Vice President Andy Jassy, followed by my keynote on Thursday morning. Tune in to hear the latest from AWS and our customers.

If you’re not already here in Vegas with us, you can sign up to watch the keynotes on live stream here.

Outside of the keynotes, there are an incredible number of sessions offering a tailored experience whether you are a developer, startup, executive, partner, or other. You can see the full session catalog here. I’m impressed by the scale and technical depth of what’s offered to attendees.

After my keynote on Thursday I will host two fireside chat sessions with cloud innovators and industry influencers:

First, I’ll talk with three technical startup founders

In the second session I will talk with three startup influencers

I will follow those two sessions with Startup Launches, where five companies will either launch their business or a significant feature entirely built on AWS. It will be a busy, fun, and informative afternoon!

Look forward to seeing you around the conference.

Speed of development, scalability, and simplicity of management are among the critical needs of mobile developers. With the proliferation of mobile devices and users, and small agile teams that are tasked with building successful mobile apps that can grow from 100 users to 1 million users in a few days, scalability of the underlying infrastructure and simplicity of management are more important than ever. We created DynamoDB to make it easy to set up and scale databases so that developers can focus on building great apps without worrying about the muck of managing the database infrastructure. As I have mentioned previously, companies like Crittercism and Dropcam have already built exciting mobile businesses leveraging DynamoDB. Today, we are further simplifying mobile app development with our newest DynamoDB feature, Fine-Grained Access Control, which gives you the ability to directly and securely access mobile application data in DynamoDB.

One of the pieces of a mobile infrastructure that developers have to build and maintain is the fleet of proxy servers that authorize requests coming from millions of mobile devices. This proxy tier allows vetted requests to continue to DynamoDB and then filters responses so the user only receives permitted items and attributes. So, if I am building a mobile gaming app, I must run a proxy fleet that ensures “johndoe@gmail.com” only retrieves his game state and nothing else. While Web Identity Federation, which we introduced a few months back, allowed using public identity providers such as Login with Amazon, Facebook, or Google for authentication, it still required a developer to build and deploy a proxy layer in front of DynamoDB for this type of authorization

With Fine-Grained Access Control, we solve this problem by enabling you to author access policies that include conditions that describe additional levels of filtering and control. This eliminates the need for the proxy layer, simplifies the application stack, and results in cost savings. Using access control this way involves a setup phase of authenticating the user (step 1) and obtaining IAM credentials (step 2). After these steps, the mobile app may directly perform permitted operations on DynamoDB (step 3).

With today’s launch, apps running on mobile devices can send workloads to a DynamoDB table, row, or even a column without going through an intervening proxy layer. For instance, the developer of a mobile app will use Fine-Grained Access Control to restrict the synchronization of user data (e.g. Game history) across the many devices the user has the app installed on. This capability allows apps running on mobile devices to modify only rows belonging to a specific user. Also, by consolidating users’ data in a DynamoDB table, you can obtain real-time insights over the user base, at large scale, without going through expensive joins and batch approaches such as scatter / gather.

To get started, please see the Fine-Grained Access Control documentation and Jeff Barr’s blog.

Many of you know Thorsten von Eicken as the founder of Rightscale, the company that has helped numerous organizations find their way onto AWS. In what seems almost a previous life by now Thorsten was one of the top young professors in Distributed Systems and I had the great pleasure of working with him at Cornell in the early 90's. What set Thorsten aside from so many other system research academics was his desire to build practical, working systems, a path that I followed as well.

In the back to basics readings this week I am re-reading a paper from 1995 about the work that I did together with Thorsten on solving the problem of end-to-end low-latency communication on high-speed networks. The problem we were facing in those days was than many new high-speed network technologies, such as ATM, became available for standard workstations but that the operating systems were not able to deliver those capabilities to its applications. Throughput was often acceptable but individual message latency was as bad as over regular ethernet, a problem that Chandu Tekkath had described earlier in "Limits to Low-Latency Communication on High-Speed Networks"

The lack of low-latency made that distributed systems (e.g. database replication, fault tolerance protocols) could not benefit from these advances at the network level. The research to unlock these capabilities led to an architecture called U-Net. What set U-Net aside from other research was that it was first and foremost and engineering effort as we set out to build a system that actually had to function in production. Many of those engineering experiences found their way back into the paper. U-Net also heavily influenced what later became the Virtual Network Architecture industry standard.

The work on U-Net was continued by Matt Welsh who built among other things a version that could be used for fast-ethernet on regular PCs and one that could safely integrate into type-safe environments such as the JVM.

U-Net: A User-Level Network Interface for Parallel and Distributed Computing, Anindya Basu, Vineet Buch, Werner Vogels, Thorsten von Eicken. Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), Copper Mountain, Colorado, December 3-6, 1995

AWS Activate – Supporting Startups on AWS

| Comments ()

I am very excited to announce AWS Activate, a program designed to provide startups with the resources they need to build applications on AWS.

Startups will forever be a very important customer segment of AWS. They were among our first customers and along the way some amazing businesses have been built by these startups, many of which running for 100% on AWS. Startups operate in a world of high uncertainty and limited capital, so an elastic and on-demand infrastructure at low and variable cost aligns very naturally with their needs. By reducing the cost of failure and democratizing access to infrastructure, the cloud has enabled more startups to build, experiment, and scale.

When we launched AWS the original mission was

To enable businesses and developers to use web services to build scalable sophisticated applications

We’re continually amazed at the incredible sophisticated applications that these startups have built on top of our foundational services. That includes the startups that have become household names – Instagram, Spotify, Pinterest, Dropbox, Etsy, AirBnB, Shazam – as well as incredibly successful companies that you might not yet have heard of, such as Twilio, Viki, Redbus, Floorplanner and Tellybug, and many more. We’re proud to have helped all startups achieve their goals.

As I’ve traveled this past year for AWS Summits I’ve met startups in countries all over the world building apps for every imaginable use case. What’s exciting to me is infrastructure is no longer a bottleneck to innovation. The democratization of infrastructure means that an internet startup in Bangalore or Sao Paulo or Manila has access to the same compute power as Amazon.com; the same durability as Dropbox; the same scalability as Airbnb; the same global footprint as Netflix. The result is we’re beginning to see more and more startups grow up in more places.

We’re excited to be a part of this global momentum in the startup ecosystem. The challenge now is to support and assist an increasing number of startups across the world.

To that end, today we’re pleased to announce AWS Activate, a new program for startups. AWS Activate is designed to provide startups with the resources they need to build applications on AWS. It includes access to web-based AWS Training courses, to help startups become familiar and proficient with AWS services; an AWS Support period, to provide expert guidance when a startup might need it; and in some cases AWS Promotional Credit. AWS Activate also allows startups to leverage the unique and robust ecosystem that has grown around AWS, both in terms of the developer community and third-party software vendors. The new Startup Forum will give startups a place to find and share tips and lessons learned – there are already posts from customers like Coursera as well as best practice guidance from AWS Solutions Architects. In addition, AWS Activate will include discounts on software that many startups find useful. Included already are exclusive offers from Opscode (for automation), AlertLogic (for security), and SOASTA (for testing).

The best part is – it’s free to join. You can learn more and sign up at AWS Activate.

The anonymity routing network Tor is frequently in the news these days, which makes it a good case to read up on the fascinating technologies behind it. Tor stands for The Onion Router as its technology is based on the onion routing principles. These principles were first described by Goldschlag, et al., from the Naval Research Lab, in their 1996 paper on Hiding Routing Information. Almost immediately work started on addressing a number of omissions in the original work in what became known as the second-generation onion router. Tor is the implementation of such a second generation router and has a number of fascinating features. The paper describing Tor is also very interesting from a practitioners point of view as it deals with the system complexities of implementing the router at scale.

Hiding Routing Information, David M. Goldschlag, Michael G. Reed, and Paul F. Syverson, in the proceeding of the Workshop on Information Hiding, Cambridge, UK, May, 1996.

Tor: The Second-Generation Onion Router, Roger Dingledine, Nick Mathewson and Paul Syverson, in Proceedings of the 13th USENIX Security Symposium, August 2004

Back-to-Basics Weekend Reading - A Decomposition Storage Model

| Comments ()

Traditionally records in a database were stored as such: the data in a row was stored together for easy and fast retrieval. Not everybody agreed that the "N-ary Storage Model" (NSM) was the best approach for all workloads but it stayed dominant until hardware constraints, especially on caches, forced the community to revisit some of the alternatives. Combined with the rise of data warehouse workloads, where there is often significant redundancy in the values stored in columns, and database models based on column oriented storage took off. The first practical modern implementation is probably C-Store by Stonebraker, et al. in 2005. There is a great tutorial by Harizopoulos, Abadi and Boncz from VLDB 2009 that takes you through the history, trade-off's and the state of the art. Many of the modern high-performance data warehouses such as Amazon Redshift are based on column stores.

But the groundwork for Column Oriented Databases was laid in 1985 when George Copeland and Setrag Koshafian questioned the NSM with their seminal paper on a "Decomposition Storage Model" (DSM). From the abstract:

There seems to be a general consensus among the database community that the n-ary approach is better This conclusion is usually based on a consideration of only one or two dimensions of a database system The purpose of this report is not to claim that decomposition is better Instead, we claim that the consensus opinion is not well founded and that neither is clearly better until a closer analysis is made along the many dimensions of a database system The purpose of this report is to move further in both scope and depth toward such an analysis We examine such dimensions as simplicity, generality, storage requirements, update performance and retrieval performance

A Decomposition Storage Model, George P. Copeland and Setrag N. Khoshafian, in the Proceedings of the 1985 SIGMOD International Conference on Management of Data