People often ask me if developing for the cloud is any different from developing on-premises software. It really is. In this post, I show some of the reasons why that's true, using the Amazon Redshift team and the approach they have taken to improve the performance of their data warehousing service as an example. The Amazon Redshift team has delivered remarkable gains using a few simple engineering techniques:

  • Leveraging fleet telemetry
  • Verifying benchmark claims
  • Optimizing performance for bursts of user activity

Leveraging fleet telemetry

The biggest difference between developing for the cloud and developing on-premises software is that in the cloud, you have much better access to how your customers are using your services.

Every week, the Amazon Redshift team performs a scan of their fleet and generates a Jupyter notebook showing an aggregate view of customer workloads. They don't collect the specific queries, just generic information such as the operation, count, duration, and plan shape. This yields hundreds of millions of data samples. I picked a few graphs to demonstrate, showing frequency, duration, and query plan for both SELECT and INSERT/UPDATE/DELETE statements.


Looking at the graphs, you can see that customers run almost as many INSERT/UPDATE/DELETE statements on their Amazon Redshift data warehouses as they do SELECT. Clearly, they're updating their systems far more frequently than they did on-premises, which changes the nature of engineering problems the team needs to prioritize.

You can also see that runtime roughly follows a power law distribution—even though the vast majority of queries run in under 100 ms, the aggregate time in each bucket is about the same. Each week, the team's job is to find something that shifts the durations left and aggregate time down by looking at query shapes to find the largest opportunities for improvement.

Doing so has yielded impressive results over the past year. On a fleet-wide basis, repetitive queries are 17x faster, deletes are 10x faster, single-row inserts are 3x faster, and commits are 2x faster. I picked these examples because they aren't operations that show up in standard data warehousing benchmarks, yet are meaningful parts of customer workloads.

These sorts of gains aren't magic—just disciplined engineering incrementally improving performance by 5-10% with each patch. Over just the past 6 months, these gains have resulted in a 3.5x increase in Amazon Redshift's query throughput. So, small improvements add up. The key is knowing what to improve.

Verifying benchmark claims

I believe that making iterative improvements based on trends observed from fleet telemetry data is the best way to improve customer experience. That said, it is important to monitor benchmarks that help customers compare one cloud data warehousing vendor to another.

I've noticed a troubling trend in vendor benchmarking claims over the past year. Below, I show measurements on comparable hardware for Amazon Redshift and three other vendors who have been recently claiming order-of-magnitude better performance and pricing. As you see later, the reality is different from their claims. Amazon Redshift is up to 16 times faster and up to eight times cheaper than the other vendors.


Note: $/Yr for Amazon Redshift is based on the 1-year Reserved Instance price


It is important, when providing performance data, to use queries derived from industry standard benchmarks such as TPC-DS, not synthetic workloads skewed to show cherry-picked queries. It is important to show both, cases where you're better as well as ones where you're behind.

And, it is important to provide the specific setup so customers can replicate the numbers for themselves. The code and scripts used by the Amazon Redshift team for benchmarking are available on GitHub and the accompanying dataset is hosted in a public Amazon S3 bucket. The scientific method requires results to be reproducible—in the cloud, it should be straightforward for customers to do so.

Note: You need valid AWS credentials to access the public S3 data. Script users should update the DDL file with their own AWS keys to load the TPC-DS data.

Optimizing performance for bursts of user activity

Another significant difference between on-premises systems and the cloud is the abundance of available resources. A typical data warehouse has significant variance in concurrent query usage over the course of a day. It is more cost-effective to add resources just for the period during which they are required rather than provisioning to peak demand.



Concurrency Scaling is a new feature in Amazon Redshift that adds transient capacity when needed, to handle heavy demand from concurrent users and queries. Due to the performance improvements discussed above, 87% of current customers don't have any significant queue wait times and don't need concurrency beyond what their main cluster provides. The remaining 13% have bursts in concurrent demand, averaging 10 minutes at a time.

With the new feature, Amazon Redshift automatically spins up a cluster for the period during which increased concurrency causes queries to wait in the queue. For every 24 hours that your main cluster is in use, you accrue a one-hour credit for Concurrency Scaling. These means that Concurrency Scaling is free for more than 97% of customers.

For any usage that exceeds accrued credits at the end of the month, customers are billed on a per-second basis. This ensures that customers not only get consistently fast performance, but also predictable month-to-month costs, even during periods of high demand variability. In the following diagram, see how the throughput of queries derived from the TPC-H benchmark goes up as the number of concurrent users increase and Amazon Redshift adds transient clusters.

Concurrency Scaling is a good example of how the Amazon Redshift team is able to leverage the elasticity of cloud resources to automatically scale capacity as needed. For Amazon Redshift customers, this results in consistently fast performance for all users and workloads, even with thousands of concurrent queries.

Conclusion

Concurrency Scaling is launching soon. You can sign up for the preview to receive an email notification when the feature is available for you to try. I hope to see you at re:Invent 2018, where you can hear more about Amazon Redshift's performance optimization techniques and how they are helping AWS customers reduce their analysts' time-to-insight.

Ciao Milano! – An AWS Region is coming to Italy!

| Comments ()

Today, I am happy to announce our plans to open a new AWS Region in Italy! The AWS Europe (Milan) Region is the 25th AWS Region that we've announced globally. It's the sixth AWS Region in Europe, joining existing regions in France, Germany, Ireland, the UK, and the new Region that we recently announced in Sweden. The AWS Europe (Milan) Region will have three Availability Zones and be ready for customers in early 2020.

Currently we have 57 Availability Zones across 19 technology infrastructure Regions. As of this announcement, another five AWS Regions and 15 Availability Zones are coming over the next year in Bahrain, Hong Kong SAR, Italy, South Africa, and Sweden. We are continuing to work to open additional Regions all over the world where our customers need them the most.

Organizations across Italy have been using the AWS Cloud for over a decade, using AWS Regions located outside of Italy. This has led us to steadily increase our investment in Italy. We needed to serve our growing base of startup, government, and enterprise customers across many vertical industries, including automotive, financial services, media and entertainment, high technology, education, and energy.

In 2012, Amazon opened its first Italian office and its first Italian point of presence (PoP) based in Milan. Since then, AWS has added another PoP in Palermo in 2017. We have offices in Rome and Milan, where we continue to help Italian customers of all sizes move to the AWS Cloud. In Italy, we employ local teams of account managers, solutions architects, business development managers, partner managers, professional services consultants, technology evangelists, start-up community developers, marketing managers, and many more.

Some of the largest enterprises and public sector organizations in Italy are using AWS to build innovations and power their businesses, drive cost savings, accelerate innovation, and speed time-to-market. This includes:

  • Enterprises such as Decysion, Docebo, Eataly, Edizioni Conde Nast, ENEL, Ferrero, GEDI Gruppo Editoriale, Imperia & Monferrina, Lamborghini, Mediaset, Navionics, Pirelli, Pixartprinting, SEAT Pagine Gialle, Tagetik Software, and Vodafone Italy.
  • Public sector customers such as A2A Smart City, City of Cagliari, Corte dei Conti, Istituto Centrale per i Beni Sonori ed Audiovisivi, Madisoft, National Institute for Astrophysics, National Institute of Molecular Genetics, Politecnico di Milano, Politecnico Torino, Regione Autonoma Sardegna, UniNettuno, and Università degli Studi di Cagliari.

Lamborghini, the world-famous manufacturer of elite, luxury sports cars based in Italy, has been using AWS to reduce the cost of their infrastructure by 50 percent, while also achieving better performance and scalability. Today, their time-to-market is close to zero. The Lamborghini website was being hosted on outdated infrastructure when the company decided to boost their online presence to coincide with the launch of their Aventador J sports car. When evaluating the different options, Lamborghini looked at an on-premises data center, which was costly; a local hosting provider, which did not offer scalability; and cloud computing with AWS. The company decided it wanted the scalability, flexibility, and cost benefits of working in the cloud. By moving to AWS, Lamborghini was able to prepare the development and test environment in a couple of days. The website went online in less than one month and was able to support a 250 percent increase in traffic around the launch of the Aventador J.

ENEL is one of the leading energy operators in the world. ENEL is using AWS to transform its entire business, closing all of their data centers by 2018, migrating workloads from over 6,000 on-premises servers onto AWS in nine months, and using AWS IoT services to better manage and understand energy consumption.

Seat Pagine Gialle Italia is an Italian media company most famous for producing the Yellow and White Pages directories. Since first launching in 1925, Seat Pagine Gialle has been looking to help companies to market themselves better, and has branched out from telephone directories. They now offer street maps, online advertising, mobile and web content creation, and ecommerce services.

Seat Pagine Gialle currently hosts over 100,000 websites while serving the needs of over 12.5 million households and over 3 million businesses in Italy. To meet such large traffic numbers, they need a technology infrastructure that is secure, reliable, and flexible. Seat Pagine Gialle is working to move their technology infrastructure to AWS. Seat Pagine Gialle is now able to scale its infrastructure, and at the same time reduce operational costs by 50 percent. This helps support over 50 million searches every month on its networks across Italy.

GEDI Gruppo Editoriale is an Italian multimedia giant that publishes some of the largest circulation newspapers in Italy, including La Repubblica and La Stampa. They are the trusted source of news for millions of Italians. Using AWS has allowed them to scale up and down whenever their news cycles require it, allowing them to deliver the news to readers when it is needed the most.

A great example of this comes from the Italian general elections in March 2018 where they experienced the highest peak traffic ever for Repubblica.it. On that day, they had over 80 million pageviews and 18.8 million unique visits. Instead of spending all their resources on making sure that the website was available, La Repubblica was able to provide their readers with continuous special election-day coverage with real-time data of elections results.

AWS also has a vibrant partner ecosystem in Italy as part of the AWS Partner Network (APN). This includes system integration (SI) and independent software vendor (ISV) partners who have built cloud practices and innovative technology solutions using AWS.

APN SIs working in Italy include Accenture, BeSharp, Capgemini, Claranet, CloudReach, Deloitte, DXC, NTT Data, Sopra Steria, Storm Reply, Techedge, XPeppers and Zero12. They help enterprise and public sector customers migrate to the AWS Cloud, deploy mission-critical applications, and provide a full range of monitoring, automation, and management services for customer's AWS environments. Some of our Italian ISV partners include Avantune, Docebo, Doxee, Tagetik Software, and TeamSystem.

We are also focused on supporting start-up companies across Italy. In 2013, we launched a dedicated program called AWS Activate. This program gives startups access to guidance and one-on-one time with AWS experts. We also give them web-based training, self-paced labs, customer support, third-party offers, and up to $100,000 in AWS service credits—all at no charge. This is in addition to the work AWS already does with the venture capital community, startup accelerators, and incubators to help startups grow in the cloud.

To support the rapid growth of their portfolio companies, we also work with accelerator organizations such as H-Farm, Nana Bianca, and PoliHub and VC firms like United Ventures and P101. Startup customers have built their businesses on top of AWS, including Beintoo, brumbrum, DoveConviene, Ennova, FattureinCloud, Musement, Musixmatch, Prima Assicurazioni, Satispay, SixthContinent, Spreaker, and Wyscout.

The new AWS Europe (Milan) Region—together with our portfolio of existing European Regions in France, Germany, Ireland, and the UK—will provide customers in Italy and across EMEA with highly reliable, scalable, secure, fast, and low-latency access to the powerful and innovative capabilities of the AWS Cloud.

Today, I'm happy to announce that the AWS GovCloud (US-East) Region, our 19th global infrastructure Region, is now available for use by customers in the US. With this launch, AWS now provides 57 Availability Zones, with another 12 zones and four Regions in Bahrain, Cape Town, Hong Kong SAR, and Stockholm expected to come online by 2020.

The AWS GovCloud (US-East) Region is our second AWS GovCloud (US) Region, joining AWS GovCloud (US-West) to further help US government agencies, the contractors that serve them, and organizations in highly regulated industries move more of their workloads to the AWS Cloud by implementing a number of US government-specific regulatory requirements.

Similar to the AWS GovCloud (US-West), the AWS GovCloud (US-East) Region provides three Availability Zones and meets the stringent requirements of the public sector and highly regulated industries, including being operated on US soil by US citizens. It is accessible only to vetted US entities and AWS account root users, who must confirm that they are US citizens or US permanent residents. The AWS GovCloud (US-East) Region is located in the eastern part of the United States, providing customers with a second isolated Region in which to run mission-critical workloads with lower latency and high availability.

In 2011, AWS was the first cloud provider to launch an isolated infrastructure Region designed to meet the stringent requirements of government agencies and other highly regulated industries when it opened the AWS GovCloud (US-West) Region. The new AWS GovCloud (US-East) Region also meets the top US government compliance requirements, including:

  • Federal Risk and Authorization Management Program (FedRAMP) Moderate and High baselines
  • US International Traffic in Arms Regulations (ITAR)
  • Federal Information Security Management Act (FISMA) Low, Moderate, and High baselines
  • Department of Justice's Criminal Justice Information Services (CJIS) Security Policy
  • Department of Defense (DoD) Impact Levels 2, 4, and 5

The AWS GovCloud (US) environments also conform to commercial security and privacy standards such as:

  • Healthcare Insurance Portability and Accountability Act (HIPAA)
  • Payment Card Industry (PCI) Security
  • System and Organization Controls (SOC) 1, 2, and 3
  • ISO/IEC27001, ISO/IEC 27017, ISO/IEC 27018, and ISO/IEC 9001 compliance, which are primarily for healthcare, life sciences, medical devices, automotive, and aerospace customers

Some of the largest organizations in the US public sector, as well as the education, healthcare, and financial services industries, are using AWS GovCloud (US) Regions. They appreciate the reduced latency, added redundancy, data durability, resiliency, greater disaster recovery capability, and the ability to scale across multiple Regions. This includes US government agencies and companies such as:

The US Department of Treasury, US Department of Veterans Affairs, Adobe, Blackboard, Booz Allen Hamilton, Drakontas, Druva, ECS, Enlighten IT, General Dynamics Information Technology, GE, Infor, JHC Technology, Leidos, NASA's Jet Propulsion Laboratory, Novetta, PowerDMS, Raytheon, REAN Cloud, a Hitachi Vantara company, SAP NS2, and Smartronix.

The US Department of Veterans Affairs is responsible for providing vital services like healthcare to America's veterans. It enjoys the flexibility and cost savings of AWS while they efficiently innovate to better serve US military veterans. By using AWS, they have been able to provide a more streamlined experience to veterans, giving them faster, easier access to benefits, healthcare, and other essential services through vets.gov.

Raytheon, one of the world's leading providers of technology for mission-critical defense systems, is using the AWS GovCloud (US) Regions to comply with a wide range of government regulations. By using AWS, they have been able to reduce the time to build, test, and scale software from weeks to hours.

State and local governments, including law enforcement agencies, are using the AWS GovCloud (US) Regions to store their data in a cost-effective way. For example, the Indiana State Police Department relies on AWS GovCloud (US) to innovate and advance law enforcement through technology, while securely storing data that is CJIS-compliant.

Drakontas provides products for law enforcement, criminal justice, infrastructure protection, transportation, and military communities. Its DragonForce software, which combines multiple planning tools to deliver real-time information to commanders and field members, is built entirely on AWS.

For more information about the AWS GovCloud (US-East) Region, I encourage you to visit the Public Sector section of the AWS website.

Today, I am excited to announce our plans to open a new AWS Region in South Africa! AWS is committed to South Africa's transformation. The AWS Africa (Cape Town) Region is another milestone of our growth and part of our long-term investment in and commitment to the country. It is our first Region in Africa, and we're shooting to have it ready in the first half of 2020.

The new AWS Africa (Cape Town) Region will have three Availability Zones and provide lower latency to end users across Sub-Saharan Africa. AWS customers will also be able to store their data in South Africa with the assurance that their content won't move unless they move it. Those looking to comply with the upcoming Protection of Personal Information Act (POPIA) will have access to secure infrastructure that meets the most rigorous international compliance standards.

This news marks the 23rd AWS Region that we have announced globally. We already have 55 Availability Zones across 19 infrastructure regions that customers can use today. Another four AWS Regions (and 12 Availability Zones) in Bahrain, Hong Kong SAR, Sweden, and a second AWS GovCloud (US) Region are expected to come online in the coming months. Despite this rapid growth, we have no plans to slow down or stop there. We are actively working to open additional Regions in the locations where our customers need them most.

We have a long history in South Africa. AWS has been an active member of the local technology community since 2004. In December of that year, we opened the Amazon Development Center in Cape Town. That's where we built many pioneering networking technologies, our next-generation software for customer support, and the technology behind our compute service, Amazon EC2.

In 2015, we expanded our presence in the country, opening an AWS office in Johannesburg. Since then, we have seen an acceleration in AWS adoption. In 2017, we brought the Amazon Global Network to Africa, through AWS Direct Connect. Earlier this year, we launched infrastructure on the African continent by introducing Amazon CloudFront to South Africa, with two new edge locations in Johannesburg and Cape Town.

Because of our expansion, we now count a number of well-known South African enterprises as customers such as Absa, Discovery, Investec, MedScheme, MiX Telematics, MMI Holdings Limited, Old Mutual, Pick n Pay, Standard Bank, and Travelstart. We also work with some of Africa's fastest growing startups such as Aerobotics, Apex Innovations, Asoriba, Custos Media, EMS Invirotel, Entersekt, HealthQ, JUMO, Luno, Mukuru, PayGate, Parcel Ninja, Simfy Africa, Zapper, Zanibal, and Zoona.

Innovative organizations in the South African public sector are using AWS to help change lives across the continent, such as Hyrax Biosciences. Hyrax has developed an AWS based technology called Exatype, which rapidly and accurately tests HIV drug resistance. Traditionally, it cost $300 to $500 to do a single resistance test. With the AWS based system, Exatype can do this at a fraction of the cost.

Many of our startup customers in Africa are leveraging the AWS Cloud to grow into successful global businesses. One example is JUMO, a technology company that has developed a platform for operating inclusive financial services marketplaces to serve small businesses and individuals in emerging markets. Since it launched in 2014, more than 9 million people have saved or borrowed on the JUMO platform. The platform gives these customers real-time access to loans, savings, and insurance products from banks.

JUMO has offices in Ghana, Kenya, Zambia, South Africa, Tanzania, Uganda, Rwanda, Pakistan, Singapore, and the United Kingdom. Using AWS allows JUMO to rapidly expand their operations throughout Africa and Asia and enables the company to focus on business expansion and forge technology partnerships. They partner with leading international and pan-African banks like Barclays and Letshego, and large mobile money operators like Airtel, MTN, Telenor, and Tigo. JUMO uses a broad range of behavioral and payments' data, near real-time analytics and predictive modeling, to create financial identities for people who were previously beyond the reach of banks. Using AWS, JUMO has been able to process this data more than 1 000 times faster. What would have taken two weeks on local servers now only takes a few minutes.

One of my favorite stories to come from Africa, however, is the work we are doing alongside our partners—Intel and Digital Divide Data—to help the National Museums of Kenya digitize their entire collection of artifacts. The National Museums of Kenya holds one of the largest collections of archaeology and paleontology in the world. This project uses 3D digital imagery to create records of over one million items. This makes the records more accessible to researchers around the globe. It also preserves the museums' collection for future generations.

As well as helping customers in South Africa, and across the continent with technology, we also have a number of programs to help foster startups and to support the development of technology skills in the education sector. With AWS Activate, we have been supporting startups across Africa with access to guidance and 1:1 time with AWS experts. We offer web-based training, self-paced labs, customer support, third-party offers, and up to $100,000 in AWS service credits—all at no charge.

This has helped unearth innovative startups like Asoriba. Asoriba is a web and mobile platform that runs entirely on AWS and enables church administrators to effectively manage church membership, events, communications, and finance. The platform also allows church members to easily make donations and offerings via a mobile app. Asoriba already has 1,500 churches as members, serving 150,000 with the aim of touching all of Africa's 521 million churchgoers.

We also work with the venture capital community, accelerators, and incubators in South Africa. In Cape Town, AWS works with organizations such as Demo Africa, LaunchLab, Mzansi Commons, and Silicon Cape as well as co-working hubs, such as Workshop17. We provide coaching and mentorship, technical support, and resources to help African startups launch their businesses and go global.

For educators and students, we have AWS Educate. This program gives access to resources such as AWS credits, a jobs board, and training content to accelerate cloud-related learning. With this program, we are already working with institutes such as the University of Cape Town and Stellenbosch University in South Africa. We also support the Explore Data Science Academy to educate students on data analytics skills and produce the next generation of data scientists in Africa.

Another program for higher education institutes is AWS Academy, which provides AWS-authorized courses for students to acquire in-demand cloud computing skills. The program has already attracted the country's major academic institutions, including the University of Cape Town, University of Johannesburg, and Durban University of Technology. By providing resources to tertiary institutes, we believe we can grow the number of cloud professionals and create a generation of cloud-native technology experts to help grow African economies into the future.

We have also been investing in helping with a number of philanthropic and charity activities in South Africa. We support organizations such as AfricaTeenGeeks, an NGO that teaches children to code; Code4CT, a charity set up to inspire and empower young girls by equipping them with technical skills; DjangoGirls, which introduces women to coding; and GirlCode, which supports the empowerment of women through technology. Our engineers work with these and other charities to provide coaching, mentoring, and AWS credits.

We look forward to working with customers from startups to enterprise, public to private sector, and many more as we grow our business in South Africa and across the African continent. For more information about our activities in South Africa, including webinars, meetups, customer case studies, and more, see the AWS Africa page.

Make your voice count by simply saying, "Alexa, let's chat."

| Comments ()

A while back I wrote about the Alexa Prize, a university competition where participating teams are creating socialbots focused on advancing computer to human interaction. We are now in year two, heading into the final stretch for 2018 and I thought I would give everyone an update.

For those who aren't familiar, Alexa Prize teams use customer feedback to advance several areas of conversational AI, with the grand challenge being a socialbotthat can engage coherently for 20 minutes in a fun, high-quality conversation on popular topics such as entertainment, sports, technology, and fashion. Teams are thinking big about how to make strides in areas including knowledge acquisition, natural language understanding and generation, context modeling, common sense reasoning, and dialog planning.

On July 2, the eight 2018 Alexa Prize teams, from universities around the world, entered the semifinals phase of the competition. In mid-August the two teams with the highest average customer ratings will automatically advance to the finals round. At least one other team will be selected by Amazon based on their socialbot's relevance, coherence, interestingness, speed, and scientific merit. While a mix of innovation, passion, and creativity fuels teams, customer engagement is key for them to reach their goals. Your interactions and feedback will be crucial to helping teams continue to improve their socialbots into the finals round.

To be part of the conversation, simply say "Alexa let's chat" on any Alexa-enabled device and start your Alexa Prize journey. To date, customers have already logged thousands of hours of conversation with the eight different socialbots currently available.

To end a conversation, simply say "Stop." You will then be prompted to provide a verbal rating and feedback. You can easily try multiple socialbots just by saying "Alexa, let's chat" again.

To learn more about the Alexa Prize, go to: www.alexaprize.com

A one size fits all database doesn't fit anyone

| Comments ()

A common question that I get is why do we offer so many database products? The answer for me is simple: Developers want their applications to be well architected and scale effectively. To do this, they need to be able to use multiple databases and data models within the same application.

Seldom can one database fit the needs of multiple distinct use cases. The days of the one-size-fits-all monolithic database are behind us, and developers are now building highly distributed applications using a multitude of purpose-built databases. Developers are doing what they do best: breaking complex applications into smaller pieces and then picking the best tool to solve each problem. The best tool for a job usually differs by use case.

For decades because the only database choice was a relational database, no matter the shape or function of the data in the application, the data was modeled as relational. Instead of the use case driving the requirements for the database, it was the other way around. The database was driving the data model for the application use case. Is a relational database purpose-built for a normalized schema and to enforce referential integrity in the database? Absolutely, but the key point here is that not all application data models or use cases match the relational model.

As I have talked about before, one of the reasons why we built Amazon DynamoDB was that Amazon was pushing the limits of what was a leading commercial database at the time and we were unable to sustain the availability, scalability, and performance needs that our growing Amazon.com business demanded. We found that about 70 percent of our operations were key-value lookups, where only a primary key was used and a single row would be returned. With no need for referential integrity and transactions, we realized these access patterns could be better served by a different type of database. Further, with the growth and scale of Amazon.com, boundless horizontal scale needed to be a key design point--scaling up simply wasn't an option. This, ultimately led to DynamoDB, a nonrelational database service built to scale out beyond the limits of relational databases.

This doesn't mean relational databases do not provide utility in present-day development and are not available, scalable, or provide high performance. The opposite is true. In fact, this is been proven by our customers as Amazon Aurora remains the fastest growing service in AWS history. What we experienced at Amazon.com was using a database beyond its intended purpose. That learning is at the heart of this blog post—databases are built for a purpose and matching the use case with the database will help you write high-performance, scalable, and more functional applications faster.

Purpose-built databases

The world is still changing and the categories of nonrelational databases continue to grow. We are increasingly seeing customers wanting to build Internet-scale applications that require diverse data models. In response to these needs, developers now have the choice of relational, key-value, document, graph, in-memory, and search databases. Each of these databases solve a specific problem or a group of problems.

Let's take a closer look at the purpose for each of these databases:

  • Relational: A relational database is self-describing because it enables developers to define the database's schema as well as relations and constraints between rows and tables in the database. Developers rely on the functionality of the relational database (not the application code) to enforce the schema and preserve the referential integrity of the data within the database. Typical use cases for a relational database include web and mobile applications, enterprise applications, and online gaming. Airbnb is a great example of a customer building high-performance and scalable applications with Amazon Aurora. Aurora provides Airbnb a fully-managed, scalable, and functional service to run their MySQL workloads.

  • Key-value: Key-value databases are highly partitionable and allow horizontal scaling at levels that other types of databases cannot achieve. Use cases such as gaming, ad tech, and IoT lend themselves particularly well to the key-value data model where the access patterns require low-latency Gets/Puts for known key values. The purpose of DynamoDB is to provide consistent single-digit millisecond latency for any scale of workloads. This consistent performance is a big part of why the Snapchat Stories feature, which includes Snapchat's largest storage write workload, moved to DynamoDB.

  • Document: Document databases are intuitive for developers to use because the data in the application tier is typically represented as a JSON document. Developers can persist data using the same document model format that they use in their application code. Tinder is one example of a customer that is using the flexible schema model of DynamoDB to achieve developer efficiency.

  • Graph: A graph database's purpose is to make it easy to build and run applications that work with highly connected datasets. Typical use cases for a graph database include social networking, recommendation engines, fraud detection, and knowledge graphs. Amazon Neptune is a fully-managed graph database service. Neptune supports both the Property Graph model and the Resource Description Framework (RDF), giving you the choice of two graph APIs: TinkerPop and RDF/SPARQL. Current Neptune users are building knowledge graphs, making in-game offer recommendations, and detecting fraud. For example, Thomson Reuters is helping their customers navigate a complex web of global tax policies and regulations by using Neptune.

  • In-memory: Financial services, Ecommerce, web, and mobile application have use cases such as leaderboards, session stores, and real-time analytics that require microsecond response times and can have large spikes in traffic coming at any time. We built Amazon ElastiCache, offering Memcached and Redis, to serve low latency, high throughput workloads, such as McDonald's, that cannot be served with disk-based data stores. Amazon DynamoDB Accelerator (DAX) is another example of a purpose-built data store. DAX was built is to make DynamoDB reads an order of magnitude faster.

  • Search: Many applications output logs to help developers troubleshoot issues. Amazon Elasticsearch Service (Amazon ES) is purpose built for providing near real-time visualizations and analytics of machine-generated data by indexing, aggregating, and searching semi structured logs and metrics. Amazon ES is also a powerful, high-performance search engine for full-text search use cases. Expedia is using more than 150 Amazon ES domains, 30 TB of data, and 30 billion documents for a variety of mission-critical use cases, ranging from operational monitoring and troubleshooting to distributed application stack tracing and pricing optimization.

Building applications with purpose-built databases

Developers are building highly distributed and decoupled applications, and AWS enables developers to build these cloud-native applications by using multiple AWS services. Take Expedia, for example. Though to a customer the Expedia website looks like a single application, behind the scenes Expedia.com is composed of many components, each with a specific function. By breaking an application such as Expedia.com into multiple components that have specific jobs (such as microservices, containers, and AWS Lambda functions), developers can be more productive by increasing scale and performance, reducing operations, increasing deployment agility, and enabling different components to evolve independently. When building applications, developers can pair each use case with the database that best suits the need.

To make this real, take a look at some of our customers that are using multiple different kinds of databases to build their applications:

  • Airbnb uses DynamoDB to store users' search history for quick lookups as part of personalized search. Airbnb also uses ElastiCache to store session states in-memory for faster site rendering, and they use MySQL on Amazon RDS as their primary transactional database.
  • Capital One uses Amazon RDS to store transaction data for state management, Amazon Redshift to store web logs for analytics that need aggregations, and DynamoDB to store user data so that customers can quickly access their information with the Capital One app.
  • Expedia built a real-time data warehouse for the market pricing of lodging and availability data for internal market analysis by using Aurora, Amazon Redshift, and ElastiCache. The data warehouse performs a multistream union and self-join with a 24-hour lookback window using ElastiCache for Redis. The data warehouse also persists the processed data directly into Aurora MySQL and Amazon Redshift to support both operational and analytical queries.
  • Zynga migrated the Zynga poker database from a MySQL farm to DynamoDB and got a massive performance boost. Queries that used to take 30 seconds now take one second. Zynga also uses ElastiCache (Memcached and Redis) in place of their self-managed equivalents for in-memory caching. The automation and serverless scalability of Aurora make it Zynga's first choice for new services using relational databases.
  • Johnson & Johnson uses Amazon RDS, DynamoDB, and Amazon Redshift to minimize time and effort spent on gathering and provisioning data, and allow the quick derivation of insights. AWS database services are helping Johnson & Johnson improve physicians' workflows, optimize the supply chain, and discover new drugs.

Just as they are no longer writing monolithic applications, developers also are no longer using a single database for all use cases in an application—they are using many databases. Though the relational database remains alive and well, and is still well suited for many use cases, purpose-built databases for key-value, document, graph, in-memory, and search uses cases can help you optimize for functionality, performance, and scale and—more importantly—your customers' experience. Build on.

The workplace of the future

| Comments ()

This article titled "Die Arbeitswelt der Zukunft" appeared in German last week in the "Digitalisierung" column of Wirtschaftwoche.

The workplace of the future

We already have an idea of how digitalization, and above all new technologies like machine learning, big-data analytics or IoT, will change companies' business models — and are already changing them on a wide scale. So now's the time to examine more closely how different facets of the workplace will look and the role humans will have.

In fact, the future is already here – but it's still not evenly distributed. Science fiction author William Gibson said that nearly 20 years ago. We can observe a gap between the haves and the have-nots: namely between those who are already using future technologies and those who are not. The consequences of this are particularly visible on the labor market many people still don't know which skills will be required in the future or how to learn them.

Against that background, it's natural for people – even young digital natives – to feel some growing uncertainty. According to a Gallup poll, 37% of millennials are afraid of losing their jobs in the next 20 years due to AI. At the same time there are many grounds for optimism. Studies by the German ZEW Center for European Economic Research, for example, have found that companies that invest in digitalization create significantly more jobs than companies that don't.

How many of the jobs that we know today will even exist in the future? Which human activities can be taken over by machines or ML-based systems? Which tasks will be left over for humans to do? And will there be completely new types of the jobs in the future that we can't even imagine today?

Future of work or work of the future?

All of these questions are legitimate. "But where there is danger, a rescuing element grows as well." German poet Friedrich Hölderlin knew that already in the 19th century. As for me, I'm a technology optimist: Using technology to drive customer-centric convenience, such as in the cashier-less Amazon Go stores, will create shifts in where jobs are created. Thinking about the work of tomorrow, it doesn't help to base the discussion on structures that exist today. After the refrigerator was invented in the 1930s, many people who worked in businesses that sold ice feared for their jobs. Indeed, refrigerators made this business superfluous for the most part; but in its place, many new jobs were created. For example, companies that produced refrigerators needed people to build them, and now that food could be preserved, whole new businesses were created which were targeted at that market. We should not let ourselves be guided in our thinking by the perception of work as we know it today. Instead, we should think about how the workplace could look like in the future. And to do that, we need to ask ourselves an entirely different question, namely: What is changing in the workplace, both from an organizational and qualitative standpoint?

Many of the tasks carried out by people in manufacturing, for example, have remained similar over time in terms of the workflows. Even the activities of doctors, lawyers or taxi drivers have hardly changed in the last decade, at least in terms of their underlying processes. Only parts of the processes are being performed by machines, or at least supported by them. Ultimately, the desired product or service is delivered in – hopefully – the desired quality. But in the age of digitalization, people do much more than fill the gaps between the machines. The work done by humans and machines is built around solving customer problems. It's no longer about producing a car, but about the service "mobility", about bringing people to a specific location. "I want to be in a central place in Berlin as quickly as possible" is the requirement that needs to be fulfilled. In the first step we might reach this goal by combining the fastest mobility services through a digital platform; in the next, it might be a task fulfilled by Virtual Reality. These new offerings are organized on platforms or networks, and less so in processes. And artificial intelligence makes it possible to break down tasks in such a way that everyone contributes what he or she can do best. People define problems and pre-structure them, and machines or algorithms develop solutions that people evaluate in the end.

Radiologists are now assisted by machine-learning-driven tools that allow them to evaluate digital content in ways that were not possible before. Many radiologists have even claimed that ML-driven advice has significantly improved their ability to interpret X-rays.

I would even go a step further because I believe it's possible to "rehumanize" work and make our unique abilities as human beings even more important. Until now, access to digital technologies was limited above all by a machine's abilities: The interfaces to our systems are no longer machine-driven; in the future humans will be the starting point. For example, anyone who wanted to teach a robot how to walk in the age of automation had to exactly calculate every single angle of slope from the upper to lower thigh, as well as the speed of movement and other parameters, and then formulate them as a command in a programming language. In the future, we'll be able to communicate and work with robots more intensively in our "language". So teaching a robot to walk will be much easier in the future. The robot can be controlled by anyone via voice command, and it could train itself by analyzing how humans do it via a motion scanner, applying the process, and perfecting it.

With the new technological possibilities and greater computing power, work in the future will be more focused on people and less on machines. Machine learning can make human labor more effective. Companies like C-SPAN show how: scores of people would have to scan video material for hours in order to create keywords, for example, according to a person's name. Today, automated face recognition can do this task in seconds, allowing employees to immediately begin working with the results.

Redefining the relationship between human and machine

The progress at the interface of human and machine is happening at a very fast pace with already a visible impact on how we work. In the future, technology can become a much more natural part of our workplace that can be activated by several input methods — speaking, seeing, touching or even smelling. Take voice-control technologies, a field that is currently undergoing a real disruption. This area distinguishes itself radically from what we knew until now as the "hands-free" work approach, which ran purely through simple voice commands. Modern voice-control systems can understand, interpret and answer conversations in a professional way, which makes a lot of work processes easier to perform. Examples are giving diagnoses to patients or legal advice. At the end of 2018, voice (input) will have already significantly changed the way we develop devices and apps. People will be able to connect technologies into their work primarily through voice. One can already get an inkling of what that looks like in detail.

At the US space agency NASA, for example, Amazon Alexa organizes the ordering of conference rooms. A room doesn't always have to be requested for every single meeting. Rather, anyone who needs a room asks Alexa and the rest happens automatically. Everyone knows the stress caused by telephone conferences: they never start on time because someone hasn't found the right dial-in number and it takes a while until you've typed in the 8-digit number plus a 6-digit conference code. A voice command creates a lot more productivity. The AWS Service Transcribe could start creating a transcript right away during the meeting and send it to all participants afterwards. Other companies, such as the Japanese firm Mitsui or the software provider bmc, use Alexa for Business to achieve a more efficient and better collaboration between their employees, among others.

The software provider fme also uses voice control to offer its customers innovative applications in the field of business intelligence, social business collaboration and enterprise-content-management technologies. The customers of fme mainly come from life sciences and industrial manufacturing. Employees can search different types of content using voice control, navigate easily through the content, and have the content displayed or read to them. Users can have Alexa explain individual tasks to them in OpenText Documentum, to give another example. This could be used to make the onboarding of new employees faster and cheaper – their managers would not have to perform the same information ritual again and again. A similar approach can be found at pharmaceutical company AstraZeneca, which uses Alexa in its manufacturing: Team members can ask questions about standard processes to find out what they need to do next.

Of course, responsibilities and organizations will change as a result of these developments. Resources for administrative tasks can be turned into activities that have a direct benefit for the customer. Regarding the character of work in the future, we will probably need more "architects," "developers," "creatives," "relationship experts," "platform specialists," and "analysts" and fewer people who need to perform tasks according to certain pre-determined steps, as well as fewer "administrators". By speaking more to humans' need to create and shape, work might ultimately become more fulfilling and enjoyable.

Expanding the digital world

This new understanding of the relationship between man and machine has another important effect: It will significantly expand the number of people who can participate in digital value creation: older people, people who at the moment don't have access to a computer or smartphone, people for whom using the smartphone in a specific situation is too complicated, and people in developing countries who can't read or write. A good example of the latter is rice farmers who work with the International Rice Research Institute, an organization based near Manila, the Philippines. The institute's mission is to fight poverty, hunger and malnutrition by easing the lives and work of rice farmers. Rice farmers can benefit from knowledge to which they wouldn't have access were they on their own. The institute has saved 70,000 DNA sequences of different types of rice, from which conclusions can be drawn about the best conditions for growing rice. Every village has a telephone, and by using it the farmers can access this knowledge: they select their dialect in a menu and describe which piece of land they tend. The service is based on machine learning. It generates recommendations on how much fertilizer is needed and when the best time is to plant the crops. So with the help of digital technologies, farmers can see how their work becomes more valuable: a given amount of effort produces a richer harvest of rice.

Until now we only have a tiny insight into the possibilities for the world of work. But they make clear that the quality of work for us humans will most probably increase, and that technology can allow us to perform many activities that we still cannot imagine today. Although there are twice as many robots per capita in German companies than in US firms, German industry still has trouble finding qualified employees rather than having to fight unemployment. In the future we humans will be able to carry out activities in a way that is closer to our creative human nature than is the case today. I believe that if we want to do justice to the technological possibilities, we should do it like Hölderlin and have faith in the rescue, but at the same time try to minimize the risks by understanding and shaping things.

Changing the calculus of containers in the cloud

| Comments ()

I wrote to you over two years ago about what happens under the hood of Amazon ECS. Last year at re:Invent, we launched AWS Fargate, and today, I want to explore how Fargate fundamentally changes the landscape of container technology.

I spend a lot of time talking to our customers and leaders at Amazon about innovation. One of the things I've noticed is that ideas and technologies which dramatically change the way we do things are rarely new. They're often the combination of an existing concept with an approach, technology, or capability in a particular way that's never been successfully tried before.

The rapid embrace of containers in the past four years is the result of blending old technology (containers) with a new toolchain and workflow (i.e., Docker), and the cloud. In our industry, four years is a long time, but I think we've only just started exploring how this combination of code packaging, well-designed workflows, and the cloud can reshape the ability of developers to quickly build applications and innovate.

Containers solve a fundamental code portability problem and enable new infrastructure patterns on the cloud. Having a consistent, immutable unit of deployment to work with lets you abstract away all the complexities of configuring your servers and deployment pipelines every time you change your code or want to run your app in a different place. But containers also put another layer between your code and where it runs. They are an important, but incremental, step on the journey of being able to write code and have it run in the right place, with the right scale, with the right connections to other bits of code, and the right security and access controls.

Solving these higher order problems of deploying, scheduling, and connecting containers across environments gave us container management tools. Container orchestration has always seemed to me to be very not cloud native. Managing a large server cluster and optimizing the scheduling of containers all backed by a complex distributed state store is counter to the premise of the cloud. Customers choose the cloud to pay as they go, not have to guess capacity, get deep operational control without operational burden, build loosely coupled services with limited blast radii to prevent failures, and self-service for everything they need to run their code.

You should be able to write your code and have it run, without having to worry about configuring complex management tools, open source or not. This is the vision behind AWS Fargate. With Fargate, you don't need to stand up a control plane, choose the right instance type, or configure all the other components of your application stack like networking, scaling, service discovery, load balancing, security groups, permissions, or secrets management. You simply build your container image, define how and where you want it to run, and pay for the resources you need. Fargate has native integrations to Amazon VPC, Auto Scaling, Elastic Load Balancing, IAM roles, and Secrets Management. We've taken the time to make Fargate production ready with a 99.99% uptime SLA and compliance with PCI, SOC, ISO, and HIPAA.

With AWS Fargate, you can provision resources to run your containers at a much finer grain than with an EC2 instance. You can select exactly the CPU and memory your code needs and the amount you pay scales exactly with how many containers you run. You don't have to guess at capacity to handle spikey traffic, and you get the benefit of perfect scale, which lets you offload a ton of operational effort onto the cloud. MiB for MiB, this might mean that cloud native technologies like Fargate look more expensive than more traditional VM infrastructure on paper. But if you look at the full cost of running an app, we believe most applications will be cheaper with Fargate as you only pay what you need. Our customers running Fargate see big savings in the developer hours required to keep their apps running smoothly.

The entire ecosystem of container orchestration solutions arose out of necessity because there was no way to natively procure a container in the cloud. Whether you use Kubernetes, Mesos, Rancher, Nomad, ECS or any other system no longer matters anymore because with Fargate, there is nothing to orchestrate. The only thing that you have to manage is the construction of the applications themselves. AWS Fargate finally makes containers cloud native.

I think the next area of innovation we will see after moving away from thinking about underlying infrastructure is application and service management. How do you interconnect the different containers that run independent services, ensure visibility, manage traffic patterns, and security for multiple services at scale? How do independent services mutually discover one another? How do you define access to common data stores? How do you define and group services into applications? Cloud native is about having as much control as you want and I am very excited to see how the container ecosystem will evolve over the next few years to give you more control with less work. We look forward to working with the community to innovate forward on the cloud native journey on behalf of our customers.

Getting Started

AWS Fargate already seamlessly integrates with Amazon ECS. You just define your application as you do for Amazon ECS. You package your application into task definitions, specify the CPU and memory needed, define the networking and IAM policies that each container needs, and upload everything to Amazon ECS. After everything is setup, AWS Fargate launches and manages your containers for you.

AWS Fargate support for Amazon EKS, the Elastic Kubernetes Service, will be available later in 2018.

Looking back at 10 years of compartmentalization at AWS

| Comments ()

At AWS, we don't mark many anniversaries. But every year when March 14th comes around, it's a good reminder that Amazon S3 originally launched on Pi Day, March 14, 2006. The Amazon S3 team still celebrate with homemade pies!

March 26, 2008 doesn't have any delicious desserts associated with it, but that's the day when we launched Availability Zones for Amazon EC2. A concept that has changed infrastructure architecture is now at the core of both AWS and customer reliability and operations.

Powering the virtual instances and other resources that make up the AWS Cloud are real physical data centers with AWS servers in them. Each data center is highly reliable, and has redundant power, including UPS and generators. Even though the network design for each data center is massively redundant, interruptions can still occur.

Availability Zones draw a hard line around the scope and magnitude of those interruptions. No two zones are allowed to share low-level core dependencies, such as power supply or a core network. Different zones can't even be in the same building, although sometimes they are large enough that a single zone spans several buildings.

We launched with three autonomous Availability Zones in our US East (N. Virginia) Region. By using zones, and failover mechanisms such as Elastic IP addresses and Elastic Load Balancing, you can provision your infrastructure with redundancy in mind. When two instances are in different zones, and one suffers from a low-level interruption, the other instance should be unaffected.

How Availability Zones have changed over the years

Availability Zones were originally designed for physical redundancy, but over time they have become re-used for more and more purposes. Zones impact how we build, deploy, and operate software, as well as how we enforce security controls between our largest systems.

For example, many AWS services are now built so that as much functionality as possible can be autonomous within an Availability Zone. The calls used to launch and manage EC2 instances, fail over an RDS instance, or handle the health of instances behind a load balancer, all work within one zone.

This design has a double benefit. First, if an Availability Zone does lose power or connectivity, the remaining zones are unaffected. The second benefit is even more powerful: if there is an error in the software, the risk of that error affecting other zones is minimized.

We maximize this benefit when we deploy new versions of our software, or operational changes such as a configuration edit, as we often do so zone-by-zone, one zone in a Region at a time. Although we automate, and don't manage instances by hand, our developers and operators know not to build tools or procedures that could impact multiple Availability Zones. I'd wager that every new AWS engineer knows within their first week, if not their first day, that we never want to touch more than one zone at a time.

Availability Zones run deep in our AWS development and operations culture, at every level. AWS customers can think of zones in terms of redundancy, "Use two or more Availability Zones for reliability." At AWS, we think of zones in terms of isolation, "Stay within the Availability Zone, as much as possible."

Silo your traffic or not – you choose

When your architecture does stay within an Availability Zone as much as possible, there are more benefits. One is that the latency within a zone is incredibly fast. Today, packets between EC2 instances in the same zone take just tens of microseconds to reach other.

Another benefit is that redundant zonal architectures are easier to recover from complex issues and emergent behaviors. If all of the calls between the various layers of a service stay within one Availability Zone, then when issues occur they can quickly be remediated by removing the entire zone from service, without needing to identify the layer or component that was the trigger.

Many of you also use this kind of "silo" pattern in your own architecture, where Amazon Route 53 or Elastic Load Balancing can be used to choose an Availability Zone to handle a request, but can also be used to keep subsequent internal requests and dependencies within that same zone. This is only meaningful because of the strong boundaries and separation between zones at the AWS level.

Regional isolation

Not too long after we launched Availability Zones, we also launched our second Region, EU (Ireland). Early in the design, we considered operating a seamless global network, with open connectivity between instances in each Region. Services such as S3 would have behaved as "one big S3," with keys and data accessible and mutable from either location.

The more we thought through this design, the more we realized that there would be risks of issues and errors spreading between Regions, potentially resulting in large-scale interruptions that would defeat our most important goals:

  • To provide the highest levels of availability
  • To allow Regions to act as standby sites for each other
  • To provide geographic diversity and lower latencies to end users

Our experience with the benefits of Availability Zones meant that instead we doubled down on compartmentalization, and decided to isolate Regions from each other with our hardest boundaries. Since then, and still today, our services operate autonomously in each Region, full stacks of S3, DynamoDB, Amazon RDS, and everything else.

Many of you still want to be able to run workloads and access data globally. For our edge services such as Amazon CloudFront, Amazon Route 53, and AWS Lambda@Edge, we operate over 100 points of presence. Each is its own Availability Zone with its own compartmentalization.

As we develop and ship our services that span Regions, such as S3 cross-region object replication, Amazon DynamoDB global tables, and Amazon VPC inter-region peering, we take enormous care to ensure that the dependencies and calling patterns between Regions are asynchronous and ring-fenced with high-level safety mechanisms that prevent errors from spreading.

Doubling down on compartmentalization, again

With the phenomenal growth of AWS, it can be humbling how many customers are being served even by our smallest Availability Zones. For some time now, many of our services have been operating service stacks that are compartmentalized even within zones.

For example, AWS HyperPlane—the internal service that powers NAT gateways, Network Load Balancers, and AWS PrivateLink—is internally subdivided into cells that each handle a distinct set of customers. If there are any issues with a cell, the impact is limited not just to an Availability Zone, but to a subset of customers within that zone. Of course, all sorts of automation immediately kick in to mitigate any impact to even that subset.

Ten years after launching Availability Zones, we're excited that we're still relentless about reducing the impact of potential issues. We firmly believe it's one of the most important strategies for achieving our top goals of security and availability. We now have 54 Availability Zones, across 18 geographic Regions, and we've announced plans for 12 more. Beyond that geographic growth, we'll be extending the concept of compartmentalization that underlies Availability Zones deeper and deeper, to be more effective than ever.

Infinitely scalable machine learning with Amazon SageMaker

| Comments ()

In machine learning, more is usually more. For example, training on more data means more accurate models.

At AWS, we continue to strive to enable builders to build cutting-edge technologies faster in a secure, reliable, and scalable fashion. Machine learning is one such transformational technology that is top of mind not only for CIOs and CEOs, but also developers and data scientists. Last re:Invent, to make the problem of authoring, training, and hosting ML models easier, faster, and more reliable, we launched Amazon SageMaker. Now, thousands of customers are trying Amazon SageMaker and building ML models on top of their data lakes in AWS.

While building Amazon SageMaker and applying it for large-scale machine learning problems, we realized that scalability is one of the key aspects that we need to focus on. So, when designing Amazon SageMaker we took on a challenge: to build machine learning algorithms that can handle an infinite amount of data. What does that even mean though? Clearly, no customer has an infinite amount of data.

Nevertheless, for many customers, the amount of data that they have is indistinguishable from infinite. Bill Simmons, CTO of Dataxu, states, "We process 3 million ad requests a second - 100,000 features per request. That's 250 trillion ad requests per day. Not your run-of-the-mill data science problem!" For these customers and many more, the notion of "the data" does not exist. It's not static. Data always keeps being accrued. Their answer to the question "how much data do you have?" is "how much can you handle?"

To make things even more challenging, a system that can handle a single large training job is not nearly good enough if training jobs are slow or expensive. Machine learning models are usually trained tens or hundreds of times. During development, many different versions of the eventual training job are run. Then, to choose the best hyperparameters, many training jobs are run simultaneously with slightly different configurations. Finally, re-training is performed every x-many minutes/hours/days to keep the models updated with new data. In fraud or abuse prevention applications, models often need to react to new patterns in minutes or even seconds!

To that end, Amazon SageMaker offers algorithms that train on indistinguishable-from-infinite amounts of data both quickly and cheaply. This sounds like a pipe dream. Nevertheless, this is exactly what we set out to do. This post lifts the veil on some of the scientific, system design, and engineering decisions we made along the way.

Streaming algorithms

To handle unbounded amounts of data, our algorithms adopt a streaming computational model. In the streaming model, the algorithm only passes over the dataset one time and assumes a fixed-memory footprint. This memory restriction precludes basic operations like storing the data in memory, random access to individual records, shuffling the data, reading through the data several times, etc.

Streaming algorithms are infinitely scalable in the sense that they can consume any amount of data. The cost of adding more data points is independent of the entire corpus size. In other words, processing the 10th gigabyte and 1000th gigabyte is conceptually the same. The memory footprint of the algorithms is fixed and it is therefore guaranteed not to run out of memory (and crash) as the data grows. The compute cost and training time depend linearly on the data size. Training on twice as much data costs twice as much and take twice as long.

Finally, traditional machine learning algorithms usually consume data from persistent data sources such as local disk, Amazon S3, or Amazon EBS. Streaming algorithms also natively consume ephemeral data sources such as Amazon Kinesis streams, pipes, database query results, and almost any other data source.

Another significant advantage of streaming algorithms is the notion of a state. The algorithm state contains all the variables, statistics, and data structures needed to perform updates, that is, all that is required to continue training. By formalizing this concept and facilitating it with software abstractions, we provide checkpointing capabilities and fault resiliency for all algorithms. Moreover, checkpointing enables multi-pass/multi-epoch training for persistent data, a pause/resume mechanism that is useful for cost effective HPO, and incremental training that updates the model only with new data rather running the entire training job from scratch.

Acceleration and distribution

When AWS customers run large-scale training tasks on Amazon SageMaker, they are interested in reducing the running time and cost of their job, irrespective of the number and kinds of machines used under the hood. Amazon SageMaker algorithms are therefore built to take advantage of many Amazon EC2 instance types, support both CPU and GPU computation, and distribute across many machines.

Cross-instance support relies heavily on containerization. Amazon SageMaker training supports powerful container management mechanisms that include spinning up large numbers of containers on different hardware with fast networking and access to the underlying hardware, such as GPUs. For example, a training job that takes ten hours to run on a single machine can be run on 10 machines and conclude in one hour. Furthermore, switching those machines to GPU-enabled ones could reduce the running time to minutes. This can all be done without touching a single line of code.

To seamlessly switch between CPU and GPU machines, we use Apache MXNet to interface with the underlying hardware. By designing algorithms that operate efficiently on different types of hardware, our algorithms gain record speeds and efficiency.

Distribution across machines is achieved via a parameter server that stores the state of all the machines participating in training. The parameter server is designed for maximal throughput by updating parameters asynchronously and offering only loose consistency properties of the parameters. While these are unacceptable in typical relational database designs, for machine learning, the tradeoff between accuracy and speed is worth it.

Post-training model tuning and rich states

Processing massively scalable datasets in a streaming manner poses a challenge for model tuning, also known as hyperparameter optimization (HPO). In HPO, many training jobs are run with different configurations or training parameters. The goal is to find the best configuration, usually the one corresponding to the best test accuracy. This is impossible in the streaming setting.

For ephemeral data sources, the data is no longer available for rerunning the training job (for persistent data sources, this is possible but inefficient). Amazon SageMaker algorithms solve this by training an expressive state object, out of which many different models can be created. That is, a large number of different training configurations can be explored after only a single training job.

Summary

Amazon SageMaker offers production-ready, infinitely scalable algorithms such as:

  • Linear Learner
  • Factorization Machines
  • Neural Topic Modeling
  • Principal Component Analysis (PCA)
  • K-Means clustering
  • DeepAR forecasting

They adhere to the design principles above and rely on Amazon SageMaker's robust training stack. They are operationalized by a thick, common SDK that allows us to test them thoroughly before deployment. We have invested heavily in the research and development of each algorithm, and every one of them advances the state of the art. Amazon SageMaker algorithms train larger models on more data than any other open-source solution out there. When a comparison is possible, Amazon SageMaker algorithms often run more than 10x faster than other ML solutions like Spark ML. Amazon SageMaker algorithms often cost cents on the dollar to train, in terms of compute costs, and produce more accurate models than the alternatives.

I think the time is here for using large-scale machine learning in large-scale production systems. Companies with truly massive and ever-growing datasets must not fear the overhead of operating large ML systems or developing the associated ML know-how. AWS is delighted to innovate on our customers' behalf and to be a thought leader, especially in exciting areas like machine learning. I hope and believe that Amazon SageMaker and its growing set of algorithms will change the way companies do machine learning.