Under the Hood of Amazon EC2 Container Service

| Comments ()

In my last post about Amazon EC2 Container Service (Amazon ECS), I discussed the two key components of running modern distributed applications on a cluster: reliable state management and flexible scheduling. Amazon ECS makes building and running containerized applications simple, but how that happens is what makes Amazon ECS interesting. Today, I want to explore the Amazon ECS architecture and what this architecture enables. Below is a diagram of the basic components of Amazon ECS:

How we coordinate the cluster

Let’s talk about what Amazon ECS is actually doing. The core of Amazon ECS is the cluster manager, a backend service that handles the tasks of cluster coordination and state management. On top of the cluster manager sits various schedulers. Cluster management and container scheduling are components decoupled from each other allowing customers to use and build their own schedulers. A cluster is just a pool of compute resources available to a customer’s applications. The pool of resources, at this time, is the CPU, memory, and networking resources of Amazon EC2 instances as partitioned by containers. Amazon ECS coordinates the cluster through the Amazon ECS Container Agent running on each EC2 instance in the cluster. The agent allows Amazon ECS to communicate with the EC2 instances in the cluster to start, stop, and monitor containers as requested by a user or scheduler. The agent is written in Go, has a minimal footprint, and is available on GitHub under an Apache license. We encourage contributions and feedback is most welcome.

How we manage state

To coordinate the cluster, we need to have a single source of truth on the clusters themselves: EC2 instances in the clusters, tasks running on the EC2 instances, containers that make up a task, and resources available or occupied (e.g., networks ports, memory, CPU, etc). There is no way to successfully start and stop containers without an accurate knowledge of the state of the cluster. In order to solve this, state needs to be stored somewhere, so at the heart of any modern cluster manager is a key/value store.

This key/value store acts as the single source of truth for all information on the cluster (state, and all changes to state transitions) are entered and stored here. To be robust and scalable, this key/value store needs to be distributed for durability and availability, to protect against network partitions or hardware failures. But because the key/value store is distributed, making sure data is consistent and handling concurrent changes becomes more difficult, especially in an environment where state constantly changes (e.g., containers stopping and starting). As such, some form of concurrency control has to be put in place in order to make sure that multiple state changes don’t conflict. For example, if two developers request all the remaining memory resources from a certain EC2 instance for their container, only one container can actually receive those resources and the other would have to be told their request could not be completed.

To achieve concurrency control, we implemented Amazon ECS using one of Amazon’s core distributed systems primitives: a Paxos-based transactional journal based data store that keeps a record of every change made to a data entry. Any write to the data store is committed as a transaction in the journal with a specific order-based ID. The current value in a data store is the sum of all transactions made as recorded by the journal. Any read from the data store is only a snapshot in time of the journal. For a write to succeed, the write proposed must be the latest transaction since the last read. This primitive allows Amazon ECS to store its cluster state information with optimistic concurrency, which is ideal in environments where constantly changing data is shared (such as when representing the state of a shared pool of compute resources such as Amazon ECS). This architecture affords Amazon ECS high availability, low latency, and high throughput because the data store is never pessimistically locked.

Programmatic access through the API

Now that we have a key/value store, we can successfully coordinate the cluster and ensure that the desired number of containers is running because we have a reliable method to store and retrieve the state of the cluster. As mentioned earlier, we decoupled container scheduling from cluster management because we want customers to be able to take advantage of Amazon ECS’ state management capabilities. We have opened up the Amazon ECS cluster manager through a set of API actions that allow customers to access all the cluster state information stored in our key/value store in a structured manner.

Through ‘list’ commands, customers can retrieve the clusters under management, EC2 instances running in a specific cluster, running tasks, and the container configuration that make up the tasks (i.e., task definition). Through ‘describe’ commands, customers can retrieve details of specific EC2 instances and the resources available on each. Lastly, customers can start and stop tasks anywhere in the cluster. We recently ran a series of load tests on Amazon ECS, and we wanted to share some of the performance characteristics customers should expect when building applications on Amazon ECS.

The above graph shows a load test where we added and removed instances from an Amazon ECS cluster and measured the 50th and 99th percentile latencies of the API call ‘DescribeTask’ over a seventy-two hour period. As you can see, the latency remains relatively jitter-free despite large fluctuations in the cluster size. Amazon ECS is able to scale with you no matter how large your cluster size – all without you needing to operate or scale a cluster manager.

This set of API actions form the basis of solutions that customers can build on top of Amazon ECS. A scheduler just provides logic around how, when, and where to start and stop containers. Amazon ECS’ architecture is designed to share the state of the cluster and allow customers to run as many varieties of schedulers (e.g., bin packing, spread, etc) as needed for their applications. The architecture enables the schedulers to query the exact state of the cluster and allocate resources from a common pool. The optimistic concurrency control in place allows each scheduler to receive the resources it requested without the possibility of resource conflicts. Customers have already created a variety of interesting solutions on top of Amazon ECS and we want to share a few compelling examples.

Hailo – Custom scheduling atop an elastic resource pool

Hailo is a free smartphone app, which allows people to hail licensed taxis directly to their location. Hailo has a global network of over 60,000 drivers and more than a million passengers. Hailo was founded in 2011 and has been built on AWS since Day 1. Over the past few years, Hailo has evolved from a monolithic application running in one AWS region to a microservices-based architecture running across multiple regions. Previously, each microservice ran atop a cluster of instances that was statically partitioned. The problem Hailo experienced was low resource utilization across each partition. This architecture wasn’t very scalable, and Hailo didn’t want its engineers to worry about the details of the infrastructure or the placement of the microservices.

Hailo decided it wanted to schedule containers based on service priority and other runtime metrics atop an elastic resource pool. They chose Amazon ECS as the cluster manager because it is a managed service that can easily enforce task state and fully exposes the cluster state via API calls. This allowed Hailo to build a custom scheduler with logic that met their specific application needs.

Remind – Platform as a service

Remind is a web and mobile application that enables teachers to text message students and stay in touch with parents. Remind has 24M users and over 1.5M teachers on its platform. It delivers 150M messages per month. Remind initially used Heroku to run its entire application infrastructure from message delivery engine, front-end API, and web client, to chat backends. Most of this infrastructure was deployed as a large monolithic application.

As the users grew, Remind wanted the ability to scale horizontally. So around the end of 2014, the engineering team started to explore moving towards a microservices architecture using containers. The team wanted to build a platform as a service (PaaS) that was compatible with the Heroku API on top of AWS. At first, the team looked to a few open-source solutions (e.g., CoreOS and Kubernetes) to handle the cluster management and container orchestration, but the engineering team was small so they didn’t have the time to manage the cluster infrastructure and keep the cluster highly available.

After briefly evaluating Amazon ECS, the team decided to build their PaaS on top of this service. Amazon ECS is fully managed and provides operational efficiency allowing engineering resources to just focus on developing and deploying applications; there are no clusters to manage or scale. In June, Remind open-sourced their PaaS solution on ECS as “Empire”. Remind saw large performance increases (e.g., latency and stability) with Empire as well as security benefits. Their plan over the next few months is to migrate over 90% of the core infrastructure onto Empire.

Amazon ECS – a fully managed platform

These are just a couple of the use cases we have seen from customers. The Amazon ECS architecture allows us to deliver a highly scalable, highly available, low latency container management service. The ability to access shared cluster state with optimistic concurrency through the API empowers customers to create whatever custom container solution they need. We have focused on removing the undifferentiated heavy lifting for customers. With Amazon ECS, there is no cluster manager to install or operate: customers can and should just focus on developing great applications.

We have delivered a number of features since our preview last November. Head over to Jeff Barr’s blog for a recap of the features we have added over the past year. Read our documentation and visit our console to get started. We still have a lot more on our roadmap and we value your feedback: please post questions and requests to our forum or on /r/aws.

Back-to-Basics Weekend Reading - Data Compression

| Comments ()

Data compression today is still as important as it was in the early days of computing. Although in those days all computer and storage resources were very limited, the objects in use were much smaller than today. We have seen a shift from generic compression to compression for specific file types, especially those in images, audio and video. In this weekend's back to basic reading we go back in time, 1987 to be specific, when Leweler and Hirschberg wrote a survey paper that covers the 40 years of data compression research. It covers all the areas that we like in a back to basics paper, it does not present the most modern results but it gives you a great understanding of the fundamentals. It is a substantial paper but easy to read.

Data compression, D.A. Lelewer and D.S. Hirschberg, Data compression, Computing Surveys 19,3 (1987) 261-297.

In just three short years, Amazon DynamoDB has emerged as the backbone for many powerful Internet applications such as AdRoll, Druva, DeviceScape, and Battlecamp. Many happy developers are using DynamoDB to handle trillions of requests every day. I am excited to share with you that today we are expanding DynamoDB with streams, cross-region replication, and database triggers. In this blog post, I will explain how these three new capabilities empower you to build applications with distributed systems architecture and create responsive, reliable, and high-performance applications using DynamoDB that work at any scale.

DynamoDB Streams enables your application to get real-time notifications of your tables’ item-level changes. Streams provide you with the underlying infrastructure to create new applications, such as continuously updated free-text search indexes, caches, or other creative extensions requiring up-to-date table changes. DynamoDB Streams is the enabling technology behind two other features announced today: cross-region replication maintains identical copies of DynamoDB tables across AWS regions with push-button ease, and triggers execute AWS Lambda functions on streams, allowing you to respond to changing data conditions. Let me expand on each one of them.

DynamoDB Streams

DynamoDB Streams provides you with a time-ordered sequence, or change log, of all item-level changes made to any DynamoDB table. The stream is exposed via the familiar Amazon Kinesis interface. Using streams, you can apply the changes to a full-text search data store such as Elasticsearch, push incremental backups to Amazon S3, or maintain an up-to-date read cache.

I have heard from many of you that one of the common challenges you have is keeping DynamoDB data in sync with other data sources, such as search indexes or data warehouses. In traditional database architectures, database engines often run a small search engine or data warehouse engines on the same hardware as the database. However, the model of collocating all engines in a single database turns out to be cumbersome because the scaling characteristics of a transactional database are different from those of a search index or data warehouse. A more scalable option is to decouple these systems and build a pipe that connects these engines and feeds all change records from the source database to the data warehouse (e.g., Amazon Redshift) and Elasticsearch machines.

The velocity and variety of data that you are managing continues to increase, making your task of keeping up with the change more challenging as you want to manage the systems and applications in real time and respond to changing conditions. A common design pattern is to capture transactional and operational data (such as logs) that require high throughput and performance in DynamoDB, and provide periodic updates to search clusters and data warehouses. However, in the past, you had to write code to manage the data changes and deal with keeping the search engine and data warehousing engines in sync. For cost and manageability reasons, some developers have collocated the extract job, the search cluster, and data warehouses on the same box, leading to performance and scalability compromises. DynamoDB Streams simplifies and improves this design pattern with a distributed systems approach.

You can enable the DynamoDB Streams feature for a table with just a few clicks using the AWS Management Console, or you can use the DynamoDB API. Once configured, you can use an Amazon EC2 instance to read the stream using the Amazon Kinesis interface, and apply the changes in parallel to the search cluster, the data warehouse, and any number of data consumers. You can read the changes as they occur in real time or in batches as per your requirements. At launch, an item’s change record is available in the stream for 24 hours after it is created. An AWS Lambda function is a simpler option that you can use, as it only requires you to code the logic, set it, and forget it.

No matter which mechanism you choose to use, we make the stream data available to you instantly (latency in milliseconds) and how fast you want to apply the changes is up to you. Also, you can choose to program post-commit actions, such as running aggregate analytical functions or updating other dependent tables. This new design pattern allows you to keep your remote data consumers current with the core transactional data residing in DynamoDB at a frequency you desire and scale them independently, thereby leading to better availability, scalability, and performance. The Amazon Kinesis API model gives you a unified programming experience between your streaming apps written for Amazon Kinesis and DynamoDB Streams.

DynamoDB Cross-region Replication

Many modern database applications rely on cross-region replication for disaster recovery, minimizing read latencies (by making data available locally), and easy migration. Today, we are launching cross-region replication support for DynamoDB, enabling you to maintain identical copies of DynamoDB tables across AWS regions with just a few clicks. We have provided you with an application with a simple UI to set up and manage cross-region replication groups and build globally-distributed applications with ease. When you set up a replication group, DynamoDB automatically configures a stream between the tables, bootstraps the original data from source to target, and keeps the two in sync as the data changes. We have publicly shared the source code for the cross-region replication utility, which you can extend to build your own versions of data replication, search, or monitoring applications.

A great example for the application of this cross-region replication functionality is Mapbox, a popular mapping platform that enables developers to integrate location information into their mobile or online applications. Mapbox deals with location data from all over the globe and their key focus areas have been availability and performance. Having been part of the preview program, Jake Pruitt, Software Developer at Mapbox told us, “DynamoDB Streams unlocks cross-region replication - a critical feature that enabled us to fully migrate to DynamoDB. Cross-region replication allows us to distribute data across the world for redundancy and speed.” The new feature enables them to deliver better availability and improves the performance because they can access all needed data from the nearest data center.

DynamoDB Triggers

From the dawn of databases, the pull method has been the preferred model for interaction with a database. To retrieve data, applications are expected to make API calls and read the data. To get updates from a table, customers have to constantly poll the database with another API call. Relational databases use triggers as a mechanism to enable applications to respond to data changes. However, the execution of the triggers happens on the same machine as the one that runs the database and an errant trigger can wreak havoc on the whole database. In addition, such mechanisms do not scale well for fast-moving data sets and large databases.

To achieve a truly scalable, high-performance, and flexible system, we need to decouple the execution of triggers from the database and bring the data changes to the applications as they occur. Enter DynamoDB Triggers—an event-driven mechanism that enables developers to define Java or JavaScript functions that run outside the database in response to specific data changes in your DynamoDB tables. Specifically, these functions are configured and executed as AWS Lambda functions, giving you the ability to scale on the fly and only pay for the fractions of the computing seconds consumed. All you need to do is register the AWS Lambda function that needs to be executed in response to a specific data change in the DynamoDB table. Lambda and DynamoDB take care of the rest. DynamoDB creates a stream and pushes the data to the trigger code. Lambda automatically creates and manages the resources needed to handle the trigger. Since the Lambda function executes on hosts that are different from that of the DynamoDB table, both the DynamoDB table and Lambda function scale independently, thus isolating the risk of errant triggers.

Triggers are powerful mechanisms that react to events dynamically and in real time. Here is a practical real-world example of how triggers can be very useful to businesses: TOKYU HANDS is a fast-growing business with over 70 shops all over Japan. Their cloud architecture has two main components: a point-of-sales system and a merchandising system. The point-of-sales system records changes from all the purchases and stores them in DynamoDB. The merchandising system is used to manage the inventory and identify the right timing and quantity for refilling the inventory. The key challenge for them has been keeping these systems in sync constantly. After previewing DynamoDB Triggers, Naoyuki Yamazaki, Cloud Architect told us, “TOKYU HANDS is running in-store point-of-sales system backed by DynamoDB and various AWS services. We really like the full-managed service aspect of DynamoDB. With DynamoDB Streams and DynamoDB Triggers, we would now make our systems more connected and automated to respond faster to changing data such as inventory.” This new feature will help them manage inventory better to deliver a good customer experience while gaining more business efficiency.

You can also use triggers to power many modern Internet of Things (IoT) use cases. For example, you can program home sensors to write the state of temperature, water, gas, and electricity directly to DynamoDB. Then, you can set up Lambda functions to listen for updates on the DynamoDB tables and automatically notify users via mobile devices whenever specific levels of changes are detected.

Summing It All Up

If you are building mobile, ad-tech, gaming, web, or IOT applications, you can use DynamoDB to build globally distributed applications that deliver consistent and fast performance at any scale. With the three new features that I mentioned, you can now enrich those applications to consume high velocity data changes and react to updates in near real time. What this means to you is that, with DynamoDB, you are now empowered to create unique applications that have been difficult and expensive to build and manage before. Let me illustrate with an example.

Let’s say that you are managing a supply-chain system with suppliers all over the globe. We all know the advantages that real-time inventory management can provide to such a system. Yet, building such a system that provides speed, scalability, and reliability at a low cost is not easy. On top of this, adding real-time updates for inventory management or extending the system with custom business logic with your own IT infrastructure is complex and costly.

This is where AWS and DynamoDB with cross-region replication, DynamoDB Triggers, and DynamoDB Streams can serve as a one-stop solution that handles all your requirements of scale, performance, and manageability, leaving you to focus on your business logic. All you need to do is to write the data from your products into DynamoDB. As illustrated in this example, if you use RFID tags on your products, you can directly feed the data from the scanners into DynamoDB. Then, you can use cross-region replication to sync the data across multiple AWS regions and bring the data close to your supply base. You can use triggers to monitor for inventory changes and send notifications in real time. To top it all off, you will have the flexibility to extend the DynamoDB Streams functionality for your custom business requirements. For example, you can feed the updates from the stream into a search index and use it for a custom search solution, thereby enabling your internal systems to locate the updates to inventory based on text searches. When you put it all together, you have a powerful business solution that scales to your needs, lets you pay only for what you provision, and helps you differentiate your offerings in the market and drive your business forward faster than before.

The combination of DynamoDB Streams, cross-region replication, and DynamoDB Triggers certainly offers immense potential to enable new and intriguing user scenarios with significantly less effort. You can learn more about these features on Jeff Barr’s blog. I am definitely eager to hear how each of you will use streams to drive more value for your businesses. Feel free to add a comment below and share your thoughts.

Today, Amazon announced the Alexa Skills Kit (ASK), a collection of self-service APIs and tools that make it fast and easy for developers to create new voice-driven capabilities for Alexa. With a few lines of code, developers can easily integrate existing web services with Alexa or, in just a few hours, they can build entirely new experiences designed around voice. No experience with speech recognition or natural language understanding is required—Amazon does all the work to hear, understand, and process the customer’s spoken request so you don’t have to. All of the code runs in the cloud — nothing is installed on any user device.

The easiest way to build a skill for Alexa is to use AWS Lambda, an innovative compute service that runs a developer’s code in response to triggers and automatically manages the compute resources in the AWS Cloud, so there is no need for a developer to provision or continuously run servers. Developers simply upload the code for the new Alexa skill they are creating, and AWS Lambda does the rest, executing the code in response to Alexa voice interactions and automatically managing the compute resources on the developer’s behalf.

Using a Lambda function for your service also eliminates some of the complexity around setting up and managing your own endpoint:

  • You do not need to administer or manage any of the compute resources for your service.
  • You do not need an SSL certificate.
  • You do not need to verify that requests are coming from the Alexa service yourself. Access to execute your function is controlled by permissions within AWS instead.
  • AWS Lambda runs your code only when you need it and scales with your usage, so there is no need to provision or continuously run servers.
  • For most developers, the Lambda free tier is sufficient for the function supporting an Alexa skill. The first one million requests each month are free. Note that the Lambda free tier does not automatically expire, but is available indefinitely.

AWS Lambda supports code written in Node.js (JavaScript) and Java. You can copy JavaScript code directly into the inline code editor in the AWS Lambda console or upload it in a zip file. For basic testing, you can invoke your function manually by sending it JSON requests in the Lambda console.

In addition, Amazon announced today that the Alexa Voice Service (AVS), the same service that powers Amazon Echo, is now available to third party hardware makers who want to integrate Alexa into their devices—for free. For example, a Wi-Fi alarm clock maker can create an Alexa-enabled clock radio, so a customer can talk to Alexa as they wake up, asking “What’s the weather today?” or “What time is my first meeting?” Read the press release here.

Got an innovative idea for how voice technology can improve customers’ lives? The Alexa Fund was also announced today and will provide up to $100 million in investments to fuel voice technology innovation. Whether that’s creating new Alexa capabilities with the Alexa Skills Kit, building devices that use Alexa for new and novel voice experiences using the Alexa Voice Service, or something else entirely, if you have a visionary idea, Amazon would love to hear from you.

For more details about Alexa you can check out today’s announcements on the AWS blog and Amazon Appstore blog.

This weekend we go back in time all the way to the beginning of operating systems research. In the first SOSP conference in 1967 there were several papers that laid the foundation for the development of structured operating systems. There was the of course the lauded paper on the THE operating system by Dijkstra but for this weekend I picked the paper on memory locality by Peter Denning as this work laid the groundwork for the development of virtual memory systems.

The Working Set Model for Program Behavior, Peter J. Denning, Proceedings of the First ACM Symposium on Operating Systems Principles, October 1967, Gatlinburg, TN, USA.

As we know the run time of most algorithms increases when the input set increases in size. There is one noticeable exception: there is a class of distributed algorithms, dubbed local algorithms, that run in constant time, independently of the size of the network. Being highly scalable and fault tolerant, such algorithms are ideal in the operation of large-scale distributed systems. Furthermore, even though the model of local algorithms is very limited, in recent years we have seen many positive results for nontrivial problems. In this weekend's paper Jukka Suomela surveys the state-of-the-art in the field, covering impossibility results, deterministic local algorithms, randomized local algorithms, and local algorithms for geometric graphs.

Survey of Local Algorithms, Jukka Suomela, in ACM Computing Surveys (CSUR) Surveys, Volume 45 Issue 2, February 2013, Article No. 24

An important way of engaging with AWS customers is through the AWS Global Summit Series. All AWS Summits feature a keynote address highlighting the latest announcements from AWS and customer testimonials, technical sessions led by AWS engineers, and hands-on technical training. You will learn best practices for deploying applications on AWS, optimizing performance, monitoring cloud resources, managing security, cutting costs, and more. You will also have opportunities to meet AWS staff and partners to get your technical questions answered.

At the Summit we focus on education and helping our customers, there are deep technical developer sessions, broad sessions on architectural principles, sessions for enterprise decision makers and how to best exploit AWS in a public sector or education settings. In all of these we make sure that it is not just the AWS team talking but invite customers to join us to tell about their journey.

I am fortunate to join our customers for a number of these Summits. London and Stockholm were already great events, but now we have another series coming up in 3 very packed weeks starting at the end of June:

Hope to see you in one of these locations!

The AWS Pop-up Loft opens in New York City

| Comments ()

Over a year ago the AWS team opened a "pop-up loft" in San Francisco at 925 Market Street. The goal of opening the loft was to give developers an opportunity to get in-person support and education on AWS, to network, get some work done, or just hang out with peers. It became a great success; every time when I visit the loft there is a great buzz with people getting advice from our solution architects, getting training or attending talks and demos. It became such a hit among developers that we decided to reopen the loft last year August after its initial run of 4 weeks, making sure everyone would have continued access to this important resource.

Building on the success of the San Francisco loft the team will now also open a pop-up loft in New York City on June 24 at 350 West Broadway. It will extend the concept that was pioneered in SF to give NYC developers access to great AWS resources:

  • Ask an Architect: You can schedule a 1:1, 60-minute session with a member of the AWS technical team. Bring your questions about AWS architecture, cost optimization, services and features, and anything else AWS related. And don’t be shy — walk-ins are welcome too.
  • Technical Presentations: AWS solution architects, product managers, and evangelists deliver technical presentations covering some of the highest-rated and best-attended sessions from recent AWS events. Talks cover solutions and services including Amazon Echo, Amazon DynamoDB, mobile gaming, Amazon Elastic MapReduce, and more.
  • AWS Technical Bootcamps: Limited to 25 participants per bootcamp, these full-day bootcamps include hands-on lab exercises that use a live environment with the AWS console. Usually these cost $600, but at the AWS Pop-up Loft we are offering them for free. Bootcamps you can register for include “Getting Started with AWS — Technical,” “Store and Manage Big Data in the Cloud,” “Architecting Highly Available Apps,” and “Taking AWS Operations to the Next Level.”
  • Self-paced, Hands-on Labs: Beginners through advanced users can attend labs on topics that range from creating Amazon EC2 instances to launching and managing a web application with AWS CloudFormation. Usually $30 each, these labs are offered for free in the AWS loft.

You are all invited to join us for the grand opening party at the loft on June 24 at 7 PM. There will be food, drinks, DJ, and free swag. The event will be packed, so RSVP today if you want to come and mingle with hot startups, accelerators, incubators, VCs, and our AWS technical experts. Entrance is on a first come, first serve basis.

Both the San Francisco loft and the New York City loft have packed calendars with events. Visit their web pages for more details and to sign up.

I have signed up to do two loft events:

  • Fireside Chat with AWS Community Heroes on June 16 starting at 6pm Jeremy Edberg (Reddit/Netflix) and Valentino Volonghi (Adroll) will be joining me at the San Francisco loft for a fireside chat about startups, technology, entrepreneurship and more. Jeremy and Valentino have been recognized by AWS as Community Heroes, an honor reserved for developers who’ve had a real impact within the community. Following the talk, we’ll kick off a unique networking social including specialty cocktails, beer, wine, food, and party swag!
  • Fireside Chat with NYC Founders on July 7 a number of startup founders who have gone through NYC accelerators will join me for a conversation about trends in the New York startup scene.

I hope to see you there!

Today was a big day for the Amazon Web Services teams as a whole range of new services and functionality was delivered to our customers. Here is a brief recap of it:

The Amazon Machine Learning service As I wrote last week machine learning is becoming an increasingly important tool to build advanced data driven applications. At Amazon we have hundreds of teams using machine learning and by making use of the Machine Learning Service we can significantly speed up the time they use to bring their technologies into production. And you no longer need to be a machine learning expert to be able to use it.

Amazon Machine Learning is a service that allows you easily to build predictive applications, including fraud detection, demand forecasting, and click prediction. Amazon ML uses powerful algorithms that can help you create machine learning models by finding patterns in existing data, and using these patterns to make predictions from new data as it becomes available. The Amazon ML console and API provide data and model visualization tools, as well as wizards to guide you through the process of creating machine learning models, measuring their quality and fine-tuning the predictions to match your application requirements. Once the models are created, you can get predictions for your application by using the simple API, without having to implement custom prediction generation code or manage any infrastructure. Amazon ML is highly scalable and can generate billions of predictions, and serve those predictions in real-time and at high throughput. With Amazon ML there is no setup cost and you pay as you go, so you can start small and scale as your application grows. Details on the AWS Blog

The Amazon Elastic File System

AWS has been offering a range of storage solutions: objects, block storage, databases, archiving, etc. for a while already. Customers have been asking to add file system functionality to our set of solutions as much of their traditional software required an EC2 mountable shared file system. When we designed Amazon EFS we decided to build along the AWS principles: Elastic, scalable, highly available, consistent performance, secure, and cost-effective.

Amazon EFS is a fully-managed service that makes it easy to set up and scale shared file storage in the AWS Cloud. With a few clicks in the AWS Management Console, customers can use Amazon EFS to create file systems that are accessible to EC2 instances and that support standard operating system APIs and file system semantics. Amazon EFS file systems can automatically scale from small file systems to petabyte-scale without needing to provision storage or throughput. Amazon EFS can support thousands of concurrent client connections with consistent performance, making it ideal for a wide range of uses that require on-demand scaling of file system capacity and performance. Amazon EFS is designed to be highly available and durable, storing each file system object redundantly across multiple Availability Zones. With Amazon EFS, there is no minimum fee or setup costs, and customers pay only for the storage they use. Details on the AWS Blog.

The Amazon EC2 Container Service (ECS)

Containers are an important building block in modern style of software development and since the launch of Amazon ECS last year November it has become a very important tool for architects and developers. Today Amazon ECS moves into General Availability (GA) so you can use it for your certified production systems.

With going GA Amazon ECS also delivers a new scheduler to support long running application, see my detail block post over here: State Management and Scheduling with the Amazon EC2 Container Service Also read the details on the AWS Blog

AWS Lambda One of the most exciting technologies we have built lately at AWS is AWS Lambda. Developers really have flocked to using this serverless programming technology to build event driven services. A great example in the AWS Summit today was by Valentino Volonghi on the use of Lambda by Adroll to deliver real-time updates around the word to their Amazon DynamoDB instances. Today Amazon Lambda is entering General Availability. Two unique areas where Lambda is driving innovation at our customers are Mobile and the Internet of Things (IoT). We have taken feedback from our customers and used our inventive capabilities to extend AWS Lambda with great new functionality to support their new innovative application development:

  • Synchronous Events – You can now create AWS Lambda functions that respond to events in your application in real time (synchronous) as well as asynchronously. Synchronous requests allow mobile and IoT apps to move data transformations and analysis to the cloud and make it easy for any application or web service to use Lambda to create back-end functionality. Synchronous events operate with low latency so you can deliver dynamic, interactive experiences to your users. To learn more about using synchronous events, read Getting Started: Handling Synchronous Events in the AWS Lambda Developers Guide.
  • AWS Mobile SDK support for AWS Lambda (Android, iOS) – AWS Lambda is now included in the AWS Mobile SDK, making it easy to build mobile applications that use Lambda functions as their app backend. When invoked through the mobile SDK, the Lambda function automatically has access to data about the device, app, and end user identity, making it easy to create rich, personalized responses to in-app activity. To learn more, visit the AWS Mobile SDK page.
  • Target, Filter, and Route Amazon SNS Notifications with AWS Lambda – You can now invoke a Lambda function by sending it a notification in Amazon SNS, making it easy to modify or filter messages before routing them to mobile devices or other destinations. Apps and services that already send SNS notifications, such as Amazon CloudWatch, gain automatic integration with AWS Lambda through SNS messages without needing to provision or manage infrastructure.
  • Apply Custom Logic to User Preferences and Game State – Amazon Cognito makes it easy to save user data, such as app preferences or game state, in the AWS Cloud and synchronize it among all the user’s devices. You can now use AWS Lambda functions to validate, audit, or modify data as it is synchronized and Cognito will automatically propagate changes made by your Lambda function to the user’s devices. End user identities created using Amazon Cognito are also included in Lambda events, making it easy to store or search for customer-specific data in a mobile, IoT, or web backend.
  • AWS CloudTrail Integration – Lambda now supports AWS CloudTrail logging for API requests. You can also use AWS Lambda to automatically process CloudTrail events to add security checks, auditing, or notifications for any AWS API call.
  • Enhanced Kinesis Stream Management – You can now add, edit and remove Kinesis streams as event sources for Lambda functions using the AWS Lambda console, as well as view existing event sources for your Lambda functions. Multiple Lambda functions can now respond to events in a single Kinesis or DynamoDB stream.
  • Increased Default Limits – Lambda now offers 100 concurrent executions and 1,000 TPS as a default limit and you can contact customer service to have these limits quickly raised to match your production needs.
  • Enhanced Metrics and Logging – In addition to viewing the number of executions of your Lambda function and its error rate and duration, you can now also see throttled attempts through a CloudWatch metric for each function. Amazon CloudWatch Logs now also support time-based sorting, making it easier to search Lambda logs and correlate them with CloudWatch metrics. API enhancements make it easier to distinguish problems in your code (such as uncaught top-level exceptions or timeouts) from errors you catch and return yourself.
  • Simplified Access Model and Cross-Account Support – Lambda now supports resource policies and cross-account access, making it easier to configure event sources such as Amazon S3 buckets and allowing the bucket owner to be in a separate AWS account. Separate IAM roles are no longer required to invoke a Lambda function, making it faster to set up event sources.

We will also launch Java as a programming language to be used in Lambda in a few weeks. More details on the AWS Blog.

For more details on these new services visit the official AWS Blog and the What's New Section on the AWS website.

Last November, I had the pleasure of announcing the preview of Amazon EC2 Container Service (ECS) at re:Invent. At the time, I wrote about how containerization makes it easier for customers to decompose their applications into smaller building blocks resulting in increased agility and speed of feature releases. I also talked about some of the challenges our customers were facing as they tried to scale container-based applications including challenges around cluster management. Today, I want to dive deeper into some key design decisions we made while building Amazon ECS to address the core problems our customers are facing.

Running modern distributed applications on a cluster requires two key components - reliable state management and flexible scheduling. These are challenging problems that engineers building software systems have been trying to solve for a long time. In the past, many cluster management systems assumed that the cluster was going to be dedicated to a single application or would be statically partitioned to accommodate multiple users. In most cases, the applications you ran on these clusters were limited and set by the administrators. Your jobs were often put in job queues to ensure fairness and increased cluster utilization. For modern distributed applications, many of these approaches break down, especially in the highly dynamic environment enabled by Amazon EC2 and Docker containers. Our customers expect to spin up a pool of compute resources for their clusters on demand and dynamically change the resources available as their jobs change over time. They expect these clusters to span multiple availability zones, and increasingly want to distribute multiple applications - encapsulated in Docker containers - without the need to statically partition the cluster. These applications are typically a mix of long running processes and short lived jobs with varying levels of priority. Perhaps most importantly, our customers told us that they wanted to be able to start with a small cluster and grow over time as their needs grew without adding operational complexity.

A modern scheduling system demands better state management than available with traditional cluster management systems. Customers running Docker containers across a cluster of Amazon EC2 instances need to know where those containers are running and whether they are in their desired state. They also need information about the resources in use and the remaining resources available as well as the ability to respond to failures, including the possibility that an entire Availability Zone may become unavailable. This requires customers to store the state of their cluster in a highly available and distributed key-value store. Our customers have told us that scaling and operating these data storage systems is very challenging. Furthermore, they felt that this was undifferentiated heavy lifting and would rather focus their energy on running their applications and growing their businesses. Let's dive into the innovations of Amazon ECS that addresses these problems and removes much of the complexity and "muck" of running a high performance, highly scalable Docker-aware cluster management system.

State Management with Amazon ECS

At Amazon, we have built a number of core distributed systems primitives to support our needs. Amazon ECS is built on top of one of these primitives - a Paxos-based transaction journal that maintains a history of state transitions. These transitions are offered and accepted using optimistic concurrency control and accepted offers are then replicated allowing for a highly available and highly scalable ACID compliant datastore. We then expose this state management behind a simple set of APIs. You call the Amazon ECS List and Describe APIs to access the state of your cluster. These APIs give the details of all the instances in your cluster and all the tasks running on those instances. The Amazon ECS APIs respond quickly whether you have a cluster with one instance and a few containers, or a dynamic cluster with 100s of instances and 1000s of containers. There is nothing to install and no database to manage.

Scheduling with Amazon ECS

The state management system underlying Amazon ECS enables us to provide our customers with very powerful scheduling capabilities. Amazon ECS operates as a shared state cluster management system allowing schedulers full visibility to the state of the cluster. The schedulers compete for the resources they need and our state management system resolves conflicts and commits serializable transactions to ensure a consistent and highly available view of cluster state. These transactional guarantees are required to ensure that changes in state are not lost, a very important property to ensure your jobs have the resources they require. This allows scheduling decisions to be made in parallel by multiple schedulers, allowing you to move quickly to create the distributed applications that are becoming increasingly common.

Amazon ECS includes schedulers for common workloads like long running services or run once jobs, and customers can write custom schedulers to meet their unique business or application requirements. This means there is no job queue, no waiting for tasks to start while locks are in place, and that your most important applications are getting the resources they need.

I hope this post has given you some insights into how and why we built Amazon ECS. Developing a system with these capabilities is hard and requires a lot of experience in building, scaling, and operating distributed systems. With Amazon ECS, these capabilities are available to you with just a few API calls. Building modern applications has never been easier. For a walkthrough of the new Amazon ECS Service scheduler and other features, please read Jeff Barr's post on the AWS blog and for a full list of features and capabilities, read Chris Barclay's post on the AWS Compute blog