The Story of Apollo - Amazon’s Deployment Engine

All Things Distributed Now Go Build! Articles @werner

The Story of Apollo - Amazon’s Deployment Engine

November 12, 2014 • 908 words

Automated deployments are the backbone of a strong DevOps environment. Without efficient, reliable, and repeatable software updates, engineers need to redirect their focus from developing new features to managing and debugging their deployments. Amazon first faced this challenge many years ago.

When making the move to a service-oriented architecture, Amazon refactored its software into small independent services and restructured its organization into small autonomous teams. Each team took on full ownership of the development and operation of a single service, and they worked directly with their customers to improve it. With this clear focus and control, the teams were able to quickly produce new features, but their deployment process soon became a bottleneck. Manual deployment steps slowed down releases and introduced bugs caused by human error. Many teams started to fully automate their deployments to fix this, but that was not as simple as it first appeared.

Deploying software to a single host is easy. You can SSH into a machine, run a script, get the result, and you’re done. The Amazon production environment, however, is more complex than that. Amazon web applications and web services run across large fleets of hosts spanning multiple data centers. The applications cannot afford any downtime, planned or otherwise. An automated deployment system needs to carefully sequence a software update across a fleet while it is actively receiving traffic. The system also requires the built-in logic to correctly respond to the many potential failure cases.

It didn’t make sense for each of the small service teams to duplicate this work, so Amazon created a shared internal deployment service called Apollo. Apollo’s job was to reliably deploy a specified set of software across a target fleet of hosts. Developers could define their software setup process for a single host, and Apollo would coordinate that update across an entire fleet of hosts. This made it easy for developers to “push-button” deploy their application to a development host for debugging, to a staging environment for tests, and finally to production to release an update to customers. The added efficiency and reliability of automated deployments removed the bottleneck and enabled the teams to rapidly deliver new features for their services.

Over time, Amazon has relied on and dramatically improved Apollo to fuel the constant stream of improvements to our web sites and web services. Thousands of Amazon developers use Apollo each day to deploy a wide variety of software, from Java, Python, and Ruby apps, to HTML web sites, to native code services. In the past 12 months alone, Apollo was used for 50M deployments to development, testing, and production hosts. That’s an average of more than one deployment each second.

The extensive use of Apollo inside Amazon has driven the addition of many valuable features. It can perform a rolling update across a fleet where only a fraction of the hosts are taken offline at a time to be upgraded, allowing an application to remain available during a deployment. If a fleet is distributed across separate data centers, Apollo will stripe the rolling update to simultaneously deploy to an equivalent number of hosts in each location. This keeps the fleet balanced and maximizes redundancy in the case of any unexpected events. When the fleet scales up to handle higher load, Apollo automatically installs the latest version of the software on the newly added hosts.

Apollo also tracks the detailed deployment status on individual hosts, and that information is leveraged in many scenarios. If the number of failed host updates crosses a configurable threshold, Apollo will automatically halt a deployment before it affects the application availability. On the next deployment, such as a quick rollback to the prior version, Apollo will start updating these failed hosts first, thus bringing the whole fleet to a healthy state as quickly as possible. Developers can monitor ongoing deployments and view their history to answer important questions like “When was this code deployed into production, and which hosts are currently running it?” or “What version of the application was running in production last week?”

Many of our customers are facing similar issues as they increase the rate of their application updates. They’ve asked us how we do it, because they want to optimize their processes to achieve the same rapid delivery. Since automated deployments are a fundamental requirement for agile software delivery, we created a new service called AWS CodeDeploy.

CodeDeploy allows you to plug in your existing application setup logic, and then configure the desired deployment strategy across your fleets of EC2 instances. CodeDeploy will take care of orchestrating the fleet rollout, monitoring the status, and giving you a clear dashboard to control and track all of your deployments. It simplifies and standardizes your software release process so that developers can focus on what they do best –building new features for their customers. Jeff Barr’s blog post includes a great walkthrough of using CodeDeploy, and is a good place to start to gain a deeper understanding of the service.

AWS CodeDeploy is the first in a set of ALM services designed to help customers build a productive cloud development process. We look forward to sharing more about the internal processes and tools that Amazon uses for agile software delivery. Apollo is just one piece of a larger set of solutions. We’d like to hear about the ways that you think we can help improve your delivery process, so please connect directly with the product team on the CodeDeploy forum.