Driving down the cost of Big-Data analytics

All Things Distributed Now Go Build! Articles @werner

Driving down the cost of Big-Data analytics

August 18, 2011 • 581 words

The Amazon Elastic MapReduce (EMR) team announced today the ability to seamlessly use Amazon EC2 Spot Instances with their service, significantly driving down the cost of data analytics in the cloud. Many of our Big-Data customers already saw a big drop in their AWS bill last month when the cost of incoming bandwidth was dropped to $0.00. Now, given that historically customers using Spot Instances have seen cost saving up to 66% over On-Demand Instance prices, Amazon EMR customers are poised to achieve even greater cost savings.

Analyzing vast amounts of data is critical for companies looking to incorporate customer insights into their business, including building recommendation engines or optimizing customer targeting. Hadoop is quickly becoming the preferred tool for this type of large scale data analytics. However, Hadoop users often waste significant intellectual bandwidth on managing clusters and running Hadoop jobs rather than focusing on creating value through analytics. Amazon Elastic MapReduce takes away much of this muck by providing a hosted Hadoop framework that enables businesses, researchers, data analysts, and developers to easily and efficiently spin up resizable clusters for distributed processing of large data sets.

An interesting observation is that data analytics is no longer the purview of large enterprises. Every young business launching today knows they must integrate data collection and analytics from the start. In order to compete in today’s market, these companies must have a deep understanding of their customers’ behavior, allowing them to continuously improve how they serve them. Launching a business with a minimally viable product and then rapidly iterating in the direction that customers lead them is becoming a standard approach to success. However, this cannot be done without efficient, scalable data analytics. Many of these startups are using Hadoop for data processing and Amazon Elastic MapReduce is the ideal environment for them: it provides instant scalability and lets them focus on analytics while EMR handles the hassle of running the various Hadoop components. Given the initial shoestring budget of many of these new companies, driving down the overall cost of analytics using Spot Instances is a huge benefit.

There are three categories of instances in an Amazon EMR cluster: 1) the Master Instance Group which contains the Hadoop Master Node that schedules the various tasks, 2) the Core Instance Group which contains instances that both store the data to be analyzed and run map and reduce tasks, and 3) the Task Instance Group which only runs map and the reduce tasks. For each instance group, you can decide to use On-Demand Instances (possibly from your Reserved Instances pool) or Spot Instances. If you choose to use Spot Instances you provide the bid price you are willing to pay for each instance in that group. If the current Spot Price is below the bid price, the Instance Group will launch. The instance groups in which Spot Instances are appropriate depends on the use case. For example, for data-critical workloads you might decide to run only the Task Group on Spot Instances, with the Core Group on On-Demand, while if you are performing application testing you may decide to run all Instance Groups using Spot Instances.

If you want a quick introduction on how to get started with mixing Spot Instances with On-Demand Instances in an Amazon EMR cluster, watch this Getting Started Video. More details can be found in the Spot Instances Section of the Amazon Elastic MapReduce Developer Guide. The posting on the AWS developer blog also has some more background.