Modeling systems has always been part of the toolkit of the computer scientist. We often try to bring systems back to simple queuing models to understand throughput and latency questions, and then use those results to predict resource usage and drive allocation. Can one actually be confident that such a simple model can accurately reflect reality? With the increased complexity of distributed systems based on a large scale autonomous services model these techniques become a lot less reliable.
I would like to use modeling techniques to focus on more than just achieving simple SLAs. I want to understand the cost impact of using certain algorithms in combination with specific node and network configurations, especially under certain failure scenarios. I would use such an economic model during the design phase of a service or application, to evaluate different algorithms for achieving consistency and availability based on their cost impact. For example, if a service needs to achieve a state persistence that can survive a complete datacenter outage and the service needs to be accessed by clients in ten datacenters with a certain SLA, there is a range of algorithmic and configuration choices to be made.
In these situations, systems design has often focused on trying to achieve the performance and availability SLAs first, which in itself is difficult enough. The economics of the different algorithm and configuration choices are often considered to be of secondary importance. However, when you are determining the cost of a system, you have to consider the choice of the size of replication units in combination with the density of the storage nodes, the reliability of the storage system, the step-function cost of inter-datacenter networking, the location and reliability of data caches. This results in a base cost plus a cost per storage operation that is different in a quorum-based system when compared to a probabilistic system. This holds especially true when you include in this modeling the cost of recovering from a cache node, storage node or datacenter failure.
Many have assumed that throwing a lot of cheap hardware at the problem is the answer to many of these questions, but our experience is that when taking complex multi-datacenter configurations into account, the answers are less obvious. As we build new services we need better models that can handle these very complex, multi-variant scenarios to make sure we build the right services at the right cost. At Amazon we are fortunate to have a lot of data that will allow us to make progress on these questions.
I have positions open for experienced engineers/scientists who want to work on the problems of modeling complex distributed systems with me. To qualify for these jobs these are some of the things that I will look for:
- You have a very solid understanding of distributed systems and networking
- You know how to do data-driven analysis and truly understand statistics
- You understand large scale monitoring and data collection architectures
- You are familiar with the current state of the art in distributed systems modeling
- You are an experienced engineer with a track record of building complex systems
- If you are not that experienced, you may have an advanced degree with a proven expertise in modeling complex distributed systems and have demonstrated involvement with large software projects (e.g. open-source).
- You have a proven ability to effectively communicate the results of data analysis and modeling
- You live in or are willing to relocate to the greater Seattle area
If you are interested in this work and feel that you are qualified, send me an email with your resume.