Job Opening for a Senior Research Engineer

All Things Distributed Now Go Build! Articles @werner

Job Opening for a Senior Research Engineer

July 19, 2007 • 1166 words

When I was building one of my first teams at Amazon, one that had to work on some really advanced distributed systems technology I put up a job description on this weblog. I was certainly pleased with the responses. Last year at a conference I heard from some of my former academic colleagues that they were using this description to educate their students abput where they were lacking in knowledge or experience. “Werner’s requirements” were used to explain to them that if they wanted to work on one of the really interesting distributed systems of this world, they better recognize the importance of [insert some random topic]

There are only a few engineers at Amazon who work directly for me and I currently have a job opening for such a position. It has most of the requirements from the previous description, but in this role there is more emphasis on analysis and modeling. If you are interested and you feel you qualify you can apply online at the Amazon careers site for job # 025213 or send email to the address on the right column of this page.

This is a job with some really tough requirements but it is an important job as you will have direct influence into how systems are build at Amazon and that is something we take very serious.

Senior Research Engineer in the Office of the CTO

Amazon.com's website is the most well-known front end to one of the world's largest and busiest service-oriented architectures. Its systems requirements are very challenging: maintain high-availability and guaranteed performance in an ultra-scalable fashion while being very cost efficient. From webpage rendering to order pipeline workflow, from data-warehouse to distributed caching, all require unique solutions. Many of these solutions require significant innovation: often these challenges have not even been addressed in research in a production setting at the scale of Amazon.

As an engineer working directly for the Chief Technology Officer, you will be confronted with Amazon.com's toughest problems. You need to be able to dive deep on technology issues, use your analytical skills to reduce a problem to its fundamentals, and create solutions. Important in this process is that you use computer science theory and knowledge of advanced research to design solutions that are fundamental in nature and can provide a solid basis for the Amazon.com platform on which to build.

The nature of the Amazon.com platform is that of a massive distributed system. As with any distributed system its overall scalability can often be reduced to the scalability of its state management systems. Much of your work will touch the way that data is transferred and stored on tens of thousands of servers through many datacenters through the world. You will need to have a thorough understanding of distributed storage systems, scalable database technologies and data stream processing.

In this position you work on investigating the fundamentals of the Amazon system architecture and use modeling to create insight into its reliability, durability, efficiency, performance and scalability. A particular emphasis is on the use of economic models to reason about the optimal use of resources and to build a proper foundation for service pricing.

This position requires you to have good communication skills as a significant portion of your time will be spend interacting with other Amazon engineers company-wide. You will need to be able to produce written materials and presentations that target engineers and projects managers as well as the senior executives. You will mentor engineers on fundamentals of distributed systems and computer science theory.

What specific things are we looking for in you?

You know your distributed systems theory: You know about logical time, snapshots, stability, message ordering, but also acid and multi-level transactions. You have heard about the FLP impossibility argument. You know why failure detectors can solve it (but you do not have to remember which one diamond-w was). You have at least once tried to understand Paxos by reading the original paper.

You have a good sense for distributed systems practice: you can reason about churn and locality in DHTs. You intuitively know when to apply ordered communication and when to use transactions. You can reason about data consistency in a system where hundreds of nodes are geographically distributed. You know why for example autonomy and symmetry are important properties for distributed systems design. You like the elegance of systems based on epidemic techniques.

You have good common sense about scale and availability: you frown when someone mentions two-phase commit in the same sentence as high-availability. You also realize that protocols that require a system "to be stable for a sufficiently long period of time" are not a good basis for building real systems. You understand the elegance of state-machine replication, but understand why it is hard to apply at large scale. You have a solid intuition about the impact of design decisions on the ability to achieve data consistency, and you are not scared by the idea of building systems based on 'eventually consistent' data.

You know about the advances in database technologies. You understand how database performance is optimized and how data-partitioning impacts query optimization. You realize what the limitations of commercial databases are and possess a good intuition about where solutions can be found. You are aware that column orientation and stream processing are not just research topics but actually solve hard problems.

Some of your heroes have actually built real systems: worshipping Dijkstra and Lamportis okay as long you also know why Jim Gray and Bruce Lindsay deserve a red carpet. You are not afraid to confront Felipe Cabrera or Marvin Theimer when you think they are wrong (never happens, of course :-)).

You have actually built some real systems yourself. At work or at school, you must have faced some real hard distributed problems and solved them. You may have been involved in an open-source project that has solid distributed systems components.

In summary:

You have a very solid understanding of distributed systems and networking
You know how to do data-driven analysis and truly understand statistics
You understand large scale monitoring and data collection architectures
You are familiar with the current state of the art in distributed systems modeling
You are a solid engineer with a track record of building complex systems or you have demonstrated involvement with public large software projects (e.g. open-source).
You have a proven ability to effectively communicate the results of data analysis and modeling
Your communication skills allow to communicate clearly with diverse audiences

For this position we require a PhD in computer science with expertise in distributed systems and you must have demonstrated expertise in modeling complex distributed systems. If you have an equivalent advanced degree with demonstrated expertise in this field through publications and/or completed products or projects you may also apply.

Update: in response to the question how flexible is the PhD requirement?: It is up to you to convince me it is not relevant in your case. That will be hard, but not impossible.