These are the old pages from the weblog as they were published at Cornell. Visit www.allthingsdistributed.com for up-to-date entries.

February 18, 2003

ws-reliability continues to be harmful

I was surprised to read Mark Baker's statement that he feels there is no need for reliable communication provisions in web-services runtimes.

I think "reliable messaging" is a huge waste of time. [Mark Baker]

I have always respected Mark's opinion, even though I am not much of a REST fanatic myself. In this case however I think that Mark is so caught up in the Web part of the web-services world that he fails to see that the ws-* specs are basically geared towards a world that is transport independent. And as such you cannot build your systems around the get/put/post semantics as you may run over transports other than HTTP.

But even if you believe that HTTP encapsulation should be the only one, the experiences with RPC systems have shown the need for an integrated notion of at-most-once for non-idempotent operations, and for exactly-once and ordering if you want to use asynchronous operations. Asynchronous operations remain our only hope for building high-performance systems.

Maybe most of this confusion is based on the ws-reliability spec that now has been submitted to OASIS for consideration. Quite a few words has already been wasted over how incomplete and technically incorrect this spec is. The spec is a collection of simple and intermediate distributed system techniques to provide a mixture of guarantees, many of which are simply incorrect or at best incomplete.

It would be good if the OASIS people would go back to the drawing board and find the right abstraction to put these guarantees into context. One of the higher-level critiques on the spec is that loosely coupled messages suddenly become coupled, without providing a context in which this coupling takes place. If we would establish the notion of a session or a message stream, this would provide a context within with exactly-once or ordering guarantees can be evaluated. This abstraction can also be used as a unit of failure management instead of having to do that at individual message level. ws-reliability does the latter, and at best one can state that it is an incomplete mess. Which is not surprising as it is very hard to get that right.

The session or stream can be used to shield the service from the invocation strategy. By now we know how to build high-performance distributed systems that can handle synchronous, asynchronous and reply-less requests, as long as you have such a containing abstraction. This abstraction also provides you with a context to run parallel streams between services without inter-stream contamination.

There is something Mark's story in which he is right. Although I am not sure whether he intended it the same way as I do: true end-to-end reliability can not be achieved at the simple protocol level or through an intermediate runtime. Especially if you have a set of services that need to agree on the outcome of an operation, you will need services at a higher-level that can help you with that. These services will invariably include some agreement and consensus protocols. Of course I come from a school where we believe that that these services can be made easy if you deploy failure-detectors at the same time. But a Paxos or a partial synchronous protocol could work just as well.

These advanced reliability guarantees are best provided through a choreography specification that includes handling of a variety of execution failures both at the service semantic and the runtime level.

Posted by Werner Vogels at February 18, 2003 10:35 AM
TrackBacks

Comments

"Quite a /few words/ has already been wasted"

I think that link should point to here

Posted by: Bill de hÓra on February 20, 2003 11:17 AM