These are the old pages from the weblog as they were published at Cornell. Visit www.allthingsdistributed.com for up-to-date entries.

January 16, 2003

An unreliable ws-reliability spec

Clemens reviewed ws-reliability yesterday and I can agree with most of his comments. Next to the issues that Clemens mentions there are more technical problems with this spec. Some of the most prominent are:

  • The requirement that messages need to be persisted has not been thought through well enough (as Clemens already hinted at). The operation on the sender side seems obvious, when you recover you try to get acknowledgements for those message you think you have sent, but may have gotten lost in the crash. However at the receiver this is less obvious. What does it mean to have delivered the message to the application successfully? Can you be sure about the point of the possible crash? Can you be sure never to deliver duplicate messages to the application during recovery? Does the app also needs to handle duplicates? There are no conditions specified for how to remove received messages from the persistent store at the receiver.
  • What are exactly the semantics of an acknowledgement? Does this means the message was stored in persistent storage? Or that it was successfully delivered to the application?
  • What does time-to-live really mean in case of persistent storing your received messages. I can send an ack telling the sender I received the message, then I get delayed for some reason (maybe a crash) and when I want to deliver the message I notice that its time has expired . According to the current spec I cannot deliver this message and have to drop it. Hence the message transport becomes unreliable.
  • The requirement to send a simple ack immediately for each message will introduce a real mess. The scenario in which a message gets lost and a subsequent message is received, will trigger an ack for this new message making the sender believe that it is reliably received. However the receiver cannot deliver the message to the app until it has received the retransmission of the missing message. This can cause unreliable behavior because you may have to drop the message if there is a ttl field, or if the sender crashes before it could retransmit the missing message, the sender gets stuck with the message it has received for ever without being able to deliver. The solution here should have been to do a delayed ack or send a negative ack, allowing the receiver to treat the new message as volatile until the retransmission gap has been filled.

The spec is full of these problems. It makes it all very unreliable. Hopefully IBM & MS will do a better job on this.

Posted by Werner Vogels at January 16, 2003 12:03 PM
TrackBacks

Comments

I'll try and address both postings at once -

Issues from http://radio.weblogs.com/0108971/2003/01/15.html -

First, let me say thank you for pointing out these issues. Many of these things were
discussed during the initial formation of the spec, and we decided that it would be better to
wait until the formation of the OASIS TC, which has now happened. Although the initial spec
was posted as a group of vendors collaborating on a specification, our goal all along has been
to use the initial spec as input to the formation of a WG or TC. We didn't want to go down
too many implementation detail paths, particularly when it comes to things like inherent
requirements on the underlying infrastructure. We also didn't want to go too far down a
"proprietary" path as a rogue gang of vendors, without bringing it to a broader forum such as
an OASIS TC. We look forward to ironing these types of issues out with the other WS-RM TC
members. That being said, let me see if I can address your points specifically -

>Because WS-Reliability is unaware of and not integrated with WS-Routing, it is only useful as
>a point to point mechanism. While routing from the sender to the receiver will likely be
>possible, the "ReplyTo" to send the acknowledgement message to does specify a plain URL and
>doesn't allow integration with a reverse path as per WS-Routing. This means that unless the
>ACK message can be piggybacked on a synchronous response (the luckiest of all circumstances),
>the spec requires either direct connectivity from the receiver back to the sender, which may
>be impossible due to firewalls and NAT, or requires some form of acknowledgement message
>dispatcher gateway at the sender's site, which requires some form of central service
>deployment as well. In short: This doesn't really work for a desktop PC wishing to reliably
>deliver a message to an external service from within the corporate firewall. "

Good issue. We actually had many discussions and early versions of the spec that had
attempted to address multi-hop, and perhaps even WS-Routing. Multi-hop issues in general are
being discussed in other work groups like XMLP (SOAP 1.2), WS-Architecture, and WS-I. We look
forward to converging with those discussion to make sure we are in step and doing the right
thing. There is also a bigger issue with WS-Routing in particular in that it is thus far a
proprietary specification.

Another point is that the growing trend in the industry for supporting asynchronous
messaging-style web services communication for interactions within and across the extended
enterprise is going to mean that most organizations will host asynchronous listeners anyhow.
WS-Reliability is not driving the charge there, its already happening. I agree that there
still needs to be some sort of routing or dispatching necessary to get back to the desktop PC.
That's a good issue to flesh out in the TC.

>There's quite a few problems to be solved with regards to simple sequence numbers and resends
>of an unaltered, carbon-copy (2.2.2) of the original message considering the accuracy of
>message timestamps, digital signatures, context coordination and techniques to avoid replay
>attacks. Sending the exact same message may be entirely impossible, even if it couldn't be
>delivered properly and therefore the "MUST" requirement of 2.2.2 cannot be fulfilled. Also,
>in 2.2.2 there's a reference to a "specified number of resend attempts" -- who specifies
them? "

We chose to use the message id as the thing that determines whether a message is a duplicate,
for these reasons. The specified number of resend attempts is intended to be a configurable
option, but falls under the category of a requirement on the underlying infrastructure, which
is yet to be specified.

>The spec rightfully calls for persistent storage of messages (2.2.3), but doesn't spell out
>rules for when messages must be written to persistent storage in the process (it should
>obviously before sending and after receiving, but before acknowledgement and forward).

I thought that section 2.2.3 was pretty clear about it. I will make a note of that as an item
of discussion in the TC.

>What I find also very noteworthy is that the authors say that they have yet to address
>synchronization between sender and receiver and establishing a common understanding by sender
>and receiver about whether the message was properly delivered (meaning that the send/ack
>cycle >was fully completed). I assume that once they do so, they'll throw the synchronous,
>piggybacked reply on top of HTTP out of the window, because this creates an in-doubt
>situation for the acknowledging party.

That situation is currently addressed by message redelivery on the sender side, and dupe
elimination on the receiver side. We will make a note to revisit this in the TC discussions.

Now that we have formed an OASIS TC, you have a public place to have these discussions. Feel
free to post your feedback to wsrm-comments@oasis-open.org.

Issues from http://www.allthingsdistributed.com/historical/archives/000013.html

>The requirement that messages need to be persisted has not been thought through well enough
>(as Clemens already hinted at). The operation on the sender side seems obvious, when you
>recover you try to get acknowledgements for those message you think you have sent, but may
>have gotten lost in the crash. However at the receiver this is less obvious. What does it
>mean to have delivered the message to the application successfully? Can you be sure about the
>point of the possible crash? Can you be sure never to deliver duplicate messages to the
>application during recovery? Does the app also needs to handle duplicates? There are no
>conditions specified for how to remove received messages from the persistent store at the
>receiver.

Issues 3 + 4 in appendix 2 are general statements that we need to further refine the semantics
of failure and recovery. Many of us in the TC have very strong experience in enterprise
messaging and are very capable of figuring this stuff out.

>What are exactly the semantics of an acknowledgement? Does this means the message was stored
>in persistent storage? Or that it was successfully delivered to the application?

My view of it is that the message can be considered acknowledgeable once it has been safely
persisted. Issues of undelivery to the application can be addressed by the notion of a
centralized fault location, or dead message queue, as noted in Appendix 2, section 3.

>What does time-to-live really mean in case of persistent storing your received messages. I
>can send an ack telling the sender I received the message, then I get delayed for some reason
>(maybe a crash) and when I want to deliver the message I notice that its time has expired .
>According to the current spec I cannot deliver this message and have to drop it. Hence the
>message transport becomes unreliable.

Also addressed by Appendix 2, section 3. Look forward to other alternatives which can be
discussed in the WSRM OASIS forum.

>The requirement to send a simple ack immediately for each message will introduce a real mess.
>The scenario in which a message gets lost and a subsequent message is received, will trigger
>an ack for this new message making the sender believe that it is reliably received. However
>the receiver cannot deliver the message to the app until it has received the retransmission
>of the missing message. This can cause unreliable behavior because you may have to drop the
>message if there is a ttl field, or if the sender crashes before it could retransmit the
>missing message, the sender gets stuck with the message it has received for ever without
>being able to deliver. The solution here should have been to do a delayed ack or send a
>negative ack, allowing the receiver to treat the new message as volatile until the
>retransmission gap has been filled.

This is recognized by section 6 in Appendix 2.

Again, now that we have formed an OASIS TC, you have a public place to have these discussions. Feel
free to post your feedback to wsrm-comments@oasis-open.org.

Dave

Posted by: Dave Chappell on February 24, 2003 07:53 PM