a dialog before the first answer may be
retried. That can cause a world of trouble if the early messages cause some
serious (and non-idempotent) work to
happen. What’s an application developer to do?
There are three broad ways to make
sure you do not have bugs with the ini-tialization stage:
˲ Trivial work. You can simply send
a message back and forth, which does
nothing but establish the dialog to the
specific partner in the scalable server.
This is what TCP does with its S YN messages.
˲ Read-only work. The initiating application can read some stuff from the
scalable server. This won’t leave a mess
if there are retries.
˲ Pending work. The initiating application can send a bunch of stuff, which
the partner will accumulate. Only when
subsequent round-trips have occurred
(and you know which specific partner
and the actual state for the dialog has
been connected) will the accumulated
state be permanently applied. If the
server accumulates state and times out,
then the accumulated state can be discarded without introducing bugs.
Using one of these three approaches, the application developer (and not
the plumbing) will ensure that no bugs
are lying in wait for a retry to a different back-end partner. To eliminate this
risk completely, the plumbing can use
TCP’s trick and send a trivial round-trip
set of messages (the SYN messages in
TCP), which hooks up the partner without bothering the application layered
on top. On the other hand, allowing
the application to do useful work with
the round-trip (for example, read some
data) is cool.
the Closing-stage Ambiguity
In any interaction, the last message from
one application service to another cannot
be guaranteed. The only way to know it
was received is to send a message saying it was. That means it is no longer
the last message.
Somehow, some way, the applica-
tion must deal with the fact that the last
message or messages sent in the same
direction may simply go poof in the net-
work. This may be in a simple request-
response, or it may be in a complex full-
duplex chatter. When the last messages
are sent, it is just a matter of luck if they
actually get delivered (see Figure 11). In
each interaction between two partners,
the last messages sent in the same di-
rection cannot be guaranteed. They
may simply disappear.
Conclusion
Distributed systems can pose challenges to applications sending messages.
The messaging transport can be downright mischievous. The target for the
message may be an illusion of a partner
implemented by a set of worker bees.
These, in turn, may have challenges in
their coordination of the state of your
work. Also, the system you think you are
talking to may be, in fact, subcontracting the work to other systems. This, too,
can add to the confusion.
Sometimes an application has
plumbing that captures its model of
communicating partners, lifetime of
the partners, scalability, failure management, and all the issues needed to
have a great two-party dialog between
communicating application components. Even in the presence of great
supporting plumbing, there are still
semantic challenges intrinsic to messaging.
This article has sketched a few principles used by grizzled old-timers to
provide resilience even when “stuff
happens.” In most cases, these programming techniques are used as
patches to applications when the rare
anomalies occur in production. As a
whole, they are not spoken about too
often and rarely crop up during testing. They typically happen when the
application is under its greatest stress
(which may be the most costly time to
realize you have a problem).
Some basic principles are:
˲ Every message may be retried and,
hence, must be idempotent.
˲ Messages may be reordered.
˲ Your partner may experience amnesia as a result of failures, poorly managed durable state, or load-balancing
switchover to its evil twin.
˲Guaranteed delivery of the last
message is impossible.
Keeping these principles in mind
can lead to a more robust application.
While it is possible for plumbing
or platforms to remove some of these
concerns from the application, this can
occur only when the communicating
apps share common plumbing. The
emergence of such common environments is not imminent (and may never
happen). In the meantime, developers
need to be thoughtful of these potential
dysfunctions in erecting applications.
Acknowledgments
Thank you to Erik Meijer and Jim Mau-rer for their commentary and editorial
improvements.
Related articles
on queue.acm.org
BASE: An Acid Alternative
Dan Pritchett
http://queue.acm.org/detail.cfm?id=1394128
A Co-Relational Model of Data
for Large Shared Data Banks
Erik Meijer and Gavin Bierman
http://queue.acm.org/detail.cfm?id=1961297
Testable System Administration
Mark Burgess
http://queue.acm.org/detail.cfm?id=1937179
References
1. ibM. WebSphere MQ; http://www-01.ibm.com/
software/integration/wmq/.
2. tanenbaum, a. S. Computer Networks, 4th ed. Prentice
hall, 2002.
3. World Wide Web consortium, network Working group.
1999. hypertext transfer Protocol—httP1.1; http://
www.w3.org/Protocols/rfc2616/rfc2616.html.
4. Wolter, r. an introduction to SQl Server Service
broker (2005); http://msdn.microsoft.com/en-us/
library/ms345108(v=sql. 90).aspx.
5. transmission control Protocol; http://www.ietf.org/
rfc/rfc793.txt
Pat helland has worked in distributed systems,
transaction processing, databases, and similar areas since
1978. for most of the 1980s, he was the chief architect of
tandem computers’ tMf (transaction Monitoring facility),
which provided distributed transactions for the nonStop
System. With the exception of a two-year stint at amazon,
helland worked at Microsoft corporation from 1994–2011
where he was the architect for Microsoft transaction
Server and SQl Service broker. he also contributed to
cosmos, a distributed computation and storage system
that provide back-end support for bing.