best meets the needs of the business.
What does it take to keep on going?
As the Black Knight from “Monty
Python and the Holy Grail” says, “’Tis
but a scratch” (https://www.youtube.
com/watch?v=ZmInkxbvlCs; thank you
to Peter Vosshall for raising the Black
Knight in discussion when we were
both younger many years ago).
The Disconcerting Discontinuum
Back when you had only one database
for an application to worry about, you
didn’t have to think about partial results. You also didn’t have to think
about data arriving after some other
data. It was all simply there.
Now, you can do so much more with
big distributed systems, but you have
to be more sophisticated in the trad-eoff between timely answers and complete answers. The best systems will
adapt and interpret their problems as,
“‘Tis but a scratch!”
The Calculus of Service Availability
Ben Treynor, Mike Dahlin,
Vivek Rau and Betsy Beyer
Toward Higher Precision
Rick Ratzel and Rodney Greenstreet
A Lesson in Resource Management
1. Dean, J. and Barroso, L.A. The tail at scale. Commun.
ACM 56, 2 (Feb. 2013), 74–80; https://dl.acm.org/
2. Hall, A., Tudorica, A., Buruiana, F., Hofmann, R.,
Ganceanu, S. and Hofmann, T. Trading off accuracy
for speed in PowerDrill. In Proceedings of the Intern.
Conf. Data Engineering, 2016; https://ai.google/
3. Kandula, S., Lee, K., Chaudhuri, S. and Friedman, M.
Experiences with approximating queries in Microsoft’s
production big-data clusters. In Proceedings of the
VLDB Endowment 12, 2 (2019), 2131–2142; http://bit.
4. Moguls, J.C., Isaacs, R., Welch, B. Thinking about
availability in large service infrastructures. In
Proceedings of the 16th Workshop on Hot Topics in
Operating Systems 2017, 12–17; https://dl.acm.org/
5. Moguls, J.C. and Wilkes, J. 2019. Nines are
not enough: meaningful metrics for clouds. In
Proceedings of the Workshop on Hot Topics in
Operating Systems, 2019, 136–141; https://dl.acm.org/
Pat Helland has been implementing transaction systems,
databases, application platforms, distributed systems,
fault-tolerant systems, and messaging systems since
1978. He currently works at Salesforce.
Copyright held by author/owner.
Publication rights licensed to ACM.
is that it is OK to retry the laggard requests because they are idempotent. It
does not cause harm to do them two or
Knowing you can’t know. The more
complex the set of inputs, the more
likely you won’t see everything in a
timely fashion. The more complex the
store-and-forward queuing, the more
likely stuff will arrive too late or not
at all. The more distant the sources of
your inputs, the more challenges you
As we have seen, sometimes it can
be effective to retry the request for input. In particular, in some systems,
retrying can ensure all the inputs are
In other systems, the inputs are
not simply fetched but are rattling
their way through queues similar to
Highway 101 through San Francisco.
In these environments, the processing probably has to simply cut off with
what it has and do the best it can. This
means you can’t guarantee the stuff is
ready when you want it.
So, if you know you can only
probably know, what’s the plan?
Approximating queries. There is
some fun new work describing analytics with approximate answers. By
expressing sampling operators, some
systems can provide really good answers based on a small subset of all
the inputs one would normally examine.
2, 3 In the cited systems, there is
more focus on sampling for performance when everything is working
and timely. Still, it’s quite similar to
what you would do to build systems
that return answers based on what is
available in a timely fashion.
Returning partial answers. Many
systems work to give some answer
in a timely fashion even if they are
wounded. Many sophisticated websites will dribble out partial answers to
the browser as they become available:
The text for a product description may
arrive before the product image; the
product image may arrive before the
user reviews. This decoupling yields a
faster overall result and, in general, a
more satisfied user.
When dealing with answers relying
on data from a distance, it’s important
to consider how to decouple results
and, where possible, return a degraded answer quickly when and if that
relying on data
from a distance,
it is important
to consider how
results and, where
a degraded answer
and if that best
meets the needs
of the business.