ing it easier to derive high-level explanations of end-to-end interactions
spanning many nodes in distributed
computations. But there is no free
lunch. Broadly speaking, large-scale
tracing systems impose on adopters both an instrumentation burden
(the effort that goes into tweaking
existing code to add instrumentation points or to propagate metadata, or both) and an overhead burden
(the runtime cost of trace capture and
propagation). The collection of papers
chosen here illustrates some strategies for ameliorating these burdens,
as well as some creative applications
for high-level explanations.
Tracing with Context Propagation
Sigelman, B.H., Barroso, L.S., Burrows,
M., Stephenson, P., Plakal, M., Beaver, D.,
Jaspan, S., Shanbhag, C.
Dapper, a large-scale distributed systems
tracing infrastructure, 2010; http://research.
Dapper represents some of the “
early” industrial work on context-based
tracing. It minimizes the instrumentation burden by relying on Google’s
relatively homogeneous infrastructure, in which all code relies on a
common RPC (remote procedure
call) library, threading library, and so
on. It minimizes the overhead burden
by selecting only a small sample of
requests at ingress and propagating
trace metadata alongside requests
in order to ensure that if a request is
sampled, all of the interactions that
contributed to its response are sampled as well.
Dapper’s data model (a tree of nested spans capturing causal and temporal relationships among services
participating in a call graph) and basic architecture have become the de
facto standard for trace collection
in industry. Zipkin (created at Twitter) was the first open-source “clone”
of Dapper; Zipkin and its derivatives
(including the recently announced
Amazon Web Services X-Ray) are in
widespread use today.
Mace, J., Roelke, R., Fonseca, R.
Pivot Tracing: Dynamic causal monitoring
for distributed systems. In Proceedings of
the 25th Symposium on Operating Systems
Principles (2015), 378–393; http://cs.brown.
Dapper was by no means the first system design to advocate in-line context propagation. The idea goes back
at least as far as Xtrace, which was
pioneered by Rodrigo Fonseca at UC
Berkeley. Fonseca (now at Brown University) is still doing impressive work
in this space. Pivot Tracing presents
the database take on low-overhead
dynamic tracing, modeling events as
tuples, identifying code locations that
represent sources of data, and turning
dynamic instrumentation into a query
planning and optimization problem.
Pivot Tracing reuses Dapper/Xtrace-style context propagation to allow efficient correlation of events according to
causality. Query the streams!
Chow, M., Meisner, D., Flinn, J.,
Peek, D., Wenisch, T.F.
The mystery machine: End-to-end
performance analysis of large-scale Internet
services. In Proceedings of the 11th Usenix
Conference on Operating Systems Design and
Implementation (2014), 217-231; https://www.
What about enterprises that can’t (or
just don’t want to) overcome the instru-
mentation and overhead burdens of
tracing? Could they reconstruct causal
relationships after the fact, from un-
structured system logs? The mystery
machine describes a system that be-
gins by liberally formulating hypoth-
eses about how events across a distrib-
uted system could be correlated (for
example, Is one a cause of the other?
Are they mutually exclusive? Do they
participate in a pipelined computa-
tion?) and then mines logs for evidence
that contradicts existing hypotheses
(for example, a log in which two events
A and B are concurrent immediately
refutes a hypothesis that A and B are
mutually exclusive). Over time, the set
of hypotheses converges into models
of system interactions that can be used
to answer many of the same questions.
Alvaro, P., Andrus, K., Sanden, C.,
Rosenthal, C., Basiri, A., Hochstein, L.
Automating failure-testing research at
Internet scale. In Proceedings of the 7th ACM
Symposium on Cloud Computing (2016), 17–28;
The raison d’etre of the systems just
described is understanding the causes
of end-to-end latency as perceived by
users. Armed with detailed “
explanations” of how a large-scale distributed
system produces its outcomes, we can
do so much more. My research group
at UC Santa Cruz has been exploring
the use of explanations of “good” or
expected system outcomes to drive
fault-injection infrastructures in order
to root out bugs in ostensibly fault-tolerant code. The basic idea is that if we
can explain how a distributed system
functions in the failure-free case, and
how it provides redundancy to overcome faults, we can better understand
This approach, called lineage-driven fault injection (LDFI), originally
relied on idealized, fine-grained data
provenance to explain distributed
executions (see our previous paper,
“Lineage-driven Fault Injection,” by
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein, presented at SIG-MOD 2015). This more recent paper
describes how the LDFI approach was
adapted to “snap in” to the microservice architecture at Netflix and to build
rich models of system redundancy
from Zipkin-style call-graph traces.
Despite the fact that distributed systems are a mature research area in
academia and are ubiquitous in industry, the art of debugging distributed systems is still in its infancy.
It is clear that conventional debug-gers—and along with them, conventional best practices for deriving ex-