Article development led by
Better understanding data requires
tracking its history and context.
BY LUCIAN CARATA, SHERIF AKOUSH,
NIKILESH BALAKRISHNAN, THOMAS BYTHEWAY,
RIPDUMAN SOHAN, MARGO SELTZER, AND ANDY HOPPER
A Primer on
ASSESSING THE QUALITY or validity of a piece of data is
not usually done in isolation. You typically examine the
context in which the data appears and try to determine
its original sources or review the process through
which it was created. This is not so straightforward
when dealing with digital data, however: the result
of a computation might have been derived from
numerous sources and by applying complex successive
transformations, possibly over long periods of time.
As the quantity of data that contributes to a
particular result increases, keeping track of how
different sources and transformations are related to
each other becomes more difficult. This constrains the
ability to answer questions regarding a result’s history,
such as: What were the underlying assumptions on
which the result is based? Under what conditions does
it remain valid? What other results were derived from
the same data sources?
The metadata that needs to be systematically captured to answer those
(or similar) questions is called
provenance (or lineage) and refers to a graph
describing the relationships among
all the elements (sources, processing
steps, contextual information and dependencies) that contributed to the existence of a piece of data.
This article presents current research in this field from a practical perspective, discussing not only existing
systems and the fundamental concepts
needed for using them in applications
today, but also future challenges and
opportunities. A number of use cases
illustrate how provenance might be
useful in practice.
Where does data come from? Consider the need to understand the conditions, parameters, or assumptions
behind a given result—in other words,
the ability to point at a piece of data, for
example, research result or anomaly
in a system trace, and ask: Where did
it come from? This would be useful
for experiments involving digital data
(such as in silico experiments in biology, other types of numerical simulations, or system evaluations in computer science).
The provenance for each run of
such experiments contains the links
between results and corresponding
starting conditions or configuration
parameters. This becomes important
especially when considering processing pipelines, where some early results
serve as the basis of further experiments. Manually tracking all the parameters from a final result through intermediary data and to original sources
is burdensome and error-prone.
Of course, researchers are not the
only ones requiring this type of tracking. The same techniques could be
used to help people in the business or
financial sectors—for example, figuring out the set of assumptions behind
the statistics reported to a board of
directors, or determining which mortgages were part of a traded security.
Who is using this data? Instead of
tracking a result back to its sources,
IMAGE COLLAGE BY ALICIA KUBISTA/ANDRIJ BORYS ASSOCIATES