In many ways, you might already be
running a specialized version of such
a system: all auditing, tracing frameworks, or change-tracking solutions
collect some form of provenance, even
though they might not identify it as
such. The advantage of thinking about
provenance as a stand-alone concept
is the ability to use this metadata in a
principled way, allowing result verifiability and complex historic queries
regardless of the underlying mechanisms used to collect it and across applications and software stacks.
Historically, provenance systems
were the focus of research in the database field, with the aim of understanding how and when materialized
views should be updated in response
to changes in the underlying tables.
Because of the well-defined relational
model, it has proven possible both
to derive precise provenance information from queries7 and to develop
formalisms that allow its concise representation.
13 This has been further
extended in systems such as Trio,
allowing records to incorporate an
associated uncertainty, which can be
propagated across multiple queries by
In contrast, capturing provenance
for applications performing arbitrary
computations (not restricted to a small
set of valid transformations) has proven more challenging. Research efforts
in this area have focused on the collection of provenance at particular points
in the software stack (by modifying applications, the runtime environment,
or the kernel).
The accompanying figure presents
a general timeline of provenance systems. This article looks at the characteristics of eight of these (PASS, SPADE,
VisTrails, ZOOM, Burrito, SPROV, Lipstick, and RAMP), each representative
of a larger class of solutions:
Operating-System Level. PASS22, 23
and SPADE12 investigate provenance by
observing application events such as
process creation or I/O. Those are then
used for inferring dependencies among
different pieces of data.
Workflows. VisTrails26 and ZOOM2
are workflow-management systems
with the ability to track provenance for
the execution of various workflows and
(in the case of VisTrails) for the evolution of the workflows themselves.
you can capture provenance to understand where that result has been subsequently used or to find out what data
was further derived from it. For example, a company might want to identify
all the internal uses of a certain piece
of code in order to respect licensing
agreements or to keep track of code
still using deprecated or unsafe functions that need to be removed.
Using similar mechanisms, end users should be able to track what personal information is used by a mobile
application and determine whether
it is displayed locally or sent over the
network to a third party. The same use
case covers the general propagation of
erroneous results, when we need to understand what pieces of data have been
invalidated by the discovery of an error.
How was it obtained? Provenance
can also be used to obtain a better
understanding of the actual process
through which different pieces of input data are transformed into outputs.
This is important in situations where
computer engineers or system administrators need to debug the problems
that arise when running complex software stacks.
In cases where it is possible to differentiate between correct and erroneous
system output, comparing their provenance will point to a list of potential
root causes of the error. In more complex scenarios, the issue might not be
directly linked to particular outputs but
to an (undesired) change in behavior.
Detecting system intrusions or explaining why the response tail latency has
increased by 20% for a server are good
examples. In those cases, grouping outputs with similar provenance could be
used for identifying normal versus abnormal system behavior and explaining
the differences between the two.
Together, the three use cases provide
an overview of the ideal provenance
application space, but they do not describe the technical details involved in
making those applications possible.
To realize each scenario in practice,
one or more provenance systems need
to be integrated into the data-process-ing workflow, becoming responsible
for capturing provenance, propagating it among related components, and
making it accessible to user queries.
As the quantity
of data that
of how different
are related to each