Application Level. Burrito14 tracks
user-space events, while also supporting additional user-provided annotations. SPROV16 focuses on the security of provenance and provides a thin
wrapper around the standard C I/O
library. A newer version is capable of
using provenance captured by other
systems, such as PASS.
Big Data. Lipstick1 and RAMP24 both
tackle the problem of tracking provenance in big-data scenarios (
MapReduce jobs).
It is the properties of these systems
that define what can be recorded and
with what trade-offs, overhead, and security implications.
Provenance System Properties
Effectively using provenance systems
in practice requires understanding a
number of aspects related to:
˲ The exact metadata being captured.
˲ The effort required for integrating
provenance systems within existing
workflows (running special kernels,
making runtime changes, or linking ap-
plications with provenance libraries).
˲Understanding how provenance
metadata can be later queried.
˲ Evaluating the overhead imposed
by those systems.
˲Security issues added by provenance, which might require different access controls from those of the
data itself.
This article categorizes the properties of the systems selected as representative according to these features,
referring to the motivating use cases
as required.
What can it capture? The metadata
captured by provenance systems typically relates the state of digital entities (files, tables, programs, network
connections, and so on) at different
stages in their lifetimes to historic dependencies on other entities or processes. In this context, two concepts
are fundamental for determining
what is captured and how: granularity
and layering.
Granularity. The granularity of cap-
ture refers to the size of basic primi-
tives that accumulate provenance
within a system. Consider a scientist
who uses a configuration file storing
various experiment parameters as one
of the inputs to a simulation program.
Capturing provenance at file granular-
ity will establish the dependency be-
tween the simulation program and the
configuration file name. The scientist
is interested in understanding the re-
lationships between the simulation re-
sults and individual parameters in the
file, however, and this requires capture
at subfile granularity.
The exact meaning of varying granularity (from fine to coarse) also depends
on the underlying data model of the
application. For example, database-provenance systems could store provenance metadata for an entire table, a
row within the table, or for each cell.
Provenance capture at the table level
is coarse grained and can answer questions such as: From which other tables
has table X derived its data? Finer granularities would determine the relation-
A timeline of provenance systems.
’ 91 timeline
database
e-science
(workflows)
systems
’ 98 ’05 ’02
Trio
Taverna
Pegasus
ES3Chimera
Kepler ZOOM
VisTrails
RAMP
Lipstick
SPADEv2
BurritoPASSv2PASS
’06
code
wrappers
uncertainty
and lineage GIS
syscall
interposition
augmented
language or APIs
’09 ’ 11 ’ 13 ’ 12
TREC
Tioga
Geolineus
SPROV