cern where provenance is being recorded at an extremely fine-grained level. In
such situations it is common for the
cost of provenance capture to equal
or exceed the cost of the recorded operation, leading to slowdowns exceeding 100%. For example, in the Lipstick
system operator-level provenance is
reported to lead to a slowdown of two
to three times,
1 while in the RAMP system, where provenance is collected
at the tuple level by propagating tags
through a MapReduce workflow, it is
common to observe a temporal overhead of up to 75%.
Spatial overhead. Similar to temporal overhead, the spatial overhead of
systems recording process execution
is a function of the amount of data per
operation and the number of recorded
operations. In the set of studied systems only half (SPROV, Burrito, Lipstick, and RAMP) are capable of recording data changes.
While the actual overhead of any
workload is sensitive to multiple factors, here are two reported data points
for illustrative purposes:
˲ The general-purpose PASSv2 system requires, on average, approximately 20% additional space overhead (compared with the original
output size) to log all the operations
for a workload representative of real-world applications.
˲ The Burrito system, running on a
real user workload, required 800MB
for provenance storage and 2GB for file
versions over a two-month period.
These results indicate that storage
overhead should not be prohibitive for
Overhead trade-offs. Generally
speaking, there is a direct trade-off
between capture granularity and
provenance overhead. SPADEv2, for
example, allows users to capture information at the function call or an
application-defined level at the cost
of increased temporal and spatial capture overhead. Similarly, SPROV allows
users to specify modifications in high-er-level semantics (for example, “new
section added to file”) at the cost of reduced per-operation observability.
For users to adopt the most suitable
system for their needs, it may be useful
for them to predetermine what prov-
enance information will be required to
answer queries and at what granular-
ity this information will be sufficient,
mapping it to the appropriate system.
Most systems also delay prove-
nance construction in order to mini-
mize capture overhead. PASSv2, for
example, captures raw operation
records, converting them to their fi-
nal representation via an asynchro-
nous user-space daemon.
uses separate provenance collection
threads to extract, filter, and commit
operations to the provenance log.
Other systems delay provenance col-
lection to query time to avoid wast-
ing resources computing provenance
that will never be accessed. For exam-
ple, Lipstick carries out provenance
construction only when a query is
1 This delayed provenance-con-
struction property is present in some
workflow systems as well. ZOOM, for
example, will compute some of the
provenance at query time, based on
the current user view. Depending on
the required cardinality, timeliness,
and complexity of provenance que-
ries, deciding on those trade-offs may
considerably improve overhead.
Security issues. It is imperative for
provenance data to be secured against
unauthorized access and to not leak
any information about the data against
which it is collected.
this requires provenance to be managed under different access policies
than those of the data. Doing so allows
the user flexibility over the disclosure
of provenance information. For example, one might make provenance
inaccessible to people outside an organization, as it would reveal proprietary
workflow or processes. The final data
result, however, might be freely available to anyone.
Formally, the security aspects of
provenance are defined as its
confidentiality (only authorized parties can
read it) and its integrity (it cannot be
forged or altered). Both properties are
considered essential for performing
integrity, validation, and consistency
checks on data.
Two solutions to the problem of
providing secure provenance have
been put forth. The first leverages the
concept of reference monitors: Pat-
rick McDaniel et al. discuss a secure
system for end-to-end provenance
based on the principle of a host-based
tamper-proof provenance monitor
that mirrors the well-known reference
monitor concept for the enforcement
of security policies.
19 The presence of
the reference monitor means the secu-
rity of provenance collection does not
have to rely on the integrity of other
system components such as the ker-
nel. While this solution is feasible,
there is no known practical implemen-
tation to date.
The second solution is based on
provenance chains11, 16 where processes that generate provenance must attest to the information added in an
encrypted, nonmodifiable, and nonrepudiable manner. Guaranteeing these
three properties ensures all collected
provenance can remain confidential
and keep its integrity. Of the systems
included in this article, SPROV16 is
a practical implementation of provenance chains. It primarily provides
confidentiality and integrity guarantees for file modifications.
SPROV leverages a number of concepts in cryptography to fulfill the security requirements: confidentiality is
maintained by encrypting the metadata describing each change; record
integrity is maintained by checksum-ming records; and attestation is supported by signing records with the public key of the creating user.
In addition to the key concepts of
confidentiality and integrity, SPROV
provides a number of useful features
that may be of interest to the practitioner (and a consideration for future
secure provenance systems): through
the use of cryptographic commitments,
3 SPROV enables selective exposure of records to third parties; by
employing broadcast encryption,
it supports selective access control
for multiple auditors without requiring a corresponding proportional increase in the number of keys; finally,
threshold encryption27 is supported,
enabling separation-of-duty scenarios in which the decryption of records
requires participation from at least
one auditor in a number of distinct
SPROV has no mechanism for preventing unauthorized reads, relying
instead on the encryption of records to
prevent unauthorized access. It is, however, the only system of those included
here that provides any provenance confidentiality and integrity guarantees.