Data Model (PROV-DM20), or by providing a universal API and allowing each
component to both accept and generate provenance using it. PASSv2 provides a disclosed-provenance API (
DPAPI) that can be used for this purpose.
A second issue exists, however.
Merely collecting metadata at different layers will result in islands of provenance, unrelated to each other. For an
actual mapping of provenance objects
between layers, all entities describing
the same event must be grouped—
for example, by tagging them with a
unique identifier.
SPADEv2, for example, uses a mul-tisource fusion filter (with process
ID as a tag) to combine provenance
data from multiple sources describing the same event and working at the
same level of abstraction. When provenance is reported at different levels
of abstraction, SPADEv2 uses a cross-layer composition filter that has the
same purpose.
Data versioning. Provenance collection in a given layer typically involves
capturing the chain of events performed by the application on a given
piece of data, though this does not necessarily require the system to capture
multiple versions of data as it is being
transformed. Assume a user edits a file
using a text editor on a PASS-enabled
system. The provenance metadata
saved by PASS can provide information
such as the program used to edit, number of bytes written to the file, and so
on, but it is not possible to revert the
file to a previous state or know what
the actual data changes were. In cases
where the current contents of the file
depend on values in previous versions,
provenance systems need to store versions of data besides events in order to
assure full verifiability.
Because of this, provenance systems
such as Burrito14 not only track system
call-level events, but also run on top of
a versioning file system. Other systems
such as Lipstick and RAMP do not require versioning as they run on top of
append-only file systems (all versions
are implicitly stored).
Versioning can prove expensive
when done for certain layers in the
stack (such as for hardware registers)
but in other cases it might simplify the
capture of provenance. This is the case
in the application layer, where data
ships between individual rows or cells.
Of course, multiple granularities can
be considered at the same time.
Systems such as PASS6 capture
provenance by intercepting system
calls made by applications as they execute. At this level, provenance is fine
grained and can provide a detailed
image of an application’s execution
and dependencies.
The noise levels in the collected data,
however, are also elevated, making it
harder to extract useful information.
Consider a Python script that copies
one file to another. When running the
script, the Python interpreter will first
read and load any required modules
from disk. Thus, beyond the dependency on the actual input, the final
provenance graph will link the output
file to all the Python modules used by
the interpreter. This extra data can
make it difficult to sift through the
provenance graph as an end user, so,
generally, heuristics are needed to determine which entities are important
and which should be ignored.
Workflow systems such as Vis Trails26
avoid the noise problem and can capture provenance at any granularity, because the processing steps and their
dependencies are explicitly declared by
the end user. Such systems, however,
are also inherently limited to recording
only those data transformations that
were part of the defined workflow.
The n-by-m problem. Independent
of the system that is chosen, accurately determining the dependencies
between input and output data may
not be possible. This is illustrated by
the n-by-m problem, where a program
reads n input files and writes m output
files. Even when tracing system calls
for individual reads and writes, it is not
possible to infer which reads affected
a particular write, so the provenance
graph has to link each output file to all
of the inputs. A system that is unaware
of the semantics of individual data
transformations within a process will
always present a number of such false-positive relationships. Both PASS and
VisTrails have this problem, as they
treat the process or each workflow step
as a black box.
The n-by-m problem can be solved
by capturing provenance at an even
finer granularity. This can be done
using binary instrumentation tech-
niques25 and computing the prov-
enance of the output as a function
of the executed code path and data
dependencies. Even if this method
requires no modification of the ap-
plication, the trade-off is a significant
increase in space and time overhead.
A low-overhead alternative would be
to modify the application to explicitly
disclose relevant provenance using an
API such as CPL,
17 but this requires ad-
ditional effort from the developer, as
we discuss later.
Granularity is not the only aspect
that users need to think about when
determining their requirements for a
provenance system. It is just as impor-
tant to know in which layer the prov-
enance collection takes place.
Layering. Provenance metadata can
be captured at multiple layers in the
stack (that is, for the application, middleware (runtime/libraries), operating
system, and/or in hardware). Capturing provenance across multiple layers provides users with the ability to
reason about their data and processes
at different levels of abstraction, with
each layer providing a different view
on the same set of events happening
in the system.
For example, consider copying rows
between two tables in a spreadsheet
and saving the result. A system that
collects provenance at the operating-system layer will observe a number of
I/O operations to/from the file. The
notions of tables and rows, however,
are known only to the application, and
dependencies among them cannot be
inferred from the metadata collected
by lower layers. If querying for such
relationships is needed, provenance
must be captured in the application
layer as well.
Cooperation between layers. When
requiring provenance capture at
multiple layers, a practitioner could
choose a different (specialized) provenance system for each layer in the
stack or a single provenance system
that was designed to span capture
across multiple layers.
In both cases, multiple provenance-aware components must cooperate by
communicating different metadata
between layers. This can be achieved
either by adhering to a common provenance data model, such as Open Provenance Model (OPM)
21 or Provenance