or (observed) low-level operations
carried out by executing processes. Broadly speaking, the time and
space overhead for capturing the
provenance of workflow evolution is
proportional to both the number of
changes in the workflow and the number of times a workflow is executed.
In comparison, the provenance overhead of capturing an execution log is
proportional to the number of recordable operations executed.
Time overhead. In practice, the
provenance-capture cost of workflow
systems (and, by extension, other disclosed provenance systems) is minuscule because of their limited approach
to collecting running process information. Both ZOOM and VisTrails, for
example, report an approximately 1%
increase in execution time.
2, 10
For systems that record process execution, provenance capture costs are
a function of the costs of intercepting
and recording observable operations.
While intuitively it may appear that
provenance capture at the operation
level is prohibitively expensive from a
temporal perspective, reported results
show this is not the case. Kernel-based
system-call interception mechanisms
such as in PASSv2 have a 1% to 23%
overhead on workloads representative
of real-world applications.
22, 23 Similarly, SPADEv2, which uses kernel-audit-ing infrastructure for provenance capture, reports less than a 10% overhead
on Windows, Linux, and OS X for production Apache runs.
12
For I/O-heavy workloads, however,
provenance capture may impose larger
runtime overheads. PASS, for example,
reports up to a 230% overhead on small
file benchmarks,
23 even though the absolute increase in execution times remains small.
The interception mechanism can
also significantly influence provenance-capture overhead in this regard. SPADEv2, for example, supports
operation interception via the kernel-auditing mechanisms on OS X, while
on Windows it requires a file-system
filter driver that relays operations to
the provenance collector. As a consequence, provenance-enabled Apache
builds are 50% slower on Windows but
only 5% slower on OS X.
The temporal cost of recording operations may also be of potential con-
Directed. The second major paradigm is the directed query, an approach more closely linked to the classic field of database query. It requires
the user to express questions about the
provenance of data as queries in a language that is often a specialized extension of SQL or a path query language.
This is effective if the user knows
precisely what information is required,
but unlike exploratory methods, the directed query approach does not facilitate discovery of new insights about the
provenance graph.
One example of the directed approach is vtPQL,
26 used in the Vis Trails
system. The language is designed to
enable the user to express provenance
queries about three different aspects
of the workflow: the execution log, the
abstract workflow representation, and
the evolution of the workflow in time.
The user can specify restrictions on
all of these spaces simultaneously—
for example, restricting the execution
logs to a particular day, highlighting
a single workflow module, and choosing a particular version of the workflow. This is helpful, as it allows the
user to think in terms of orthogonal
querying concerns.
Hybrid. Some systems use a hybrid
of the two paradigms. For example,
the ZOOM system2 starts from a user-provided “declaration of interest” to
derive a contextually appropriate minimal form of the provenance graph. The
heart of the system is an algorithm
that summarizes “irrelevant” parts of
the graph in ways that maintain their
semantics. The user needs only to
provide the list of the modules in the
workflow definition that are of interest
and is then allowed to browse the provenance graph without being distracted
by unimportant pieces of information.
Understanding Overhead
As with any computational functionality, provenance capture has associated
temporal and spatial costs. Given that
provenance support is likely to be an
additional consideration to the primary function of the system, leveraged
only when the lineage properties of the
data are required, it is imperative to
minimize the overhead.
General-purpose provenance systems typically capture either (
disclosed) evolutions of a given workflow
While intuitively
it may appear
that provenance
capture at
the operation level
is prohibitively
expensive from
a temporal
perspective,
reported results
show this is
not the case.