interest in a process through a provenance query,
essentially performing a reverse graph traversal over the
data flow DAG and terminating according to the
query-specified scope; the query output is a DAG subset. Scoping can be based on types of relationships,
intermediary results, services, or subprocesses [ 7].
IN HEALTH CARE MANAGEMENT
To illustrate our approach, we explore a health care
management application. The Organ Transplant
Management (OTM) system under development by
the Catalan Transplant Organization, Catalonia,
Spain, manages all the activities pertaining to organ
transplants across multiple Catalan hospitals and
their regulatory authority, the government of Catalonia, Spain [ 1]. OTM consists of a complex
process involving the surgery itself, along with such
activities as data collection and patient organ analysis that must comply with a set of regulatory rules.
OTM is supported by an IT infrastructure that
maintains records that allow medical personnel to
view (and edit) a given patient’s local file within a
given institution or laboratory. However, the system
does not yet connect records or capture the dependencies among them or allow external auditors or
patients’ families to analyze or understand how decisions are made.
By making OTM provenance-aware, powerful queries impossible withoutprove-nance-awareness functionality can now be
supported (such as find all doctors
involved in a decision, find all blood-test
results involved in a donation decision,
and find all data that led to a decision). Such functionality can be made available not only to the medical profession but also to regulators and families.
Here, we limit ourselves to a simplified subset of
the OTM workflow—the process leading to the decision of whether or not to donate an organ. As a hospitalized patient’s health declines and in anticipation
of a potential organ donation, an attending doctor
requests the full health record for the patient and
sends a blood sample for analysis. Through a context-sensitive menu-driven user interface (UI), the attending doctor submits the requests that are then passed to
a software component (the donor data collector)
responsible for collecting all expected results. If brain
death is observed and logged into the system and if all
requested data and analysis results are obtained, the
system asks the doctor to decide about the donation
of an organ. The decision, or the outcome of the doctor’s medical judgment based on the collected data, is
explained in a report submitted by the doctor as the
decision’s justification.
Figure 3 (top) outlines the components involved in
this scenario and their interactions. The UI sends
requests (I1, I2, I3) to the donor data collector service,
which gets data from the patient records database (I4,
I5), along with analysis results from the laboratory
(I6, I7), and finally requests a decision (I8, I9).
To make OTM provenance-aware, designers are
augmenting OTM with the ability to produce an
explicit representation of the process taking place,
including p-assertions for all interactions (I1–I9), relationship p-assertions capturing dependencies between
data items, and state p-assertions. Figure 3 (bottom)
outlines the DAG representing a donation decision’s
provenance, which consists of relationship p-assertions produced by provenance-aware OTM. DAG
nodes denote data items, whereas DAG edges (in
blue) represent relationships (such as data dependencies, like “is based on” and “is justified by,” and causal
relationships, like “in response to” and “is caused by”).
Each data item is annotated by the interaction in
which it occurs. Further, the UI asserts a service-state
p-assertion for each of its interactions about the users
logged into the system.
Authorized users can then issue provenance queries
that navigate the provenance graph, pruning it
according to the querier’s needs; for example, from the
graph, we can derive that users X and Y are both causing a donation decision to be reached. Figure 3
includes only a limited number of components, but in
real-life examples involving vast amounts of documentation, users—doctors, patients, or regulatory
authorities—benefit from a powerful and accurate
provenance-query facility.
EXISTING SYSTEMS
The approach we’ve explored here is derived from an
extensive requirement analysis [ 8] that resulted in a
complete architectural specification [ 7] used as the
basis for writing an open specification of data models and interfaces. The open approach allows the
documentation of complex distributed applications,
possibly involving multiple technologies (such as
Web services, command-line executables, and
monolithic executables). It also allows the expression
of complex provenance queries to identify data and
scoping processes independent of the technologies
being used.
The Virtual Data System [ 4] and myGrid [ 10] are
execution environments for scientific workflows that
provide support for provenance. They focus on producing documentation from a workflow enactor’s
viewpoint using data models compatible with p-assertions. They assume their respective workflow lan-