figure 2: An example of a c-table.
Location
U. Washington
U. Wisconsin
Y! Research
(X = 1) ( Y = 1) (X = 1) ( Y = 1) (X = 2) ( Y = 1)
>>>
<<<
>>>
<<
>>
1 11 234
(X = 2) ( Y = 1) (X = 2) ( Y = 1) (X = 1) (X = 1)
1 11 234
(X = 3) ( Y = 1) (X = 3) ( Y = 1)
1 11 2
introduced by Trio4 and consists of
maybe-tuples, which may or may not be
in the database, and x-tuples, which are
sets of mutually exclusive tuples.
Several applications require a richer
representation formalism, one that
can express complex correlations between tuples, and several such formalisms have been described in the literature: lineage-based, 4, 18 U-relations, 2
or the closure of BID tables under
conjunctive queries. 12 Others are the
Probabilistic Relational Model of
Friedman et al. 17 that separates the
data from the probabilistic network,
and Markov Chains. 25, 34 Expressive formalisms, however, are often difficult to
understand by users, and increase the
complexity of query evaluation, which
lead researchers to search for simpler, workable models for probabilistic
data. 4
All representation formalisms are,
at their core, an instance of database
normalization: they decompose a
probabilistic database with correlated
tuples into several BID tables. This is
similar to the factor decomposition
in graphical models, 9 and also similar
to database normalization based on
multivalued dependencies. 39 A first
question is how to design the normal
representation given a probabilistic database. This requires a combination of
techniques from graphical models and
database normalization, but, while the
connection between these two theories
was described by Verma and Pearl39
in the early 1990s, to date there exists
no comprehensive theory of normalization for probabilistic databases. A
second question is how to recover the
complex probabilistic database from
its normalized representation as BID
tables. This can be done through SQL
views12 or through lineage.
Lineage. The lineage of a tuple is an
annotation that defines its derivation.
Lineage is used both to represent probabilistic data, and to represent query
results. The Trio system4 recognized
the importance of lineage in managing
Lineage is a powerful tool in
ProbDMS because of the following
important property: the answer to a
query over a c-table can always be represented as another c-table, using the
same hidden variables. In other words,
it is always possible to compute the lineage of the output tuples from those
of the input tuples. This is called a
closure property and was first shown by
Imielinski and Lipski. 22 We illustrate
this property on the database in Figure
1, where each tuple has a very simple
lineage. Consider now the SQL query in
Figure 3(a), which finds the affiliations
of all people who performed some
service for the VLDB conference. The
answer to this query is precisely the c-table in Figure 2.
data with uncertainty, and called itself
a ULDB, for uncertainty-lineage database. In Trio, when new data is produced by queries over uncertain data,
the lineage is computed automatically
and captures all correlations needed
for computing subsequent queries
over the derived data.
Lineage also provides a powerful
mechanism for understanding and re-
solving uncertainty. With lineage, user
feedback on correctness of results can
be traced back to the sources of the rel- facet 2: Query evaluation
evantdata, allowing unreliablesources Query evaluation is the most difficult
to be identified. Users can provide technical challenge in a ProbDMS.
much detailed feedback if data lineage One approach is to separate the que-
is made visible to them. For example, ry and lineage evaluation from the
in information extraction applications probabilistic inference on the lineage
where data items are generated by pipe- expression. Various algorithms have
lines of AI operators, users can indicate been used for the latter, such as Monte
if a data item is correct, as well as look Carlo approximation algorithms. 11, 31
at the lineage of data items to locate Recently, a much more general Monte
the exact operator making the error. Carlo framework has been proposed by
Thenotionoflineageisderivedfrom Jampani et al. 23 Variable Elimination9
a landmark paper by Imielinski and was used by Sen and Deshpande. 36
Lipski22 from 1984, who introduced Another approach is to integrate the
c-tables as a formalism for represent- probabilistic inference with the query
ing incomplete databases. We describe computation step. With this approach,
c-tables and lineage by using the exam- one can leverage standard data man-
ple in Figure 2. In a c-table, each tuple agement techniques to speed up the
is annotated with a Boolean expression probabilistic inference, such as static
over some hidden variables; today, analysis on the query or using material-
we call that expression lineage. In our ized views. This has led to safe queries
example there are three tuples, U. of and to partial representations of mate-
Washington, U. of Wisconsin, rialized views.
and Y! Research, each annotated Safety. Two of the authors showed
with a lineage expression over vari- that certain SQL queries can be evalu-
ables X , X , Y , Y , Y . The semantics of ated on a probabilistic database by
13124
a c-table is a set of possible worlds. An pushing the probabilistic inference
assignment of the variables defines the completely inside the query plan. 11
world consisting of precisely those tu- Thus, for these queries there is no need
ples whose lineage is true under that for a separate probabilistic inference
assignment, and the c-table “means” step: the output probabilities are com-
the set of possible worlds defined by all puted inside the database engine, dur-
possible assignments. For an illustra- ing normal query processing. The per-
tion, in our example any assignment formance improvements can be large,
containing X = 3, Y = 1, X = 2, Y = 1 for example, Ré et al. 31 observed two or-
1234
(and any values for the other variables) ders of magnitude improvements over
defines the world {Y! Research, U. a Monte Carlo simulation. Queries for
of Washington}, while any assign- which this is possible are called safe
ment with Y = Y = Y = 0 defines the queries, and the relational plan that
123
emptyworld. computes the output probabilities