A new direction is taken by a recent
project at IBM Almaden23 that builds
a database system where Monte Carlo
simulations are pushed deep inside
the engine, thus able to evaluate any
query safe or unsafe. What is particularly promising about this approach
is that through clever query optimization techniques, such as tuple bundles,
the cost of sampling operations can be
drastically reduced. A complementary
approach, explored by Olteanu et al., 29
is to rewrite queries into ordered binary
decision diagrams (OBDD). They have
observed that safe plans lead to linear-sized OBDDs. This raises the possibility that other tractable cases of OBDDs
can be inferred, perhaps by analyzing
both the query expression and the database statistics.
Materialized Views. The use of materialized views to answer queries is a very
powerful tool in data management. 21
In its most simple formulation, there
are a number of materialized views, for
example, answers to previous queries,
and the query is rewritten in terms of
these views, to improve performance.
In the case of probabilistic databases, materialized views have been
studied in Ré and Suciu. 33 Because of
the dichotomy of the query complexity, materialized views can have a dramatic impact on query evaluation: a
query may be unsafe, hence #P-hard,
but after rewriting it in terms of views
it may become a safe query, and thus is
in PTIME. There is no magic here, we
don’t avoid the #P-hard problem, we
simply take advantage of the fact that
the main cost has already been paid
when the view was materialized.
The major challenge in using materialized views over probabilistic data
is that we need to represent the view’s
output. We can always compute the lineage of all the tuples in the view, and
this provides a complete representation of the view, but it also defeats our
purpose, since using these lineage expressions during query evaluation does
not simplify the probabilistic inference
problem. Instead, we would like to use
only the marginal tuple probabilities
that have been computed for the materialized view, not their lineage. For
example, it may happen that all tuples
are independent probabilistic events,
and in this case we only need the
marginal probabilities; we say in this
case the view is fully representable. In
general, not all tuples in the view are
independent, but it is always possible
to partition the tuples into blocks such
that tuples from different blocks are
independent, and, moreover, there exists a “best” such partition33; within
a block, the correlations between the
tuples remain unspecified. The blocks
are described at the schema level, by a
specific set of attributes: grouping by
those attributes gives the blocks. This
is called a partial representation, and
can be used to evaluate some queries
over the views. Note that the problem
of finding a good partial representation of the view is done by a static
analysis that is orthogonal to the analysis whether the view is safe or unsafe:
there are examples for all four combinations of safe/unsafe representable/
unrepresentable views.
facet 3: user interface
The semantics of query Q on a probabilistic database with possible worlds W
is, in theory, quite simple, and is given
by the image probability space over the
set of possible answers, {Q(I) | I ∈ W}.
In practice, it is impossible, and perhaps useless, to return all possible sets
of answers. An important problem in
probabilistic databases is how to best
figure 3: An sQL query on the data in figure 1(a) returning the affiliations of all researchers who performed some service for VLDB. The
query follows the syntax of MayBMs, 2 where confidence( ) is an aggregate operator returning the output probability. The figure shows an
unsafe plan in (b) and a safe plan in (c), and also traces the computation of the output probability of Y! Research: it assumes there is a single
researcher Fred with that affiliation, and that Fred performed two services for VLDB. The safe plan rewritten in sQL is shown in (d): the aggregate function prod is not supported by most relational engines, and needs to be rewritten in terms of sum, logarithms, and exponentiation.
SELECT x.Affiliation, confidence( )
FROM Researchers x, Services y
WHERE x.Name = y.Name
and y.Conference = ’VLDB’
GROUP BY x.Affiliation
(a)
Y! Research 1 − ( 1 − p q )( 1 − p q )
33
11 12
Affiliation
Fred Y! Research VLDB Session Chair pq
3
1
1
Fred Y! Research VLDB PC Member p q
3
1
2
Researchers
Name
Conference
Services
Fred Y! Research p
3
1
Fred VLDB
Fred VLDB
Session Chair q
1
PC Member q
2
(b)
SELECT x.Affiliation, 1-prod(1-x.P y.P)
*
FROM Researchers x,(SELECT Name, 1-(1-prod(P))
FROM Services
WHERE Conference = ’VLDB’
GROUP BY Name) y
WHERE x.Name = y.Name
GROUP BY x.Affiliation
(d)
Y! Research
1 − p ( 1 − q )( 1 − q )
3
11 2
Fred Y! Research 1 − p ( 1 − q )( 1 − q )
3
11 2
Affiliation
Name
Fred 1−( 1−q1)( 1−q2)
Name
Researchers
Fred Y! Research p
3
1
Conference
Services
Fred VLDB SessionChair q
1
Fred VLDB PCMember q
2
(c)