correctly is called a safe plan. To understand the context of this result we
review a fundamental principle in relational query processing: the separation
between what and how.
In a relational query the user specifies what she wants: relational query
languages like SQL are declarative.
The system translates the query into
relational algebra, using operators
like join , selection s, projection-with-duplicate-elimination Π. The resulting expression is called a relational
plan and represents how the query will
be evaluated. The separation between
what and how was first articulated by
Codd when he introduced the relational data model, 8 and is at the core
of any modern relational database system. A safe plan allows probabilities to
be computed in the relational algebra,
by extending its operators to manipulate probabilities. 12 There are multiple
ways to extend them, the simplest is to
assume all tuples to be independent:
a join that combines two tuples computes the new probability as p p , and
12
a duplicate elimination that replaces
n tuples with one tuple computes the
output probability as 1 − ( 1 − p )...
1
( 1 − p ). A safe plan is by definition a
n
plan in which all these operations
are provably correct. The correctness proof (or safety property) needs
to be done by the query optimizer,
through a static analysis on the plan.
Importantly, safety does not depend
on the actual instance of the database,
instead, once a plan has been proven
to be safe, it can be executed on any database instance.
We illustrate with the query in Figure
3(a). Any modern relational database
engine will translate it into the logical plan shown in (b). However, this
plan is not safe, because the operation
Π (projection-with-duplicate-
Affiliation
elimination) combines tuples that are
not independent, and therefore the out-
put probabilities are incorrect. The fig-
ure illustrates this for the output value Y!
Research, by tracing its computation
through the query plan: the output prob-
ability is 1 − ( 1 − p 1q )( 1 − p 1q ). However,
31 32
the lineage of Y! Research is (X = 3 ∧ Y
11
= 1) ∧ (X = 3 ∧ Y = 1), hence the correct
12
probability is p 1( 1 − ( 1 − q )( 1 − q )).
3 12
Alternatively, consider the plan
shown in (c). This plan performs an early projection and duplicate elimination
on Services. It is logically equivalent
to the plan in (b), i.e., the extra duplicate elimination is harmless. However,
the new plan computes the output
probability correctly: the figure illustrates this for the same output value,
Y! Research. Note that although
plans (b) and (c) are logically equivalent over conventional databases, they
are no longer equivalent in their treatment of probabilities: one is safe, the
other not.
Safety is a central concept in query
processing on probabilistic databases.
A query optimizer needs to search not
just for a plan with lowest cost, but for
one that is safe as well, and this may
affect the search strategy and the outcome: in a conventional database there
is no reason to favor the plan in (c) over
that in (b) (and in fact modern optimizers will not choose plan (c) because the
extra duplication elimination increases the cost), but in a probabilistic database plan (c) is safe while (b) is unsafe.
A safe plan can be executed directly
by a database engine with only small
changes to the implementation of its
relational operators. Alternatively, a
safe plan can be executed by expressing it in regular SQL and executing it
on a conventional database engine,
without any changes: Figure 3(d) illustrates how the safe plan can be converted back into SQL.
Safe plans have been described for
databases with independent tuples, 11
for BID databases, 12 for queries whose
predicates have aggregate operators, 32
and for Markov Chain databases. 34
While conceptually a safe plan ties the
probabilistic inference to the query
plan, Olteanu et al. 29 have shown that it
is possible to separate them at runtime:
the optimizer is free to choose any query plan (not necessarily safe), then the
probabilistic inference is guided by the
information collected from the safe
plan. This results in significant execution speed-up for typical SQL queries.
Dichotomy of Query Evaluation.
Unfortunately, not all queries admit
safe plans. In general, query evaluation
on a probabilistic database is no easier
than general probabilistic inference.
The latter is known to be a hard problem. 35 In databases, however, one can
approach the query evaluation problem differently, in a way that is best
explained by recalling an important
distinction made by Vardi in a landmark paper in 1982.38 He proposed that
the query expression (which is small)
and the database (which is large) be
treated as two different inputs to the
query evaluation problem, leading to
three different complexity measures:
the data complexity (when the query
is fixed), the expression complexity
(when the database is fixed), and the
combined complexity (when both
are part of the input). For example, in
conventional databases, all queries
have data complexity in PTIME, while
the combined complexity is PSPACE
complete.
We apply the same distinction to
query evaluation on probabilistic databases. Here the data complexity offers
a more striking picture: some queries
are in PTIME (for example, all safe queries), while others have #P-hard data
complexity. In fact, for certain query
languages or under certain assumptions it is possible to prove a complete
dichotomy, that is, each query belongs
to one of these two classes. 10, 12, 32, 34
Figure 4 describes the simplest dichotomy theorem, for conjunctive queries
without self-joins over databases with
independent tuples, first proven by
Dalvi and Suciu. 11 Safe queries are by
definition in the first class; under the
dichotomy property, any unsafe query
has #P-hard data complexity. For unsafe queries we really have no choice
but to resort to a probabilistic inference algorithm that solves, or approximates ,a #P-hard problem. The abrupt
change in complexity from PTIME to
#P-hard is unique to probabilistic databases, and it means that query optimizers need to make special efforts to
identify and use safe queries.
An active line of research develops
query evaluation techniques that soften the transition from safe to unsafe
queries. One approach extends the
reach of safe plans: for example, safe
subplans can be used to speed up processing unsafe queries, 33 functional
dependencies on the data, or knowing
that some relations are deterministic
can be used to find more safe plans, 11,
29
and safe plans have been described
for query languages for streams of
events. 34
Another approach is to optimize
the general-purpose probabilistic inference on the lineage expressions. 36