present the set of possible query answers to the user. To date, two practical approaches have been considered:
ranking tuples and aggregation over
imprecise values.
Ranking and Top-k Query Answering.
In this approach the system returns all
possible answer tuples and their prob-
abilities: p(t ∈ Q ), p( t ∈ Q), . . . noted
12
previously: the correlations between
the tuples are thus lost. The emphasis
in this approach is to rank these tuples,
and restrict them to the top k.
One way to rank tuples is in decreas-
ing order of their output probabili-
ties: 31 p(t ∈ Q) ≥ p(t ∈ Q) ≥ . . . Often,
12
however, there may be a user-specified
order criteria, and then the system
needs to combine the user’s ranking
scores with the output probability. 37 A
separate question is whether we can
use ranking to our advantage to speed
up query performance by returning
only the k highest ranked tuples: this
problem is called top-k query answer-
ing. One can go a step further and drop
the output probabilities altogether: Ré
et al. 31 argue that ranking the output
tuples is the only meaningful seman-
tics for the user, and propose focus-
ing the query processor on computing
the ranking, instead of the output
probabilities.
The power of top-k query answering
in speeding up query processing has
been illustrated in a seminal paper by
Fagin et al. 16 When applied to probabilistic databases that principle leads to
a technique called multisimulation. 31
It assumes that a tuple’s probability p(t
∈ Q) is approximated by an iterative algorithm, like a Monte Carlo simulation: after some number steps n, the
probability p(t ∈ Q) is known to be, with
high probability, in an interval (p − e ,
n
p + e ), where e decreases with n. The
nn
idea in the multisimulation algorithm
is to carefully control how to allocate
the simulation steps among all candidate tuples in the query’s answer, in order to identify the top-k tuples without
wasting iterations on the other tuples.
Multisimulation reduces the computation effort roughly by a factor of N/k,
where N is the number of all possible
answers, and k is the number of top tuples returned to the user.
Aggregates over Imprecise Data. In
SQL, aggregates come in two forms:
value aggregates, as in for each company return the sum of the profits in all its
units, and predicate aggregates, as in
return those companies having the sum
of profits greater than 1M. Both types
of aggregates are needed in probabilistic databases. The first type is interpreted as expected value, and most
aggregate functions can be computed
easily using the linearity of expectation. For instance, the complexities of
computing sum and count aggregates
over a column are the same as the complexities of answering the same query
without the aggregate, such as where
all possible values of the column are
returned along with their probabilities. Complexities of computing min
11
and max are the same as those of computing the underlying queries with the
aggregates replaced by projections
removing the columns. 11 One aggregate whose expected value is more difficult to compute is average, which
is an important aggregate function for
OLAP over imprecise data. One can
compute the expected values of sum
and count( ), but the expected value
*
of average is not their ratio. A surprising result was shown by Jayram et
al. 24 who proved that average can be
computed efficiently. They give an exact algorithm to compute average on
a single table in time O(n log2 n). They
figure 4: The dichotomy of conjunctive queries without self-joins on tuple-independent probabilistic
databases is captured by Hierarchical Queries.
Hierarchical Queries
In the case of tuple-independent databases (where all tuples are independent) safe queries are precisely the
hierarchical queries; we define hierarchical queries here.
A conjunctive query is:
q(z¯) : − body
where body consists of a set of subgoals g ,g ,..., g , and ¯ z are called the head variables. Denote Vars(g ) the set of
12 k i
variables occurring in gi and Vars (q) = �i = 1, k Vars(gi). For each x � Vars(q) denote sg(x) = {gi | x � Vars(gi)}.
DEFINITION 2. 1. Let q be a conjunctive query and ¯ z its head variables. q is called hierarchical if for all x, y �
Vars(q) − ¯ z, one of the following holds: (a) sg(x) � sg(y), or (b) sg(x) � sg(y), or (c) sg(x) � sg(y) = .
A conjunctive query is without self-joins if any two distinct subgoals refer to distinct relation symbols.
THEOREM 2. 2 (DICHOTOMY). (Dalvi and Suciu11, 12) Let q be a conjunctive query without self-joins. ( 1) If q is
hierarchical then its data complexity over tuple-independent databases is in PTIME. ( 2) If q is not hierarchical
then its data complexity over tuple-independent databases is #P-hard.
To illustrate the theorem, consider the two queries:
q(z) : − R(x, z), S(x, y), T(x, z)
1
q2(z) : − R(x, z), S(x, y), T(y, z)
In q we have sg(x) = {R, S, T}, sg( y) = {S}; hence it is hierarchical and can be evaluated in PTIME.
1
In q we have sg(x) = {R, S}, sg( y) = {S, T }; hence it is nonhierarchical and is #P-hard.
2