also give efficient algorithms to compute various aggregates when the data
is streaming.
The second type of aggregates, those
occurring in the HAVING clause of an
SQL query, have also been considered. 32
In this case, one needs to compute the
entire density function of the random
variable represented by the aggregate,
and this is more difficult than computing the expected value. Similar to safe
queries, the density function can sometimes be computed efficiently and exactly, but it is hard in general. Worse, in
contrast to safe queries, which can always be efficiently approximated, there
exists HAVING queries that do not admit efficient approximations.
A Little history of the
(Possible) Worlds
There is a rich literature on probabilistic
databases, and we do not aim here to be
complete; rather, as in Gombrich’s classic A Little History of the World, we aim to
“catch a glimpse.” Early extensions of
databases with probabilities date back
to Wong40 and Cavallo and Pittarelli. 6 In
an influential paper Barbara et al. 3 described a probabilistic data model that
is quite close to the BID data model,
and showed that SQL queries without
duplicate elimination or other aggregations can be evaluated efficiently. ProbView27 removed the restriction on queries, but returned confidence intervals
instead of probabilities. At about the
same time, Fuhr and Roelleke18 started
to use c-tables and lineage for probabilistic databases and showed that every
query can be computed this way.
Probabilities in databases have also
been studied in the context of “
reliability of queries,” which quantifies
the probability of a query being correct
assuming that tuples in the database
have some probability of being wrong.
Grädel et al. 19 were the first to prove
that a simple query can have data complexity that is #P-hard.
Andritsos et al. 1 have applied probabilistic databases to the problem of
consistent query answering over inconsistent databases. They observed that
the “certain tuples” 21 to a query over an
inconsistent database are precisely the
tuples with probability 1 under probabilistic semantics.
The intense interest in probabilistic databases seen today is due to a
number of influential projects: applications to sensor data, 7, 15 data cleaning, 1 and information extraction, 20 the
safe plans of Dalvi and Suciu, 11 the Trio
system4 that introduced ULDBs, and
the advanced representation systems
described in Antova et al. 2 and Sen and
Deshpande. 36
Conclusion
Many applications benefit from finding valuable facts in imprecise data,
the diamonds in the dirt, without having to clean the data first. The goal
of probabilistic databases is to make
uncertainty a first-class citizen, and
to reduce the cost of using such data,
or (more likely) to enable applications
that were otherwise prohibitively expensive. This article describes some
of the recent advances for large-scale
query processing on probabilistic databases and their roots in classical data
management concepts.
This work was partially supported by
NSF Grants IIS-0454425, IIS-0513877,
IIS-0713576, and a Gift from Microsoft.
An extended version of this work with
additional references is available at
http://www.cs.washington.edu/homes/
suciu/.
References
1. Andritsos, P. and Fuxman, A., Miller, R.J. Clean
answers over dirty databases. In ICDE (2006).
2. Antova, L., Jansen, T., koch, C. and Olteanu, D. Fast
and simple relational processing of uncertain data.
In ICDE (2008).
3. Barbara, D., Garcia-Molina, h. and Porter, D. The
management of probabilistic data. IEEE Trans.
Knowl. Data Eng. 4, 5 (1992), 487–502.
4. Benjelloun, O., Sarma, A.D., halevy, A., Theobald,
M. and Widom, J. Databases with uncertainty and
lineage. VLDBJ 17, 2 (2008), 243–264.
5. Burdick, D., Deshpande, P., Jayram, T.S.,
Ramakrishnan, R. and Vaithyanathan, S. Efficient
allocation algorithms for OLAP over imprecise data.
In VLDB (2006), 391–402.
6. Cavallo, R. and Pittarelli, M. The theory of
probabilistic databases. In Proceedings of VLDB
(1987), 71–81.
7. Cheng, R., kalashnikov, D. and Prabhakar, S.
Evaluating probabilistic queries over imprecise data.
In SIGMOD (2003), 551–562.
8. Codd, E.F. Relational completeness of data base
sublanguages. In Database Systems (1972),
Prentice-hall, 65–98.
9. Cowell, R., Dawid, P., Lauritzen, S. and Spiegelhalter,
D., eds. Probabilistic Networks and Expert Systems
(1999), Springer.
10. Dalvi, N. and Suciu, D. The dichotomy of conjunctive
queries on probabilistic structures. In PODS (2007),
293–302.
11. Dalvi, N. and Suciu, D. Efficient query evaluation
on probabilistic databases. VLDB J. 16, 4 (2007),
523–544.
12. Dalvi, N. and Suciu, D. Management of probabilistic
data: Foundations and challenges. In PODS (Beijing,
China, 2007) 1–12 (invited talk).
13. Darwiche, A. A differential approach to inference in
bayesian networks. J. ACM 50, 3 (2003), 280–305.
14. DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D.,
Doan, A. and Ramakrishnan, R. Dblife: A community
information management platform for the database
research community. In CIDR (2007), 169–172.
15. Deshpande, A., Guestrin, C., Madden, S., hellerstein,
J.M. and hong, W. Model-driven data acquisition in
sensor networks. In VLDB (2004), 588–599.
16. Fagin, R., Lotem, A. and Naor, M. Optimal aggregation
algorithms for middleware. In Proceedings of the
20th ACM SIGMOD-SIGACT-SIGAR T Symposium on
Principles of Database Systems (2001), ACM Press,
102–113.
17. Friedman, N., Getoor, L., koller, D. and Pfeffer A.
Learning probabilistic relational models. In IJCAI
(1999), 1300–1309.
18. Fuhr, N. and Roelleke, T. A probabilistic relational
algebra for the integration of information retrieval
and database systems. ACM Trans. Inf. Syst. 15, 1
(1997), 32–66.
19. Grädel, E., Gurevich, Y. and hirsch, C. The complexity
of query reliability. In PODS (1998), 227–234.
20. Gupta, R. and Sarawagi, S. Creating probabilistic
databases from information extraction models.
In VLDB (2006), 965–976.
21. halevy, A. Answering queries using views: A survey.
VLDB J. 10, 4 (2001), 270–294.
22. Imielinski, T. and Lipski, W. Incomplete information
in relational databases. J. ACM 31 (Oct. 1984),
761–791.
23. Jampani, R., Xu, F., Wu, M., Perez, L., Jermaine, C. and
haas, P. MCDB: A Monte Carlo approach to managing
uncertain data. In SIGMOD (2008), 687–700.
24. Jayram, T., kale, S. and Vee, E. Efficient aggregation
algorithms for probabilistic data. In SODA (2007).
25. kanagal, B. and Deshpande, A. Online filtering,
smoothing and probabilistic modeling of streaming
data. In ICDE (2008), 1160–1169.
26. Lafferty, J., McCallum, A. and Pereira, F. Conditional
random fields: Probabilistic models for segmenting
and labeling sequence data. In ICML (2001).
27. Lakshmanan, L., Leone, N., Ross, R. and
Subrahmanian, V. Probview: A flexible probabilistic
database system. ACM Trans. Database Syst. 22, 3
(1997).
28. Nierman, A. and Jagadish, h. Pro TDB: Probabilistic
data in XML. In VLDB (2002), 646–657.
29. Olteanu, D., huang, J. and koch, C. SPROUT: Lazy vs.
eager query plans for tuple independent probabilistic
databases. In ICDE (2009).
30. Rastogi, V., Suciu, D. and hong, S. The boundary
between privacy and utility in data publishing.
In VLDB (2007).
31. Ré, C., Dalvi, N. and Suciu, D. Efficient Top-k query
evaluation on probabilistic data. In ICDE (2007).
32. Ré, C., Suciu, D. Efficient evaluation of having queries
on a probabilistic database. In Proceedings of DBPL
(2007).
33. Ré, C. and Suciu, D. Materialized views in probabilistic
databases for information exchange and query
optimization. In Proceedings of VLDB (2007).
34. Ré, C., Letchner, J., Balazinska, M. and Suciu, D.
Event queries on correlated probabilistic streams. In
SIGMOD (Vancouver, Canada, 2008).
35. Roth, D. On the hardness of approximate reasoning.
Artif. Intell. 82, 1–2 (1996), 273–302.
36. Sen, P. and Deshpande, A. Representing and querying
correlated tuples in probabilistic databases. In ICDE,
2007.
37. Soliman, M.A., Ilyas, I.F. and Chang, k.C.-C.
Probabilistic top- and ranking-aggregate queries.
ACM Trans. Database Syst. 33, 3 (2008).
38. Vardi, M. Y. The complexity of relational query
languages. In Proceedings of 14th ACM SIGACT
Symposium on the Theory of Computing (San
Francisco, California, 1982), 137–146.
39. Verma, T. and Pearl, J. Causal networks: Semantics
and expressiveness. Uncertainty Artif. Intell. 4
(1990), 69–76.
40. Wong, E. A statistical approach to incomplete
information in database systems. ACM Trans.
Database Syst. 7, 3 (1982), 470–488.
Nilesh Dalvi ( ndalvi@yahoo-inc.com)
Yahoo!Research, Santa Clara, CA.
Christopher Ré ( chrisre@cs.washington.edu)
University of Washington, Seattle, WA.
Dan Suciu ( suciu@cs.washington.edu)
University of Washington, Seattle, WA.
© 2009 ACM 0001-0782/09/0700 $10.00