review articles

Doi: 10.1145/1538788.1538810

Treasures abound from hidden facts found
in imprecise data sets.

BY niLesh DALVi, ChRis ToPheR Ré, AnD DAn suCiu
Probabilistic
Databases:
Diamonds in
the Dirt
a wide ranGe
of applications have recently emerged
that require managing large, imprecise data sets.

The reasons for imprecision in data are as diverse as the applications themselves: in sensor and RFID data, imprecision is due to measurement errors; 15, 34 in information extraction, imprecision comes from the inherent ambiguity in natural-language text; 20, 26 and in business intelligence, imprecision is tolerated because of the high cost of data cleaning. 5 In some applications, such as privacy, it is a requirement that the data be less precise. For example, imprecision is purposely inserted to hide sensitive attributes of individuals so that the data may be published. 30 Imprecise data has no place in traditional, precise database applications like payroll and inventory, and so, current database management systems are not prepared to deal with it. In contrast, the newly emerging applications offer value precisely because

they query, search, and aggregate large volumes of imprecise data to find the “diamonds in the dirt.” This wide variety of new applications points to the need for generic tools to manage imprecise data. In this article, we survey the state of the art of techniques that handle imprecise data by modeling it as probabilistic data. 2–4, 7, 12, 15, 23, 27, 36

A probabilistic database management system, or ProbDMS, is a system that stores large volumes of probabilistic data and supports complex queries. A ProbDMS may also need to perform some additional tasks, such as updates or recovery, but these do not differ from those in conventional database management systems and will not be discussed here. The major challenge in a ProbDMS is that it needs both to scale to large data volumes, a core competence of database management systems, and to do probabilistic inference, which is a problem studied in AI. While many scalable data management systems exist, probabilistic inference is a hard problem, 35 and current systems do not scale to the same extent as data management systems do. To address this challenge, researchers have focused on the specific nature of relational probabilistic data, and exploited the special form of probabilistic inference that occurs during query evaluation. A number of such results have emerged recently: lineage-based representations, 4 safe plans, 11 algorithms for top-k queries, 31, 37 and representations of views over probabilistic data. 33 What is common to all these results is they apply and extend well-known concepts that are fundamental to data management, such as the separation of query and data when analyzing complexity, 38 incomplete databases, 22 the threshold algorithm, 16 and the use of materialized views to answer queries. 21 In this article, we briefly survey the key concepts in probabilistic database systems and explain the intellectual roots of these concepts in data management.

References:

Archives