tioned Fred in one Web page refer to
the same entity as Fred in another Web
page?) The example in Figure 1 presents a very simplified view of a general
principle: uncertain data is annotated
with a confidence score, which is interpreted as a probability. Here, we use
“probabilistic data” and “uncertain
data” as synonyms.
facets of a PROBDMs
There are three important, related
facets of any ProbDMS: How do we
store (or represent) a probabilistic database? How do we answer queries using our chosen representation? How
do we present the result of queries to
the user?
There is a tension between the power of a representation system, that is,
as the system more faithfully models
correlations, it becomes increasingly
difficult to scale the system. A simple
representation where each tuple is an
independent probabilistic event is easier to process, but it cannot faithfully
model the correlations important to all
applications. In contrast, a more complicated representation, for example,
a large Markov Network, 9 can capture
the semantics of the data very faithfully, but it may be impossible to compute
even simple SQL queries using this representation. An extra challenge is to ensure the representation system maps
smoothly to relational data, so that the
nonprobabilistic part of the data can
be processed by a conventional database system.
A ProbDMS must support complex,
decision-support style SQL, with ag-
gregates. While some applications can
benefit from point queries, the real
value comes from queries that search
many tuples, or aggregate over many
data values. For example, the answer
to find the affiliation of PC Chair of SIG-
MOD’2008 is inherently imprecise (and
can be answered more effectively by
consulting the SIGMOD’2008 home
page), but a query like find all institutions (affiliations) with more than 20
SIGMOD and VLDB PC Members
returns more interesting answers. There
are two logical steps in computing an
SQL query on probabilistic data: first,
fetch and transform the data, and second, perform probabilistic inference.
A straightforward but naïve approach
is to separate the two steps: use a database engine for the first step and a
general-purpose probabilistic inference technique for the second. But
9, 13
on large data the probabilistic inference quickly dominates the total running time. A better approach is to integrate the two steps, which allows us to
leverage some database-specific techniques, such as query optimization,
using materialized views, and schema
information, to speed up the probabilistic inference.
Designing a good user interface
raises new challenges. The answer to
an SQL query is a set of tuples, and it is
critical to find some way to rank these
tuples because there are typically lots
of false positives when the input data
is imprecise. Alternatively, aggregation
queries can extract value from imprecise data, because errors tend to cancel
each other out (the Law of the Large
Numbers). A separate and difficult task
is how to indicate to the user the correlations between the output tuples. For
example, the two highest ranked tuples
may be mutually exclusive, but they
could also be positively correlated. As a
result, their ranks alone convey insufficient information to the user. Finally,
a major challenge of this facet is how
to obtain feedback from the users and
how to employ this feedback to “clean”
the underlying database. This is a hard
problem, that to date has not yet been
solved.
Probabilistic databases have found
usage in a wide class of applications.
Sensor data is obtained from battery-powered sensors that acquire temperature, pressure, or humidity readings
from the surrounding environment.
The BBQ system15 showed that a probabilistic data model could represent
well this kind of sensor data. A key
insight was the probabilistic model
could answer many queries with sufficient confidence without needing to
acquire additional readings. This is an
important optimization since acquiring fewer sensor readings allows longer
battery life, and so more longer lasting
sensor deployments. Information Extraction is a process that extracts data
items of a given type from large corpora of text. 26 The extraction is always
noisy, and the system often produces
several alternatives. Gupta and Sarawagi20 have argued that such data is best
stored and processed by a probabilistic
database. In Data Cleaning, deduplication is one of the key components and
is also a noisy and imperfect process.
figure 1: example of a probabilistic database. This is a block-independent-disjoint database: the eight tuples in Researchers are
grouped in four groups of disjoint events, for example, t1, t 2, t 3 are disjoint, and so are, t 1, t 2, while tuples from different blocks
111 44
are independent, for example, t 2, t 2, t 1 are independent; the five tuples in Services are independent probabilistic events. This
124
database can be represented as a c-table using the hidden variables X , X , X , X for Researchers and Y , Y , Y , Y , Y for Services.
1234 12345
ResReeasrcehae rcrhs:ers:
NamNeame AffAiflfiialtiiaotnion
t11 1 t1 FreFdred U. WUa. s Whainsghtinognton
t 2 t 2 U. WUi.s Wcoisncsoinnsin
11
t 3 t 3 Y! RYe!sReeasrc eharch
11
t 1 t 1 SueSue U. WUa. s Whainsghtinognton
22
t 1 t 1 JohJnohn U. WUi.s Wcoisncsoinnsin
3
3
t 2 t 2 U. WUa. s Whainsghtinognton
33
t 1 t 1 FraFnrkank Y! RYe!sReeasrc eharch
44
t 2 t 2 M. RMe.sReeasrc eharch
44
(a) (a)
P P
p1 p= 10=. 30. 3
1
1
p2 p= 20=. 20. 2
11
p3 p= 30=. 50. 5
11
p1 p= 11.=01.0
22
p1 p= 10=. 70. 7
3
3
p2 p= 20=. 30. 3
33
p1 p= 10=. 90. 9
44
p p= 20=. 1 0.1
2
4
4
SerSviecrevsic: es:
X1 =X 11= 1
X1 =X 12= 2
X1 =X 13= 3
X2 =X 21= 1
X3 =X 31= 1
X3 =X 32= 2
X4 =X 41= 1
X4 =X 42= 2
NamNeame ConCfoenrfeenrceence RolReole
S S FreFdred VLDVBLDB SesSseiosnsiConhaCirhair
1
1
S 2 S FreFdred VLDVBLDB PC PMCemMbee 2 mrber
S3 S3 JohJnohn SIGSMIGOMDOD PCPMCemMbee mrber
S 4 S JohJnohn VLDVBLDB PC PMCemMbee 4 mrber
S5 S SueSue SIGSMIGOMDOD ChaCi 5 rhair
(b) (b)
P P
q1 =q 10. = 2 0.2
q2 =q20=. 80. 8
q3 =q 30=. 70. 7
q4 =q 40. = 7 0.7
q =q 0. = 5 0.5
5
5
Y1 =Y11= 1
Y2 =Y21= 1
Y3 =Y31= 1
Y4 =Y41= 1
Y =Y1= 1
5
5