Example 4. 1. Take the database instances and rules in
Figure 8 as an example: each tuple in relation R, S, and Q is
a random variable, and V contains all random variables. The
inference rules F1 and F2 ground factors with the same name in
the factor graph, as illustrated in Figure 8. Both F1 and F2 are
implemented as SQL statements in DeepDive.
Incremental grounding. Because DeepDive is based on
SQL, we are able to take advantage of decades of work on
incremental view maintenance. The input to this phase
is the same as the input to the grounding phase: a set of
SQL queries and the user schema. The output of this phase
is how the output of grounding changes, that is, a set of
modified variables ∆V and their factors ∆F. Since V and F
are simply views over the database, any view maintenance
techniques can be applied to incremental grounding.
DeepDive uses the DRed algorithm, 17 which handles both
additions and deletions. Recall that in DRed, for each relation Ri in the user’s schema, we create a delta relation,
Rδi, with the same schema as Ri and an additional column
count. For each tuple t, t.count represents the number of
derivations of t in Ri. On an update, DeepDive updates delta relations in two steps. First, for tuples in Rδi, DeepDive
directly updates the corresponding counts. Second, an
SQL query called a delta rule is executed, which processes
these counts to generate modified variables ∆V and factors ∆F. We found that the overhead of DRed is modest
and the gains may be substantial, so DeepDive always runs
DRed—except on initial load.
4. 2. Statistical inference and learningi
The main task that DeepDive conducts on factor graphs
is statistical inference, that is, determining for a given
node what the marginal probability is that this node takes
the value 1, that it is a correct output tuple that should be
included in the final knowledge base. In general, computing
these marginal probabilities is P-hard. 45 Like many other
systems, DeepDive uses Gibbs sampling40 to estimate the
marginal probability of every tuple in the database.
Efficiency and scalability. There are two components to
scaling statistical algorithms: statistical efficiency, roughly how
many steps an algorithm takes to converge, and hardware efficiency, how efficient each of those steps is. We introduced this
terminology and studied this extensively in a recent paper. 51
Dimm Witted, the statistical inference and learning engine
in DeepDive, 51 is built upon our research on how to design a
high-performance statistical inference and learning engine
on a single machine. 27, 32, 50, 51 DimmWitted models Gibbs
sampling as a “column-to-row access” operation: each row
corresponds to one factor, each column to one variable, and
the nonzero elements in the data matrix correspond to edges
in the factor graph. To process one variable, DimmWitted
fetches one column of the matrix to get the set of factors, and
other columns to get the set of variables that connect to the
same factor. On standard benchmarks, DimmWitted was
3. 7× faster than GraphLab’s implementation without any
application-specific optimization. Compared with traditional
work, the main novelty of DimmWitted is that it considers
both hardware efficiency and statistical efficiency for executing an inference and learning task.
• Hardware efficiency. DeepDive takes into consideration the architecture of modern nonuniform memory
access (NUMA) machines. A NUMA machine usually
contains multiple nodes (sockets), where each socket
contains multiple CPU cores. To achieve higher hardware efficiency, one wants to decrease the communication across different NUMA nodes.
• Statistical efficiency. Pushing hardware efficiency to the
extreme might decrease statistical efficiency, because
the lack of communication between nodes might
decrease the rate of convergence of a statistical inference and learning algorithm. DeepDive takes advantage
of the theoretical results of model averaging53 and our
own results about lock-free execution. 27, 32
On the whole corpus of Paleobiology, the factor graph con-
tains more than 0.2 billion random variables and 0.3 billion
factors. On this factor graph, DeepDive is able to run Gibbs
sampling on a machine with four sockets ( 10 cores per
socket), and we find that we can generate 1000 samples for
all 0.2 billion random variables in 28 min. This is more than
4× faster than a non-NUMA-aware implementation.
Incremental inference. Due to our choice of incremen-
tal grounding, the input to DeepDive’s inference phase is a
factor graph along with a set of changed variables and fac-
tors. The goal is to compute the output probabilities com-
puted by the system. Our approach is to frame the incre-
mental maintenance problem as approximate inference.
Previous work in the database community has looked at
how machine learning data products change in response
to both new labels24 and new data. 7, 8 In KBC, both the pro-
gram and data change on each iteration. Our proposed ap-
proach can cope with both types of change simultaneously.
The technical question is which approximate inference
algorithms to use in KBC applications. We choose to study
two popular classes of approximate inference techniques:
sampling-based materialization (inspired by sampling-
based probabilistic databases such as MCDB20) and vari-
ational-based materialization (inspired by techniques for
approximating graphical models44). Applying these tech-
niques to incremental maintenance for KBC is novel, and
it is not theoretically clear how the techniques compare.
Thus, we conducted an experimental evaluation of these
two approaches on a diverse set of DeepDive programs.
We found these two approaches to be sensitive to changes
along three largely orthogonal axes: the size of the factor
graph, the sparsity of correlations, and the anticipated
number of future changes. The performance varies by up
to two orders of magnitude in different points of the space.
Our study of the tradeoff space highlights that neither
materialization strategy dominates the other. To automati-
cally choose the materialization strategy, we developed a
simple rule-based optimizer. 42
i For example, for the grounding procedure illustrated in Figure 8, the
delta rule for F1 is qδ(x): −Rδ(x, y).