94 COMMUNICATIONS OF THE ACM | MAY 2017 | VOL. 60 | NO. 5
efforts. For example, developers might decide to sink a large
amount of time into improving the quality of some upstream
component of their pipeline, only to find that it has a negligible effect on end system performance—essentially, running
into an Amdahl’s law for quality. In contrast, by formulating
the task as a single probabilistic inference problem, DeepDive
allows the developer to effectively profile the end-to-end
quality of his or her application. We argue that our approach
leads to higher quality end-to-end models in less time, which
is the ultimate goal of all information extraction systems.
Like other KBC systems, DeepDive uses a high-level
declarative language to enable the user to describe application inputs, outputs, and model structure. 9, 31, 33 DeepDive’s
language is based on SQL, but also inherits Markov Logic
Networks’ formal semantics to enable users to declaratively
describe their KBC task as a type of probabilistic graphical
model called a factor graph. 11, 33
DeepDive uses a standard execution model9, 31, 33 in
which programs go through two main phases, grounding
and inference. In the grounding phase, DeepDive evaluates
a sequence of SQL queries to produce a factor graph that
describes a set of random variables and how they are correlated. Essentially, every tuple in the database which represents a candidate extraction to be potentially included in
the output knowledge base is included as a random variable
(node) in this factor graph. In the inference phase, DeepDive
then takes the factor graph from the grounding phase and
performs statistical inference using standard techniques,
for example, Gibbs sampling. 47, 50 The output of inference is
the marginal probability of every tuple in the output knowledge base. As with Google’s Knowledge Vault12 and others, 34
DeepDive also produces marginal probabilities that are
calibrated: if one examined all facts with probability 0.9,
we would expect approximately 90% of these facts to be correct. To calibrate these probabilities, DeepDive estimates
(i.e., learns) parameters of the statistical model from data.
Inference is a subroutine of the learning procedure and is
the critical loop. Inference and learning are computationally intense (hours on 1TB RAM/48-core machines).
In our experience, we have found that DeepDive can reli-ably obtain extremely high quality on a range of KBC tasks.
In the past few years, DeepDive has been used to build dozens of high-quality KBC systems by a handful of technology companies, a number of law enforcement agencies via
DARPA’s MEMEX program, and scientists in fields, such as
paleobiology, drug repurposing, and genomics. Recently, we
compared the quality of a DeepDive system’s extractions to
those provided by human volunteers over the last 10 years
for a paleobiology database, and we found that the DeepDive
system had higher quality (both precision and recall) on
many entities and relationships. Moreover, on all of the
extracted entities and relationships, DeepDive had no worse
quality. 36 Additionally, the winning entry of the 2014 TAC-KBC competition was built on DeepDive. 1
One key lesson learned was that in all cases, enabling
developers to iterate quickly was critical to achieving such
high quality. More broadly, we have seen that the process
of developing KBC systems for real applications is funda-
mentally iterative: quality requirements change, new data
sources arrive, and new concepts are needed in the applica-
tion. Thus, DeepDive’s architecture is designed around a set
of techniques that not only make the execution of statistical
inference and learning efficient, but also make the entire
pipeline incremental in the face of changes both to the data
and to the declarative specification.
This article aims at giving a broad overview of DeepDive.
The rest of the article is organized as follows. Section 2
describes some example applications of DeepDive and
outlines core technical challenges. Section 3 presents the
system design and language for modeling KBC systems
inside DeepDive. We discuss the different techniques in
Section 4 and give pointers for readers who are interested
in each technique.
2. APPLICATIONS AND CHALLENGES
KBC plays a critical role in many analysis tasks, both scientific and industrial, and is often the bottleneck to answering
new and impactful macroscopic questions. In many scientific analyses, for example, one first needs to assemble a
large, high-quality knowledge base of facts (typically from
the literature) in order to understand macroscopic trends
and patterns, for example, about the amount of carbon in
the Earth’s atmosphere throughout time36 or all the drugs
that interact with a particular gene, 29 and some scientific
disciplines have undertaken decade-long collection efforts
to this end, for example, PaleoDB.org and PharmaGKB.org.
In parallel, KBC has attracted interest from industry15, 52
and many areas of academia outside of computer science. 2, 3, 6,
14, 23, 25, 31, 34, 37, 41, 43, 48 To understand the common patterns in
KBC systems, we are actively collaborating with scientists
from a diverse set of domains, including geology, 49
paleontology, 36 pharmacology for drug repurposing, and others. We first describe one KBC application we built, called
PaleoDeepDive, then present a brief description of other
applications built with similar purposes and finally discuss
the challenges inherent in building such systems.
2. 1. PaleoDB and PaleoDeepDive
Paleontology is based on the description and biological classification of fossils, an enterprise that has been recorded
in hundreds to thousands of scientific publications over
the past four centuries. One central task that paleontologists have long been concerned with is the construction of
a knowledge base about fossils from scientific publications.
Existing knowledge bases compiled by human volunteers—
for example, PaleoDB—have already greatly expanded the
intellectual reach of paleontology and led to many fundamental new insights into macroevolutionary processes
and the nature of biotic responses to global environmental change. However, the current process of using human
volunteers is usually expensive and time-consuming. For
example, PaleoDB, one of the largest such knowledge bases,
took more than 300 professional paleontologists and 11
human years to build over the last two decades, resulting
in ℙ aleoDB.org. To get a sense of the impact of this database on this field, at the time of writing, this dataset has contributed to 205 publications, of which 17 have appeared in
Nature or Science.