DeepDive: Declarative Knowledge
Base Construction
By Ce Zhang, Christopher Ré, Michael Cafarella, Christopher De Sa, Alex Ratner, Jaeho Shin, Feiran Wang, and Sen Wu
DOI: 10.1145/3060586
Abstract
The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database
with information from unstructured data sources, such as
emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems
of data extraction, cleaning, and integration. We describe
DeepDive, a system that combines database and machine
learning ideas to help to develop KBC systems. The key idea
in DeepDive is to frame traditional extract–transform–load
(ETL) style data management problems as a single large
statistical inference task that is declaratively defined by the
user. DeepDive leverages the effectiveness and efficiency
of statistical inference and machine learning for difficult
extraction tasks, whereas not requiring users to directly
write any probabilistic inference algorithms. Instead,
domain experts interact with DeepDive by defining features
or rules about the domain. DeepDive has been successfully
applied to domains such as pharmacogenomics, paleobiology, and antihuman trafficking enforcement, achieving
human-caliber quality at machine-caliber scale. We present the applications, abstractions, and techniques used in
DeepDive to accelerate the construction of such dark data
extraction systems.
1. INTRODUCTION
The goal of knowledge base construction (KBC) is to populate a structured relational database from unstructured
input sources, such as text documents, PDFs, and diagrams.
As the amount of available unstructured information has
skyrocketed, this task has become a critical component
in enabling a wide range of new analysis tasks. For example, analyses of protein–protein interactions for biological, clinical, and pharmacological applications29; online
human trafficking activities for law enforcement support;
and paleological facts for macroscopic climate studies36
are all predicated on leveraging data from large volumes
of text documents. This data must be collected in a structured format in order to be used, however, and in most cases
doing this extraction by hand is untenable, especially when
domain expertise is required. Building an automated KBC
system is thus often the key development step in enabling
these analysis pipelines.
The process of populating a structured relational database from unstructured sources has also received renewed
interest in the database community through high-profile
start-up companies, established companies such as IBM’s
Watson, 5, 15 and a variety of research efforts. 9, 26, 31, 41, 46 At the
same time, the natural language processing and machine
learning communities are attacking similar problems. 3, 12, 22
Although different communities place differing emphasis
on the extraction, cleaning, and integration phases, all seem
to be converging toward a common set of techniques that
includes a mix of data processing, machine learning, and
engineers-in-the-loop.a
Here, we discuss DeepDive, our open-source engine for
constructing knowledge bases with human-caliber quality
at machine-caliber scale (Figure 1). DeepDive takes the view-
point that in information extraction, the problems of extrac-
tion, cleaning, and integration are not disjoint algorithmic
problems, though the database community has treated them
as such for several decades. Instead, these problems can be
more effectively attacked jointly, and viewed as a single sta-
tistical inference problem that takes all available informa-
tion into account to produce the best possible end result.
We have found that one of the most harmful inefficiencies of
traditional pipelined approaches is that developers struggle
to understand how changes to the separate extraction, clean-
ing, or integration modules improve the overall system qual-
ity, leading them to incorrectly distribute their development
The original version of this paper is entitled “Incremen-
tal Knowledge Base Construction Using DeepDive” and
was published in Proceedings of the VLDB Endowment,
2015. This paper also contains content from other previ-
ously published work. 16, 36, 39, 51
Figure 1. Knowledge base construction (KBC) is the process
of populating a structured relational knowledge base from
unstructured sources. DeepDive is a system aimed at facilitating the
KBC process by allowing domain experts to integrate their domain
knowledge without worrying about algorithms.
Unstructured Documents
Structured Knowledge Base
Knowledge Base Construction (KBC)
Output
Input
Domain
Experts
Features
Knowledge
Supervision
a http://deepdive.stanford.edu.