MAY2017 | VOL. 60 | NO. 5 | COMMUNICATIONS OF THE ACM 97
would have another number. A relation associates two (or
more) entities, and represents the fact that there exists a
relationship between the participating entities. For example, “Barack_Obama_ 1” and “Michelle_Obama_ 1” participate in the HasSpouse relation, which indicates that they
are married. These real-world entities and relationships are
described in text. A mention is a span of text in an input document that refers to an entity or relationship: “Michelle”
may be a mention of the entity “Michelle_Obama_ 1.” A
relation mention is a phrase that connects two mentions
that participate in a relation, such as Barack Obama and
M. Obama. The process of mapping mentions to entities is
called entity linking.
3. 2. The DeepDive frameworkf
DeepDive is an end-to-end framework for building KBC systems. In this section, we walk through each phase. DeepDive
supports both SQL and Datalog, but we use datalog syntax
for this exposition. The rules we describe in this section are
manually created by the user of DeepDive, and the process
of creating these rules is application-specific. For simplicity
of exposition, we focus on an example with text input in the
rest of the section (Figure 7).g
Candidate mapping and feature extraction. All data in
DeepDive—preprocessed input, intermediate data, and
final output—is stored in a relational database. The first
phase populates the database using a set of SQL queries and
user-defined functions (UDFs) that we call feature extractors.
By default, DeepDive stores all documents in the database
in one sentence per row with markup produced by standard
3. KBC USING DEEPDIVE
We describe DeepDive, an end-to-end framework for building KBC systems with a declarative language.
3. 1. Definitions for KBC systems
The input to a KBC system is a heterogeneous collection of
unstructured, semistructured, and/or structured data, ranging from text documents to existing but incomplete KBs,
and an application schema specifying the target relations to
extract. The output of the system is a relational database containing relations extracted from the input according to the
application schema. Creating the knowledge base involves
extraction, cleaning, and integration.
Example 3. 1. Figure 6 illustrates a running example in which
our goal is to construct a knowledge base with pairs of individuals who are married to each other. The input to the system is a collection of news articles and an incomplete set of married people;
the output is a Knowledge base (KB) containing pairs of people
that the input sources assert to be married. A KBC system extracts
linguistic patterns, for example, “. . . and his wife . . .” between
a pair of mentions of individuals (e.g., Barack Obama and
M. Obama); these patterns are then used as features in a classifier
deciding whether this pair of mentions indicates that they are
married (in the HasSpouse) relation.
We adopt standard terminology from KBC, for example,
ACE. There are four types of objects that a KBC system seeks
to extract from input documents, namely entities, relations,
mentions, and relation mentions. An entity is a real-world
person, place, or thing. For example, “Michelle_Obama_ 1”
represents the actual entity for a person whose name is
“Michelle Obama”; another individual with the same name
f http://www.itl.nist.gov/iad/mig/tests/ace/2000/.
g For more information, including examples, please see http://deepdive.
stanford.edu. Note that our engine is built on Postgres and Greenplum for
all SQL processing and UDFs. There is also a port to MySQL.
Figure 5. Another challenge of building high-quality KBC systems
is that one usually needs to deal with data at the scale of terabytes.
These data are not only processed with traditional relational
operations, but also operations involving machine learning and
statistical inference. Thus, DeepDive consists of a set of techniques
to increase the speed, scale, and incremental execution of inference
tasks involving billions of correlated random variables.
300M Candidates
30M Extractions
Input Documents (2TB) 100M Sentences
S
N
V
NP
Det
NP
VP
OCR + NLP (340K Machine Hours)
Figure 6. A KBC system takes unstructured documents as input
and outputs a structured knowledge base. The runtimes are for the
TAC-KBP competition system. To improve quality, the developer
adds new rules and new data with error analysis conducted on the
result of the current snapshot of the system. DeepDive provides a
declarative language to specify each type of different rule and data,
and techniques to incrementally execute this iterative process.
Candidate
Mapping
& Feature
Extraction
Supervision Learning and Inference
3h 1h 3h
KBC system built with DeepDive Input Output
HasSpouse
...
1.8M
docs
2.4M
facts
Engineering-in-the-loop development
F
ea
tu
re
E
xt.
ru
le
s
N
ew
do
c
um
en
ts
I
nf
er
en
c
e
r
ul
es
Su
pe
rvi
si
on
ru
le
s
U
pd
at
ed
K
B
Er
ro
r
a
n
a
l
ysis
add...
Barack
Obama and
his wife
M. Obama ...