need to be considered together—that is, jointly—to make a
correct extraction. In Figure 4, for example, to reach the
extraction that the genus Xenacanthus appears in the
location of the name Obara, the extraction system needs
to consult extractions from text, tables, and external
structured sources.
Scale. KBC systems need to be able to ingest massive numbers of documents, far outstripping the document counts
of even well-funded human curation efforts. For example,
Figure 5 illustrates the data flow of PaleoDeepDive. The input to PaleoDeepDive contains nearly 300K journal articles
and books, whose total size exceeds 2TB. These raw inputs
are then processed with tools such as OCR and linguistic
parsing, which are computationally expensive and may take
hundreds of thousands of machine hours.d
Multimodal input. We have found that text is often not
enough: often, the data that are interesting to scientists are
located in the tables, figures, and images of articles. For ex-
ample, in geology, more than 50% of the facts that we are
interested in are buried in tables. 16 For paleontology, the
relationship between taxa, as known as taxonomy, is al-
most exclusively expressed in section headers. 36 For phar-
macology, it is not uncommon for a simple diagram to con-
tain a large number of metabolic pathways. Additionally,
external sources of information (other knowledge bases)
typically contain high-quality signals (e.g., Freebase and
Macrostrat) that we would like to leverage and integrate. To
build a high-quality KBC system, we need to deal with these
diverse modalities of input.e
of people and organizations (e.g., age, birthplace, spouses,
and shareholders) from 1. 3 million newswire and web
documents—this task is also termed slot filling. In the
2014 evaluation, 31 US and international teams partici-
pated in the competition, including a Stanford team that
submitted a solution based on DeepDive. 1 The DeepDive-
based solution achieved the highest precision, recall, and
F1 of all the submissions.
2. 3. Challenges
In all the applications mentioned above, KBC systems built
with DeepDive achieved high quality as illustrated in Figure 3.
Achieving this high quality level requires that we deal with
several challenging aspects of the KBC problem.
Unstructured data complexity. In its full generality, the
KBC task encompasses several longstanding grand chal-
lenges of computer science, including machine reading
and computer vision. Even for simple schemas, extraction of
structured information from unstructured sources contains
many challenging aspects. For example, consider extracting
the relation Causes(Gene, ℙhenotype)—that is, as-
sertions of a genetic mutation causing a certain phenotype
(symptom)—from the scientific literature (see Section 2. 2).
Genes generally have standardized forms of expression (e.g.,
BRCA1); however, they are easily confused with acronyms
for diseases they cause; signals from across the document
must be used to resolve these false positives. Phenotypes
are even more challenging, because they can be expressed
in many synonymous forms (e.g., “headache,” “head pain,”
and “pain in forehead”). And extracting pairs that partici-
pate in the Caused relation encompasses dealing with all
the standard challenges of linguistic variation and complex-
ity, as well as application-specific domain terminology.
This challenge becomes even more serious when infor-
mation comes from different sources that potentially
Figure 3. Quality of KBC systems built with DeepDive. On many
applications, KBC systems built with DeepDive achieve comparable
(and sometimes better) quality than professional human volunteers,
and lead to similar scientific insights on topics, such as biodiversity.
This quality is achieved by iteratively integrating diverse sources of
data-often quality scales with the amount of information we enter
into the system.
Taxon–Taxon
Taxon–Fm.
Fm.–Time
Fm.–Location
27 97 92
4 96 84
3 92 89
5 94 90
Quality of PaleoDeepDive Scale of PaleoDeepDive
0
1000
2000
3000
500 200
T
ot
al
div
e
rsi
ty
(
nu
mb
er
o
f
sp
e
cie
s
)
Geological time (M)
400
Test set
extractions
Documents
processed 300K+ 40K
300 100 0
Biodiversity curve
HumanPaleoDeepDive
129K 60K
Figure 4. One challenge of building high-quality KBC systems is
exploiting diverse sources of information jointly to extract data
accurately. In this example page of a Paleontology journal article,
identifying the correct location of Xenacanthus requires integrating
information from within tables, text, and external structured
knowledge bases. This problem becomes even more challenging
when many extractors are not 100% accurate, motivating the joint
probabilistic inference engine inside DeepDive.
Input Document Table Snippet
AppearInLocation
(Obara, Xenacanthus)
Extractions
from Table
External Knowledge
(Czech Republic; GPS 49° 27’
34” N, 16° 36’ 8” E)...
Extractions from Text
… from the Obara village near
Boskovice, in central Moravia
Final
Extraction
AppearInLocation
(Obara 49°N 16°E, Xenacanthus)
d http://www.freebase.com/.
e http://macrostrat.org/.