MAY2017 | VOL. 60 | NO. 5 | COMMUNICATIONS OF THE ACM 95
call, whereas human annotators only obtain 93% precision
and 92.5% recall. MEMEX has been covered on 60 min and
other news sources, currently supports the operations of
several law enforcement agencies nationwide, and has been
used in at least one arrest and conviction.
Medical genetics. The body of literature in the life sciences has been growing at an accelerating speed, to the
extent that it has been unrealistic for scientists to perform
research solely based on reading and/or keyword search. Numerous manually curated structured knowledge bases are
likewise unable to keep pace with exponential increases in
the number of publications available online. For example,
OMIM is an authoritative database of human genes and
Mendelian genetic disorders that dates back to the 1960s,
and so far contains about 6000 hereditary diseases or phenotypes, growing at a rate of roughly 50 records per month
for many years. Conversely, almost 10,000 publications were
deposited into PubMed Central per month last year. In collaboration with Prof. Gill Bejerano at Stanford, we are developing DeepDive applications to create knowledge bases in
the field of medical genetics. Specifically, we use DeepDive
to extract mentions of direct causal relationships between
specific gene variants and clinical phenotypes from the literature that are presently being applied to clinical genetic
diagnostics and reproductive counseling.b
Pharmacogenomics. Understanding the interactions of
chemicals in the body is a key to drug discovery. However,
the majority of this data resides in the biomedical literature
and cannot be easily accessed. The Pharmacogenomics
Knowledge Base is a high quality database that aims to annotate the relationships between drugs, genes, diseases, genetic variation, and pathways in the literature. In collaboration with Emily Mallory and Prof. Russ Altman at Stanford,
we used DeepDive to extract mentions of gene–gene interactions from the scientific literature, 29 and are currently
developing DeepDive applications with extraction schemas
that include relations between genes, diseases, and drugs
in order to predict novel pharmacological relationships.c
TAC-KBP. TAC-KBP is a NIST-sponsored research com-
petition in which the task is to extract common properties
The potential impact of automating this labor-intensive
extraction task and the difficulty of the task itself provided
an ideal test bed for our KBC research. In particular, we con-
structed a prototype called PaleoDeepDive36 that takes in
PDF documents and extracts a set of paleontological enti-
ties and relations (see Figure 2). This prototype attacks chal-
lenges in optical character recognition, natural language
processing, information extraction, and integration. Some
statistics about the process are shown in Figure 3. As part
of the validation of this system, we performed a double-
blind experiment to assess the quality of PaleoDeepDive
versus PaleoDB. We found that PaleoDeepDive achieved
accuracy comparable to—and sometimes better than—that
of PaleoDB (see Figure 3). 36 Moreover, PaleoDeepDive was
able to process roughly 10x the number of documents, with
per-document recall roughly 2.5x that of human annotators.
2. 2. Beyond paleontology
The success of PaleoDeepDive motivates a series of other
KBC applications in a diverse set of domains, including both
natural and social sciences. Although these applications
focus on very different types of KBs, they are usually built
in a way similar to PaleoDeepDive. This similarity across
applications has motivated us to build DeepDive as a unified
framework to support these diverse applications.
Human trafficking. Human trafficking is an odious crime
that uses physical, economic, or other means of coercion to
obtain labor from human beings, who are often used in sex
or factory work. Identifying victims of human trafficking is
difficult for law enforcement using traditional means; however, like many other forms of commerce, sex work advertising is now online, where providers of sex services post ads
containing price, location, contact information, physical
characteristics, and other data. As part of the DARPA MEMEX
project, we ran DeepDive on approximately 90M advertisements and 0.5M forum posts, creating two distinct
structured tables that included extracted attributes about
potentially trafficked workers, such as price, location, phone
number, service types, age, and various other attributes that
can be used to detect signs of potential trafficking or abuse.
In many cases, DeepDive is able to extract these attributes
with comparable or greater quality levels than human annotators; for example, on phone number extraction from service ads, DeepDive achieves 99.5% precision and 95.5% re-
Figure 2. Example relations extracted from text, tables, and diagrams in the paleontology literature by PaleoDeepDive.
... The Namurian Tsingyuan Formation
from Ningxia, China, divided into
three members ...
Natural Language Text
Tsingyuan Fm. Namurian
Formation–Time (Location)
Taxon–Formation
Taxon–Taxon Taxon–Real Size
Tsingyuan Fm. Ningxia Retispira Tsingyuan Fm.
Strobeus
rectilinea
Buccinum
rectineum
Shansiella
tongxinensis
5 cm x 5 cm
Document Layout Image
b http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/.
c https://www.pharmgkb.org/.