References
1. Angeli, G. et al. Stanford’s 2014 slot
filling systems. TAC KBP (2014).
2. Banko, M. et al. Open information
extraction from the Web. In IJCAI
(2007).
3. Betteridge, J., Carlson, A., Hong, S.A.,
Hruschka, E.R., Jr, Law, E.L.,
Mitchell, T.M., Wang, S.H.
Toward never ending language
learning. In AAAI Spring Symposium
(2009).
4. Brin, S. Extracting patterns and
relations from the world wide web. In
WebDB (1999).
5. Brown, E. et al. Tools and methods
for building Watson. IBM Research
Report (2013).
6. Carlson, A. et al. Toward an
architecture for never-ending
language learning. In AAAI (2010).
7. Chen, F., Doan, A., Yang, J.,
Ramakrishnan, R. Efficient
information extraction over evolving
text data. In ICDE (2008).
8. Chen, F. et al. Optimizing statistical
information extraction programs over
evolving text. In ICDE (2012).
9. Chen, Y., Wang, D.Z. Knowledge
expansion over probabilistic
knowledge bases. In SIGMOD (2014).
10. De Sa, C., Olukotun, K., Ré, C.
Ensuring rapid mixing and low bias for
asynchronous gibbs sampling. arXiv
preprint arXiv:1602.07415 (2016).
11. Domingos, P., Lowd, D. Markov Logic:
An Interface Layer for Artificial
Intelligence. Morgan & Claypool,
2009.
12. Dong, X. L. et al. From data fusion to
knowledge fusion. In VLDB (2014).
13. Ehrenberg, H. R., Shin, J., Ratner, A.J.,
Fries, J. A., Ré, C. Data programming
with DDLite: Putting humans in
a different part of the loop. In
HILDA’ 16 SIGMOD (2016), 13.
14. Etzioni, O. et al. Web-scale
information extraction in KnowItAll:
Preliminary results. In WWW (2004).
15. Ferrucci, D. et al. Building Watson: An
overview of the DeepQA project. AI
Magazine (2010).
5. RELATED WORK
KBC has been an area of intense studies over the last
decade. 2, 3, 6, 14, 23, 25, 31, 37, 41, 43, 48, 52 Within this space, there are a
number of approaches.
5. 1. Rule-based systems
The earliest KBC systems used pattern matching to extract
relationships from text. The most well-known example is the
“Hearst Pattern” proposed by Hearst18 in 1992. In her seminal
work, Hearst observed that a large number of hyponyms can
be discovered by simple patterns, for example, “X such as Y.”
Hearst’s technique has formed the basis of many further
techniques that attempt to extract high-quality patterns from
text. Rule-based (pattern matching-based) KBC systems,
such as IBM’s System T, 25, 26 have been built to aid developers
in constructing high-quality patterns. These systems provide
the user with a (declarative) interface to specify a set of rules
and patterns to derive relationships. These systems have
achieved state-of-the-art quality on tasks, such as parsing. 26
5. 2. Statistical approaches
One limitation of rule-based systems is that the developer
needs to ensure that all rules provided to the system are high-precision rules. For the last decade, probabilistic (or machine
learning) approaches have been proposed to allow the system to select from a range of a priori features automatically.
In these approaches, the extracted tuple is associated with a
marginal probability that it is true. DeepDive, Google’s knowledge graph, and IBM’s Watson are built on this approach.
Within this space, there are three styles of systems based
on classification, 2, 3, 6, 14, 48 maximum a posteriori, 23, 31, 43 and
probabilistic graphical models. 11, 37, 52 Our work on DeepDive
is based on graphical models.
6. CURRENT DIRECTIONS
6. 1. Data programming
In a standard DeepDive KBC application (e.g., as in Section
3. 2), the weights of the factor graph that models the extraction
task are learned using either hand-labeled training data or
distant supervision. However, in many applications, assembling hand-labeled training data is prohibitively expensive
(e.g., when domain expertise is required), and distant supervision can be insufficient or time consuming to implement
perfectly. For example, users may come up with many potential distant supervision rules that overlap, conflict, and are of
varying unknown quality, and deciding which rules to include
and how to resolve their overlaps could take many development cycles. In a new approach called data programming, 38 we
allow users to specify arbitrary labeling functions, which subsume distant supervision rules and allow users to programati-cally generate training data with increased flexibility. We then
learn the relative accuracies of these labeling functions and
denoise their labels using automated techniques, resulting
in improved performance on the KBC applications outlined.
6. 2. Lightweight extraction
In some cases, users may have simple extraction tasks which
need to be implemented rapidly, or may wish to first iterate
on a simpler initial version of a more complex extraction task.
For example, a user might have a complex extraction task
involving multiple entity and relation types, connected by a
variety of inference rules, over a large web-scale dataset; but
they may want to start by iterating on just a single relationship
over a subset of the data. For these cases, we are developing a
lightweight, Jupyter notebook-based extraction system called
Snorkel, intended for quick iterative development of simple
extraction models using data programming. 13 We envision
Snorkel as a companion and complement to DeepDive.j
6. 3. Asynchronous inference
One method for speeding up the inference and learning
stages of DeepDive is to execute them asynchronously. In
recent work, we observed that asynchrony can introduce
bias in Gibbs sampling, and outline some sufficient conditions under which the bias is negligible. 10 Further theoretical and applied work in this direction will allow for faster
execution of complex DeepDive models asynchronously.
Acknowledgments
We gratefully acknowledge the support of the Defense
Advanced Research Projects Agency (DARPA) XDATA program under no. FA8750-12-2-0335 and DEFT program
under no. FA8750-13-2-0039, DARPA’s MEMEX program
and SIMPLEX program, the National Science Foundation
(NSF) CAREER Award under no. IIS-1353606, the Office of
Naval Research (ONR) under awards nos. N000141210041
and N000141310129, the National Institutes of Health
Grant U54EB020405 awarded by the National Institute of
Biomedical Imaging and Bioengineering (NIBIB) through
funds provided by the trans-NIH Big Data to Knowledge
(BD2K) initiative, the Sloan Research Fellowship, the Moore
Foundation, American Family Insurance, Google, and
Toshiba. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
authors and do not necessarily reflect the views of DARPA,
AFRL, NSF, ONR, NIH, or the US government.
j snorkel.stanford.edu.