for granted. A fact checker based on
Open IE seems like a natural next
step.e
Consider a schoolchild incorrectly
identifying the capital of North Dakota, or the date of India’s independence, in her homework. The fact
checker could automatically detect
the error and underline the erroneous
sentence in blue.f Right-clicking on
the underlined sentence would bring
up the conflicting facts that led the
checker to its conclusion.
Where would the fact checker’s
knowledge base originate? While resources such as WordNet and the CIA
World Fact book are of high quality,
they are inherently limited in scope
because of the labor-intensive process by which they are compiled. Even
Wikipedia, which is put together by
a large number of volunteers, only
had about two million articles at last
count—and they were not guaranteed
to contain accurate information. To
provide a checker with broad scope, it
is natural to use all of the above but
also include information extracted
from the Web via Open IE.
Of course, the use of information
extracted from the Web increases
the chance that a correct fact will be
flagged as erroneous. Again, this is
similar to utilities such as the spell
checker and grammar checker, which
also periodically misidentify words or
sentences as incorrect. Our goal, of
course, is to build fact checkers with
high precision and recall. In addition, when a fact is flagged as potentially incorrect, the checker provides
an easy means of accessing the source
of the information that led it to this
determination.
conclusion and Directions
for future Work
This article sketched the transformation of information extraction (IE)
from a targeted method, appropriate
for finding instances of a particular
relationship in text, to an open-ended
method (which we call “Open IE”)
that scales to the entire Web and can
support a broad range of unanticipat-
e This idea comes from Krzysztof Gajos.
f Blue is used to distinguish its findings from
the red underline for misspellings and the
green underline for grammatical errors.
ed questions over arbitrary relations.
Open IE also supports aggregating, or
“fusing,” information across a large
number of Web pages in order to provide comprehensive answers to questions such as “What do people think
about the Thinkpad laptops?” in the
Opine system15 or “What kills bacteria?” in Figure 4.
We expect future work to improve
both the precision and recall of Open
IE (for example, see Downey8 and
Yates24). We have begun to integrate
Open IE with inference, which would
enable an Open IE system to reason
based on the facts and generalizations
it extracts from text. The challenge, of
course, is to make this reasoning process tractable in the face of billions of
facts and rules. We foresee opportunities to unify Open IE with information
provided by ontologies such as WordNet and Cyc, as well as with human-contributed knowledge in OpenMind
and FreeBase, in order to improve
the quality of extracted information
and facilitate reasoning. Finally, we
foresee the application of Open IE to
other languages besides English.
acknowledgments
This research was supported in part
by NSF grants IIS-0535284 and IIS-
0312988, ONR grant N00014-08-1-
0431, SRI CALO grant 03-000225, and
the WRF/TJ Cable Professorship, as
well as by gifts from Google. It was carried out at the University of Washington’s Turing Center. We wish to thank
the members of AI and KnowItAll
groups for many fruitful discussions.
References
1. agichtein, E. and gravano, L. snowball: Extracting
relations from large plain-text collections.
in Proceedings of the 5th ACM International
Conference on Digital Libraries (2000).
2. arPa. Proceedings of the 3rd Message
Understanding Conference (1991).
3. banko, m., cafarella, m., soderland, s., broadhead,
m. and Etzioni, o. open information extraction from
the Web. in Proceedings of the International Joint
Conference on Artificial Intelligence (2007).
4. banko, m. and Etzioni, o. The tradeoffs between
traditional and open relation extraction. in
Proceedings of the Association of Computational
Linguistics (2008).
5. brin, s. Extracting patterns and relations from the
World Wide Web. in Proceedings of the Workshop
at the 6th International Conference on Extending
Database Technology, (valencia, spain, 1998),
172–183.
6. bunescu, r. and mooney, r. Learning to extract
relations from the Web using minimal supervision.
in Proceedings of the Association of Computational
Linguistics (2007).
7. downey, d., Etzioni, o. and soderland, s. a
probabilistic model of redundancy in information
extraction. in Proceedings of the International Joint
Conference on Artificial Intelligence (2005).
8. downey, d., schoenmackers, s. and Etzioni, o. sparse
information extraction: unsupervised language
models to the rescue. in Proceedings of the
Association of Computational Linguistics (2007).
9. Etzioni, o., cafarella, m., downey, d., kok, s., Popescu,
a., shaked, T., soderland, s., Weld, d. and yates, a.
unsupervised named-entity extraction from the Web:
an experimental study. Artificial Intelligence 165, 1
(2005), 91–134.
10. feldman, r., rosenfeld, b., soderland, s. and Etzioni,
o. self-supervised relation extraction from the Web.
in Proceedings of the International Symposium
on Methodologies for Intelligent Systems (2006),
755–764.
11. kim, j. and moldovan, d. acquisition of semantic
patterns for information extraction from corpora. in
Proceedings of the 9th IEEE Conference on Artificial
Intelligence for Applications (1993), 171–176.
12. Lafferty, j., mccallum, a. and Pereira, f. conditional
random fields: Probabilistic models for segmenting
and labeling sequence data. in Proceedings of the
2001 International Conference on Machine Learning.
13. mccallum, a. Efficiently inducing features of
conditional random fields. in Proceedings of the 19th
Conference on Uncertainty in Artificial Intelligence
(acapulco, 2003), 403–410.
14. Poon, h. and domingos, P. joint inference in
information extraction. in Proceedings of the 22nd
National Conference on Artificial Intelligence (2007),
913–918.
155. Popescu, a. and Etzioni, o. Extracting product
features and opinions from reviews. in Proceedings
of the Empirical Methods on Natural Language
Processing Conference (2005).
16. Popescu. a-m. information extraction from
unstructured Web text. Ph.d. thesis, university of
Washington (2007).
17. riloff, E. automatically constructing extraction
patterns from untagged text. in Proceedings of the
13th National Conference on Artificial Intelligence
(1996), 1044–1049.
18. riloff, E. and jones, r. Learning dictionaries for
information extraction by multi-level bootstrapping.
in Proceedings of the AAAI- 99 Conference (1999),
1044–1049.
19. schubert, L. can we derive general world knowledge
from texts? in Proceedings of the Human Language
Technology Conference (2002).
20. shinyama, y. and sekine, s. Preemptive information
extraction using unrestricted relation discovery. in
Proceedings of the Human Language Technology/
NAACL Conference (2006).
21. soderland, s. Learning information extraction rules
for semi-structured and free text. Machine Learning
34, 1–3 (1999), 233–272.
22. soderland, s., fisher, d., aseltine, j. and Lehnert,
W. crys TaL: inducing a conceptual dictionary.
in Proceedings of the 14th International Joint
Conference on Artificial Intelligence (1995),
1314–1321.
23. Weld, d., Wu, f., adar, E., amershi, s., fogarty, j.,
hoffmann, r., Patel, k. and skinner, m. intelligence in
Wikipedia. in Proceedings of the 23rd Conference on
Artificial Intelligence (2008).
244. yates, a. and Etzioni, o. unsupervised resolution of
objects and relations on the Web. in Proceedings of
the Human-Language Technology Conference (2007).
Oren Etzioni ( etzioni@cs.washington.edu) is a professor
of computer science and the founder and director of the
Turing center at the university of Washington, seattle.
Michele Banko ( banko@cs.washington.edu) is a Ph.d.
candidate at the university of Washington, seattle.
Stephen Soderland ( soderland@cs.washington.edu)
is a research scientist in the department of computer
science and Engineering at the university of Washington,
seattle.
Daniel S. Weld ( weld@cs.washington.edu) is the
Thomas j. cable/Wrf Professor of computer science
and Engineering at the university of Washington, seattle.