as open problems for future work.
Post-extraction, TextRunner’s collection of triples is made efficiently
searchable by using Lucene, a high-performance indexing and search engine.d Thus TextRunner can be que-ried for tuples containing particular
entities (for example, Edison), relationships (invented), or relationships
between two entities (such as
Microsoft and IBM). The different triples
returned in response to a query are
ranked by a fairly complex formula,
but a key parameter that boosts ranking is the number of times a tuple has
been extracted from the Web. Because
the Web corpus is highly redundant,
we have found that repeated extractions are strongly correlated with increased likelihood that an extraction
is correct.
We have run TextRunner on a collection of over 120 million Web pages
and extracted over 500 million tuples.
By analyzing random samples of the
output, we have determined that the
precision of the extraction process
exceeds 75% on average.
4 In collaboration with Google, we have also run
a version of TextRunner on over one
billion pages of public Web pages and
have found that the use of an order-of-magnitude larger corpus boosts both
precision and recall. Other researchers have investigated techniques
closely related to Open IE, but at a
substantially smaller scale.
20, 23
address this topic. Using Open IE, the
range of questions TextRunner can
address mirrors the unbounded scope
and diversity of its Web corpus.
The two additional tasks are:
˲ “Opinion mining,” in which Open
IE can extract opinion information
about particular objects (including
products, political candidates, and
more) that are contained in blog
posts, reviews, and other texts.
˲ “Fact checking,” in which Open
IE can identify assertions that directly
or indirectly conflict with the body of
knowledge extracted from the Web
and various other knowledge bases.
Opinion Mining is the process of taking a corpus of text expressing multiple opinions about a particular set of
entities and creating a coherent overview of those of opinions. Through
this process, opinions are labeled as
positive or negative, salient attributes
of the entities are identified, and specific sentiments about each attribute
are extracted and compared.
In the special case of mining product reviews, opinion mining can be
decomposed into the following main
subtasks, originally described in
Popescu:
15
1. Identify product features. In a
given review, features can be explicit
(for example, “the size is too big”) or
implicit (“the scanner is slow).
2. Identify opinions regarding product features. For example, “the size is
too big” contains the opinion phrase
“too big,” which corresponds to the
“size” feature.
3. Determine the polarity of opinions. Opinions can be positive (for example, “this scanner is so great”) or
negative (“this scanner is a complete
disappointment”).
4. Rank opinions based on their
strength. “Horrible,” say, is a stronger
adjective than “bad.”
Opine16 is an unsupervised infor-mation-extraction system that em-bodies solutions to all of the mentioned subtasks. It relies on Open IE
techniques to address the broad and
diverse range of products without
requiring hand-tagged examples of
each type of product. Opine was the
first to report its precision and recall
on the tasks of opinion-phrase extraction and opinion-polarity determination in the context of known product
features and sentences. When tested
on hotels and consumer electronics,
Opine was found to extract opinions
with a precision of 79% and a recall of
76%. The polarity of opinions could be
identified by Opine with a precision of
86% and a recall of 89%.
Fact Checking. Spell checkers and
grammar checkers are word-process-ing utilities that we have come to take
applications of open ie
IE has numerous applications, but
some tasks require the full power of
Open IE because of the scope and diversity of information to be extracted.
This diversity is often referred to as
the “long tail” to reflect the distribution of information requests—some
are very common but most are issued
infrequently.
We consider three such tasks here.
First and foremost is “question answering,” the task of succinctly providing an answer to a user’s factual
question. In Figure 4, for example,
the question is “What kills bacteria?”
It turns out that the most comprehensive answer to that question is
produced by collecting information
across thousands of Web sites that
d http://lucene.apache.org/