rules. The rules were applied to Web
pages identified via search-engine
queries, and the resulting extractions
were assigned a probability using
information-theoretic measures derived from search engine hit counts.
For example, KnowItAll utilized generic extraction patterns like “<X> is
a < Y>” to find a list of candidate members X of the class Y. When this pattern was used, say, for the class
Country, it would match a sentence that
included the phrase “X is a country.”
Next, KnowItAll used frequency statistics computed by querying search
engines to identify which instantia-tions were most likely to be bona fide
members of the class. For example, in
order to estimate the likelihood that
“China” was the name of a country,
KnowItAll used automatically generated phrases associated with the class
Country to see if there was a high correlation between the numbers of documents containing the word “China”
and those containing the phrase
“countries such as.” Thus KnowItAll
was able to confidently label China,
France, and India as members of the
class Country while correctly knowing
that the existence of the sentence,
“Garth Brooks is a country singer” did
not provide sufficient evidence that
“Garth Brooks” is the name of a country.
7 Moreover, KnowItAll learned a
set of relation-specific extraction patterns (for example, ”capital of <coun-try>”) that led it to extract additional
countries, and so on.
KnowItAll is self-supervised; instead of utilizing hand-tagged training data, the system selects and labels its own training examples and
iteratively bootstraps its learning
process. But while self-supervised
systems are a species of unsupervised
systems, unlike classic unsupervised
systems they do utilize labeled examples and do form classifiers whose
accuracy can be measured using
standard metrics. Instead of relying
on hand-tagged data, self-supervised
systems autonomously “roll their
own” labeled examples. (See Feldman10 for discussion of an additional
self-supervised IE system inspired by
KnowItAll.)
While self-supervised, KnowItAll
is relation-specific. It requires a laborious bootstrapping process for each
relation of interest, and the set of relations has to be named by the human
user in advance. This is a significant
obstacle to open-ended extraction
because unanticipated concepts and
relations are often encountered while
processing text.
The Intelligence in Wikipedia
(IWP) project23 uses a different form
of self-supervised learning to train
its extractors. IWP bootstraps from
the Wikipedia corpus, exploiting the
fact that each article corresponds to a
primary object and that many articles
contain infoboxes—tabular summaries of the most important attributes
(and their values) of these objects. For
example, Figure 2 shows the “Beijing”
infobox for the class Settlement that
was dynamically generated from the
accompanying attribute/value data.
IWP is able to use Wikipedia pages with infoboxes as training data
in order to learn classifiers for page
type. By using the values of infobox
attributes to match sentences in the
article, IWP can train extractors for
the various attributes. Further, IWP
can autonomously learn a taxonomy
over infobox classes, construct schema mappings between the attributes
of parent/child classes, and thus use
shrinkage to improve both recall and
precision. Once extractors have been
successfully learned, IWP can extract
values from general Web pages in order to complement Wikipedia with
additional content.
open information extraction
While most IE work has focused on a
small number of relations in specific
preselected domains, certain corpora—encyclopedias, news stories,
email, and the Web itself—are unlikely to be amenable to these methods. Under such circumstances, the
relations of interest are both numerous and serendipitous—they are not
known in advance. In addition, the
Web corpus contains billions of documents, necessitating highly scalable
extraction techniques.
The challenge of Web extraction
led us to focus on Open Information Extraction (Open IE), a novel
extraction paradigm that tackles an
unbounded number of relations, eschews domain-specific training data,
and scales linearly (with low constant
factor) to handle Web-scale corpora.
For example, an Open IE system
might operate in two phases. First, it
would learn a general model of how
relations are expressed in a particular
language. Second, it could utilize this
model as the basis of a relation-independent extractor whose sole input
is a corpus and whose output is a set
of extracted tuples that are instances
of a potentially unbounded set of
relations. Such an Open IE system
would learn a general model of how
relations are expressed (in a particular language), based on unlexicalized
features such as part-of-speech tags
(for example, the identification of a
verb in the surrounding context) and
domain-independent regular expressions (for example, the presence of
capitalization and punctuation).
Is there a general model of relationships in English, though? To address this question we examined a
sample of 500 sentences selected at
random from the IE training corpus
developed by Bunescu and Mooney.
6
We found that most relationships expressed in this sample could in fact
be characterized by a compact set of
relation-independent patterns. See
Table 1 for these patterns and an estimate of their frequency.a In contrast,
traditional IE methods learn lexical
models of individual relations from
hand-labeled examples of sentences
that express these relations. Such an
IE system might learn that the presence of the phrase “headquarters
located in” indicates an instance of
the headquarters relation. But lexical
features are relation-specific. When
using the Web as a corpus, the relations of interest are not known prior
to extraction, and their number is immense. Thus an Open IE system cannot rely on hand-labeled examples of
each relation. Table 2 summarizes
the differences between traditional
and Open IE.
Systems such as KnowItAll and
IWP may be seen as steps in the direction of Open IE, but the former
didn’t scale as well as desired and the
latter seems incapable of extracting
more than 40,000 relations. Knext19
appears to fit the Open IE paradigm,
a For simplicity, we restricted our study to binary relationships.