table 1: taxonomy of binary relationships. nearly 95% of 500 randomly
selected sentences belong to one of the eight categories noted here.
Relative frequency
37. 8
category
verb
22. 8
noun + Prep
16.0
verb + Prep
9. 4
Infinitive
5. 2
modifier
1. 8
coordinaten
1.0
coordinatev
0.8
Appositive
simplified Lexico-syntactic Pattern
e1 verb e2
X established Y
e1 nP Prep e2
X settlement with Y
e1 verb Prep e2
X moved to Y
e1 to verb e2
X plans to acquire Y
e1 verb e2 noun
X is Y winner
e1 (and|,|-|:) e2 nP
X-Y deal
e1 (and|,) e2 verb
X , Y merge
e1 nP (:|,)? e2
X hometown : Y
table 2: the contrast between traditional and open ie.
Input
traditional ie
corpus + labeled data
Relations
Complexity
Specified In Advance
O (D R)
D documents, R relations
open ie
corpus + domain-Independent
methods
discovered Automatically
O (D)
D documents
figure 3: information extraction as sequence labeling. a cRf is used to identify
the relationship, born in, between Kafka and Prague. entities are labeled as ent.
the B-ReL label indicates the start of a relation, with i-ReL indicating the continuation
of the sequence.
Kafka , a writer born in Prague , wrote “ The metamorphosis .”
enT
o
o
enT
b-rel
I-rel
enT
Kafka
,
a
writer
born
in
Prague
but its precision, recall, and scalability have yet to be demonstrated.
the textRunner system
TextRunner3, 4 is a fully implemented
Open IE system that utilizes the two-phase architecture described here.b
TextRunner extracts high-quality information from sentences in a scalable and general manner. Instead of
requiring relations to be specified in
its input, TextRunner learns the relations, classes, and entities from its
corpus using its relation-independent
extraction model.
TextRunner’s first phase uses a
general model of language. Specifically, it trains a graphical model called a
conditional random field (CRF)
12 to
maximize the conditional probability
of a finite set of labels, given a set of
input observations. By making a first-order Markov assumption about the
dependencies among the output variables, and thus arranging variables
sequentially in a linear chain, extraction can be treated as a sequence-la-beling problem. Using a CRF, the extractor learns to assign labels to each
of the words in a sentence denoting
the beginning and end both of entity
names and the relationship string.c
See Figure 3 for an illustration.
In the second phase, TextRunner’s
extractor scans sentences linearly and
rapidly extracts one or more textual
triples that aim to capture (some of)
the relationships in each sentence.
For example, given the sentence “
Kafka, a writer born in Prague, wrote The
Metamorphosis,” the extractor forms
the triple (Kafka, born in, Prague).
The triple consists of three strings, in
which the first and third are meant to
denote entities and the second to denote the relationship between them.
Of course, there are many subtle-ties to successful extraction from a
corpus as large and heterogeneous as
the Web. First, the same entities may
be referred to by a variety of names
(for example, Edison, Thomas Edison,
Thomas Alva Edison, and so on). Second, the same string (say, John Smith)
may refer to different entities. Third,
vagaries of natural language (such as
pronoun resolution, metaphor, anaphora, complex or ungrammatical
sentences) have to be unraveled to
correctly extract information. Fourth,
the Web is rife with incorrect information (for example, Elvis killed JFK). In
fact, there are many more challenges
that we do not have room to discuss
here, though we have addressed some
of them in our research. For instance,
the Resolver system24 computes the
probability that two strings are synonymous based on a highly scalable and
unsupervised analysis of TextRunner
tuples. Numerous other issues remain
b See www.cs.washington.edu/research/textrunner
c Although TextRunner has initially focused
on extracting binary relationships, its model
structure can be extended to identify relationships with greater arity.