<P1> Donald E. Martella
</P1>,formerly vice
president, was named
<POS> president </POS>
of <ORG> Topologix,
Inc </ORG>. <P1> He </
P1> succeeds <P2> Jack
Harper </P2>, a company
founder who was named
chairman.
supervised
machine Learning
“<P1> was named
<POS> of <ORG>”
“<P1> succeeds <P2>”
<P1> Mr. Smith </P1>
succeeds <P2> Jack
Harper </P2>.
figure 2: sample Wikipedia infobox and the attribute/value data used to generate it.
{{Infobox Settlement
|official_ name = Be ˘iji ˉng
|other _ name = 北京
|native _ name =
|settlement_ type = [[Municipality of
China|Municipality
|image _skyline = SA Temple of Heaven.jpg
|image _caption = The [[Temple of
Heaven]], a symbol of Beijing
|citylogo _ size =
|image _ map = China-Beijing.png
|mapsize = 275px
|map_ caption = Location within China
|subdivision _ type = Country
|subdivision _ name = [[People’s Republic of China]]
|subdivision _type1 = [[Political divisions of
China#County level|Countylevel&
nbsp;divisions]]
|subdivision _ name1 = 18
|subdivision _type2 = [[Political divisions of
China#Township
level|Township divisions]]
|subdivision _ name2 = 273
|leader _ title =[[Communist Party of
China|CPC]] Beijing
|leader _ name =[[Liu Qi (Communist)|Liu Qi]]
Committee Secretary
|leader _ title1 = [[Mayor]]
|leader _ name1 =[[Wang Qishan]]
|established _ title = Settled
|established _ date = ca. 473 BC
…
}}
of MUC- 3 and MUC- 4 was Latin-American Terrorism;
2 and the task was to
fill templates with information about
specific terrorist actions, with fields
for the type of event, date, location,
perpetrators, weapons, victims, and
physical targets. Subsequent MUC
conferences focused on domains
such as joint ventures, microelectronics, or management succession.
The first IE systems relied on some
form of pattern-matching rules that
were manually crafted for each do-
main. Rules that assigned the semantic class PhysicalTarget space to the
term bank in the terrorism domain,
for example, needed to be altered to
identify instances of the class
Corporation in the joint-ventures domain.
These systems were clearly not scalable or portable across domains.
Supervised Methods. Modern IE,
beginning with the works of Soderland,
21, 22 Riloff,
17 and Kim and Moldovan,
11 automatically learns an extractor from a training set in which
domain-specific examples have been
tagged. With this machine-learning approach, an IE system uses a
domain-independent architecture
and sentence analyzer. When the examples are fed to machine-learning
methods, domain-specific extraction
patterns can be automatically learned
and used to extract facts from text.
Figure 1 shows an example of such
extraction rules, learned to recognize
persons moving into and out of top
corporate-management positions.
The development of suitable training data for IE requires substantial effort and expertise. DIPRE,
5 Snowball,
1
and Meta-Bootstrapping18 sought to
address this problem by reducing the
amount of manual labor necessary to
perform relation-specific extraction.
Rather than demand hand-tagged corpora, these systems required a user to
specify relation-specific knowledge
through either of the following: a
small set of seed instances known to
satisfy the relation of interest; or a set
of hand-constructed extraction patterns to begin the training process.
For instance, by specifying the set
Bolivia, city, Colombia, district,
Nicaragua over a corpus in the terrorism
domain, these IE systems learned
patterns (for example, headquartered
in <x>, to occupy <x>, and shot in <x>)
that identified additional names of
locations. Recent advances include
automatic induction of features when
learning conditional random fields13
and high-level specification of extraction frameworks using Markov logic
networks.
14 Nevertheless, the amount
of manual effort still scales linearly
with the number of relations of interest, and these target relations must
be specified in advance.
Self-Supervised Methods. The KnowItAll Web IE system9 took the next step
in automating IE by learning to label
its own training examples using only
a small set of domain-independent
extraction patterns. KnowItAll was
the first published system to carry out
extraction from Web pages that was
unsupervised, domain-independent,
and large-scale.
For a given relation, the set of generic patterns was used to automatically instantiate relation-specific extraction rules, which were then used
to learn domain-specific extraction