Targeted IE methods are transforming
into open-ended techniques.
By oRen etzioni, micheLe BanKo,
stePhen soDeRLanD, anD DanieL s. WeLD
from the Web
say you want to select a quiet, centrally located
Manhattan hotel. Google returns an overwhelming
seven million results in response to the query “new
york city hotels.” or, say you are trying to assemble
a program committee for an annual conference
composed of researchers who have published at
the conference in previous years, and to balance it
geographically. While today’s Web search engines
identify potentially relevant documents, you are
forced to sift through a long list of urls, scan
each document to identify any pertinent bits of
information, and assemble the extracted findings
before you can solve your problem.
over the coming decade, Web searching will
increasingly transcend keyword queries in favor of
systems that automate the tedious and error-prone
task of sifting through documents. Moreover, we
will build systems that fuse relevant
pieces of information into a coherent
overview, thus reducing from hours
to minutes the time required to perform complex tasks.
Information extraction (IE)—a
venerable technology that maps nat-ural-language text into structured
relational data—offers a promising
avenue toward this goal. Although extracting data from text is inherently
challenging, given the ambiguous
and idiosyncratic nature of natural
language, substantial progress has
been made over the last few decades.
This article surveys a range of IE
methods, but we highlight Open
3, 4 wherein the
identities of the relations to be extracted are unknown and the billions
of documents found on the Web necessitate highly scalable processing.
At the core of an IE system is an extractor, which processes text; it overlooks irrelevant words and phrases
and attempts to home in on entities
and the relationships between them.
For example, an extractor might map
the sentence “Paris is the stylish capital of France” to the relational tuple
(Paris, CapitalOf, France), which
might be represented in RDF or another formal language.
Considerable knowledge is necessary to accurately extract these tuples
from a broad range of text. Existing
techniques obtain it in ways ranging
from direct knowledge-based encoding (a human enters regular expressions or rules) to supervised learning
(a human provides labeled training
examples) to self-supervised learning
(the system automatically finds and
labels its own examples). Here, we
briefly survey these methods.
Knowledge-Based Methods. The first
IE systems were domain-specific. A
series of DARPA Message Understanding Conferences (MUCs) challenged
the NLP community to build systems
that handled robust extraction from
naturally occurring text. The domain