vertical-domain portals, including
product search and the Libra portal
for scholarly search on extracted records about authors, papers, conferences, and communities.
Once the facts are gathered and organized into searchable form, a typical IR issue arises concerning how a
system should rank the results of an
entity-centric query. To this end, an
advanced statistical language model
(LM) has been extended from the form
of document-oriented bags of words
to the form of structured records.
Libra is an example of the Statistical-Web approach.
Cimple/DBLife. The Cimple project,
11, 22 being carried out jointly by the
University of Wisconsin and Yahoo!
Research, is similar to Libra, aiming
to generate and maintain community-specific portals with structured information gathered from Web sources.
However, it applies a number of methods to achieve this goal, as we illustrate
by discussing its flagship application:
the DBLife portal.
DBLife features automatically compiled “super-homepages” of researchers with bibliographic data, as well as
facts about community services (such
as PC work), colloquium lectures, and
more. For gathering and reconciling these facts, Cimple has a suite of
DB-style extractors based on pattern
matching and dictionary lookups. The
extractors are combined into execution plans and periodically applied to
a carefully selected set of relevant Web
sources, including prominent sites like
DBLP and the Dbworld archive and important conference and university pages that are selected semi-automatically.
While the overall approach makes use
of IR concepts like tf*idf-based ranking
and Web-graph link analysis, Cimple
emphasizes a more DB-oriented toolkit
for declarative extraction programs, using Datalog as a query-language framework and DB rewriting techniques for
Cimple leans more toward the Semantic-Web approach and less toward
a Statistical-Web approach. In addition, it contains Social-Web elements,
most notably, a Wiki-based mechanism for users to provide feedback
about incorrect facts they identify on
KnowItAll/TextRunner. Both Libra
is how to extract
facts from the
Web and organize
them into an
and Cimple operate on the basis of one
page at a time, then aim to extract as
many facts as possible from the given
page. A dual view is to focus on one or
more entity types or relationship types,
aiming to populate them by inspecting many pages and exploiting their
redundancies. For example, a user
might want to find all cities on planet
Earth, along with all scientists, guitar
players, and other unary relations (
entity types). For binary relations, a user
might consider gathering all CEOs
of all companies, all (city, river) pairs
where a city is located on a river, or the
answers to questions like: Who discovered what? and Which enzyme triggers
which biochemical process?
The KnowItAll project at the
5, 6, 12
University of Washington in Seattle
has pursued this goal, using techniques that combine pattern matching, linguistic analysis, and statistical
learning. KnowItAll starts with a set of
seeds: the instances of the relation of
interest (such as a set of cities or a set
of (city, river) pairs).
12 This is the only
“training input” needed by KnowItAll,
which automatically finds sentences
on the Web with the seeds, extracts linguistic patterns surrounding the seeds,
performs statistical analyses to identify
strong patterns, and finally identifies
the most useful patterns to obtain extraction rules. For example, the phrase
templates “located in downtown $x”
and “$x is located on the banks of $y”
may be determined to be good rules for
extracting cities and (city, river) pairs,
respectively. Now these rules can be applied to newly seen Web pages, yielding
facts or fact candidates, some in turn
considered as new, additional seeds.
Needed are statistical inferences to
identify good rules and assess the confidence in the harvested facts.
The TextRunner tool5 pays special
attention to scalability and simplifies
the entire fact-gathering pipeline. It
has a completely unsupervised boot-strapping phase for identifying simple
patterns, just enough to identify, with
high confidence, noun phrases and
verbal patterns. When TextRunner sees
a new Web page, it aggressively extracts
all potentially meaningful instances of
all possible binary relation types from
the page text; Banko et al.
5 refers to this
processing mode as open information
extraction, or “machine reading.”
APriL 2009 | voL. 52 | no. 4 | communicAtionS of the Acm