could pave the way toward semantic IR
on digital libraries (such as PubMed),
news, and blogs and also aid natural-language question answering and
searching the deep, or hidden, Web.
harvesting, Searching,
Ranking the Web
The Web has the potential for being the
world’s most comprehensive knowledge base, but we are still far from exploiting it. Valuable scientific and cultural content is all mixed up with huge
amounts of noisy, low-quality, unstructured text and media. The challenge is
how to extract the important facts from
the Web and organize them into an
explicit knowledge base that captures
entities and semantic relationships
among them. Imagine a formally structured Wikipedia with the same scale
and richness as Wikipedia itself but
that offers a precise and concise representation of knowledge that enables
expressive and precise querying.
Figure 2 outlines what such a knowledge base might look like, depicting an
excerpt from our own Yet Another Great
Ontology (YAGO) knowledge base,
24 a
typed entity-relationship graph that can
be represented in the RDF or Owl-Lite
data models. Building and maintaining it in a largely automated manner is
not only difficult but an opportunity for
computer science to contribute toward
high-value assets for science, culture,
and society. DB and IR methods could
indeed have the potential to play major
roles in this endeavor.
With a knowledge base that sublimates valuable content from the Web,
we could address difficult questions
beyond the capabilities of today’s keyword-based search engines. For example, a user might ask for a list of drugs
that inhibit proteases and obtain a fairly comprehensive list of drugs for this
HIV-relevant family of enzymes. Such
advanced information requests are
posed by knowledge workers, including scientists, students, journalists,
historians, and market researchers.
Although it is possible to find relevant
answers, the process is laborious and
time-consuming, as it often requires
rephrasing queries and browsing
through many potentially promising
but ultimately useless result pages.
The following example questions illustrate this complexity:
Which German Nobel laureate survived both world wars and outlived all
four of his children? The answer is Max
Planck. The bits and pieces needed to
answer are not difficult to locate: lists
of Nobel prize winners, birth and death
dates of the relevant people, the names
of family members extracted from biog-raphies, and dates associated with the
various children. Gathering and connecting these facts is straightforward
for a human but could take them days
of manually inspecting Web pages.
Which politicians are also accomplished scientists? Today’s search engines fail on such questions because
they match words and return pages
rather than identify entities (such as
persons) and test their relationships.
Moreover, the question entails a difficult ranking problem. Wikipedia
alone contains hundreds of names
listed in the categories “Politicians”
and “Scientists.” An insightful answer
must rank important people first, say,
the German chancellor Angela Merkel,
who has a doctoral degree in physical
chemistry, and Benjamin Franklin,
who made scientific discoveries and
was a founding father of the U.S.
How are Max Planck, Angela Merkel,
Jim Gray, and the Dalai Lama related?
All four have doctoral degrees from
German universities (honorary doctorates for Gray and the Dalai Lama).
Discovering interesting facts about
multiple entities and their connections on the Web is virtually impossible due to the sheer amount of in-terconnected pages about these four
famous people.
Note that even though the questions are asked in natural language,
they would remain equally difficult to
answer even if expressed in a formal
language. Conversely, a rich knowledge base of entities and relationships
would enable much more effective
natural-language question answering.
Information organization and
search on the Web are being augmented with increasingly sophisticated
structure, context awareness, and semantic flavor in the form of faceted
search, vertical-domain search, entity
search, and deep-Web search. All major search engines recognize a large
fraction of worldwide product names,
have built-in knowledge about geographic locations, and return high-
precision results for popular queries
about consumer interests, travel, and
entertainment. Information-extraction and entity-search methods are
clearly at work. But these efforts focus
only on specific domains. Generalizing the approach toward a universal
methodology for knowledge harvesting requires bolder steps, and three
major research avenues promise to
contribute to this goal:
Semantic-Web-style knowledge repositories (such as ontologies and taxonomies). Included are general-purpose ontologies and thesauri (such
as SUMO, OpenCyc, and WordNet),
as well as domain-specific ontologies
and terminological taxonomies (such
as GeneOntology and UMLS in the
biomedical domain);
Large-scale information extraction
(IE) from text sources in the spirit of a
Statistical Web. IE methods—entity
recognition and learning relational
patterns—are increasingly scalable
and less dependent on human supervision ; and
1, 10, 21
Social tagging and Web 2.0 communities that constitute the social Web. Human contributions are abundant in the
form of semantically annotated Web
pages, phrases in pages, images, and
videos, together providing “wisdom of
the crowds.” Freebase and other such
endeavors collect structured data records from human communities. Wikipedia is another example of the Social
Web paradigm, including semistructured data (such as infoboxes) that can
be augmented with explicit facts.
4,
24,
27
Research projects often combine
elements of the semantic, statistical,
and social approaches. Here, we discuss several interesting projects, highlighting YAGO results:
Libra. Aiming to support entity
search on the Web, the Microsoft Research Lab in Beijing has developed
comprehensive technology for information extraction, including pat-tern-matching algorithms tailored to
typical Web-page layouts and trained
learning of patterns using advanced
models (such as hierarchical conditional random fields28). A particularly
fruitful focus is to extract entities and
their attributes from product-related
pages with HTML tables and lists.
These methods and tools are being
used to build and maintain several
60 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4