contributed articles
Doi: 10.1145/1498765.1498784
Comprehensive knowledge bases would tap
the Web’s deepest information sources and
relationships to address questions beyond
today’s keyword-based search engines.
BY GeRhARD WeiKum, GJeRGJi KASneci, mAYA RAmAnAth,
AnD fABiAn SuchAneK
Database and
information-
Retrieval
methods for
Knowledge
Discovery
our aim here
is to advocate for the integration of
database systems (DB) and information-retrieval (IR)
methods to address applications that are emerging
from the ongoing explosion and diversification
of digital information. One grand goal of such an
endeavor is the automatic building and maintenance
of a comprehensive knowledge base of facts from
encyclopedic sources and the scientific literature.
Facts should be represented in terms of typed entities
and relationships and allow expressive queries that
return ranked results with precision in
an efficient and scalable manner. We
thus explore how DB and IR methods
might contribute toward this ambitious goal.
DB and IR are separate fields in
computer science due to historical
accident. Both investigate concepts,
models, and computational methods
for managing large amounts of complex information, though each began
almost 40 years ago with very different application areas as motivations
and technology drivers; for DB it was
accounting systems (such as online
reservations and banking), and for
IR it was library systems (such as bibliographic catalogs and patent collections). Moreover, these two directions
and their related research communities emphasized very different aspects
of information management; for DB
it was data consistency, precise query
processing, and efficiency, and for IR
it was text understanding, statistical
ranking models, and user satisfaction.
There were attempts at integration
(late 1990s), most notably the probabilistic datalog and probabilistic re-lational-algebra models,
13, 14 the proximal node model,
19 and the WHIRL
approach to similarity joins.
9 But it is
only in the past few years that mission-critical applications have emerged
with a compelling need for integrated
DB and IR methods and platforms.
From an IR perspective, digital libraries of all kinds are becoming rich
information repositories, with documents augmented by metadata and
annotations captured in semistructured data formats (such as XML); enterprise search on intranet data represents a variant of this theme.
From a DB point of view, application
domains (such as customer support,
product and market research, and
health-care management) reflect data
growth in terms of both structured
and unstructured information. Web
2.0 applications (such as social networks) require support for structured
and textual data, as well as ranking
and recommendation in the presence
56 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4