contributed articles

Doi: 10.1145/1498765.1498784

Comprehensive knowledge bases would tap
the Web’s deepest information sources and
relationships to address questions beyond
today’s keyword-based search engines.

BY GeRhARD WeiKum, GJeRGJi KASneci, mAYA RAmAnAth,
AnD fABiAn SuchAneK

Database and
information-
Retrieval
methods for
Knowledge
Discovery
our aim here
is to advocate for the integration of

database systems (DB) and information-retrieval (IR) methods to address applications that are emerging from the ongoing explosion and diversification of digital information. One grand goal of such an endeavor is the automatic building and maintenance of a comprehensive knowledge base of facts from encyclopedic sources and the scientific literature. Facts should be represented in terms of typed entities and relationships and allow expressive queries that

return ranked results with precision in an efficient and scalable manner. We thus explore how DB and IR methods might contribute toward this ambitious goal.

DB and IR are separate fields in computer science due to historical accident. Both investigate concepts, models, and computational methods for managing large amounts of complex information, though each began almost 40 years ago with very different application areas as motivations and technology drivers; for DB it was accounting systems (such as online reservations and banking), and for IR it was library systems (such as bibliographic catalogs and patent collections). Moreover, these two directions and their related research communities emphasized very different aspects of information management; for DB it was data consistency, precise query processing, and efficiency, and for IR it was text understanding, statistical ranking models, and user satisfaction.

There were attempts at integration (late 1990s), most notably the probabilistic datalog and probabilistic re-lational-algebra models, 13, 14 the proximal node model, 19 and the WHIRL approach to similarity joins. 9 But it is only in the past few years that mission-critical applications have emerged with a compelling need for integrated DB and IR methods and platforms. From an IR perspective, digital libraries of all kinds are becoming rich information repositories, with documents augmented by metadata and annotations captured in semistructured data formats (such as XML); enterprise search on intranet data represents a variant of this theme.

From a DB point of view, application domains (such as customer support, product and market research, and health-care management) reflect data growth in terms of both structured and unstructured information. Web 2.0 applications (such as social networks) require support for structured and textual data, as well as ranking and recommendation in the presence

56 communicAtionS of the Acm | APriL 2009 | voL. 52 | no. 4

References:

Archives