of uncertain information of highly diverse quality (see Figure 1). The Figure categorizes information systems
along two dimensions: how the data
is to be managed and how the data is
to be searched. The first divides the
world of digital data into structured
data (such as like schema-oriented records with numerical, categorical, and
short-string attributes) and unstructured data (such as natural-language
text and multimodal information, including speech and video) and loose
collections of heterogeneous records.
The second dimension distinguishes
sophisticated query languages that express logical conditions from simple
keyword search as the prevalent way
of posing queries to search engines.
Since the late 1960s DB and IR systems have resided in two totally separate quadrants in the Figure, while it
seemed as though the other two were
useless or unoccupied.
Since the late 1990s, DB and IR re-
searchers have explored these previously blank quadrants (middle of the
Figure). IR-style keyword search over
structured data (such as relational databases) makes sense when the structural data description—the schema—
is so complex that information needs
cannot be concisely or conveniently
expressed in a structured query. As an
example of this difficulty, consider a
social-network database with tables
of users, friends, and posted items
(such as photos, videos, and recommended books or songs), as well as
ratings and comments. Assume a user
wants to find the connections shared
by Alon, Raghu, and Surajit with respect to the Semantic Web. Answers
might be that the three co-authored a
book on the Semantic Web, two edited
a book, one commented on it, or the
three are friends and one posted a video called “Semantic Web Saga.” With
structured querying, where each value
(such as “Alon”) refers to a particu-
lar attribute (such as User.Name and
Friend.Name), the combinatorial options lead to very complex queries with
many joins and unions. Much simpler
is to state five keywords—“Alon, Raghu, Surajit, Semantic, Web”—and let
the system compute the most meaningful answers in a relational graph.
This relaxed attitude toward the
schema (which value should occur in
which attribute) naturally entails IR-style ranking.
Conversely, linguistic and learn-ing-based information-extraction
techniques have been applied in order to augment textual sources with
structured records and enable expressive DB-style querying over originally
unstructured data. Consider an information request about “the life of the
scientist Max Planck” to be evaluated
over an XML-based digital library, perhaps an extended form of Wikipedia.
A simple approach would be to formulate a keyword query like “life scientist
APriL 2009 | voL. 52 | no. 4 | communicAtionS of the Acm
57