figure 2: excerpt from the YAGo knowledge base.
subclass
subclass
Entity
organization
subclass
Person
Location
subclass
subclass
subclass
subclass
subclass
subclass
Scientist
Politician
instanceof
Country
subclass
Biologist
Physicist
instanceof
instanceof
instanceof
State
instanceof
instanceof
City
Germany
oct. 23, 1944
diedon
Erwin Planck
nobel Prize
Max Planck Society
oct. 4, 1947
means(0.1)
hasWon
fatherof
diedon
Max Planck
bornon
locatedin
citizenof
instanceof
Kiel
Schleswig-Holstein
Angela Merkel
bornin
Apr. 23, 1858
means(0.9)
“Max Planck”
means
“Max Karl Ernst Ludwig Planck”
means
“Angela Merkel”
means
“Angela Dorothea Merkel”
Abstracting from this application-centric discussion, we have identified
several compelling motivations for
bringing IR concepts to DB systems
and vice versa, leading to the following DB and IR concepts and methods
a developer would find useful:
Approximate matching and record
linkage. Adding text-matching functionality to DB systems often entails
approximate matching (such as due to
spelling variants) and when text fields
refer to named entities lead to record
linkage for matching entities. For example, the strings “William J. Clinton” and “Bill Clinton” likely denote
the same person, and the names “M-
31” and “NGC 224” should be reconciled to denote the Andromeda galaxy.
Approximate matching by similarity
measures requires IR-style ranking.
Too-many-answers ranking.
Preference search of, say, travel portals and
product catalogs often poses a too-
many-answers problem. Narrowing the
query conditions may overshoot by producing too few or even no results; interactive reformulation and browsing is
time-consuming and might irritate users. Large result sets inevitably require
ranking based on data and/or workload
statistics, as well as on user profiles;
Schema relaxation and heterogeneity. In the DB world, the norm is that
applications access multiple databases, often with a run-time choice of
the data sources. Even if each source
contains structured data records and
comes with an explicit schema, there
is no unified global schema unless a
breakthrough could be achieved to
magically perform perfect on-the-fly
data integration. So the application
program must be able to cope with the
heterogeneity of the underlying schema names, XML tags, and Resource
Description Framework (RDF) properties, and queries must be schema-
agnostic or at least tolerant to schema
relaxation;
Information extraction and uncertain
data. Textual information contains
named entities and relationships in
natural-language sentences that can
be made explicit through information-extraction techniques (pattern matching, statistical learning, and natural-language processing). However, this
approach can lead to large knowledge
bases with facts that exhibit uncertainty; querying extracted facts thus
entails ranking.
Entity search and ranking. Recognizing entities in text sources allows entity-search queries about, say, electronics products, travel destinations, and
movie stars, boosting search capabilities on intranets, portals, news feeds,
and the business- and entertainment-oriented parts of the Web. Extracting binary relations between entities,
as well as place and time attributes,
APriL 2009 | voL. 52 | no. 4 | communicAtionS of the Acm
59