Technical Perspective
schema mappings:
Rules for mixing Data
By Alon Halevy
Doi: 10.1145/1629175.1629200
wHen YoU seARCH for flight tickets on
you favorite Web site, your query is often dispatched to tens of databases to
produce an answer. When you search
for products on Amazon.com, you
are seeing results from thousands of
vendor databases that were developed
before Amazon existed. Did you ever
wonder how that happens? What is
the theory behind it all? At the core,
these systems are powered by schema
mappings that provide the glue to tie
all these databases together. The following paper by ten Cate and Kolaitis
will give you a glimpse into the theoretical foundations underlying schema mappings and might even inspire
you to work in the area.
The scenarios I’ve noted here are
examples of data management applications that require access to multiple heterogeneous data sets. Data
integration is the field that develops
architectures, systems, formalisms,
and algorithms for combining data
from multiple sources, be they relational databases, XML repositories, or
data from the Web. The goal of a data
integration system is to offer uniform
access to a collection of sources, and
free the user from having to locate individual sources, learn their specific
interaction details, and to manually
combine the data. The work on data
integration spans multiple fields of
computer science, including data
management, artificial intelligence
systems, and human-computer interaction. The field has been nicknamed
the “AI-complete” problem of data
management due to the challenges
that arise from reconciling multiple
models of data created by humans,
and the realization that we never expect to solve data integration completely automatically.
Data integration challenges are
pervasive in practice. Large enterprises often must combine data from
hundreds of repositories, and scientists constantly face an explosion in
the number of sources being created
in their domain. The Web provides an
extreme case of data integration with
tens of millions of independently developed data sources. Fortunately,
data integration is also a pervasive
problem in government organizations, enabling a steady stream of
research on the topic. In a nutshell,
data integration is difficult because
the data sets were developed independently and for different purposes.
Therefore different developers model
varying aspects of the data, use inconsistent terminology, and make different assumptions on the data.
There are several architectures
for data integration systems, and the
appropriate choice depends on the
need of the application. In some cases it is possible to collect all the data
in one physical repository; in other
cases data must be exchanged from a
source database to a target. In other
scenarios, organizational boundaries or other factors dictate that data
must be left at the original sources
and combination of the relevant data
can only occur in response to a query.
Regardless of the architecture used,
the core of data integration relies on
schema mappings that specify how
to translate terms (for example, table
names and attribute names) between
different sources and relate differing
database organizations. Much of the
effort in building a data integration
application is to construct schema
mappings and maintain them over
time. The main reason building the
mappings is difficult is that it requires
understanding the semantics of the
source and target databases (that may
require more than one person), and
the ability to express the semantic relationship formally (that may require
a database specialist in addition to
the domain experts). There has been
a large body of research on providing
assistance in creating and debugging
schema mappings.
A schema mapping must be written
in some logical formalism. In the earliest data integration systems, schema
mappings were written like ordinary
view definitions (now known as GAV
mappings), where an integrated view
is defined over tables from multiple
sources. With time, it became evident
that this approach did not scale to a
large number of sources, thus LAV
mappings were developed. In LAV, the
focus is on describing the contents of
an individual source irrespective of
the other sources. LAV mappings are
complemented by a general reasoning engine that infers how to combine
data from multiple sources, given a
particular query. As this study of mappings progressed, researchers discovered close relationships between
mapping formalisms and constraint
languages such as tuple-generating
dependencies.
Though some properties of these
languages in isolation are well understood, this paper sheds significant
light, for the first time, on the relationships between these languages. The
authors identify general properties
of mappings (that are not tied to the
formalism in which they are written),
and show how these properties can be
used to characterize the language that
can express a mapping. Except for
providing several insightful results,
I believe their paper merits careful
study because it opens up a new and
exciting field of research involving the
expressive power of data integration
systems.
Alon halevy is a research scientist at Google, where he
manages a team looking into how structured data can be
used in Web search,