Structural Characterizations
Doi: 10.1145/1629175.1629201
of Schema-Mapping
Languages
By Balder ten Cate and Phokion G. Kolaitis
Abstract
Information integration is a key challenge faced by all major
organizations, business and governmental ones alike. Two
research facets of this challenge that have received considerable attention in recent years are data exchange and data
integration. The study of data exchange and data integration has been facilitated by the systematic use of schema
mappings, which are high-level specifications that describe
the relationship between two database schemas. Schema
mappings are typically expressed in declarative languages
based on logical formalisms and are chosen with two criteria in mind: (a) expressive power sufficient to specify
interesting data interoperability tasks and (b) desirable
structural properties, such as query rewritability and existence of universal solutions, that, in turn, imply good algorithmic behavior.
Here, we examine these and other fundamental structural properties of schema mappings from a new perspective by asking: How widely applicable are these properties?
Which schema mappings possess these properties and
which do not? We settle these questions by establishing
structural characterizations to the effect that a schema
mapping possesses certain structural properties if and
only if it can be specified in a particular schema-mapping
language. More concretely, we obtain structural characterizations of schema-mapping languages such as global-as-view (GAV) dependencies and local-as-view (LAV)
dependencies. These results delineate the tools available
in the study of schema mappings and pinpoint the properties of schema mappings that one stands to gain or
lose by switching from one schema-mapping language to
another.
1. in TRoDuc Tion
The aim of information integration is to synthesize information distributed over multiple heterogeneous sources
into a single unified format. Information integration has
been recognized as a key (and costly) challenge faced by
large organizations today (see Bernstein and Haas3, 12). It is
also well understood12 that information integration is not
a single problem but, rather, a collection of interrelated
problems that include extracting and cleaning data from
the sources, deriving a unified format for the integrated
data, transforming data from the sources into data conforming with the unified format, and answering queries
over the unified format. In this article, we focus on
relational information integration, this is to say, we assume
that the sources are databases over (different) relational
schemas, called source or local schemas, and also that the
unified format is some other relational schema, called the
target or the global schema. A relational schema or simply
a schema consists of names of relations and names of the
columns of each relation. A database instance or simply
an instance for a given schema is a collection containing,
for each relation name in the schema, a finite relation (i.e.,
a table of records). An example of a source schema and a
target schema is given in Figure 1. The source schema
consists of three relation names that contain information
about direct orders from a manufacturer together with
information about retail sales; the target schema consists
of a single relation name intended to summarize the sales
records. Figure 1 also depicts a source instance and three
target instances that will be used later on to illustrate the
main concepts.
Two important facets of information integration are data
exchange and data integration. Both these facets deal with
the attainment of information integration, but they adopt
distinctly different approaches. Data exchange is the problem of transforming data residing in different sources into
data structured under a target schema; in particular, data
exchange entails the materialization of data, after the data
have been extracted from the sources and restructured into
the unified format. In contrast, data integration can be
described as symbolic or virtual integration: users are provided with the capability to pose queries and obtain answers
via the unified format interface, while the data remain in
the sources and no materialization of the restructured data
takes place. Figure 2 depicts the data integration and data-exchange tasks.
In both data exchange and data integration, the relationship between the local schemas and the global schema must
be spelled out. One way to accomplish this is via programs or
SQL scripts written by human experts; this, however, can be
an expensive and error-prone undertaking due to the complexity of the transformations involved. Instead, the research
community has introduced schema mappings, a higher level
A previous version of this article appeared in the
Proceedings of the 12th International Conference on
Database Theory, 2009.