Schema matching (a set of
correspondences bet ween
attributes of relations)
Schema-mapping generation
Source schema S
Source data
Schema-mapping
Executable script generation
Executable transformation
script (SQL/XSLT/...)
Target schema T
queries in relational databases (see Figure 2).
A system that makes systematic use of schema mappings
is Clio, a data-exchange system that started as a research
prototype at the IBM Almaden Research Center and is now
part of IBM’s suite of information integration tools.
13 The
architecture of Clio is depicted in Figure 3. The system has
a schema-matching component, a schema-mapping generation component, and an executable-script generation
component. The schema-matching component produces a set
of correspondences between attributes of relations in a
source schema and a target schema; these correspondences
are derived automatically or semi-automatically through an
interaction with the user and via a graphical user interface
that allows the user to intervene and make changes. The
schema-mapping generation component takes as input
these attribute correspondences and returns a schema mapping. In general, there are more than one schema mapping
that are consistent with a set of attribute correspondences.
Clio produces just one of these possible schema mappings
but the user can again intervene and edit the schema mapping returned by the system. Finally, the executable-script
generation component automatically transforms this
schema mapping into a set of scripts in some language, such
as SQL or XSLT.
2. scHemA mAPPinGs AnD lAnGuAGes
In this section, we define the basic notions about schema
mappings and present some of the main schema-mapping
languages studied by the research community.
2. 1. Basic notions
A (relational) schema is a tuple R = (R
1, . . . , Rn) of relation symbols each of which has a fixed arity (number of attributes).
An R-instance is a tuple I = (R1I,... , RnI) of finite relations,
whose arities match those of the relation symbols of R.
A fact of I is an expression Ri a, where i £ n and a is a tuple of
values belonging to the relation RiI. The active domain of I,
denoted by adom (I), is the set of all values occurring in the
relation RiI, for 1 £ i £ n. We will be usually working with two
disjoint schemas, called the source schema s = (S1, . . . , Sn)
and the target schema t = (T1, . . . , Tm). An s-instance is
called a source instance and a t-instance is called a target
instance. Whenever we consider a pair of instances (I, J), it
is understood that I is a source instance and J is a target
instance.
As stated earlier, a schema mapping is a declarative
specification that describes the relationship between
two schemas. More precisely, a schema mapping is a triple M = (s, t, S), where s is a source schema, t is a target schema disjoint from s, and S is a set of sentences
(i.e., formulas with no free variables) in some logical
formalism. This is the syntactic view of schema mappings. There is also a complementary semantic view
of schema mappings that we present next. Let I be a
source instance and J a target instance. We say that J
is a solution for I w.r.t. a schema mapping M = (s, t, S)
if (I, J ) |= S, which means that (I, J) satisfies every sentence
in S. Consequently, from a semantic standpoint, a schema
mapping M can be thought of as a triple M = (s, t, W),
where W is the set of all pairs (I, J) such that J is a solution
for I w.r.t. M. So, as a semantic object, a schema mapping
M is triple M = (s, t, W), where s is a source schema, t is
a target schema disjoint from s, and W is a collection of
pairs (I, J) with I a source instance and J a target instance.
Let L be a logical language, let S be a set of L-sentences,
and let M = (s, t, W) be a schema mapping given as a semantic object. We say that M is L-definable by S if for every source
instance I and every target instance J, we have that (I, J) Î
W if and only if (I, J) |= S. When we work with schema mappings in the sequel, it will be clear from the context if the
schema mapping at hand is viewed as a syntactic object or
as a semantic one.
Example 2. 1. To illustrate these notions, consider the
schema mapping in Figure 1. In this example, the source
schema s consists of the relations DirectCustomer,
DirectOrder, and Retail, while the target schema t
consists of the single relation Sales. The relationship between
source and target is then described by the schema mapping M = (s, t, S), where S consists of the two first-order
sentences listed in Figure 1. As a semantic object, M is
the triple (s, t, W), where W consists of all pairs (I, J) satisfies the two sentences in S. In particular, the pairs (I, J1)
and (I, J2) belong to W, but the pair (I, J3) does not. In other
words, J1 and J2 are solutions for I w.r.t. to M, but J3 is not.
2. 2. schema-mapping languages
What is a “good” language for specifying schema mappings? To address this question, let us reflect on the relational database model, introduced by E.F. Codd 40 years
ago.
4 One of the reasons for the success of this model is
that it supports powerful, high-level database query languages, such as relational algebra and relational calculus,
that have formed the foundation for SQL. Relational algebra and relational calculus have the same expressive power
as first-order logic5; in fact, relational calculus is a syntactic
variant of first-order logic tailored for databases. Thus, at
first sight, it is natural to ask: can first-order logic also be
used as a schema-mapping language? It is not hard to show,
however, that basic algorithmic problems in data integration and data exchange, such as existence-of-solutions and