necessary to define a common schema—data from both sources can be
merged into a single self-describing
XML document—though in scenarios
such as warehousing applications the
transformation and fusing of the original data into a well-defined format is
required. Still, its flexibility and the
ubiquity of free parsers make XML
attractive in scenarios with looser requirements, and it is increasingly being used for transferring data between
systems and sometimes as a format for
storing data as well. 3
Schema Standards. It is easier to integrate data from different sources if
they use the same schema. This consistency avoids the need to reformat
the data before integrating it, and it
also ensures that data from all of the
sources have mutual meaning.
Even if sources do not conform to
a common schema, each source may
be able to relate its data to a common standard, either industry-wide or
enterprise-specific. Thus two sources
can be related by composing the two
mappings that relate each of them to
the standard. This approach only enables integration of information that
appears in the standard, and because
a standard is often a least common denominator, some information is lost
in the composition.
There are many industry-wide schema standards. 18, 28, 29 Some are oriented
toward generic kinds of data, such as
geographic information or software-engineering information. Others pertain to particular application domains
such as computer-aided design, news
stories, and medical billing.
When the schema standard is abstract and focuses on creating a taxonomy of terms, it is usually called an
ontology. Ontologies are often used as
controlled vocabularies—for example,
in the biomedical domain—rather
than as data formats. 12, 13
Data Cleansing. When the same or
related information is described in
multiple places (possibly within a single source), often some of the occurrences are inconsistent or just plain
wrong—that is, “dirty.” They may be
dirty because the data, such as inventory and purchase-order information
about the same equipment, were independently obtained. Or they may simply have errors such as misspellings,
information
integration is
a vibrant field
powered not only
by engineering
innovation but also
by evolution of the
problem itself.
be missing recent changes, or be in a
form that is inappropriate for a new
use that will be made of it.
A typical initial step in information
integration is to inspect each of the
data sources— perhaps with the aid of
data-profiling tools—for the purpose
of identifying problematic data. Then
a data-cleansing tool may be used to
transform the data into a common
standardized representation. A typical data-cleansing step, for example,
might correct misspellings of street
names or put all addresses in a common format. 10 Often, data-profiling
and -cleansing tools are packaged together as part of an ETL tool set.
One important type of data cleansing is entity resolution, or deduplica-tion, which identifies and merges information from multiple sources that
refer to the same entity. Mailing lists
are a well-known application; we have
all received duplicate mail solicitations
with different spellings of our names
or addresses. On the other hand,
sometimes seeming “duplicates” are
perfectly valid, because there really are
two different persons with very similar
names (John T. Jutt and his son John J.
Jutt) living at the same address.
Many data-cleansing tools exist,
based on different approaches or applied at different levels or scales. For
individual fields, a common technique
is edit-distance; two values are duplicates if changing a small number of
characters transforms one value into
the other. For records, the values of
all fields have to be considered; more
sophisticated systems look at groups
of records and accumulate evidence
over time as new data appears in the
system.
Schema Mapping. A fundamental operation for all information-integration
systems is identifying how a source
database schema relates to the target
integrated schema. Schema-mapping
tools, which tackle this challenge,
typically display three vertical panes
(see Figure 2). The left and right panes
show the two schemas to be mapped;
the center pane is where the designer
defines the mapping, usually by drawing lines between the appropriate
parts of the schemas and annotating
the lines with the required transformations. Some tools offer design assistance in generating these transfor-