one above. We summarize these approaches in this section, along with the
general types of products used. For information on specific products, we refer the interested reader to IT research
companies that publish comprehensive comparisons and to Web search
engines (using the product categories
we define here as keyword queries).
Data Warehouse Loading. A data
warehouse is a database that consolidates data from multiple sources. 7
For example, it may combine sales information from subsidiaries to give a
sales picture for the whole company.
Because subsidiaries have overlapping
sets of customers and may have inconsistent information about a particular
customer, data must be cleansed to
reconcile such differences. Moreover,
each subsidiary may have a database
schema (that is, data representation)
that differs from the warehouse schema. So each subsidiary’s data has to be
reshaped into the common warehouse
schema.
Extract-Transform-Load (ETL) tools
address this problem21 by simplifying
the programming of scripts. An ETL
tool typically includes a repertoire of
cleansing operations (such as detection of approximate duplicates) and
reshaping operations (such as Structured Query Language [SQL]-style operations to select, join, and sort data).
The tool may also include scheduling
functions to control periodic loading
or refreshing of the data warehouse.
Some ETL tools are customized for
master data management—that is, to
produce a data warehouse that holds
the master copy of critical enterprise
reference data, such as information
about customers or products. Master
data is first integrated from multiple
sources and then itself becomes the
definitive source of that data for the
enterprise. Master data-management
tools sometimes include domain-specific functionality. For example,
for customer or vendor information,
they may have formats for name and
address standardization and cleansing functions to validate and correct
postal codes.
Virtual Data Integration. While
warehouses materialize the integrated
data, virtual data integration gives the
illusion that data sources have been
integrated without materializing the
Beyond classical
information-
technology
applications,
information
integration is also a
large and growing
part of science,
engineering,
and biomedical
computing, as
independent labs
often need to use
and combine each
other’s data.
integrated view. Instead, it offers a mediated schema against which users can
pose queries. The implementation,
often called a query mediator35 or en-terprise-information integration (EII)
system, 16, 27 translates the user’s query
into queries on the data sources and
integrates the result of those queries
so that it appears to have come from a
single integrated database. EII is still
an emerging technology, currently less
popular than data warehousing.
Although the databases cover related subject matter, they are heterogeneous in that they may use different
database systems and structure the
data using different schemas. An EII
system might be used, for example,
by a financial firm to prepare for each
customer a statement of portfolio positions that combines information
about his or her holdings from the local customer database with stock prices retrieved from an external source.
To cope with this heterogeneity
in EII, a designer creates a mediated
schema that covers the desired subject matter in the data sources and
maps the data source schemas to the
new mediated schema. Data cleansing and reshaping problems appear in
the EII context, too. But the solutions
are somewhat different in EII because
data must be transformed as part of
query processing rather than via the
periodic batch process associated with
loading a data warehouse.
EII products vary, depending on the
types of data sources to be integrated.
For example, some products focus on
integrating SQL databases, some on
integrating Web services, and some on
integrating bioinformatics databases.
Message Mapping.
Message-orient-ed middleware helps integrate independently developed applications by
moving messages between them. If
messages pass through a broker, the
product is usually called an enterprise-application integration (EAI) system. 1
If a broker is avoided through all applications’ use of the same protocol (for
example, Web services), then the product is called an enterprise service bus.
If the focus is on defining and controlling the order in which each application is invoked (as part of a multistep
service), then the product is called a
workflow system.
In addition to the protocol-transla-