Beyond classical information-technology applications, information integration is also a large and growing part
of science, engineering, and biomedical computing, as independent labs
often need to use and combine each
other’s data.
Software vendors offer numerous
tools to reduce the effort, and hence
the cost, of integration and to improve
the quality. Moreover, because information integration is a complex and
multifaceted task, many of these tools
are highly specialized. The resulting
profusion of tools can be confusing. In
this article, we try to clear up any confusion by:
˲ Exploring an example of a typical
integration problem
˲ Describing types of information integration tools used in practice
˲ Reviewing core technologies that lie
at the heart of integration tools
˲ Identifying future trends.
an example
Consider a large auto manufacturer’s
support center that receives a flood of
emails and service-call transcriptions
every day. From any given text, company analysts can extract the type of
car, the dealership, and whether the
customer is pleased or annoyed with
the service. But to truly understand
the reasons for the customer’s sentiment, the company also needs to know
something about the dealership and
the transaction—information that is
kept in a relational database.
Solving such an integration problem is an iterative process. The data
must first be understood and then
prepared for integration by means of
“cleansing” and “standardization.”
Next, specifications are needed regarding what data should be integrated and how they are related. Finally,
an integration program is generated
and executed by some type of integration engine. The results are examined,
and any anomalies must be resolved,
which often requires returning to step
one and studying the data.
Many technologies are needed to
support this process. We introduce a
few here and then describe them in
greater depth, along with others, in
subsequent sections.
The first step toward integrating the
text and the relational data is to understand what transactions and other
information they contain and how to
relate that information to each dealership. The manufacturer next needs to
decide how to represent the integrated
information. A simple schema—auto
model, customer, dealership, date
sold, price, size of dealership, date of
problem, problem type—might suffice
(See Figure 1). But how should each
field be represented, and where will the
data come from? Let’s assume that the
auto model, customer, and dealership
information will be extracted from the
textual complaint, as will the date of
problem and problem type. The relational database has tables about dealerships and transactions that can provide the rest of the information: size of
dealership, date sold, and price.
Next, programs are needed to extract structured information from the
email or transcription text. These programs’ outputs provide a schema for
the text data—the “fields” that can be
queried. Matching and mapping tools
can be used to relate this derived schema to the target schema. Similarly,
matching and mapping must be done
for the relational schema. The dealership name extracted from the text can
be connected to an entry in the dealership table, and the customer, auto
model, and dealership to the transactions table, thus joining the two tables
with the textual data.
Programs are needed to align data
instances: because it is unlikely that
the data formats of the extracted text
are identical to those in the relational
database, some data cleansing will
be required. For example, dealership
names may not exactly match. These
data-integration programs must then
be executed, often using a commercial
integration product.
types of information-integration tools
A variety of architectural approaches
can be used to solve problems like the
figure 1. annotators extract key information from email messages. this information is used
to probe the relational source data to retrieve additional facts needed for the target schema.
source Data and schema
ts: oct. 28, 2007
sirs:
my Galaxy’s brakes are
squealing after only 6 months!
i purchased this clunker at
billboy in oshkosh…
sincerely,
John J. Jutt
DealeRs
Dealership
oshkosh billboy Ford
DealeriD
bb32
size of Dealership
300 cars per month
owner
bill boy
annual Revenue
$5m
tRansactions
transiD customer
1234 Jutt, John J.
DealeriD
bb32
auto model
Galaxy
Date sold
4/21/07
Price
5000
target schema
auto model
Galaxy
customer
John J. Jutt
Dealership
oshkosh billboy Ford
Date sold
4/21/07
Price
5000
size of Dealership
300 cars per month
Problem Date
oct. 28, 2007
Problem type