mations, which are often complex. 26, 30
Once the mapping is defined, most
tools can generate a program to transform data conforming to the source
schema into data conforming to the
target schema. 15 For an ETL engine,
the tool might generate a script in the
engine’s scripting language. For an EII
system, it might generate a query in
the query language, such as SQL. For
an EAI system, it might transform XML
documents from a source-message format to that of the target. For an object-to-relational mapping system, it might
generate a view that transforms rows
into objects. 23
Schema Matching. Large schemas
have several thousand elements, presenting a major problem for a schema-mapping tool. To map an element of
Schema 1 into a plausible match in
Schema 2, the designer may have to
scroll through dozens of screens. To
avoid this tedious process, the tool may
offer a schema-matching algorithm, 31
which uses heuristic or machine-learning techniques to find plausible
matches based on whatever information it has available—for example,
name similarity, data-type similarity,
structure similarity, an externally supplied thesaurus, or a library of previously matched schemas. The human
user must then validate the match.
Schema-matching algorithms do
well at matching individual elements
with somewhat similar names, such
as Salary_of_Employee in Schema 1
and EmpSal in Schema 2, or when
matching predefined synonyms, such
as Salary and Wages. Some techniques
leverage data values. For example, the
algorithm might suggest a match between the element Salary of the source
database and Stpnd of the target if
they both have values of the same type
within a certain numerical range.
But matching algorithms are ineffective when there are no hints to exploit. They cannot map an element
called PW (that is, person’s wages)
to EmpSal when no data values are
available; nor can they readily map
combinations of elements, such as To-tal_Price in Schema 1 and Quantity ´
Unit_Cost in Schema 2. That is, these
algorithms are helpful for avoiding
tedious activities but not for solving
subtle matching problems.
Keyword Search. Keyword search is
second nature to us all as a way to find
information. A search engine accepts
a user’s keywords as input and returns
a rank-ordered list of documents that
is generated using a pre-built index
and other information, such as anchor text and click-throughs. 5 A less
familiar view of search is as a form of
integration—for example, when a Web
search on a keyword yields an integrated list of pages from multiple Web
sites. In more sophisticated scenarios,
the documents to be searched reside
in multiple repositories such as digital libraries or content stores, where it
is not possible to build a single index.
In such cases, federated search can be
used to explore each store individually
and merge the results. 24
While keyword search does integrate information, it does so “loosely.”
The results are often imprecise, incomplete, or even irrelevant. By contrast, integration of structured data via
an ETL tool or a query mediator can
create new types of records by correlating and merging information from
different sources. The integration request has a precise semantics and the
answer normally includes all possible
relevant matches from these data sets,
assuming that the source data and
entity resolution are correct (both of
these are big assumptions). Both precise and loose integration techniques
have merit for different scenarios. Keyword search may even be used against
structured data to get a quick feel for
what is available and set the stage for
more precise integration.
Information Extraction. Information
extraction25 is the broad term for a set
of techniques that produce structured
information from free-form text. Concepts of interest are extracted from
document collections by employing a
set of annotators, which may either be
custom code or specially constructed
extraction rules that are interpreted
and executed by an information-ex-traction system. In some scenarios,
when sufficient labeled training data
is available, machine-learning techniques may also be employed.
Important tasks include named-entity recognition (to identify people,
places, and companies, for example)
and relationship extraction (such as
customer’s phone number or customer’s address). When a text fragment is
recognized as a concept, that fact can
be recorded by surrounding it with
XML tags that identify the concept, by
adding an entry in an index, or by copying the values into a relational table.
The result is better-structured information that can more easily be combined with other information, thus
aiding integration.
Dynamic Web Technologies. When a
portal is used to integrate data, it usually needs to be dynamically generated
from files and databases that reside
on backend servers. The evolution of
Web technologies has made such data
access easier. Particularly helpful has
been the advent of Web services and
Really Simple Syndication (RSS) feeds,
along with many sites offering their
data in XML. 6 Development technology has been evolving too, with rapid
improvement of languages, runtime
libraries, and graphical development
frameworks for dynamic generation of
Web pages.
One popular way to integrate dynamic content is a “mashup,” which
is a Web page that combines information and Web services. For example,
because a service for displaying maps
may offer two functions—one to display a map and another to add a glyph
that marks a labeled position on the
map—it could be used to create a
mashup that displays a list of stores
and their locations on the map. To
reduce the programming effort of creating mashups, frameworks are now
emerging that provide a layer of information integration analogous to EII
systems, but which are tailored to the
new “Web 2.0” environment. 2
future trends
Today, every step of the information-integration process requires a good
deal of manual intervention, which
constitutes the main cost. Because
integration steps are often complex,
some human involvement seems unavoidable. Yet more automation is
surely possible—for example, to explain the behavior of mappings, identify anomalous input data, and trace
the source of query results. 8 Researchers and product developers continue
to explore ways to reduce human effort not only by improving the core
technologies mentioned in this article
and the integration tools that embody