them but also by trying to simplify the
process through better integration of
the tools themselves.
Information integration is currently a brittle process; changing the structure of just one data source can force
an integration redesign. This problem
of schema evolution32 has received
much attention from researchers; but
surprisingly few commercial tools that
might reduce the cost of integration
are available to address the problem.
Another cause of brittleness, and another topic of research, 9 arises from
the complex rules for handling the
inconsistencies and incompleteness
of different sources. One possible approach, for example, is to offer a tool
that suggests minimal changes to
source data, thereby eliminating many
of the unanticipated inconsistencies.
Most past work has focused on the
problems of information-technology
shops, where the goal of integration is
usually known at the outset of a project. But some recent work addresses
problems in other domains, notably
science, engineering, and personal-information management. In these
domains, information integration is
often an exploratory activity in which
a user integrates some information,
evaluates the result, and consequently identifies additional information
to integrate. In this scenario, called
“dataspaces,” 17 finding the right data
sources is important, as is automated
tracking of how the integrated data
was derived, called its “provenance.” 34
Semantic technologies such as ontologies and logic-based reasoning engines may also help with the integration task. 19
Information integration is a vibrant
field powered not only by engineering innovation but also by evolution
of the problem itself. Initially, information integration was stimulated by
the needs of enterprises; for the last
decade, it has also been driven by the
desire to integrate the vast collection
of data available on the Web. Recent
trends—the continual improvement
of Web-based search, the proliferation
of hosted applications, cloud storage,
Web-based integration services, and
open interfaces to Web applications
(such as social networks), among others—present even more challenges to
the field. Information integration will
keep large numbers of software engineers and computer-science researchers busy for a long time to come.
acknowledgments
We are grateful to Denise Draper, Alon
Halevy, Mauricio Hernández, David
Maier, Sergey Melnik, Sriram Raghavan, and the anonymous referees for
many suggested improvements.
References
1. Alonso, G., Casati, F., Kuno, H.A., and Machiraju, V. Web
Services—Concepts, Architectures and Applications.
Springer, 2004.
2. Altinel, M., Brown, P., Cline, S., Kartha, R., Louie,
E., Markl, V., Mau, L., Ng, Y-H, Simmen, D. E., and
Singh, A. DAMIA—A data mashup fabric for intranet
applications. VLDB Conference (2007), 1370–1373.
3. Babcock, C. XML plays big integration role.
Information Week (May 24, 2004); www.
informationweek.com/story/showArticle.
jhtml?articleID=20900153.
4. Bernstein, P. A. and Melnik, S. Model management 2.0:
Manipulating richer mappings. In Proceedings of the
ACM SIGMOD Conference, 2007, 1–12.
5. Brin, S. and Page, L. The anatomy of a large-scale
hypertextual Web search engine. Computer Networks
30, 1–7 (1998), 107–117.
6. Carey, M.J. Data delivery in a service-oriented world:
The BEA AquaLogic data services platform. In
Proceedings of the ACM SIGMOD Conference (2006),
695–705.
7. Chaudhuri, S. and Dayal, U. An overview of data
warehousing and OLAP technology. ACM SIGMOD
record 26, 1 (1997), 65–74.
8. Chiticariu, L. and Tan, W. C. Debugging schema
mappings with routes. VLDB Conference (2006),
79–90
9. Chomicki, J. Consistent query answering: Five
easy pieces. In Proceedings of the International
Conference on Database Theory (2007), 1–17.
10. Dasu, T., and Johnson, T. Exploratory Data Mining and
Data Cleaning. John Wiley, 2003.
11. Firestone, J.M. Enterprise Information Portals and
Knowledge Management. Butterworth-Heinemann
(Elsevier Science, KMCI Press), 2003.
12. Foundational Model of Anatomy, Structural
Informatics Group, University of Washington; http://
sig.biostr.washington.edu/projects/fm/
13. Gene Ontology; http://www.geneontology.org/.
14. Haas, L. M. Beauty and the beast: The theory and
practice of information integration. International
Conference on Database Theory (2007), 28–43.
15. Haas, L. M., Hernández, M.A., Ho, H., Popa, L., and
Roth, M. Clio grows up: From research prototype to
industrial tool. In Proceedings of the ACM SIGMOD
Conference (2005), 805–810.
16. Halevy, A. Y., Ashish, N., Bitton, D., Carey, M.J., Draper,
D., Pollock, J., Rosenthal, A., and Sikka, V. Enterprise
information integration: Successes, challenges, and
controversies. In Proceedings of the ACM SIGMOD
Conference (2005), 778–787.
17. Halevy, A. Y., Franklin, M.J., and Maier, D. Principles of
dataspace systems. ACM Symposium on Principles of
Database Systems (2006), 1–9.
18. Health Level Seven; http://www.hl7.org/.
19. Hepp, M., De Leenheer, P., de Moor, A., and Sure,
Y. (Eds.). Ontology management: Semantic web,
semantic web services, and business applications. Vol.
7 of series Semantic Web And Beyond. Springer, 2008.
20. IDC. Worldwide Data Integration and Access
Software 2008–2012 Forecast. Doc No. 211636 (Apr.
2008).
21. Kimball, R. and Caserta, J. The Data Warehouse E TL
Toolkit. Wiley and Sons, 2004.
22. Ludascher, B., Papakonstantinou, Y., and Velikhov, P.
Navigation-driven evaluation of virtual mediated views.
Extending Database Technology (2000), 150–165.
23. Melnik, S., Adya, A., and Bernstein, P. A. Compiling
mappings to bridge applications and databases. In
Proceedings of the ACM SIGMOD Conference (2007),
461–472.
24. Meng, W., Yu, C., and Liu, K. Building efficient and
effective metasearch engines. ACM Computing
Surveys 34, 1 (2002), 48–89.
25. McCallum, A. Information extraction: Distilling
structured data from unstructured text. ACM Queue 3,
9 (Nov. 2005).
26. Miller, R.J., Haas, L.M., and Hernández, M.A. Schema
mapping as query discovery. VLDB Conference (2000),
77–88.
27. Morgenthal, J. P. Enterprise Information Integration: A
Pragmatic Approach. Lulu.com, 2005.
28. OASIS standards; www.oasis-open.org/specs/.
29. OMG Specifications; www.omg.org/technology/
documents/ modeling_spec_catalog.htm.
30. Popa, L., Velegrakis, Y., Miller, R. J., Hernández, M. A.,
and Fagin, R. Translating Web data. VLDB Conference
(2002), 598–609.
31. Rahm, E. and Bernstein, P.A. A survey of approaches
to automatic schema matching. VLDB Journal 10, 4
(2001), 334–350.
32. Roddick, J. F. and de Vries, D. Reduce, reuse, recycle:
Practical approaches to schema integration, evolution,
and versioning. Advances in Conceptual Modeling—
Theory and Practice, Lecture Notes in Computer
Science, 4231. Springer, 2006.
33. Smith, M. Toward enterprise information integration.
Softwaremag.com (Mar. 2007); www.softwaremag.
com/ L.cfm?Doc=1022-3/2007.
34. Tan, W-C. Provenance in databases: past, current, and
future. IEEE Data Eng. Bulletin 30, 4 (2007), 3–12.
35. Wiederhold, G. Mediators in the architecture of future
information systems. IEEE Computer 25, 3 (1992),
38–49.
36. Workshop on Information Integration, October 2006;
http://db.cis.upenn.edu/iiworkshop/postworkshop/
index.htm.
Philip A. Bernstein ( philbe@microsoft.com) is a principal
researcher in the database group of Microsoft Research in
Redmond, WA.
Laura M. haas ( laura@almaden.ibm.com) is an IBM
distinguished engineer and director of computer science at
the IBM Almaden Research Center in San Jose, CA.