devised. At the bottom, commercial
services fill the need for primary, additional, or backup sites for collections of
individual or private value. At the top,
stewardship is primarily in the hands
of libraries, museums, archives, government funding agencies, and other
trusted institutions. In the middle, institutions, communities, enterprises,
and others are the primary stewards
of data, wrestling with institutional
and community solutions for stable
and sustainable digital stewardship
and preservation. The next decade will
likely see more creative partnerships
among all the players in the Pyramid,
as well as more attention to who actually pays the data bill and how its costs
are managed.
Creating an economically viable
Data Pyramid must also be complemented with continued research into
and development of solutions that address the technical challenges of data
management and preservation, resulting in the ability to utilize and create
new knowledge from the data being
stored. For example, the process of
searching and mining data depends
on how it is organized, what additional
information (metadata) is associated
with it, and what information might be
included about the relationship (
ontological structure) of data items to one
another in a collection. All these functions are active and important areas
for research. Privacy and policy controls for data collections and security
of the supporting infrastructure are
also critical research areas. Addressing the technical, economic, and social
aspects of digital preservation will be
critical to ensuring that the information age has the foundation required
to achieve its potential.
top 10 Guidelines for
Data stewardship
Whether your data portfolio is of personal, community, or societal value (or
some combination), its viability and
usefulness to you will result from how
you plan for stewardship and preservation over its lifetime. The following guidelines help promote effective
stewardship and preservation of digital data:
1. Make a plan. Create an explicit
strategy for stewardship and preservation for your data, from its inception
to the end of its lifetime; explicitly consider what that lifetime may be;
2. Be aware of data costs and include
them in your overall IT budget. Ensure
that all costs are factored in, including hardware, software, expert support, and time. Determine whether it
is more cost-effective to regenerate
some of your information rather than
preserve it over a long period;
3. Associate metadata with your data.
Metadata is needed to be able to find
and use your data immediately and for
years to come. Identify relevant standards for data/metadata content and
format, following them to ensure the
data can be used by others;
4. Make multiple copies of valuable
data. Store some of them off-site and
in different systems;
5. Plan for the transition of digital
data to new storage media ahead of time.
Include budgetary planning for new
storage and software technologies, file
format migrations, and time. Migration must be an ongoing process. Migrate data to new technologies before
your storage media goes obsolete;
6. Plan for transitions in data stewardship. If the data will eventually be
turned over to a formal repository,
institution, or other custodial environment, ensure it meets the requirements of the new environment and
that the new steward indeed agrees to
take it on;
7. Determine the level of “trust” required when choosing how to archive
data. Are the resources of the U.S. National Archives and Records Administration necessary or will Google do?;
8. Tailor plans for preservation and
access to the expected use.
Gene-se-quence data used daily by hundreds of
thousands of researchers worldwide
may need a different preservation and
access infrastructure from, say, digital
photos viewed occasionally by family
members;
9. Pay attention to security. Be aware
of what you must do to maintain the
integrity of your data; and
10. Know the regulations. Know
whether copyright, the Health Insurance Portability and Accountability
Act of 1996, the Sarbanes-Oxley Act of
2002, the U.S. National Institutes of
Health publishing expectations, or
other policies and/or regulations are
relevant to your data, ensuring your ap-
proach to stewardship and publication
is compliant.
While adherence is not a magic bullet guaranteeing the long-term safety
and accessibility of fragile digital data,
these guidelines help focus appropriate attention, effort, and support on
the maintenance and preservation of
our valued digital information. Such
attention is critical to our ability to
harness the immense potential of the
information age to illuminate and empower us in our changing world.
acknowledgment
I am grateful to John Gantz, Chris
Greer, Nancy McGovern, David Minor,
David Reinsel, Brian Schottlaender,
Jan Zverina, and the reviewers for their
useful comments and generous help
with this article.
References
1. alvarez, r. Developing and Extending a
Cyberinfrastructure Model. research bulletin 5.
Educause center for applied research, boulder, co,
2008.
2. atkins, d. Revolutionizing Science and Engineering
Through Cyberinfrastructure: Report of the National
Science Foundation Blue Ribbon Advisory Panel on
Cyberinfrastructure. Nsf report. Nsf, arlington, va,
2003; www.nsf.gov/od/oci/reports/toc.jsp.
3. berman, f. making cyberinfrastructure real. Educause
Review 43, 4 (july/aug. 2008), 18–32.
4. branscomb, L. et al. From Desktop to TeraFlop:
Exploiting the U.S. Lead in High-Performance
Computing. Final Report of the National Science
Foundation Blue Ribbon Panel on High-Performance
Computing. National science foundation, arlington,
va, 1993; www.nsf.gov/pubs/stis1993/nsb93205/
nsb93205.txt.
5. gantz, j. The Diverse and Exploding Digital Universe.
White paper. international data corporation,
framingham, ma, mar. 2008; www.emc.com/
collateral/analyst-reports/diverse-exploding-digital-universe.pdf.
6. gantz, j. The Expanding Digital Universe. White paper.
international data corporation, framingham, ma,
mar. 2007; www.emc.com/collateral/analyst-reports/
expanding-digital-idc-white-paper.pdf (methodology
discussion begins on 17).
7. higgins, s. draft dcc curation model. International
Journal of Digital Curation 2, 2 (2007), 82–87.
8. moore, r., d’aoust, j, mcdonald, r., and minor, d. disk
and tape storage cost models. in Proceedings of the
Society for Imaging Science and Technology’s Archiving
Conference (arlington, va, 2007), 29–32; users.sdsc.
edu/~mcdonald/content/papers/dt_cost.pdf.
9. National science board, Long-Lived Digital Data
Collections: Enabling Research and Education in the
21st Century. arlington, va, sept. 2005; www.nsf.gov/
pubs/2005/nsb0540/nsb0540.pdf.
10. National science foundation. NSF Data Sharing Policy.
arlington, va, 2001; www.nsf.gov/pubs/2001/gc101/
gc101rev1.pdf.
11. reinsel, d. Personal communication. group vice
President, storage and semiconductors, international
data corporation, july, 2008.
Francine Berman ( berman@sdsc.edu) is director of the
san diego supercomputer center, Professor of computer
science and Engineering, and hPc high Performance
computing Endowed chair in the jacobs school of
Engineering at the university of california, san diego,
and is also co-chair of the blue ribbon Task force on
sustainable digital Preservation and access.