indicate that 2007 marked the “
crossover” year in which more digital data
was created than there is data storage
to host it. At that point, the amount
of digital data (information created,
captured, or replicated in digital form)
exceeded the amount of storage (all
empty or usable space on hard drives,
tapes, CDs, DVDs, and volatile and
nonvolatile memory). At the crossover
point, this amount was estimated to
be around 264 exabytes (264 × 1018
bytes).
5 This is almost a million times
the amount of digital data hosted in
2008 by the U.S. Library of Congress
( www.loc.gov/library/libarch-digital.
html) and more than 20,000 times the
aggregate of permanent electronic records projected to be stored in 2010 by
the U.S. National Archives and Records
Administration ( www.archives.gov/
era/). The IDC report further projected
that by 2011 the amount of digital information created will be nearly 1. 8
zettabytes ( 1. 8 × 1021), or more than
twice the amount of available storage,
estimated at 800+ exabytes.
The methodology under which
these estimates were derived (what is
counted and how it is calculated6) is
fascinating and has generated considerable community discussion. However, even under alternative variations
of the IDC methodology, the trend
is unmistakable: We do not produce
storage capacity at the same rate we
produce digital information. Even if
we wanted to, we cannot keep all of our
digital data.
The thoughtful and methodical
selection of which data sets to keep
(called “appraisal” in the archival
world) will be critical to communities
used to keeping it all. In the research
and education community, methods
for community appraisal (coupled
with the need for budgets to ensure
adequate data stewardship and preservation for selected data sets) will likewise be more important over the next
decade.
The need for community appraisal
will push academic disciplines beyond
individual stewardship, where project
leaders decide which data is valuable,
which should be preserved, and how
long it should be preserved (except
where regulation, policy, and/or publication protocols mandate specific
stewardship and preservation time-frames). Some communities are beginning to develop explicit appraisal
criteria and community stewardship
models for valuable reference data
collections (such as the Protein Data
Bank, www.rcsb.org/pdb/home/home.
do, in the life sciences and the Panel
Study of Income Dynamics, psidon-
line.isr.umich.edu/Guide/, in the social sciences). Over the next decade, as
more data is generated and the costs of
data CI are incorporated into the “IT
bill” at our institutions and enterprises, we can expect to devote more time
and attention to the criteria and process through which we appraise data
for stewardship and preservation.
Trend 2. More and more policies and
regulations require the access, stewardship, and/or preservation of digital
data. Even before the information age,
the Copyright Clause (Article 1, Section 8) of the U.S. Constitution and
subsequent regulation set the stage
for policy with respect to the rights
and dissemination of information in
the U.S. Today, many forms of digital rights management and a broad
range of public policies govern the
access, stewardship, and preservation of digital data around the world.
In the U.S., the Sarbanes-Oxley Act of
2002 promotes appropriate responsible management and preservation
of digital financial and other records
for publicly owned companies, and
the Health Insurance Portability and
Accountability Act of 1996 ensures the
privacy of digital medical records. On
the research front, investigators at the
U.S. National Institutes of Health are
required to submit digital copies of
their publications to PubMed Central
( publicaccess.nih.gov/), and the U.S.
National Science Foundation’s data-sharing policy “expects its awardees to
share results of NSF-assisted research
and education projects with others
both within and outside the scientific
and engineering research and education community.”
10
Increased emphasis on the access,
preservation, and use of digital materials is not limited to the U.S. For example
in the U.K., the Joint Information Systems Committee ( www.jisc.ac.uk/) and
the British Library ( www.bl.uk/npo)
figure 1: Data cyberinfrastructure at the san Diego supercomputer center.
coordinated component systems
modeling
analysis
simulation
visualization
portals
curation
file systems, database systems,
collection management,
data integration
instruments
computers
sensor-
nets
Data sources
coordinated Data
cyberinfrastructure
Data access
Data use
Data management
Data storage
Data services
˲ database selection and schema design
˲ Portal creation and collection publication
˲ data analysis
˲ data mining
˲ data hosting
˲ Preservation services
˲ domain-specific tools:
˲ next-Generation
biology Workbench
˲ Geon portal
˲ Kepler (workflow
management)