data is an example of cyberinfrastructure—the coordinated aggregate of
information technologies and systems
(including experts and organizations)
enabling work, recreation, research,
education, and life in the information
age. The relationship between cyberinfrastructure in the cyberworld and infrastructure in the physical world was
described in the U.S. National Science
Foundation’s 2003 Final Report of the
Blue Ribbon Advisory Panel on Cyberinfrastructure, commonly known as the
“Atkins Report”
2 after its Chair, Dan
Atkins: “The term infrastructure has
been used since the 1920s to refer collectively to the roads, power grids, telephone systems, bridges, rail lines, and
similar public works that are required
for an industrial economy to function.
Although good infrastructure is often
taken for granted and noticed only
when it stops functioning, it is among
the most complex and expensive
things that society creates. The newer
term cyberinfrastructure refers to infrastructure based upon distributed
computer, information, and communication technology. If infrastructure
is required for an industrial economy,
then we could say that cyberinfrastructure is required for a knowledge economy.”
The implication of the report is
that like infrastructure in the physical world, data cyberinfrastructure, or
data CI, should exhibit critical characteristics that render it useful, usable,
cost-effective, and unremarkable. The
innovation, development, prototyping, and deployment of CI with such
characteristics constitute a massive
endeavor for all sectors, including the
academic sector.
1, 3
What are the components of data
CI? In the research and education
community, users want a coordinated
environment that manages digital
data from creation to preservation,
accommodates data ingested from instruments, sensors, computers, laboratories, people, and other sources,
and includes data management tools
and resources, data storage, and data
use facilities (such as computers for
analysis, simulation, modeling, and
visualization). Users want to store and
use their data for periods spanning the
short-term (days) to the long-term (
decades and beyond), and they want it to
be available to their collaborators and
communities through portals and other environments. Figure 1 outlines the
portfolio of coordinated components
that constitute the data CI environment at the San Diego Supercomputer
Center ( www.sdsc.edu/). Such environments must be designed to meet the
needs of the target user community
while being continually maintained
and evolved to support digital data
over the long term.
trends
A 2008 International Data Corporation
(IDC) white paper sponsored by EMC
Corporation5 described the world we
live in as awash in digital data—an estimated 281 exabytes ( 2. 25 × 1021 bits)
in 2007. This is equivalent to 281 trillion digitized novels but less than 1%
of Avogadro’s number, or the number
of atoms in 12 grams of carbon ( 6.022
× 1023). By IDC estimates, the amount
of digital data in our cyberworld will
surpass Avogadro’s number by 2023.5
Even if these estimates are off significantly, storing, accessing, managing,
preserving, and dealing with digital
data is clearly a fundamental need and
an immense challenge.
The development of data CI is greatly affected by both current and projected use-case scenarios, and our need
to search, analyze, model, mine, and
visualize digital data informs how we
organize, present, store, and preserve
it. More broadly, data CI is influenced
by trends in technology, economics,
policy, and law. Four significant trends
reflect the larger environment in which
data CI is evolving:
Trend 1. More digital data is being
created than there is storage to host it.
Estimates from the IDC white paper