data is an example of cyberinfrastructure—the coordinated aggregate of information technologies and systems (including experts and organizations) enabling work, recreation, research, education, and life in the information age. The relationship between cyberinfrastructure in the cyberworld and infrastructure in the physical world was described in the U.S. National Science Foundation’s 2003 Final Report of the Blue Ribbon Advisory Panel on Cyberinfrastructure, commonly known as the “Atkins Report” 2 after its Chair, Dan Atkins: “The term infrastructure has been used since the 1920s to refer collectively to the roads, power grids, telephone systems, bridges, rail lines, and similar public works that are required for an industrial economy to function. Although good infrastructure is often taken for granted and noticed only when it stops functioning, it is among the most complex and expensive things that society creates. The newer term cyberinfrastructure refers to infrastructure based upon distributed computer, information, and communication technology. If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy.”
The implication of the report is that like infrastructure in the physical world, data cyberinfrastructure, or data CI, should exhibit critical characteristics that render it useful, usable, cost-effective, and unremarkable. The innovation, development, prototyping, and deployment of CI with such
characteristics constitute a massive endeavor for all sectors, including the academic sector. 1, 3
What are the components of data CI? In the research and education community, users want a coordinated environment that manages digital data from creation to preservation, accommodates data ingested from instruments, sensors, computers, laboratories, people, and other sources, and includes data management tools and resources, data storage, and data use facilities (such as computers for analysis, simulation, modeling, and visualization). Users want to store and use their data for periods spanning the short-term (days) to the long-term ( decades and beyond), and they want it to be available to their collaborators and communities through portals and other environments. Figure 1 outlines the portfolio of coordinated components that constitute the data CI environment at the San Diego Supercomputer Center ( www.sdsc.edu/). Such environments must be designed to meet the needs of the target user community while being continually maintained and evolved to support digital data over the long term.
A 2008 International Data Corporation (IDC) white paper sponsored by EMC
Corporation5 described the world we live in as awash in digital data—an estimated 281 exabytes ( 2. 25 × 1021 bits) in 2007. This is equivalent to 281 trillion digitized novels but less than 1% of Avogadro’s number, or the number of atoms in 12 grams of carbon ( 6.022 × 1023). By IDC estimates, the amount of digital data in our cyberworld will surpass Avogadro’s number by 2023.5 Even if these estimates are off significantly, storing, accessing, managing, preserving, and dealing with digital data is clearly a fundamental need and an immense challenge.
The development of data CI is greatly affected by both current and projected use-case scenarios, and our need to search, analyze, model, mine, and visualize digital data informs how we organize, present, store, and preserve it. More broadly, data CI is influenced by trends in technology, economics, policy, and law. Four significant trends reflect the larger environment in which data CI is evolving:
Trend 1. More digital data is being created than there is storage to host it. Estimates from the IDC white paper
References:
Archives