Vviewpoints
DOI: 10.1145/1859204.1859220
Viewpoint
We need a research
data Census
THis past Year was a census year in the U.S. We respond- ed to arguably the most long-lived and broad-based gathering of domiciliary information about the American public
anywhere. U.S. Census data, collected
every decade, provides a detailed picture of how many of us there are, where
we live, and how we’re distributed by
age, gender, household, ethnic diversity, and other characteristics.
The Census (http://2010.census.
gov/2010census/ index.php) provides
an evidence-based snapshot of America. This important information is
publicly available and used in a variety
of ways—to guide in the planning of
senior centers, schools, bridges, and
emergency services, to make assessments informed by societal trends
and attributes, and to make predictions about future social and economic
needs. The Census is particularly valuable as a planning tool in the building
of physical infrastructure, as the distribution and characteristics of the population drive the development of hospitals, public works projects, and other
essential facilities and services.
photograph by paUl dineen
Given the role and importance of the
Census in the physical world, it is useful to ask what provides an analogous
evidence-based and publicly available
snapshot of the “inhabitants” of the
Digital World—our digital data.
What do we know about our data?
How much is there? Where does it re-
side? What are its characteristics? Good
“top-down” methodological estimates
of these questions have come from the
reports on the increasing deluge of digi-
tal information developed by the IDC
( http://www.emc.com/collateral/ana-
lyst-reports/diverse-exploding-digi-
tal-universe.pdf), by Bohn and Short
( http://hmi.ucsd.edu/pdf/HMI_2009_
ConsumerReport_Dec9_2009.pdf),
and (some time ago) by Lyman and
Varian ( http://www2.sims.berkeley.
edu/research/projects/how-much-in-
fo-2003/printable_report.pdf). These
provide intriguing, analytically derived
bounds of the Digital World.
However, to make economic decisions that can drive the cost-effective
development and deployment of the
cyberinfrastructure needed to support
long-lived digital data, we need more
resolution. This is particularly important in the research arena, where federal R&D agencies apportion funding
between the competing priorities of
conducting basic research, and creating and supporting the cyberinfrastructure that enables that research.
Just as the U.S. Census drives planning
for infrastructure in the physical world,
a Research Data Census would inform