provide comprehensive information
about the necessary data systems and
environments required to support
data.
A Research Data Census will provide
some specifics critical to cost-effective
planning for stewardship of federally
funded research data, however, and it
will allow us to infer some key requirements for data cyberinfrastructure.
In particular, a Research Data Census
could help inform:
˲ • Useful estimates of the storage capacity required for data stewardship,
and a lower bound on data that must
be preserved for future timeframes.
Data required by regulation or policy
to be preserved is a lower bound on
valued preservation-worthy research
data—additional data sets will need
to be preserved for research progress
(for example, National Virtual Observatory data sets).
˲ • The types of data services most important for research efforts. Knowing
the most common types of useful services and tools can help drive academic
and commercial efforts.
˲ • Estimates of the size, training, and
skill sets that will be needed for today’s
and tomorrow’s data work force.
Getting it Done
A Data Census sounds like a big job
and it is, however there is potential
to use existing mechanisms to help
gather the needed information efficiently. We already provide annual and
final reports to federal R&D agencies
to describe the results of sponsored research. One could imagine a straightforward addition to annual reporting
vehicles and/or sites such as grants.
gov to collect this information (
preferably electronically). Although U.S.
Census information is gathered every
10 years, the Research Data Census
would require frequent updating in order to provide useful information for
planning purposes about our dynamically changing data landscape. The
right periodicity for reporting is a topic
for discussion, but an annual update
probably provides the best resolution
for the purpose of tracking trends.
Note also that there is real complexity in doing an effective Data Census:
much of our data is generated from
collaborative research, which can
cross institutional, agency, and na-
an effective Research
Data Census should
provide a quantitative
snapshot of the
research data
landscape at
a given point in time.
tional boundaries. The Data Census
reporting mechanisms must take this
into account to produce relatively accurate counts. Data sets are often replicated for preservation purposes—do
we count the data in all copies (all of
which require storage), or do we count
only the non-replicated data? (It is interesting to note that the U.S. Census
has a related problem and covers it
as question 10: “Does person 1 sometimes live or stay somewhere else?”
If yes, check all that apply….). As with
any survey, careful design is critical in
order to ensure the results are accurate
and useful as the basis for making predictions and tracking trends.
using the Research Data Census to
Create Effective Data Stewardship
An important outcome of the Research
Data Census would be evidence-based
information on the amount of data in
the research community that must be
preserved over time. This would help in
understanding and meeting our needs
for archival services and community
repositories.
Such information can help cut data
management and preservation problems down to size. Knowing that data
valued by a particular community is
typically of a certain type, a certain
size, and/or needed over a certain
timeframe, can help the community
plan for the effective stewardship of
that data. For example, accurate estimates of the digital data emanating
from the Large Hadron Collider at
CERN have been instrumental in the
development of a data analysis and
management plan for the High Energy
Physics community.
It is likely that some of the capacity needed for stewardship of research
data will come from university libraries
reinventing themselves to address 21st
century information needs; some of
the capacity may come from the commercial sector, which has responded to
emerging needs for digital storage and
preservation through the development
of commercial services. In some cases,
the federal government will take on
the stewardship responsibilities for research data (for example, the NIST Science Reference Data). It is clear that the
size, privacy, longevity, preservation,
access, and other requirements for research data preclude a “
one-size-fits-all” approach to creation of supporting
data cyberinfrastructure. It is also true
that no one sector will be able to take
on the responsibility for stewardship
of all research data. A national strategic
partnership spanning distinct sectors
and stakeholder communities is needed to effectively address the capacity,
infrastructure, preservation, and privacy issues associated with the growing
deluge of research data. The development of a Research Data Census can
provide critical information for more
effectively developing this partnership.
no Time Like the Present
The 2010 requirement for a data
management plan at the National
Science Foundation (http://www.nsf.
gov/news/news_summ.jsp?cntn_id=
116928&org=NSF) joins existing requirements for data sharing and management at NIH and elsewhere. Such
requirements expand community
awareness about responsible digital
data stewardship and will exacerbate
the emerging need for reliable, cost-effective data storage and preservation
at the national scale.
A Research Data Census will provide a foundation for estimating the
data cyberinfrastructure required for
strategic stewardship. It can lay the
groundwork today for access to our
most valuable digital research assets
tomorrow, and the new discoveries
and innovation they drive.
Francine Berman ( bermaf@rpi.edu) is Vice president for
research at rensselaer polytechnic institute, the former
director of the San diego Supercomputer Center, and the
co-chair of the blue ribbon task Force for Sustainable
digital preservation and access ( http://brtf.sdsc.edu).