Government R&D agencies (such as the
National Science Foundation) have an
opportunity to ensure the lack of adequate data infrastructure does not
present a roadblock to innovative research and educational programs.
Developing and sustaining the infrastructure that ensures that research
data is available to the public and accessible for reuse and reproducibility requires stable economic models.
While there is much support for the
development of tools, technologies,
building blocks, and data-commons
approaches, few U.S. federal programs
directly address the resource challenges for data stewardship or provide help
for libraries, domain repositories, and
other stewardship environments to become self-sustaining and address the
need for public access.
While the U.S. federal government
cannot take on the entire responsibility
for stewardship of sponsored research
data and its infrastructure, neither
should it shy away from providing seed
or transition funding for institutions and
organizations to develop sustainable
stewardship options for the national
community. We encourage the community, inside and outside of government,
to support the development and piloting
of sustainable data stewardship models
for data-driven research and data science
education through strategic programs,
guidance, and cross-agency and public-private partnerships. Science-centric
government agencies like the National
Science Foundation should coordinate
with peer agencies like the National Institutes of Health that focus on similar issues to leverage investments and provide
economies of scope and scale.
Realizing the Potential
The research, education, and infrastructure discussions here focus on
developing a foundation that can increase the pool of data scientists and
data-literate professionals to meet the
current and near-term challenges of
data-driven efforts in all sectors, as well
as the need to evolve data science as a
discipline that can meet the challenges
of future data-driven scenarios.
Data is everywhere, providing an increasingly important tool for a broad
spectrum of endeavors. As systems grow
“smarter” and take on more autonomous and decision-making capabilities,
we will increasingly face data science
technical challenges and the social challenges of governance, ethics, policy, and
privacy. Addressing them will be critical
to rendering data-driven systems useful,
effective, and productive, rather than intrusive, limiting, and destructive. Such
solutions will be particularly important
in highly data-driven environments like
the Internet of Things. Moreover, as
fundamental computational platforms
change in response to the looming end
of Moore’s Law scaling of semiconductors,12 there will be tremendous opportunities to reimagine the entire hard-ware/software enterprise in the light of
future data needs.
Our community must be prepared to
deal with future scenarios by encouraging the initial research that lays the
groundwork for innovative uses of data,
well-functioning data-focused systems,
useful policy and protections, and effective governance of data-driven environments. With both programmatic resources and a platform for community
leadership, federal R&D agencies like
the National Science Foundation play
an important role in guiding the community toward innovation. Attention
to deep efforts needed to expand the
field and its impact, as well as broad efforts to help data science reach its potential for transforming 21st-century
research, education, commerce, and
life, are needed.
We would like to thank the National
Science Foundation for convening this
group and the institutions and organizations of the co-authors for their support for this work.
1. Bengio, Y., LeCun, Y., and Hinton, G. Deep learning.
Nature 521 (May 28, 2015), 436–444.
2. Berman, F. (co-chair), Rutenbar, R. (co-chair),
Christensen, H., Davidson, S., Estrin, D., Franklin,
M., Hailpern, B. , Martonosi, M., Raghavan, P.,
Stodden, V., and Szalay, A. Realizing the Potential
of Data Science: Final Report from the National
Science Foundation Computer and Information
Science and Engineering Advisory Committee
Data Science Working Group. National Science
Foundation Computer and Information Science and
Engineering Advisory Committee Report, Dec. 2016;
3. Cho, A. The discovery of the Higgs Boson. Science 338,
6114 (Dec. 21, 2012), 1524–1525.
4. Columbia University Data Science Institute. Master of
Science in Data Science; http://datascience.columbia.
5. Coursera. Master of Computer Science in Data
6. Dhar, V. When to trust robots with decisions, and
when not to. Harvard Business Review (May 17, 2006);
7. Moore-Sloan Data Science Program; http://msdse.org/
8. University of California, Berkeley. Data Science
Education Program; http://data.berkeley.edu/data-science-education-program
9. University of Chicago. Master of Science in
Computational Analysis & Public Policy; https://capp.
10. University of Illinois, Urbana-Champaign, CS@
ILLINOIS. Master of Computer Science in Data
Science, Data Science Track; http://www.cs.uiuc.edu/
11. University of Michigan. Undergraduate Program in
Data Science; https://www.eecs.umich.edu/eecs/
12. Waldrop, M. M. The chips are down for Moore’s Law.
Nature 530, 7589 (Feb. 11, 2016), 144–146.
Francine Berman (email@example.com) is the Edward P.
Hamilton Distinguished Professor in Computer Science
at Rensselaer Polytechnic Institute, Troy, NY, USA, and
Chair of the Research Data Alliance / U.S. She served as
Co-Chair of the Data Science Working Group of the NSF
CISE Advisory Committee.
Rob Rutenbar (firstname.lastname@example.org) is a professor
of computer science and electrical and computer
engineering and Senior Vice Chancellor for Research
at the University of Pittsburgh, Pittsburgh, PA, USA. He
served as Co-Chair of the Data Science Working Group of
the NSF CISE Advisory Committee.
Henrik Christensen (email@example.com) is a
professor of computer science and Director of the
Institute for Contextual Robotics at the University of
California at San Diego, USA.
Susan Davidson (firstname.lastname@example.org) is the Weiss
Professor of Computer and Information Science at the
University of Pennsylvania, Philadelphia, PA, USA.
Deborah Estrin (email@example.com) is Associate
Dean and professor of computer science at Cornell Tech
in New York City and a professor of public health at Weill
Cornell Medical College, New York, USA.
Michael Franklin (firstname.lastname@example.org) is the Liew
Family Chairman of Computer Science and Senior Advisor
to the Provost for Data and Computing at the University of
Brent Hailpern (email@example.com) is a Distinguished
Research Staff Member, Science Director of the IBM
Cognitive Horizons Network, and Head of Computer
Science for IBM Research, San Jose, CA, USA.
Margaret Martonosi (firstname.lastname@example.org) is the Hugh
Trumbull Adams ‘35 Professor of Computer Science at
Princeton University, Princeton, NJ, USA.
Padma Raghavan (email@example.com) is a
professor of computer science and computer engineering
and Vice President of Research at Vanderbilt University,
Nashville, TN, USA.
Victoria Stodden (firstname.lastname@example.org) is an associate
professor in the School of Information Sciences at the
University of Illinois at Urbana-Champaign, USA.
Alex Szalay (email@example.com) is Bloomberg Distinguished
Professor in the Departments of Physics and Astronomy
and Computer Science at the Johns Hopkins University,
Baltimore, MD, USA.
© 2018 ACM 0001-078218/4 $15.00
Watch the authors discuss
their work in this exclusive