The datasets must be available for reproducible research and hosted by reliable infrastructure.
Lack of such infrastructure and datasets will inhibit success. Education
and training in data science is most authentic in a setting where students can
work on data that represents the datasets and environments they will see
in the professional arena; that is, data
that is both at-scale and embedded in
a stewardship infrastructure that enables it to be a useful tool in analysis,
modeling, and mining.
In the best case, data infrastructure should support access to data for
research and education that is equivalent to access to any other key utility;
it must be “always on,” it must be robust enough to support extensive use,
and the quality must be good. In the
world of data, this comes down to
responsible stewardship, meaning
there must be actors, plans, and both
“social” and technical infrastructure
to ensure the following:
Data is appropriately tracked, monitored, and identified. Who created, curated, and used the data? Can it be persistently identified? Are there adequate
privacy and security controls?;
Data is well cared for. Who is committed to keeping it, in what formats,
and for how long? Who is committed
to funding data stewardship? And how
will it be stored and migrated to next-generation media?;
Data is discoverable and useful. How
is data made available and to whom?
What services are needed to make good
use of it? And what metadata and other
information is needed to promote reproducibility?; and
Data stewardship is compliant with
policy and good practice. Does stewardship comply with community standards
and appropriate policy regarding reporting, intellectual property, and other
concerns? Are the rights, licenses, and
other properties that will determine appropriate use clear? And what data and
metadata are to be kept, who owns it
and its by-products, and who has access
to it and its metadata or parts of it?
Since data will become the core for
research and insight for a broad set of
academic disciplines, access to it in a
usable form on a reasonable time scale
becomes the entry point for any effective research and education agenda.
that we do not “standardize” data science too quickly, continuing to explore
configurations of courses, areas, projects, faculty, and partnerships to gain
critical experience in how to best educate new generations of data scientists.
In addition to “data science” programs and majors that serve to evolve
data science as a discipline, data science skills are increasingly critical
as training for other disciplines and
professions as they become more and
more data-enabled. Effective training
will empower data-enabled professionals and domain scientists to utilize
data effectively and operate within a
broader data-driven environment, develop an appreciation of what data can
tell us and what it cannot, acquire appropriate technical knowledge about
how data should be handled, gain
awareness that correlation in data does
not necessarily imply causality, and begin to develop a sense of responsible
methodologies and ethical principles
in the use of data.
More specific training in the nuts and
bolts of dealing with data is also critical for various data-driven professions.
Training in programming and software
engineering is useful for students who
will be using data-driven simulations
and models in their research. Training
in version control and the subtleties of
stewardship, including working with repositories for data and software, should
be taught to computational researchers. And training in best practices for
digital scholarship and reproducibility should be integrated into research-methodology curricula. The ethics of
using (and misusing) data should be incorporated into all training programs to
promote effective and responsible data
use. Courses teaching these skills can
be made available in a variety of venues,
from university courses and modules to
online courses to professional courses
that could be developed by scientific societies and communities.
Data Science Research and
Any innovative agenda in data science
research and education will depend
on a foundation of enabling data infrastructure and useful datasets. Research in data science needs access to
sufficiently large and numerous datasets to illuminate and validate results.
no single model
for data science