this prior work has little evaluation of the impacts of data science in introductory education [ 6, 16].
The RealTimeWeb project (RTW) [ 2] made it easy for instructors to bring web-based, rapidly-updating data into their
introductory classrooms by writing light-weight specifications of
remote data streams. CORGIS is similar to RTW, moving from
a small collection of web-based data to a wide collection of local
data. The change was moivated by the authors perceived scarcity
of quality, real-time datasets. And, while the technical challenges
of integrating real-time data are interesting, they are of secondary
importance to the value of having a diversity of relevant datasets.
The Sinbad project of Hamid [ 11] takes real-time data access
in a different direction by reducing the technical requirements on
instructors even further. The library uses sophisticated reflection
techniques to automatically infer the structure of data, so that students can access any desired URL endpoint and receive structured
data. Although it provides a flexible architecture, this project does
not attempt to provide datasets, making it more appropriate for
advanced students who can find their own data to work with.
The BRIDGES project provides students with visualizations
of their algorithms on datasets [ 4]. BRIDGES does not focus on
organizing datasets directly, but incorporates existing datasets
(including some directly from the CORGIS project). The use
of visualizations can help students understand the interaction
between their algorithms and data.
The CORGIS project aspires to have multiple ( 3-4) datasets relevant to each major career path that potential students might
seek. To achieve this, we draw on data from wide-ranging,
open-access sources including governments, research publications, journalists, non-profits, and industry. Figure 1 provides
an overview of how datasets are added to the CORGIS project.
First, real-world data is collected and preprocessed into a
“clean” state. Most of the human effort for the project comes at
this phase, as organizing a dataset requires expertise. Figure 2
shows a representation for the cleaned structure of a dataset. Every dataset becomes a list of hierarchical maps (a tree), where the
leaf nodes are simple data types: numbers, text, and booleans.
introductory courses creates challenges for instructors, both
pedagogically and technologically. Finding many, varied datasets can be difficult, and they often require cleaning to be suitable for beginners (e.g., to remove missing values, to scale it to
the appropriate size, to choose a convenient format). To overcome these problems, we present here the open-source COR-
GIS project, the Collection Of Really Great and Interesting
dataSets, which makes a wide variety of student-ready datasets
available. Our materials are free and open-source, available at
This paper begins with a brief review of relevant educational
theory and existing projects in this space. We then briefly describe
the technical innovations of the CORGIS project and its pedagogical affordances, and specific ways that it can be used in a course.
We report and evaluate results of an intervention conducted using the software. Finally, we discuss the future of the project.
Educational Theory grounds the development of the CORGIS
project. We use Situated Learning Theory [ 13] to better understand
how students learn, and the MUSIC Model of Academic Motivation [ 12] to better understand why students choose to engage.
Situated Learning Theory suggests that authentic contexts are crucial for learners. Originally proposed by Lave and
Wenger, SL Theory argues that tasks in the learning environment should parallel real-world tasks, to maximize the authenticity of the experience [ 13]. Some interpretations of this
theory draw a distinction between the content (e.g., learning to
program) and the context (e.g., by creating a video game), and
stress that proper contextualization is important for students’
comprehension and investment [ 5].
In our research, we rely on the MUSIC Model of Academic
Motivation [ 12]. Built as a meta-model, it incorporates many
existing theories and is tailored particularly for education. The
model distinguishes between students’ situational Interest in an
academic activity and their sense of the Usefulness of the activity to their long-term career goals. We connect this distinction
to the different contexts available to introductory courses. Creating games and animations, for example, appeals to students’
situational interest, while we posit that data science will appeal
more to students’ sense of usefulness. The MUSIC model also
incorporates students’ expectation for appropriate Successes
when learning, their sense of eMpowerment within the activity,
and their perception of their instructors Caring towards them.
From these two theories, we hypothesize that Data Science
will be a beneficial context for students, since students will perceive the context as closely related to their career goals.
RELATED AND PRIOR WORK
The use of data analysis for contextualization represents an
actively growing movement [ 1, 6, 10, 16]. Upper division courses have employed situated learning experiences using data of
varying size and complexity for several years [ 7, 17]. However,
Figure 1: How datasets are added to CORGIS