and dictionaries). We direct students to create a hand-writ-ten “data map”, like the one in Figure 2. This diagram visually
describes the structure of the dataset, and students use it as a
guide to developing the necessary code constructs (e.g., iteration, dictionary access), much as they would navigate a real
map to find a path in a maze. Because of the vastness of most
datasets’ structure, students must be judicious in what branches and leaves of the structure they diagram, in anticipation of
what they will use in their code.
LARGE-SCALE DATA ANALYSIS
For our course’s final project, students are tasked with formulating questions about a dataset of their own choosing, writing
code to create visualizations, and then interpreting those visualizations using relevant domain knowledge. We claim that this
is an authentic form of assessment for students in the course,
modeling what they might do if they were to incorporate data
science into their own careers to solve open-ended problems using computational techniques. The nature of the CORGIS data
structure requires them to use a number of coding constructs,
including iteration and dictionary access. In addition to their
code, students must turn in a 5-minute video presentation reporting their results. Students share their videos with each other
to demonstrate the breadth available in computing. Students are
required to describe the abstractions used in the project and the
inherent limitations of the dataset. This turns a weakness in the
dataset into an important learning experience for the students.
The final project is assessed both by course staff and peers.
In this section, we evaluate the CORGIS project’s progress in
two ways. First, we present empirical metrics for the datasets.
Second, we present survey results from a course that incorporated real-world data through CORGIS.
At the time of writing, there are over 40 datasets in the CORGIS gallery, and we are actively working to add more datasets.
Figure 6 reveals characteristics of the datasets within the corpus. If the data’s structure is seen as a tree, the Average Branch
Factor (ABF) is the mean number of fields in a child. The height
of a dataset is the maximal depth of the tree, and Fields is the
number of leaves. Rows is the number of records in the dataset, while size is the amount of disk space used by a dataset.
Although we are pleased with the narrow distribution on some
of the attributes (e.g., heights), the dispersion of ABF and Fields
suggests that some datasets need to have fewer fields organized
into more branches.
Figure 7 shows the distribution of atomic and composite
types within datasets. Numeric and string types dominate. Most
datasets have few or no boolean types, and few datasets have
more than just the top-level list (which is present in every data-
set). The x-axis shows percentages of all types within the dataset,
with numerics and strings as the most common. The chart does
ments also benefit from the wide variety of the CORGIS collec-
tion. Learners can choose their own datasets to explore things
related to their own interests and career goals, increasing their
sense of agency.
The Visualizer facilitates early exploration of datasets, without
the need to program. In the beginning of our course, students
generate graphs in groups (each group is assigned a dataset) and
then on their own (free to choose their own dataset). They are
tasked with creating visualizations to explore questions about the
distribution, trends, and relationships of the data. For example,
in the crime dataset, they can identify the downward trend of
violent crime rates over time in different states. This is an opportunity to discuss how complex, real-world entities and phenomenon can be represented with computable abstractions (e.g., numbers). Additionally, this gives students practice with selecting and
interpreting different kinds of charts. In our experience, many
students struggle with aspects of graphs such as distinguishing
bar charts and histograms, or knowing when to use line plots.
When students start programming, we give them practice problems contextualized with CORGIS datasets. Although the scope
of these problems is similar to those found in systems like CodingBat [ 15], the problems can be more realistic. The complexity
of using complete datasets is avoided by using the simpler interfaces exposed for libraries. For example, students might be
tasked with writing a program to print whether an umbrella is
necessary depending on the weather in their current city (
requiring only a function call to the Weather library, an if-statement,
and the print statement). A wide range of programming topics
can be contextualized with the CORGIS libraries, including col-lection-based iteration, decisions, printing and visualization, and
functions. In our course, datasets are incorporated through a
block-based programming environment [ 3] and then in a regular Python programming environment, but the libraries and raw
datasets could be used in a variety of development environments.
A learning goal in our course is for students to be able to navigate and manipulate complex data structures (e.g., nested lists
Figure 5: The CORGIS Gallery