interpretation is also not alone: our results support Lister’s
argument that CS grades are generally not bimodal.
We invite readers to replicate our findings at other institutions.c The code to replicate the analysis is available
online at https://github.com/patitsas/bimodality.
4. STUDY 2: HUMAN INTERPRETATION OF
So if CS grades are rarely bimodal, why does the belief in
bimodality persist? An insight came one day when generat-
ing some random normal distributions in R: with only 100
data points, the resulting histogram often had more than
one peak and could be easily erroneously perceived as
“bimodal”. A typical “large class” does not have a large
enough sample size to consistently provide a smooth curve.
Indeed, many of the distributions produced by R’s rnorm
looked very much like the grade distributions we had seen in
our own classes and called “bimodal.”
Interested in whether instructor perceptions affect the
interpretation of noisy distributions, we designed an experi-
ment wherein participants are presented with histograms of
distributions produced by R’s rnorm function, and asked to
categorize the distribution (normal, bimodal, uniform, etc.).
We initially had two research questions:
1. Do CS instructors who believe in innate ability categorize more noisy distributions as bimodal?
2. If we prime participants that CS distributions are commonly thought to be bimodal, are they then more likely
to see bimodal distributions in the noise?
Once we analyzed our data for those two research questions, a third research question arose:
3. If instructors label noisy distributions as bimodal, are
they more likely to agree with the idea of innate CS
ability? (i.e., is there a possible feedback loop between
looking at distributions and instructors’ beliefs?)
4. 1. Experimental design
A difficulty in studies looking at priming effects is that you
cannot state the purpose of the study in the consent form. If
you do, then you are priming participants, even the participants you want in your control group. To disguise our study,
we presented it as one asking people how often they saw
various distribution shapes in their own classes.
We presented each participant with the six histograms as
shown in Figure 1, all of which we generated using R’s
rnorm function. We generated a few dozen histograms and
selected the six histograms from that pool: one to be clearly
normal (distribution 1), one that was mildly skewed as
though students who were failing were pushed up to 50% (
distribution 5), one where the ceiling effect was visible (
distribution 6), and three noisy distributions which had multiple
peaks (distributions 2–4).
We asked each participant whether they saw this shape of
3. 2. Testing for normality
A variety of null hypothesis tests, such as Anderson-Darling, Shapiro-Wilk, and Pearson’s chi-squared test
determine whether a dataset is normal. We chose Shapiro-Wilk, because it has been found to have the highest statistical power.
Shapiro-Wilk test. For the Shapiro-Wilk test, the null
hypothesis is that the population is normally distributed. So, if
p < α, we can reject the null hypothesis and have evidence that
the population is not normally distributed. We could reject the
null hypothesis for 106 classes. This indicates that 13.6% of the
classes in the data set are not normally distributed. As with the
results of Hartigan’s Dip Test, this does not mean that the null
hypothesis is necessarily false in these cases. There are many
reasons a distribution could not be normal: for example, it
could be too skewed, it could be the wrong shape (e.g., triangular and uniform), or it could be multimodal.
It is worth noting that of the 45 classes where we rejected
the null hypothesis that they were unimodal, for 44 of these
classes we also rejected the null hypothesis that they were
not-normal. As such, 44 of the 106 ( 41.5%) of the classes that
were tested as being not-normal were also tested as being
For the 86.4% of classes where we failed to reject the null
hypothesis, we cannot guarantee that they are actually normal (type II error). To give an estimate of how many are actually normal, we bootstrapped a likely beta value. This yielded
an estimated false negative rate of 1.48%.
From our data, we estimate that 85.1% of the final grades
in UBC’s CS classes are normally distributed. This indicates
that grades from a computer science class are typically
Skewness. Although most of the distributions appear to
be normally distributed, it is worth noting that the average skewness of all the distributions is –0.33, whereas a
normal distribution should have a skewness of zero. If
we only consider the distributions whose test results
indicated normality, the average skewness is –0.13. This
provides some sanity checking on our normality testing:
the “normal” distributions are not particularly skewed.
For the classes where we rejected the null hypothesis of
normality (i.e., probably not normal), the average skewness was higher. Likely, this is why many of these classes
were indicated by Shapiro-Wilk as not normal. Higher
skewness could also be a result of the ceiling effect in
3. 3. Discussion
We only examined final grades: our analysis did not include
term grades. And as grades only came from one institution,
one may wonder about generalizability. We tried to acquire
grade distributions from other institutions, but generally
found it difficult to gather the same scale of data. What
stood out for us is that our colleagues (both at UBC and elsewhere) would routinely assert that their CS grades are
bimodal, and our analysis gives evidence to the contrary.
Although we cannot assert from this analysis that every uni-
versity has the same distributions as UBC, the large scale of
data both in numbers and time-span is compelling. Our
c Since the original ICER publication, our findings have been replicated at a
university in the United States.