variance), and so should not be analyzed with classic parametric tests either. To make matters worse, the tests
that people typically use to check the
normality or heterscedaticity of data
are not reliable when both are present.
So, basically, you should always run
modern robust tests in preference to
the classic ones. I have come to the sad
conclusion that I am going to have to
learn R. However, at least it is free and
a package called nparLD does ANOVA-type statistics. Kaptein et al.’s paper
gives an example of such analysis,
which I am currently practicing with.
You might think this is the end of the
statistical jiggery pokery required to
publish some seemingly simple results
correctly. Uh-uh, it gets more complicated. The APA style guidelines require
authors to publish effect size as well as
significance results. What is the difference? Significance testing checks to
see if differences in the means could
have occurred by chance alone. Effect
size tells you how big the difference
was between the groups. Randolph,
Julnes, Sutinen, and Lehman4, in what
amounts to a giant complaint about
the reporting practices of researchers
in computer science education, pointed out that the way stats are reported by
computer science education folk does
not contain enough information, and
missing effect sizes is one problem.
Apparently it is not just us: Paul Ellis reports similar results with psychologists
in The Essential Guide to Effect Sizes.
Ellis also comments that there is
a viewpoint that not reporting effect
size is tantamount to withholding evidence. Yikes! Robert Cole has a useful
article, “It’s the Effect Size, Stupid,” on
what effect size is, why it matters, and
which measures one can use. Researchers often use Cohen’s d or the correlation coefficient r as a measure of effect size. For Cohen’s d, there is even a
handy way of saying whether the effect
size is small, medium, or big. Unfortunately, if you have nonparametric data,
effect size reporting seems to get more
tricky, and Cohen’s way of interpreting the size of effect no longer makes
sense (indeed, some people question
whether it makes sense at all). Also, it is
difficult for nonexperts to understand.
Common language effect sizes or
Why is it so common
to use parametric
tests such as
the t-test or
probability of superiority statistics can
solve this problem (Grissom2). It is “the
probability that a randomly sampled
member of a population given one
treatment will have a score (y) that is
higher on the dependent variable than
that of a randomly sampled member of
a population given another treatment
(y2)” (Grissom2). An example from Robert Cole: Consider a common language
effect size of 0.92 in a comparison of
heights of males and females. In other words “in 92 out of 100 blind dates
among young adults, the male will be
taller than the female.” If you have
Likert-type data with an independent
design and you want to report an effect
size, it is quite easy. SPSS won’t do it for
you, but you can do it with Excel: PS = U/
mn where U is the Mann-Whitney U result, m is the number of people in condition 1, and n is the number of people
in condition 2 (Grissom2). If you have a
repeated measures design, refer to Grissom and Kim’s Effect Sizes for Research
(2006, p. 115). PSdep = w/n, where n is
the number of participants and w refers
to “wins” where the score was higher in
the second measure compared to the
first. Grissom2 has a handy table for
converting between probability of superiority and Cohen’s d, as well as a way of
interpreting the size of the effect.
I am not a stats expert in any way.
This is just my current understand-
ing of the topic from recent reading,
although I have one or two remaining
questions. If you want to read more,
you could consult a forthcoming paper
by Maurits Kaptein and myself in this
year’s CHI conference (Kaptein and
Robertson5). I welcome any corrections
from stats geniuses! I hope it is useful
but I suspect colleagues will hate me for
bringing it up. I hate myself for reading
any of this in the first place. It is much
easier to do things incorrectly.
1. erceg-hurn, d. M. and Mirosevich, V. M. (2008).
Modern robust statistical methods: an easy way to
maximize the accuracy and power of your research.
The American psychologist, 63(7), 591–601. doi:
2. grissom, r. J. (1994). Probability of the superior
outcome of one treatment over another. Journal
of Applied Psychology, 79( 2), 314–316. doi:
3. kaptein, M. c., nass, c., and Markopoulos, P. (2010).
Powerful and consistent analysis of likert-type
ratingscales. Proceedings of the 28th international
conference on Human factors in computing systems
- CHI ’ 10 (p. 2391). new york, ny: acM Press. doi:
4. randolph, J., Julnes, g., Sutinen, e., and lehman, S.
(2008). a Methodological review of computer
Science education research. Journal of Information
Technology Education, (7), 135–162. www.jite.org/
5. kaptein M. and robertson, J. (in press). rethinking
Statistical Methods for chi. accepted for publication
in chi 2012, austin, tX. http://judyrobertson.typepad.
By replacing ANOVA by nonparametric
or robust statistics we risk ending up in
another local maximum. Robust statistics
are just another way to squeeze your data
into a shape appropriate for the infamous
“minimizing sum of squares” statistics.
Those had their rise in the 20th century
because they were computable by pure
brainpower (or the ridiculously slow
computers in those times).
If HCI researchers and psychologists
would just learn their tools and
acknowledge the progress achieved in
econometrics or biostatistics. For example,
linear regression and model selection
strategies are there to replace the
one-by-one null hypothesis testing with
subsequent adjustment of alpha-level. With
maximum likelihood estimation, a person
no longer needs to worry about Gaussian
error terms. Just use Poisson regression
for counts and logistic regression for binary
outcomes. The latter can also model Likert-type scales appropriately with meaningful
parameters and in multifactorial designs.
Once you start discovering this world of
modern regression techniques, you start
seeing more in your data than just a number
of means and their differences. You start
seeing its shape and begin reasoning about
the underlying processes. This can truly
be a source of inspiration.
Judy Robertson is a lecturer at heriot-Watt University.