neity cannot be taken for granted. 17
Instead, visibility variance turned
out to be the regular case, producing a remarkable effect; progress no
longer follows the geometric series,
moving instead much more slowly
over the long term. The consequence
of ignoring visibility variance and not
accounting for incompleteness is the
same; the progress of a study is overestimated, so the required sample
size is underestimated.
In their 2010 meta study, Hwang
and Salvendy6 analyzed the results of
many research papers published since
1990 in order to define a general rule
for sample size (replacing Nielsen’s
magic number five). Hwang’s and Salvendy’s minimum criterion for inclusion in their study was that a study
reported average discovery rates, or
number of successful problem discoveries divided by total number of trials
(number of problems multiplied by
number of sessions). However, this
statistic may be inappropriate, as it
neither accounts for incompleteness
nor for visibility variance. Taking one
reference dataset from the meta study
as an example, I now aim to show how
the 10± 2 rule is biased. It turns out
that the sample size required for an
80% target is much greater than previously assumed.
figure 3. fit of the lnBzt model on the law and hvannberg study8 169×169mm ( 72×72DPi).
Logit−Normal Binomial model with Zero−truncation
Empirical Seen: 88 LNBzt m = − 3.091 s = 1. 52
Unseen: 74
20 40 60
Frequency
nlogLik = 140.691
AIC = 285.524
0
0
5
10
Times Discovered
15
Seen and unseen
In a 2004 study conducted by Law and
Hvannberg, 8 17 independent usability
inspection sessions found 88 unique
usability problems, reporting on the
frequency distribution of the discovery
of each problem. A first glance at frequency distribution reveals that nearly
half the problems were discovered only
once (see Figure 1). This result raises
suspicion that the study did not uncover all existing problems, meaning the
dataset is most likely incomplete.
In the study, a total of 207 events
represented successful discovery of
problems. Assuming completeness,
the binomial probability is estimated
as p=207/( 17* 88)=0.138. Using Virzi’s
formula, Hwang and Salvendy estimated the 80% target being met through 11
sessions, supporting their 10± 2 rule.
However, Figure 1 shows the theoretical binomial distribution is far from
matching the observed distribution,
reflecting three discrepancies:
Never-observed problems. The theoretical distribution predicts a considerable number of never-observed
problems;
Singletons. More problems are observed in exactly one session than is
predicted by the theoretical distribution; and
Frequent occurrences. The number of
frequently observed problems (in more
than five sessions) is undercounted by
the theoretical distribution.
The first discrepancy indicates the
study was incomplete, as the binomial model would predict eight unseen
problems. The GT estimator Lewis
proposed is an adjustment researchers can make for such incomplete datasets, smoothing the data by setting
the number of unseen events to the
number of singletons, here 41.b With
the GT adjustment the binomial model
obtains an estimate of p=0.094 (see Figure 2). The GT adjustment lets the binomial model predict the sample size
for an 80% discovery target at 16, which
is considerably beyond the 10± 2 rule.
variance matters
The way many researchers understand
variance is likely shaped by the common analysis of variance (ANOVA)
and underlying Gaussian distribution.
Strong variance in a dataset is interpreted as noise, possibly forcing researchers to increase the sample size;
b Lewis favors an equally weighted combination
of normalization procedure and GT adjustment, but its theoretical justification is tenuous, ultimately making only a small difference
to prediction (p=0.085).
variance is therefore often called a nuisance parameter. Conveniently, the
Gaussian distribution has a separate
parameter for variance, uncoupling
it from the parameter of interest, the
mean. That is, more variance makes
the estimation less accurate but usually does not introduce bias. Here, I address why variance is not harmless for
statistical models rooted in the binomial realm, as when trying to predict
the sample size of a usability study.
Binomial distribution has a remarkable property: Its variance is tied
to the binomial parameters, the sample size n and the probability p, as in
Var = np( 1−p). If the observed variance
exceeds np( 1−p) it is called overdispersion, and the data can no longer be
taken as binomially distributed. Overdispersion has an interesting interpretation: The probability parameter
p varies, meaning, in this case, problems vary in terms of visibility. Indeed,
Figures 1 and 2 shows the observed
distribution of problem discovery has
much fatter left and right tails than
the plain binomial and GT-adjusted
models; more variance is apparently
observed than can be handled by the
binomial model.
Regarding sample-size estimation
in usability studies, the 2006 edition
of the International Encyclopedia of
Ergonomics and Human Factors says,
“There is no compelling evidence that
a probability density function would
lead to an advantage over a single
value for p.” 19 However, my own 2008–
2009 results call this assertion into
question. The regular case seems to
be that p varies, strongly affecting the