the significance of the categories of
bias identified, not on methodological aspects of the research. For more
detail, see the References and the research listed in the online appendix
“Further Reading” ( dl.acm.org/cita-
tion.cfm?doid=3209581&picked=form
ats) of this article.
Activity Bias, or Wisdom of a Few
In 2011, a study by Wu et al. 28 on how
people followed other people on Twit-
ter found that the 0.05% of the most
popular people attracted almost 50%
of all participants; 28 that is, half of the
Twitter users in the dataset were fol-
lowing only a few select celebrities. I
thus asked myself: What percentage
of active Web users generate half the
content in a social media website? I
did not, however, consider the silent
majority of Web users who only watch
the Web without contributing to it,
which in itself is a form of self-selec-
tion bias. 14 Saez-Trumper and I8 ana-
lyzed four datasets, and as I detail, the
results surprised us.
Exploring a Facebook dataset from
2009 with almost 40,000 active users,
we found 7% of them produced 50% of
the posts. In a larger dataset of Amazon
reviews from 2013, we found just 4% of
the active users. In a very large dataset
from 2011 with 12 million active Twitter users, the result was only 2%. Finally, we learned that the first version
of half the entries of English Wikipedia
was researched and posted by 0.04% of
its registered editors, or approximately
2,000 people, indicating only a small
percentage of all users contribute to
the Web and the notion that it represents the wisdom of the overall crowd
is an illusion.
In light of such findings, 8 it did not
make sense that just 4% of the people
voluntarily write half of all the reviews in the Amazon dataset. I sensed
something else is at play. A month
after publication of our results, my
hunch was confirmed. In October
2015, Amazon began a corporate campaign against paid fake reviews that
continued in 2016 by suing almost
1,000 people accused of writing them.
Our analysis8 also found that if we
consider only the reviews that some
people find helpful, the percentage
decreases to 2.5%, using the positive
correlation between the average helpfulness of each review according to
users and a proxy of text quality. Although the example of English Wikipedia is the most biased, it represents
a positive bias. The 2,000 people at
the start of English Wikipedia probably triggered a snowball effect that
helped Wikipedia become the vast
encyclopedic resource it is today.
Zipf’s least-effort principle, 29 also
called Zipf’s law, maintains that many
people do only a little while few people
do a lot, possibly helping explain a big
part of activity bias. However, economic
and social incentives also play a role in
yielding this result. For example, Zipf’s
law can be seen in most Web measures
both the growth of the Web and its use.
Here, I explain each of the biases (in
red) and classify them by type, begin-
ning with activity bias resulting from
how people use the Web and the hid-
den bias of people without Internet ac-
cess. I then address bias in Web data
and how it potentially taints the algo-
rithms that use it, followed by biases
created through our interaction with
websites and how content and use
recycles back to the Web or to Web-
based systems, creating various types
of second-order bias.
Consider the following survey of research on bias on the Web, some I was
involved with personally, focusing on
Figure 1. The vicious cycle of bias on the Web.
Activity bias
Second-order bias
Self-selection bias Algorithmic bias
Interaction bias
Data bias
Web
Screen
Algorithm
Figure 2. Shame effect (line with small trend direction) vs. minimal effort (notable
trend direction) on number of links on U.K. webpages, with intersection between 12
and 13 links. Data at far right is probably due to pages having been written by
software, not by Web users or developers. 5
10–6
10–5
10–4
10–3
10–2
10–1
100
Num
be
r
of
P
ag
es
Number of Links
101 102 103