58 COMMUNICATIONS OF THE ACM | JUNE 2018 | VOL. 61 | NO. 6
contributed articles
gender bias throughout human his-
tory. 25 However, an underlying factor
hides a deeper bias that is revealed
when looking more closely at the
creation process. In the category of
biographies, Wikipedia statistics
show that less than 12% of Wikipe-
dia editors are women. In other cat-
egories, gender bias is even worse,
reaching 4% in geography. On the
other hand, as the percentage of all
publicly reported Wikipedia female
editors is just 11%, biographies actu-
ally show a small positive bias. Keep
in mind these values are also biased,
as not all Wikipedia editors identify
their gender, and females might thus
be underrepresented.
Our third source of data bias is Web
spam, a well-known human-generated
malicious bias that is difficult to characterize. The same applies to content (near)
duplication (such as mirrored websites)
that, in 2003, represented approximately
20% of static Web content. 13
Since measuring almost any bias is
difficult, its effect on prediction algorithms using machine learning are likewise difficult to understand. As Web
data represents a biased sample of the
population to begin with, studies based
on social media may have a significant
amount of error we can be sure is not
uniformly distributed. For the same
reason, the results of such research
cannot be extrapolated to the rest of
the population; consider, for example,
the polling errors in the 2016 U.S. presidential election, 18 though online polls
predicted the outcome better than live
polls. Other sources of error include biased data samples (such as due to selection bias) or samples too small for the
analytical technique at hand. 7
Algorithmic Bias and Fairness
Algorithmic bias is added by the algorithm itself and not present in the
input data. If the input data is indeed
biased, the output of the algorithm
might also reflect the same bias. However, even if all possible biases are
detected, defining how an algorithm
should proceed is generally difficult,
in the same way people disagree over
what is a fair solution to any controversial issue. It may even require calling on a human expert to help detect if
an output indeed includes any bias at
all. In a 2016 research effort that used
a corpus of U.S. news to learn she-he
analogies through word embeddings,
most of the results was reported as
biased, as in nurse-surgeon and diva-superstar instead of queen-king. 9 A
quick Web search showed that approxi-
A second set of biases is due to the
interaction between different types
of bias. Consider Figure 4, which
plots the fraction of biographies of
women in Wikipedia, 16 a curve that
could be explained through systemic
Figure 6. Dependency graph of biases affecting user interaction.
Data and algorithmic bias Self-selection bias
Position bias
Interaction bias
Social bias
Mouse
movement
bias
Click bias
Ranking bias
Scrolling bias
Presentation
bias
Figure 5. Heat maps of eye-tracking analysis on web-search results pages, from 2005 (left)
to 2014 (right). 18
9
As with all the relative heat maps presented in this study,
the red areas are those where participants spent the most
amount of time looking as a percentage of the total time they
looked at the page, followed by yellow, then green.
2
1
archers look
e Golden
ecause…
s are no longer
left corner so
ere to find them.
ve habitually
hers to scan
an horizontally.
king for the fastest
d content.
12
we learned
The distinct triangle shape is not visible because
searchers are scanning vertically more than they are
reading horizontally.
Possible classification of biases whereby the cultural and cognitive columns
are user-dependent.
Bias Type Statistical Cultural Cognitive
Algorithmic •
Presentation •
Position •
Sampling •
Data • • •
Second-order • • •
Activity • •
User Interaction • •
Ranking • •
Social • •
Self-selection •