13. Fetterly, D., Manasse, M., and Najork, M. On the
evolution of clusters of near-duplicate webpages.
Journal of Web Engineering 2, 4 (Oct. 2003), 228–246.
14. Gong, W., Lim, E.-P., and Zhu, F. Characterizing silent
users in social media communities. In Proceedings
of the Ninth International AAAI Conference on Web
and Social Media (Oxford, U. K., May 26–29). AAAI,
Fremont, CA, 2015, 140–149.
15. Graells-Garrido, E. and Lalmas, M. Balancing diversity
to countermeasure geographical centralization in
microblogging platforms. In Proceedings of the 25th
ACM Conference on Hypertext and Social Media
(Santiago, Chile, Sept. 1–4). ACM Press, New York,
2014, 231–236.
16. Graells-Garrido, E., Lalmas, M., and Menczer, F. First
women, second sex: Gender bias in Wikipedia. In
Proceedings of the 26th ACM Conference on Hypertext
and Social Media (Guzelyurt, TRNC, Cyprus, Sept. 1–4).
ACM Press, New York, 2015, 165–174.
17. Lazer, D. M. J. et al. The science of fake news. Science
359, 6380 (Mar. 2018), 1094–1096.
18. Mediative. The Evolution of Google’s Search Results
Pages & Effects on User Behaviour. White paper, 2014;
http://www.mediative.com/SERP
19. Mercer, A., Deane, C., and McGeeney, K. Why 2016
Election Polls Missed Their Mark. Pew Research
Center, Washington, D. C., Nov 2016; http://www.
pewresearch.org/fact-tank/2016/11/09/why-2016-
election-polls-missed-their-mark/
20. Olteanu, A., Castillo, C., Diaz, F., and Kiciman, E. Social
Data: Biases, Methodological Pitfalls, and Ethical
Boundaries. SSRN, Rochester, N Y, Dec. 20, 2016;
https://ssrn.com/abstract=2886526
21. Pariser, E. The Filter Bubble: How the New
Personalized Web Is Changing What We Read and
How We Think. Penguin, London, U.K., 2011.
22. Saez- Trumper, D., Castillo, C., and Lalmas, M. Social
media news communities: Gatekeeping, coverage,
and statement bias. In Proceedings of the ACM
International Conference on Information and
Knowledge Management (San Francisco, CA, Oct. 27–
Nov. 1). ACM Press, New York, 2013, 1679–1684.
23. Silberzahn, R. and Uhlmann, E.L. Crowdsourced
research: Many hands make tight work. Nature 526,
7572 (Oct. 2015), 189–191; https://psyarxiv.com/qkwst/
24. Smith, M., Patil, D.J., and Muñoz, C. Big Data: A Report
on Algorithmic Systems, Opportunity, and Civil Rights.
Executive Office of the President, Washington, D.C.,
2016; https://obamawhitehouse.archives.gov/sites/
default/files/microsites/ostp/2016_0504_data_
discrimination.pdf
25. Wagner, C., Garcia, D., Jadidi, M., and Strohmaier, M.
It’s a man’s Wikipedia? Assessing gender inequality
in an online encyclopedia. In Proceedings of the Ninth
International AAAI Conference on Web and Social
Media (Oxford, U.K., May 26–29). AAAI, Fremont, CA,
2015, 454–463.
26. Wang, T. and Wang, D. Why Amazon’s ratings might
mislead you: The story of herding effects. Big Data 2, 4
(Dec. 2014), 196–204.
27. White, R. Beliefs and biases in Web search. In
Proceedings of the 36th ACM SIGIR Conference
(Dublin, Ireland, July 28–Aug. 1). ACM Press, New
York, 2013, 3–12.
28. Wu, S., Hofman, J.M., Mason, W.A., and Watts, D.J.
Who says what to whom on Twitter. In Proceedings of
the 20th International Conference on the World Wide
Web (Hyderabad, India, Mar. 28–Apr. 1). ACM Press,
New York, 2011, 705–714.
29. Zipf, G.K. Human Behavior and the Principle of Least
Effort. Addison-Wesley Press, Cambridge, MA, 1949.
Ricardo Baeza-Yates ( rbaeza@acm.org) is Chief
Technology Officer of NTENT, a search technology
company based in Carlsbad, CA, USA, and Director of
Computer Science Programs at Northeastern University,
Silicon Valley campus, San Jose, CA, USA.
Copyright held by owner/author.
Publication rights licensed to ACM. $15.00.
user. If a personalization algorithm
uses only our interaction data, we see
only what we want to see, thus biasing
the content to our own selection biases,
keeping us in a closed world, closed off
to new items we might actually like. This
issue must be counteracted through collaborative filtering or task contextual-ization, as well as through diversity, novelty, serendipity, and even, if requested,
giving us the other side. This has a positive effect on online privacy because, by
incorporating such techniques, less personal information is required.
Conclusion
The problem of bias is much more complex than I have outlined here, where I
have covered only part of the problem.
Indeed, the foundation involves all of
our personal biases. On the contrary,
many of the biases described here manifest beyond the Web ecosystem (such
as in mobile devices and the Internet of
Things). The table here aims to classify
all the main biases against the three
types of bias I mentioned earlier. We
can group them in three clusters: The
top one involves just algorithms; the
bottom one—activity, user interaction,
and self-selection—involves those that
come just from people; and the middle
one—data and second-order—includes
those involving both. The question
marks in the first line indicate that each
program probably encodes the cultural
and cognitive biases of their creators.
One antecedent to support this claim is
an interesting data-analysis experiment
where 29 teams in a worldwide crowd-sourcing challenge performed a statistical analysis for a problem involving
racial discrimination. 3
In early 2017, US-ACM published
the seven properties algorithms must
fulfill to achieve transparency and ac-
countability: 1 awareness, access and
redress, accountability, explanation,
data provenance, auditability, and
validation and testing. This article is
most closely aligned with awareness.
In addition, the IEEE Computer Soci-
ety also in 2017 began a project to de-
fine standards in this area, and at least
two new conferences on the topic were
held in February 2018. My colleagues
and I are also working on a website
with resources on “fairness measures”
related to algorithms (http://fairness-
measures.org/), and there are surely
other such initiatives. All of them
should help us define the ethics of al-
gorithms, particularly with respect to
machine learning.
As any attempt to be unbiased might
already be biased through our own cultural and cognitive biases, the first step
is thus to be aware of bias. Only if Web
designers and developers know its existence can they address, and if possible,
correct them. Otherwise, our future
could be a fictitious world based on biased perceptions from which not even
diversity, novelty, or serendipity would
be able to rescue us.
Acknowledgments
I thank Jeanna Matthews, Leila Zia, and
the anonymous reviewers for their helpful comments, as well as for Amanda
Hirsch for her earlier English revision.
References
1. ACM U. S. Public Policy Council. Statement on
Algorithmic Transparency and Accountability. ACM,
Washington, D.C., Jan. 2017; https://www.acm.org/
binaries/content/ assets/public-policy/2017_usacm_
statement_algorithms.pdf
2. Agar wal, D., Chen, B-C., and Elango, P. Explore/exploit
schemes for Web content optimization. In Proceedings
of the Ninth IEEE International Conference on Data
Mining (Miami, FL, Dec. 6–9). IEEE Computer Society
Press, 2009.
3. Baeza-Yates, R., Castillo, C., and López, V. Characteristics
of the Web of Spain. Cybermetrics 9, 1 (2005), 1–41.
4. Baeza-Yates, R. and Castillo, C. Relationship between
Web links and trade (poster). In Proceedings of the
15th International Conference on the World Wide Web
(Edinburgh, U.K., May 23–26). ACM Press, New York,
2006, 927–928.
5. Baeza-Yates, R., Castillo, C., and Efthimiadis, E.N.
Characterization of national Web domains. ACM
Transactions on Internet Technology 7, 2 (May 2007),
article 9.
6. Baeza-Yates, R., Pereira, Á., and Ziviani, N.
Genealogical trees on the Web: A search engine user
perspective. In Proceedings of the 17th International
Conference on the World Wide Web (Beijing, China, Apr.
21–25). ACM Press, New York, 2008, 367–376.
7. Baeza-Yates, R. Incremental sampling of query logs.
In Proceedings of the 38th ACM SIGIR Conference
(Santiago, Chile, Aug. 9–13). ACM Press, New York,
2015, 1093–1096.
8. Baeza-Yates, R. and Saez-Trumper, D. Wisdom of the
crowd or wisdom of a few? An analysis of users’ content
generation. In Proceedings of the 26th ACM Conference
on Hypertext and Social Media (Guzelyurt, TRNC,
Cyprus, Sept. 1–4). ACM Press, New York, 2015, 69–74.
9. Bolukbasi, R., Chang, K. W., Zou, J., Saligrama, V., and
Kalai, A. Man is to computer programmer as woman
is to homemaker? De-biasing word embeddings.
In Proceedings of the 30th Conference on Neural
Information Processing Systems (Barcelona, Spain,
Dec. 5–10). Curran Associates, Inc., Red Hook, NY,
2016, 4349–4357.
10. Caliskan, A., Bryson, J.J., and Narayanan, A. Semantics
derived automatically from language corpora contain
human-like biases. Science 356, 6334 (Apr. 2017),
183–186.
11. Chapelle, O. and Zhang, Y. A dynamic Bayesian network
click model for Web search ranking. In Proceedings of
the 18th International Conference on the World Wide
Web (Madrid, Spain, Apr. 20–24). ACM Press, New York,
2009, 1–10.
12. Dupret, G.E. and Piwowarski, B. A user-browsing
model to predict search engine click data from past
observations. In Proceedings of the 31st ACM SIGIR
Conference (Singapore, July 20–24). ACM Press, New
York, 2008, 331–338.
Watch the author discuss
his work in this exclusive
Communications video.
https://cacm.acm.org/videos/
bias-and-the-web