content associated with large cities and
tourist attractions. Another example
of the network effect of Web bias is the
link structure of the Web itself. Figure 3
plots the number of links from the Web
within Spain to other countries, along
with exports from Spain to the same
other countries. 3 The countries toward
the bottom right are outliers, as they
had all sold the right to use their domains for other purposes (such as the
.fm country code, top-level domain
for the Federated States of Micronesia). Ignoring them, the correlation
between exports and number of links
is more than 0.8 for Spain. In fact,
the more developed a country is, the
greater is the correlation, ranging from
0.6 for Brazil to 0.9 for the U.K. 4
(such as number of pages per website
or number of links per webpage). Figure 2 plots the number of links in U.K.
webpages on the x-axis and the number of webpages on the y-axis. Zipf’s
law is clearly visible on the right side, in
the line with the more negative slope.
However, there is a strong social force
at the beginning of the x-axis I call the
“shame effect” that makes the slope
less negative. It also illustrates that
many people prefer to exert the least
effort, though most people also need
to feel they do enough to avoid feeling
ashamed of their effort. 5 These two effects are common characteristics of
people’s activity on the Web.
Finally, Nobel laureate Herbert Simon said, “A wealth of information
creates a poverty of attention.” Activity
bias thus generates a “digital desert”
across the Web, or Web content no one
ever sees. A lower bound comes from
Twitter data where Saez-Trumper and
I8 found that 1.1% of the tweets were
written and posted by people without
followers. Reviewing Wikipedia use statistics gave us an upper bound, whereby
31% of the articles added or modified in
May 2014 were never visited in June.
The actual size of the digital desert on
the Web likely lies in the first half of the
1% to 31% range.
On the other hand, bias is not always negative. Due to activity bias, all
levels of Web caching are highly effective at keeping the most used content
readily available, and the load on websites and the Internet network in general is then much lower than would be
potentially possible.
Data Bias
As with people skills, data quality is
heterogeneous and thus, to some ex-
tent, expected to be biased. People
working in government, universities,
and other institutions that deal with
information should publish data of
higher quality and less bias, while so-
cial media as a whole is much larger,
biased, and without doubt, of lower
average quality. On the other hand,
the number of people contributing to
social media is probably at least one
order of magnitude greater than the
number of people working in informa-
tion-based institutions. There is thus
more data of any quality coming from
all people, including high-quality data,
no matter what definition of what qual-
ity one uses. Still, a lot of fake content
on the Web seems to spread faster than
reliable content. 17
The first set of biases seen in people
interacting with the Web is due to their
demographics. Accessing and using the
Internet correlates with educational,
economic, and technological bias, as
well as other characteristics, causing a
ripple effect of bias in Web content and
links. For example, it is estimated that
over 50% of the most popular websites
are in English, while the percentage of
native English speakers in the world is
approximately only 5%; this increases
to 13% if all English speakers are included, as estimated by Wikipedia.
Geographical bias is also seen in Web
Figure 3. Economic bias in links for the Web in Spain. 3
E
xp
o
rt
s
(T
ho
u
s
an
ds
o
f
US$
)
Number of Linked Domains
1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
10 100 1,000 10,000 100,000
Figure 4. Accumulated fraction of women’s biographies in Wikipedia. 16
Fr
acti
o
no
fBi
o
gr
a
phi
e
sPe
r
Ye
a
r
0.00
0.05
0.10
0.15
0.20
0.25
Cumulative Fraction
0.0 0.2 0.4 0.6 0.8 1.0