of this analysis along with the rankings
of two data brokers. In second place is
comScore who are found on 38% of pages, followed by Facebook with 31%. It is
striking that these two companies combined still have less reach than Google.
Additionally, companies were categorized according to their type of revenue model. Some 80% of the top 10
companies are advertisers. The only
exceptions to this rule are Adobe and
Amazon. Adobe offers a mix of software and services, including traffic
analytics. Amazon is in the business
of both consumer-retail sales as well
as Web hosting with the Amazon Web
Services (AWS) division. At present it is
unclear if AWS data is integrated into
Amazon product recommendations or
deals, but the possibility exists.
While advertisers dominate online
tracking, I was also able to detect two
major data brokers: Experian (5% of
pages), and Acxiom (3% of pages). The
main business model of data brokers is
to collect information about individuals and households in order to sell it to
financial institutions, employers, marketers, and other entities with such
interest. Credit scores provided by
Experian help determine if a given individual qualifies for a loan, and if so,
at what interest rate. Given that a 2007
study revealed that “ 62.1% of all bank-ruptcies ... were medical,” 11 it is possible that some data brokers not only
know when a given person suffered a
medical-related bankruptcy, but perhaps even when they first searched
for information on the ailment that
caused their financial troubles.
Health information leakage. The
HTTP 1. 1 protocol specification warns
the source of a link [URI] might be pri-
vate information or might reveal an
otherwise private information source
and advises that “[c]lients SHOULD
NOT include a Referer header field in a
(non-secure) HT TP request if the refer-
ring page was transferred with a secure
protocol.” 8 In simpler terms, Web pag-
es that include third-party elements,
but do not use secure HTTP requests,
risk leaking sensitive data via the Ref-
erer field. Of the pages analyzed, only
3.24% used secure HTTP, the rest used
non-encrypted HTTP connections and
thereby potentially transmitted sensi-
tive information to third parties. Un-
surprisingly, a significant amount of
counts for 8% of the top requested ele-
ments. Table 1 presents additional de-
tail into the file extensions found.
Given that tracking occurs on the
so-called Invisible Web, it initially ap-
pears odd that so many mechanisms
are images. However, when investigat-
ing the images themselves, it is clear
they provide little indication as to
whom they belong to, and thus users
are kept in the dark as to their purpose
or presence. An examination of the top
100 requested images determined that
only 24% contained information that
would alert the user they had initiated
contact with a third party. Many images
were only a single pixel in size, and are
often referred to as tracking pixels as
their only purpose is to initiate HTTP
requests. The most popular image,
found on 45% of pages, was a single
tracking pixel with the name utm.gif,
which is part of the Google Analytics
service. The second most popular im-
age is the clearly identifiable Facebook
“Like” button that was found on 16%
of pages. It is unclear how many users
elect to “Like” an illness, but Facebook
is able to record page visits regardless if
a user clicks the “Like” button, or if they
even have a Facebook account in the
first place. Google and Facebook are
not alone, however, there are a number
of companies tracking users online.
Corporate ownership. While secu-
rity and privacy research has often fo-
cused on how user privacy is violated,
insufficient attention has been given
to who is collecting user information.
The simple answer is that a variety
of advertising companies have de-
veloped a massive data collection in-
frastructure that is designed to avoid
detection, as well as ignore, counter-
act, or evade user attempts at limiting
collection. Despite the wide range of
entities collecting user data online, a
handful of privately held U.S. advertis-
ing firms dominate the landscape of
the Invisible Web.
Some 78% of pages analyzed in-
cluded elements that were owned by
Google. Such elements represent a
number of hosted services and use a
variety of domain names: they range
from traffic analytics (google-analytics.
com), advertisements (doubleclick.
net), hosted JavaScript (googleapis.
com), to videos ( youtube.com). Regard-
less of the type of services provided, in
some way all of these HTTP requests
funnel information back to Google.
This means a single company has the
ability to record the Web activity of a
huge number of individuals seeking
sensitive health-related information
without their knowledge or consent.
While Google is the elephant in the
room, they are far from alone. Table 2
details the top 10 firms found as part
Table 1. Types of file extensions.
Type
No Extension 47
JavaScript 33
Image 8
Dynamic Page 4
Other 8
Table 2. Corporate ownership and risk assessment (N= 80,142).
Rank Pages Company Revenue Identification Blind Discrimination
1 78 Google Advertising X X
2 38 comScore Advertising — X
3 31 Facebook Advertising X X
4 22 AppNexus Advertising — X
5 18 Add This Advertising — X
6 18 Twitter Advertising — X
7 16 Quantcast Advertising — X
8 16 Amazon Retail and Hosting — X
9 11 Adobe Software and Services — X
10 11 Yahoo! Advertising — X
... — — — — —
31 5 Experian Data Broker X —
... — — — — —
47 3 Acxiom Data Broker X —