of browser and computer the user is
on. In this case, the user employs the
Mozilla Firefox browser on a Macintosh
computer. Such information is helpful
when loading specially optimized pages for smartphones or tablets.
Once this request has been made,
the CDC Web server sends the user an
HTML file. This file contains the text
of the page as well as a set of instructions that tells the Web browser how
to download and style additional elements such as images (Figure 1. 2). In
order to get the CDC logo, the following HTTP request is made:
GET /TemplatePackage/images/
cdcHeaderLogo.gif
Host: www.cdc.gov
User-Agent: Mozilla/5.0 (Macintosh...
Referer: http://www.cdc.gov/hiv/
This request introduces a new piece
of information called the Referer,
which contains the address of the page
the user is currently viewing. The CDC
Web server may keep records of all
HTTP requests in order to determine
what pages and content are being requested most often.
Because the “Host” for both requests
is identical ( www.cdc.gov), the user is
only interacting with a single party and
such requests are called “first-party re-
quests.” The only two parties who know
the user is looking up information
about HIV are the user and the CDC.
However, the HTML file also contains
code that makes requests to outside par-
ties. These types of third-party requests
typically download third-party elements
such as images and JavaScript. Due to
the fact that users are often unaware of
such requests, they form the basis of the
so-called “Invisible Web.”
On the CDC’s HIV page, third-par-
ty requests are made to the servers
of Facebook, Pinterest, Twitter, and
Google. In the case of the first three
companies, the requested elements
are all social media buttons, which al-
low for the sharing of content via the
“Recommend,” “Tweet,” or “Pin It”
icons (Figure 1. 3). It is unlikely that
many users would understand the
presence of these buttons indicates
that their data is sent to these compa-
nies. In contrast, the Google elements
on the page are entirely invisible and
there is no Google logo present. One of
identify 80,142 unique health-related
Web pages by compiling responses to
queries for 1,986 common diseases.
This selection of pages represents what
users are actually visiting, rather than a
handful of specific health portals.
Having identified a population of
health-related Web pages, I created a
custom software platform to monitor
the HTTP requests initiated to third
parties. I discovered that 91% of pages
make requests to additional parties,
potentially putting user privacy at risk.
Given that HTTP requests often include
the URI of the page currently being
viewed (known as the “Referer” [sic]),
information about specific symptoms,
treatments, and diseases may be transmitted. My analysis shows 70% of URIs
contains such sensitive information.
This proliferation of third-party requests makes it possible for corporations to assemble dossiers on the health
conditions of unwitting users. In order to identify which corporations are
the recipients of this data I have also
analyzed the ownership of the most requested third-party domains. This has
produced a revealing picture of how
personal health information becomes
the property of private corporations.
This article begins with a short prim-
er on how third-party HTTP requests
work, reviews previous research in this
area, details methodology and findings,
and concludes with suggestions for pro-
tecting health privacy online.
Background: Third-Party
HTTP Requests
A real-world example is the best way
to understand how the information
is leaked to third parties on a typical
Web page. When a user searches online for “HIV” one of the top results is
for the U.S. Centers for Disease Control and Prevention (CDC) page with
the address http://www.cdc.gov/hiv/.a
Clicking on this result initiates what
is known as a “first-party” Hypertext
Transfer Protocol (HTTP) request to
the CDC Web server (Figure 1. 1). A portion of such a request is as follows:
GET /hiv/
Host: www.cdc.gov
User-Agent: Mozilla/5.0(Macintosh...
This request is sent to the CDC Web
server (“Host: www.cdc.gov”) and is an
instruction to return (“GET”) the page
with the address “/hiv/.” This request
also includes “User-Agent” information that tells the server what kind
a As of April, 2014
Figure 1. First- and third-party requests on the CDC Web page for HIV/AIDS.
WEB
PAGE
CEN TERS FOR
DISEASE
CONTROL
1) User initiates request to download
Web page from CDC Server
2) Web page initiates request to
download CDC logo from CDC Server
3) Web page initiates request to
download share button from Facebook
4) Web page initiates request to
download JavaScript from Google
First-Party
Requests
(Green)
Third-Party
Requests
(Red)
WEB
PAGE
WEB
PAGE
WEB
PAGE
GOOGLE
JAVA
SCRIPT
IMAGE
IMAGE
FACEBOOK
CENTERS FOR
DISEASE
CONTROL