are most likely to visit, irrespective of if
the site is health-centric.
Third-party request detection. To
detect third-party HTTP requests, my
methodology employs a “headless”
Web browser named PhantomJS. 24
PhantomJS requires no GUI, has very
low resource utilization, and is therefore well suited for large-scale analyses. Due to the fact it is built on WebKit, PhantomJS’s underlying rendering
engine is capable of executing Java
Script, setting and storing cookies, and
producing screen captures. Most important for this project, PhantomJS allows for the direct monitoring of HTTP
requests without the need to resort to
browser hacks or network proxies.
It should be noted that the most recent versions of PhantomJS ( 1. 5+) do
not support the Adobe Flash browser
plug-in. To address this potential
limitation, I conducted testing with
an older version of PhantomJS ( 1. 4)
and Flash. The inclusion of Flash led
to much higher resource utilization,
instability, and introduced a large performance penalty. While this method
successfully analyzed Flash requests,
I determined that Flash elements
were comparatively rare and had negligible effect on the top-level trends
presented below. Therefore, I made
the decision to forgo analysis of Flash
requests in favor of greater software
reliability by using the most recent
version of PhantomJS ( 1. 9).
In order to fully leverage the power
of PhantomJS, I created a custom software platform named WebXray that
drives PhantonJS, collects and analyzes the output in Python, and stores
results in MySQL. The workflow begins
with a predefined list of Web page addresses that are ingested by a Python
script. PhantomJS then loads the given
Web address, waits 30 seconds to allow
for all redirects and content loading to
complete, and sends back JSON-for-matted output to Python for analysis.
This technique represents an improvement over methods such as searching for known advertising elements
detected by popular programs such
as Ghostery or AdBlock. 4 As of March
2014, Ghostery reports the WebMD
Web page for “HIV/AIDS” contains four
trackers. In contrast, WebXray detects
the same page initiating requests to
thirteen distinct third-party domains.
company, 4, 17, 18, 25 but often utilize their
own methodologies for analysis. Krishnamurthy and Wills have conducted
many of the most important studies in
this area18 and developed the idea of a
privacy footprint17 based upon the number of nodes a given user is exposed to
as they surf the Web. This team has consistently found there are high levels of
tracking on the Web, including on sites
dealing with sensitive personal information such as health. 17 Other teams
have performed comparative analyses
between countries4 as well as explored
general trends in tracking mechanisms. 19, 25 A common theme among all
measurement research is the amount
of tracking on the Web is increasing,
and shows no signs of abating. The data
presented in this article updates and advances extant findings with a focus on
how users are tracked when they seek
health information online.
Methodology
In order to quickly and accurately reveal third-party HTTP requests on
health-related Web pages, my methodology has four main components: page
selection, third-party request detection, request analysis, and corporate
ownership analysis.
Page selection. A variety of websites
such as newspapers, government agencies, and academic institutions provide
health information online. Thus, limiting analysis to popular health-centric
sites fails to reach many of the sites
users actually visit. 16 To wit, the Pew
Internet and American Life Project
found “77% of online health seekers say
they began at a search engine such as
Google, Bing, or Yahoo” 9 as opposed to
a health portal like WebMD.com. In order to best model the pages a user would
visit after receiving a medical diagnosis,
I first compiled a list of 1,986 diseases
and conditions based on data from the
Centers for Disease Control, the Mayo
Clinic, and Wikipedia. Next, I used the
Bing search API in order to find the top
50 search results for each term.b Once
duplicates and binary files (pdf, doc, xls)
were filtered out, a set of 80,142 unique
Web pages remained. A major contribution of this study to prior work is the fact
that my analysis is focused on the pages
that users seeking medical information
b Search results were localized to U.S./English.
Prior research
has demonstrated
that while users
are uncomfortable
with this type
of tracking,
it is performed in
a number of highly
sophisticated ways,
and it is increasingly
widespread.