This is due to the fact that Ghostery
and AdBlock rely on curated blacklists
of known trackers, rather than reporting all requests.
Request analysis. The primary goal
of WebXray is to identify third-party requests by comparing the domain of the
Web page being visited to the domains
of requests being made. For example,
the address “ http://example.com” and
the request “http://images.example.
com/ logo.png” both share the domain
“ example.com,” thus constituting a
first-party request.
Alternately, a request from the same
page to “http://www.googleanalytics.
com/ga.js,” which has the domain
“ google-analytics.com,” is recognized
as a third-party request. The same technique for HTTP requests is also applied toward evaluating the presence
of third-party cookies. The method is
not flawless, as a given site may actually use many domains, or a subdomain
may point to an outside party. However,
when evaluating these types of requests
in aggregate, such problems constitute
the statistical noise that is present in
any large dataset.
Finally, in order to evaluate larger trends in tracking mechanisms,
third-party requests are dissected to
extract arguments (for example, “?SIT-
EID=123”) and file extensions such as
.js (JavaScript), .jpg (image), and .css
(cascading style sheet).
Removing arguments also allows
for a more robust analysis of which elements are the most prevalent, as argument strings often have specific site
identifiers, making them appear unique
when they are not.
Corporate ownership. A specific fo-
cus of this investigation is to determine
which corporate bodies are receiving
information from health-related Web
pages. While it is possible to program-
matically detect requests to third-party
domains, it is not always clear who be-
longs to the requested domains. By ex-
amining domain registration records,
I have been able to pair seemingly ob-
scure domain names (for example,
“ 2mdn.net,” “ fbcdn.net”) with their
corporate owners (for example, Google,
Facebook). This process has allowed
me to follow the data trail back to the
corporations that are the recipients of
user data. To date, the literature has
given much more attention to technical
mechanisms, and much less to the un-
derlying corporate dynamics. This fresh
analytical focus highlights the power of
a handful of corporate giants.
Limitations. While this methodology is resource efficient and performs
well at large scale, it comes with several
potential limitations, many of which
would produce an under-count of the
number of third-party requests. First,
given the rapid rate by which pages are
accessed, it is possible that rate-limiting
mechanisms on servers may be triggered (that is, the requests generated by
my IP would be identifiable as automated), and my IP address could be blacklisted, resulting in an under-count.
Second, due to the fact I use PhantomJS
without browser plugins such as Flash,
Java, and Silverlight, some tracking
mechanisms may not load or execute
properly, resulting in an under-count.
Third, many tracking mechanisms are
designed to be difficult to detect by a
user, and an under-count could result
from a failure to detect particularly
clever tracking mechanisms. Therefore,
the findings presented here constitute a
lower bound of the amount of requests
being made.
Findings
In April 2014, I scanned 80,142 Web
pages that were collected from search
results for 1,986 common diseases with
the intent of detecting the extent and
the ways in which the sensitive health
data of users was being leaked.
General trends. I have broken up my
top-level findings into five general cat-
egories based on information gleaned
from the TLDs used. They are: all pages,
commercial pages (.com), non-profit
pages (.org), government pages (.gov),
and education-related pages (.edu).
This information is illustrated in Figure
2. Of all pages examined, 91% initiate
some form of third-party HTTP request,
86% download and execute third-party
JavaScript, and 71% utilize cookies. Un-
surprisingly, commercial pages were
above the global mean and had the most
third-party requests (93%), JavaScript
(91%), and cookies (82%). Education
pages had the least third-party HTTP
requests (76%) and JavaScript (73%),
with a full quarter of the pages free of
third-party requests. Government pages
stood out for relatively low prevalence
of third-party cookies, with only 21% of
pages storing user data in this way. Fig-
ure 2 details these findings.
Mechanisms. Given that 91% of pages make third-party HTTP requests, it
is helpful to know what exactly is being
requested. Many third-party requests
lack extensions, and when viewed in a
browser display only blank pages that
generate HTTP requests and may also
manipulate browser caches. Such requests accounted for 47% of the top 100
requests and may point toward emerging trends in the ongoing contest between user preferences and tracking
techniques. The second most popular
type of requested elements were JavaScript files (33%). These files are able
to execute arbitrary code in a user’s
browser and may be used to perform
fingerprinting techniques, manipulate
caches and HTML5 storage, as well as
initiate additional requests. The third
most popular type of content is the
tried-and-true image file, which ac-
Figure 2. Prevalence of third-party requests, JavaScript, and cookies by TLD.
Pe
rc
en
to
f
Pa
ge
s
0
25
50
75
100
com org gov edu
39
21
73
82
73
86
76
91
76
88
9293
w/Request w/JavaScript w/Cookie