plex URIs that use GET requests to pass
on statea, thus obscuring the identity of
the actual resources.
URIs that carry state are used heavily in Web applications but are, to
date, largely unanalyzed. For example, in a June 2007 talk, Udi Manber,
Google’s VP of engineering, addressed
the issue of why Web search is so difficult, 25 explaining that on an average
day, 20%–25% of the searches seen by
Google have never been submitted before and that each of these searches
generates a unique identifier (using
server-specific encoding information).
So a Web-graph model would represent only the requesting document
(whether a user request or a request
generated by, for example, a dynamic
advertisement content request) linked
to the www.google.com node. However if, as is widely reported, Google
receives more than 100 million queries
per day, and if 20% of them are unique,
then more than 20 million links, represented as new URIs that encode the
search term(s), should show up in the
Web graph every day, or around 200 per
second. Do these links follow the same
power laws? Do the same growth models explain these behaviors? We simply
don’t know.
Analyzing the Web solely as a graph
also ignores many of its dynamics (
especially at short timescales). Many
phenomena known to Web users (such
as denial-of-service attacks caused by
flooding a server and the need to click
the same link multiple times before getting a response) cannot be explained by
the Web-graph model and often can’t
be expressed in terms amenable to
such graph-based analysis. Representing them at the networking level, ignoring protocols and how they work, also
misses key aspects of the Web, as well
as a number of behaviors that emerge
from the interactions of millions of requests hitting many thousands of servers every second. Web dynamics were
analyzed more than a decade ago, 20 but
the combination of (i) the exponential
growth in the amount of Web content,
(ii) the change in the number, power,
and diversity of Web servers and appli-
a. These characters, including ?.#, =, and &, followed by keywords, may follow the last “slash”
in the URI, thus making for the long URIs often generated by dynamic content servers.
today’s interactive
applications are
very early social
machines, limited
by the fact that they
are largely isolated
one from another.
cations, and (iii) the increasing number of diverse users from everywhere
in the world makes a similar analysis
impossible today without creating and
validating new models of the Web’s
dynamics. Such models must also pay
special attention to the details of the
Web’s architecture, as well as to the
complexity of the interactions actually
taking place there.
Additionally, modern, sophisticated Web sites provide powerful
user-interface functionality by running large script systems within the
browser. These applications access the
underlying remote data model through
Web APIs. This application architecture allows users and entrepreneurs
to quickly build many new forms of
global systems using the processing
power of users’ machines and the storage capacity of a mass of conventional
Web servers. Like the basic Web, each
such system is interesting mainly for
its emergent macro-scale properties,
of which we have little understanding.
Are such systems stable? Are they fair?
Do they effectively create a new form
of currency? And if they do should it
be regulated?
Similarly, many user-generated
content sites now store personal information yet have rather simplistic
systems to restrict access to a person’s
“friends.” This information is not available to wide-scale analysis. Some other
sites must be allowed to access the sites
by posing as the user or as a friend; a
number of three-party authentication
protocols are being deployed to allow
this. A complex system is thus being
built piece by piece, with no invariants
(such as “my employer will never see
this picture”) assured for the user.
The purpose of this discussion is not
to go into the detail of Web protocols
or the relative merits of Web-modeling
approaches but to stress that they are
critical to the current and continued
working of the Web. Understanding
the protocols and issues is important
to understanding the Web as a technical construct and to analyzing and
modeling its dynamic nature. Our ability to engineer Web systems with desirable properties at scale requires that
we understand these dynamics. This
analysis and modeling are thus an important challenge to computer scientists if they are to be able to understand