Optimization 5: Use compression and
delta encoding. Compression of HTML
and other text-based components can
reduce the amount of content traveling over the middle mile to one-tenth
of the original size. The use of delta
encoding, where a server sends only
the difference between a cached HTML
page and a dynamically generated version, can also greatly cut down on the
amount of content that must travel
over the long-haul Internet.
While these techniques are part of
the HTTP/1.1 specification, browser
support is unreliable. By using a highly
distributed network that controls both
endpoints of the middle mile, compression and delta encoding can be
successfully employed regardless of
the browser. In this case, performance
is improved because very little data
travels over the middle mile. The edge
server then decompresses the content
or applies the delta encoding and delivers the complete, correct content to the
end user.
Optimization 6: Offload computations
to the edge. The ability to distribute applications to edge servers provides the
ultimate in application performance
and scalability. Akamai’s network enables distribution of J2EE applications
to edge servers that create virtual application instances on demand, as needed. As with edge page assembly, edge
computation enables complete origin
server offloading, resulting in tremendous scalability and extremely low application latency for the end user.
While not every type of application
is an ideal candidate for edge computation, large classes of popular applications—such as contests, product catalogs, store locators, surveys, product
configurators, games, and the like—
are well suited for edge computation.
Putting it all together
Many of these techniques require a
highly distributed network. Route optimization, as mentioned, depends on
the availability of a vast overlay network that includes machines on many
different networks. Other optimizations such as prefetching and page assembly are most effective if the delivering server is near the end user. Finally,
many transport and application-layer
optimizations require bi-nodal connections within the network (that is, you
control both endpoints). To maximize
the effect of this optimized connection, the endpoints should be as close
as possible to the origin server and the
end user.
Note also that these optimizations
work in synergy. TCP overhead is in
large part a result of a conservative approach that guarantees reliability in
the face of unknown network conditions. Because route optimization gives
us high-performance, congestion-free
paths, it allows for a much more aggressive and efficient approach to
transport-layer optimizations.
highly Distributed network Design
It was briefly mentioned earlier that
building and managing a robust, highly
distributed network is not trivial. At Akamai, we sought to build a system with
extremely high reliability—no downtime, ever—and yet scalable enough
to be managed by a relatively small
operations staff, despite operating in
a highly heterogeneous and unreliable
environment. Here are some insights
into the design methodology.
The fundamental assumption behind Akamai’s design philosophy is
that a significant number of component or other failures are occurring at
all times in the network. Internet systems present numerous failure modes,
such as machine failure, data-center
failure, connectivity failure, software
failure, and network failure—all occurring with greater frequency than
one might think. As mentioned earlier,
for example, there are many causes of
large-scale network outages—
including peering problems, transoceanic
cable cuts, and major virus attacks.
Designing a scalable system that
works under these conditions means
embracing the failures as natural and
expected events. The network should
continue to work seamlessly despite
these occurrences. We have identified
some practical design principles that
result from this philosophy, which we
share here.
1
Principle 1: Ensure significant redundancy in all systems to facilitate failover.
Although this may seem obvious and
simple in theory, it can be challenging
in practice. Having a highly distributed
network enables a great deal of redundancy, with multiple backup possibilities ready to take over if a component
fails. To ensure robustness of all systems, however, you will likely need to
work around the constraints of existing
protocols and interactions with third-party software, as well as balancing
trade-offs involving cost.
For example, the Akamai network
relies heavily on DNS (Domain Name
System), which has some built-in constraints that affect reliability. One example is DNS’s restriction on the size
of responses, which limits the number
of IP addresses that we can return to a
relatively static set of 13. The Generic
Top Level Domain servers, which supply the critical answers to akamai.net
queries, required more reliability, so
we took several steps, including the use
of IP Anycast.
We also designed our system to take
into account DNS’s use of TTLs (time
to live) to fix resolutions for a period
of time. Though the efficiency gained
through TTL use is important, we need
to make sure users aren’t being sent
to servers based on stale data. Our approach is to use a two-tier DNS—
employing longer TTLs at a global level and
shorter TTLs at a local level— allowing
less of a trade-off between DNS efficiency and responsiveness to changing
conditions. In addition, we have built
in appropriate failover mechanisms at
each level.
Principle 2: Use software logic to provide message reliability. This design
principle speaks directly to scalability.
Rather than building dedicated links
between data centers, we use the public Internet to distribute data—
including control messages, configurations,
monitoring information, and customer content—throughout our network.
We improve on the performance of
existing Internet protocols—for example, by using multirouting and limited
retransmissions with UDP (User Da-tagram Protocol) to achieve reliability
without sacrificing latency. We also use
software to route data through intermediary servers to ensure communications (as described in Optimization 2),
even when major disruptions (such as
cable cuts) occur.
Principle 3: Use distributed control for
coordination. Again, this principle is
important both for fault tolerance and
scalability. One practical example is the
use of leader election, where leadership
evaluation can depend on many factors
50 CommunICatIons of the aCm | feBRuaRY 2009 | vol. 52 | No. 2