including machine status, connectivity
to other machines in the network, and
monitoring capabilities. When connectivity of a local lead server degrades, for
example, a new server is automatically
elected to assume the role of leader.
Principle 4: Fail cleanly and restart.
Based on the previous principles, the
network has already been architected
to handle server failures quickly and
seamlessly, so we are able to take a
more aggressive approach to failing
problematic servers and restarting
them from a last known good state.
This sharply reduces the risk of operating in a potentially corrupted state. If
a given machine continues to require
restarting, we simply put it into a “long
sleep” mode to minimize impact to the
overall network.
Principle 5: Phase software releases.
After passing the quality assurance (QA)
process, software is released to the live
network in phases. It is first deployed
to a single machine. Then, after performing the appropriate checks, it is
deployed to a single region, then possibly to additional subsets of the network, and finally to the entire network.
The nature of the release dictates how
many phases and how long each one
lasts. The previous principles, particularly use of redundancy, distributed
control, and aggressive restarts, make
it possible to deploy software releases
frequently and safely using this phased
approach.
Principle 6: Notice and proactively
quarantine faults. The ability to isolate
faults, particularly in a recovery-orient-ed computing system, is perhaps one
of the most challenging problems and
an area of important ongoing research.
Here is one example. Consider a hypothetical situation where requests for
a certain piece of content with a rare
set of configuration parameters trigger a latent bug. Automatically failing
the servers affected is not enough, as
requests for this content will then be
directed to other machines, spreading
the problem. To solve this problem,
our caching algorithms constrain each
set of content to certain servers so as
to limit the spread of fatal requests. In
general, no single customer’s content
footprint should dominate any other
customer’s footprint among available
servers. These constraints are dynamically determined based on current lev-
els of demand for the content, while
keeping the network safe.
Practical Results and Benefits
Besides the inherent fault-tolerance
benefits, a system designed around
these principles offers numerous other
benefits.
Faster software rollouts. Because the
network absorbs machine and regional
failures without impact, Akamai is able
to safely but aggressively roll out new
software using the phased rollout approach. As a benchmark, we have historically implemented approximately
22 software releases and 1,000 customer configuration releases per month to
our worldwide network, without disrupting our always-on services.
Minimal operations overhead. A
large, highly distributed, Internet-based network can be very difficult to
maintain, given its sheer size, number
of network partners, heterogeneous
nature, and diversity of geographies,
time zones, and languages. Because
the Akamai network design is based
on the assumption that components
will fail, however, our operations team
does not need to be concerned about
most failures. In addition, the team
can aggressively suspend machines or
data centers if it sees any slightly worrisome behavior. There is no need to
rush to get components back online
right away, as the network absorbs the
component failures without impact to
overall service.
This means that at any given time, it
takes only eight to 12 operations staff
members, on average, to manage our
network of approximately 40,000 devices (consisting of more than 35,000 servers plus switches and other networking
hardware). Even at peak times, we successfully manage this global, highly
distributed network with fewer than 20
staff members.
Lower costs, easier to scale. In addition to the minimal operational staff
needed to manage such a large network, this design philosophy has had
several implications that have led to
reduced costs and improved scalability. For example, we use commodity
hardware instead of more expensive,
more reliable servers. We deploy in
third-party data centers instead of having our own. We use the public Internet
instead of having dedicated links. We
deploy in greater numbers of smaller
regions—many of which host our servers for free—rather than in fewer, larger, more “reliable” data centers where
congestion can be greatest.
Conclusion
Even though we’ve seen dramatic advances in the ubiquity and usefulness
of the Internet over the past decade,
the real growth in bandwidth-intensive
Web content, rich media, and Web-and IP-based applications is just beginning. The challenges presented by this
growth are many: as businesses move
more of their critical functions online, and as consumer entertainment
(games, movies, sports) shifts to the
Internet from other broadcast media,
the stresses placed on the Internet’s
middle mile will become increasingly
apparent and detrimental. As such, we
believe the issues raised in this article
and the benefits of a highly distributed
approach to content delivery will only
grow in importance as we collectively
work to enable the Internet to scale to
the requirements of the next generation of users.
References
1. afergan, M., Wein, J., laMeyer, a. experience with
some principles for building an internet-scale reliable
system. in Proceedings of the 2nd Conference on Real,
Large Distributed Systems 2. (these principles are laid
out in more detail in this 2005 research paper.)
2. akamai report: the state of the internet, 2nd quarter,
2008; http://www.akamai.com/stateoftheinternet/.
(these and other recent internet reliability events are
discussed in akamai’s quarterly report.)
3. anderson, n. comcast at ces: 100 Mbps connections
coming this year. ars technica (Jan. 8, 2008); http://
arstechnica.com/news.ars/post/20080108-comcast-
100mbps-connections-coming-this-year.html.
4. horrigan, J.b. home broadband adoption 2008. Pew
internet and american life Project; http://www.
pewinternet.org/pdfs/PiP_broadband_2008.pdf.
5. internet World statistics. broadband internet
statistics: top World countries with highest internet
broadband subscribers in 2007; http://www.
internetworldstats.com/dsl.htm.
6. Mehta, s. verizon’s big bet on fiber optics. Fortune
(feb. 22, 2007); http://money.cnn.com/magazines/
fortune/fortune_archive/2007/03/05/8401289/.
7. spangler t. at&t: U-verse tv spending to increase.
Multichannel News (May 8, 2007); http://www.
multichannel.com/article/ca6440129.html.
8. telegeography. cable cuts disrupt internet in Middle
east and india. CommsUpdate (Jan. 31, 2008); http://
www.telegeography.com/cu/article.php?article_
id=21528.
Tom Leighton co-founded akamai technologies in august
1998. serving as chief scientist and as a director to the
board, he is akamai’s technology visionary, as well as
a key member of the executive committee setting the
company’s direction. he is an authority on algorithms for
network applications. leighton is a fellow of the american
academy of arts and sciences, the national academy of
science, and the national academy of engineering.
a previous version of this article appeared in the october
2008 issue of ACM Queue magazine.
© 2009 acM 0001-0782/09/0200 $5.00
feBRuaRY 2009 | vol. 52 | No. 2 | CommunICatIons of the aCm
51