WEB-SCALE COMPUTING MEANS running hundreds
of thousands of servers. It requires a fundamentally
different approach from smaller environments.
Consistent server hardware and datacenter plans,
as well as consistent and simple configurations, are
essential. Everything is designed to expect and embrace
failure without human intervention. Operations
must be largely autonomous. Software must assume
crashes. Development, test, and deployment must have
integrated and automated solutions.
This article offers a brief survey of many of the
techniques so well known to those working at Web
scale and yet frequently surprising to others.
Web-scale infrastructure implies
lots of servers working together—
often tens or hundreds of thousands
of servers all working toward the
same goal. How can the complexity
of these environments be managed?
How can commonality and simplicity be introduced?
A lot of effort goes in to achieving
uniform and fungible hardware resources in Web-scale datacenters. If
you can have just one kind of server
with the same CPU, same DRAM, same
storage, and the same network capacity, then any server is as good as the next
server. When all the servers are the
same, there is a single pool of spares
and a single resource to allocate.
You want to treat your hardware
Article development led by
Embrace failure so
it does not embrace you.
BY PAT HELLAND, SIMON WEAVER, AND ED HARRIS