People don’t scale—dermatological
issues notwithstanding. Say your operations staff can manage 100 servers per
person. This is very typical in a DevOps
environment. Each server needs attention to validate state, handle failures,
and perform repairs. Frequently, the
servers come in many flavors and you
want to optimize the hardware for each
server. Does everyone understand all
the server types? What are their different operational tasks?
Some 50,000 servers at 100 servers
per person require 500 operators. This
gets out of hand very quickly as you scale.
Zen and the art of datacenter maintenance. To support Web-scale datacenters, we have had to evolve from
Ops to DevOps to NoOps, as detailed
in Figure 2. Historically, with manual
operations, people not only decided
what to do, but also did all the actual
work required to operate the servers.
DevOps is a huge step forward as it automates the grunt work of operations.
Still, this is not adequate when scaling
to tens of thousands of servers. In a
NoOps or autonomously managed system, the workflow and control over the
operational tasks are also automated.
Software at Web Scale
Software must embrace failures with
pools of stateless servers and special
storage servers designed to cope with
the loss of replicas.
Stateless servers and whack-a-mole.
Stateless servers accept requests, may
read data from other servers, may write
data to other servers, and then return
results. When a stateless server dies, its
work must be restarted. Stateless servers are designed to fail.
Stateless servers must be idempotent. Restarting the work must still give
the correct answer. Learning about
idempotent behavior is an essential
part of large-scale systems. Idempo-tence is not that hard!
Frequently, stateless servers run as
a pool of servers with the number increasing or decreasing as the demand
fluctuates. When a server fails, the demand increases on its siblings, and
that likely will cause a replacement to
pop back to life. Just like the arcade
game whack-a-mole, as soon as you hit
one, another one pops up.
To each according to its need. Con-
current requests for a stateless service
ual details of the various types of servers
become overwhelming. The aggregate
system with its large number of server
types is full of detail and complexity.
Attack of the clones. In a typical
datacenter, you pick a standard server
configuration and insist everyone use
the same type. Just like Henry Ford’s
Model T, you can have any color you
like as long as it’s black.
With one SKU (stock keeping unit) to
order, you gain huge leverage with ven-
dors in buying servers. In addition, there
is a single pool of spares for that SKU.
Now, there are some exceptions.
Each company is phasing in a new server type while phasing out an old one.
Also, it’s common to have a very few
special servers—such as one for compute loads and one for storage.
Still, tightly controlling the variety
of server types is essential.
The short life of hardware in the
datacenter. Messing with stuff in the
datacenter causes problems. Upgrading servers can cause inconsistencies.
Just don’t do it! When repairing, you really only want to replace the server with
an identical spare and then repave its
software. Maybe the broken server can
be fixed and become a spare.
Servers and other gear provide less
value over time. New servers offer more
computation and storage for the same
form factor and same electricity. The
value for the electrical cost diminishes.
Datacenter hardware is typically
decommissioned and discarded (or
returned to its lessor) after three years.
It’s just not worth keeping.
Datacenter servers and roast beef are
worth a lot less after a few years.
That means there is a lot of pressure to place new servers into production quickly. Say it takes two months
to commission, activate, and load data
into new servers. In addition, it may
take one month to decommission the
servers and get the data out of them.
That is three months out of a total
three-year lifetime not productively
used. The life cycle of servers is a big
Operations at Web Scale
Operations at Web scale are very different from operations at smaller
scale. It’s not practical to be hands
on. This leads to autonomous datacenter management.
typical IT shops.
stick to common
often as possible.