need assignment to servers. Load balancing requests across a pool of servers
is much like spraying work.
When requests to servers are taking
longer than hoped, it is likely because
there is a queue waiting for the individual servers. If you are spraying work
across more servers, then the per-serv-er queue is shorter. This will result in a
faster response.
Adding servers to a server pool usually lowers the response time for requests.
Removing servers reclaims resources.
Given a response-time goal, this can be
done by an automated robot.
Avoiding memory loss from traumatic server injury. Most distributed
storage systems keep each piece of
data on three separate servers in three
separate racks in a datacenter. The
choice of three replicas is a function of
the durability goals, the MTBF (mean
time between failures), and the MTTR
(mean time to repair).
Availability = MTBF / (MTBF + MTTR)
Since we assume one in five servers
fail every year, our MTBF (Mean Time
Between Failures) is relatively short. To
improve availability, we either need a
longer MTBF or a shorter MTTR (Mean
Time To Repair). By shortening our
MTTR, we can dramatically improve
our availability and data durability.
Assume the data contained in each
server is cut into pieces and the pieces
have their additional replicas on many
different servers, as shown in Figure 3.
For example, the data on server-A is cut
into 100 pieces. Each of those pieces
has its secondary and tertiary replicas
on different servers, perhaps as many
as 200 total in addition to server-A. If
server-A fails, the other 100 secondary
servers will try to find a new place on
potentially yet another 100 servers. In
the limit, this parallelism can reduce
the MTTR by 100 times.
Notice in Figure 3 that each slot of
data in server-B (S9, S2, S4, S8, and S5)
has two other replicas on different servers. If server-B fails, each of these slots
is replicated onto a new third server.
The placement of the new third replica
preserves the guarantee that each replica lives in a separate server.
This approach is tried and true
with GFS (Google File System),
2 HDFS
(Hadoop Distributed File System),
3
Figure 2. Ops, DevOps, and NoOps.
Tasks
Manual
Operations
“Ops”
Automated
Operations
“DevOps”
Autonomous
Operations
“No Ops”
Who sets the goals? Human Human Human
Who decides when to start? Human Human Machine
Who adjudicates priorities? Human Human Machine
Who does the work? Human Machine Machine
Who generates validation data? Human Machine Machine
Who interprets validation data? Human Human Machine
Who handles failures? Human Human Machine
Who handles exceptions? Human Human Human
Figure 3. Data replication.
S1
S10
server-A
S3
S6
S7
S9
S4
S2
server-B
S8 S5
S9
S6
server-C
S1
S4
S8
S5
S8
server-D
S2
S10
S3 S7
S2
server-E
S1
S9
S6 S4
S7
server-F
S5
S3
S10
S1
S10
S5
server-A
S3
S6
S7
S9
S4
S2
S8 S5
S9
S6
S2
server-C
S1
S4
S8
S5
S8
S9
server-D
S2
S10
S3
S4
S7
S2
server-E
S1
S9
S6
S8
S4
S7
server-F
S5
S3
S10
Figure 1. Typical server failure rate.
Assume 4 SATA Disks per Server
Assume 4% disk failure rate per year
4 disks 4% per disk means 16% of the servers fail from disk
Assume 4% miscellaneous failures (e.g. power supplies, etc)
20 of servers fail each year
1 in 1825 servers fails each day (1825 = 5 365)
50,000 servers • expect 27 servers to fail each day