relative to processor cycles, exposing
the network stack as a critical latency
bottleneck.
22 This is, in part, the result
of a user-kernel context switch in the
TCP/IP/Ethernet stack—and possibly
additional work to copy the message
from the application buffer into the
kernel buffer and back again at the receiver. A two-pronged hardware/soft-ware approach tackled this latency penalty: OS bypass, and zero copy, both of
which are aimed at eliminating the user-kernel switch for every message and
avoiding a redundant memory copy by
allowing the network transport to grab
the message payload directly from the
user application buffers.
To ameliorate the performance
impact of a user/kernel switch, OS bypass can be used to deposit a message
directly into a user-application buffer.
The application participates in the
messaging protocol by spin-waiting
on a doorbell memory location. Upon
arrival, the NIC deposits the message
contents in the user-application buffer, and then “rings” the doorbell to
indicate message arrival by writing the
offset into the buffer where the new
message can be found. When the user
thread detects the updated value, the
incoming message is processed entirely from user space.
Zero-copy message-passing protocols avoid this additional memory copy
from user to kernel space, and vice versa at the recipient. An interrupt signals
the arrival of a message, and an interrupt handler services the new message
and returns control to the user application. The interrupt latency—the time
from when the interrupt is raised until
control is handed to the interrupt handler—can be significant, especially if
interrupt coalescing is used to amortize the latency penalty across multiple
interrupts. Unfortunately, while interrupt coalescing improves message
efficiency (that is, increased effective
bandwidth), it does so at the cost of
both increased message latency and latency variance.
can sometimes be done at the “border”
of the Internet where commonly requested pages are cached and serviced
by edge servers, while inward computation is generally carried out by a cluster
in a data center with tightly coupled,
orchestrated communication. User
demand is diurnal for a geographic
region; thus, multiple data centers
are positioned around the globe to
accommodate the varying demand.
When possible, demand may be spread
across nearby data centers to load-balance the traffic.
The sheer enormity of this computing infrastructure makes nimble deployment very challenging. Each cluster is built up rack by rack and tested
as units (rack, top-of-rack switch,
among others), as well as in its entirety
with production-level workloads and
traffic intensity.
The cluster ecosystem undergoes
organic growth over its life span, pro-
pelled by the rapid evolution of soft-
ware—both applications and, to a less-
er extent, the operating system. The
fluid-like software demands of Web
applications often consume the cluster
resources that contain them, making
flexibility a top priority in such a fluid
system. For example, adding 10% ad-
ditional storage capacity should mean
adding no more than 10% more serv-
ers to the cluster. This linear growth
function is critical to the scalability of
the system—adding fractionally more
servers results in a commensurate
growth in the overall cluster capac-
ity. Another aspect of this flexibility is
the granularity of resource additions,
which is often tied to the cluster pack-
aging constraints. For example, adding
another rack to a cluster, with, say, 100
new servers, is more manageable than
adding a whole row, with tens of racks,
on the data-center floor.
figure 3. example packet routing through a switch chip.
dest.
address Routing
Lookup
Table
egress
port
incoming
packet
footer
F
crossbar
payload
header
H
input
ports
output
ports
output port
(to next hop)
figure 4. Throughput (accepted bandwidth) as load varies.
post-saturation
instability
scalable, Manageable, And flexible
In general, cloud computing requires
two types of services: user-facing computation (for example, serving Web
pages) and inward computation (for
example, indexing, search, and map/
reduce). Outward-facing functionality
throughput (bits/s)
offered load (bits/s)
α