sioned by providing ample bandwidth
for even antagonistic traffic patterns.
Overprovisioning within large-scale
networks is prohibitively expensive. Alternatively, implementing QoS (quality
of service) policies to segregate traffic
into distinct classes and provide performance isolation and high-level traffic engineering is a step toward ensuring application-level SLAs are satisfied.
Most QoS implementations are implemented by switch and NIC (network
interface controller) hardware where
traffic is segregated based on priority
explicitly marked by routers and hosts
or implicitly steered using port ranges.
The goal is the same: a high-performance network that provides predictable latency and bandwidth characteristics across varying traffic patterns.
Data-Center Traffic
Traffic within a data-center network is
often measured and characterized ac-
cording to flows, which are sequences
of packets from a source to destination
host. When referring to Internet proto-
cols, a flow is further refined to include
a specific source and destination port
number and transport type—UDP or
TCP, for example. Traffic is asymmet-
ric with client-to-server requests being
abundant but generally small. Server-
to-client responses, however, tend to
be larger flows; of course, this, too,
depends on the application. From the
purview of the cluster, Internet traffic
becomes highly aggregated, and as a
result the mean of traffic flows says very
little because aggregated traffic exhib-
its a high degree of variability and is
non-Gaussian.
16
figure 2. A conventional tree-like data-center network topology.
Internet
BR
Border Router
BR
CR
CR ...
Cluster Router
CR
AS
AS Aggregation Switches
L2S L2S L2S
...
L2S
The transient load imbalance induced by elephant flows can adversely
affect any innocent-bystander flows
that are patiently waiting for a heavily
utilized link common to both routes.
For example, an elephant flow from A
to B might share a common link with
a flow from C to D. Any long-lived contention for the shared link increases
the likelihood of discarding a packet
from the C-to-D flow. Any packet discards will result in an unacknowledged packet at the sender’s transport layer and be retransmitted when
the timeout period expires. Since the
timeout period is generally one or two
orders of magnitude more than the
network’s round-trip time, this additional latency22 is a significant source
of performance variation.
3
Today’s typical multitiered data-center network23 has a significant
amount of oversubscription, where the
hosts attached to the rack switch (that
is, first tier) have significantly more—
say an order of magnitude more—
provisioned bandwidth between one another than they do with hosts in other
racks. This rack affinity is necessary
to reduce network cost and improve
utilization. The traffic intensity emitted by each host fluctuates over time,
and the transient load imbalance that
results from this varying load can create contention and ultimately result
in discarded packets for flow control.
Traffic between clusters is typically
less time critical and as a result can
be staged and scheduled. Inter-cluster
traffic is less orchestrated and consists
of much larger payloads, whereas in-tra-cluster traffic is often fine-grained
with bursty behavior. At the next level,
between data centers, bandwidth is often very expensive over vast distances
with highly regulated traffic streams
and patterns so that expensive links
are highly utilized. When congestion
occurs the most important traffic gets
access to the links. Understanding the
granularity and distribution of network flows is essential to capacity planning and traffic engineering.
ToR
ToR
...
Top of Rack
Switches
HHH H H H HH H H H HH H H H
HHH H H H cluster
ToR ToR
Data-Center network Architecture
The network topology describes precisely how switches and hosts are interconnected. This is commonly represented as a graph in which vertices
represent switches or hosts, and links