layer- 2 domain is typically limited to a few hundred due
to Ethernet scaling overheads (packet flooding and ARP
broadcasts). To limit these overheads and to isolate different services or logical server groups (e.g., e-mail, search,
web front ends, web back ends), servers are partitioned into
virtual LANs (VLANs) placed into distinct layer- 2 domains.
Unfortunately, this conventional design suffers from three
limited server-to-server capacity: As we go up the
hierarchy, we are confronted with steep technical and
financial barriers in sustaining high bandwidth. Thus, as
traffic moves up through the layers of switches and routers, the oversubscription ratio increases rapidly. For example, servers typically have 1: 1 oversubscription to other
servers in the same rack—that is, they can communicate
at the full rate of their interfaces (e.g., 1 Gbps). We found
that uplinks from ToRs are typically 1: 2 to 1: 20 oversubscribed (i.e., 1–10 Gbps of uplink for 20 servers), and paths
through the highest layer of the tree can be 1:240 oversubscribed. This large oversubscription factor fragments the
server pool by preventing idle servers from being assigned
to overloaded services, and it severely limits the entire
data center’s performance.
Fragmentation of resources: As the cost and performance of communication depends on distance in
the hierarchy, the conventional design encourages service planners to cluster servers nearby in the hierarchy.
Moreover, spreading a service outside a single layer- 2
domain frequently requires the onerous task of reconfiguring IP addresses and VLAN trunks, since the IP addresses
used by servers are topologically determined by the access
routers above them. Collectively, this contributes to the
squandering of computing resources across the data center. The consequences are egregious. Even if there is plentiful spare capacity throughout the data center, it is often
effectively reserved by a single service (and not shared), so
that this service can scale out to nearby servers to respond
rapidly to demand spikes or to failures. In fact, the growing
resource needs of one service have forced data center operations to evict other services in the same layer- 2 domain,
incurring significant cost and disruption.
Poor reliability and utilization: Above the ToR, the
basic resilience model is 1: 1. For example, if an aggregation switch or access router fails, there must be sufficient
remaining idle capacity on the counterpart device to carry
the load. This forces each device and link to be run up to
at most 50% of its maximum utilization. Inside a layer- 2
domain, use of the Spanning Tree Protocol means that
even when multiple paths between switches exist, only
a single one is used. In the layer- 3 portion, Equal Cost
Multipath (ECMP) is typically used: when multiple paths of
the same length are available to a destination, each router
uses a hash function to spread flows evenly across the available next hops. However, the conventional topology offers
at most two paths.
3. MeasuReMents anD iMPLications
Developing a new network architecture requires a quanti-
tative understanding of the traffic matrix (who sends how
much data to whom and when?) and churn (how often does
the state of the network change due to switch/link failures
and recoveries, etc.?). We studied the production data
centers of a large cloud service provider, and we use the
results to drive our choices in designing VL2. Details of the
methodology and results can be found in other papers. 10, 16
Here we present the key findings that directly impact the
design of VL2.
figure 2. Mice are numerous; 99% of flows are smaller than 100MB.
however, more than 90% of bytes are in flows between 100MB and 1GB.
1 100 1e+ 10 1e+ 12
Flow Size PDF
Total Bytes PDF
10000 1e+06 1e+08
Flow Size (Bytes)
Flow Size CDF
Total Bytes CDF
10000 1e+06 1e+08
Flow Size (Bytes)
1e+ 10 1e+ 12