figure 3. number of concurrent connections has two modes: ( 1) 10
flows per node more than 50% of the time and ( 2) 80 flows per node
for at least 5% of the time.
Fraction of Time
0
0.01
0.02
0.03
0.04
PDF
CDF
1 10 100 1000
Number of Concurrent flows in/out of each Machine
Cumulative
0
0.2
0.4
0.6
0.8
1
than 50% of the time, an average machine has about ten
concurrent flows, but at least 5% of the time it has greater
than 80 concurrent flows. We almost never see more than
100 concurrent flows.
The distributions of flow size and number of concurrent
flows both imply that flow-based VLB will perform well on
this traffic. Since even big flows are only 100MB (1s of transmit time at 1Gbps), randomizing at flow granularity (rather
than packet) will not cause perpetual congestion if there is
unlucky placement of too many flows in the same link.
volatile traffic patterns: While the sizes of flows show a
strong pattern, the traffic patterns inside a data center are
highly divergent. When we cluster the traffic patterns, we
find that more than 50 representative patterns are required
to describe the traffic in the data center. Further, the traffic pattern varies frequently—60% of the time the network
spends only 100 s in one pattern before switching to another.
Frequent failures: As discussed in Section 2, conventional
data center networks apply 1 + 1 redundancy to improve reliability at higher layers of the hierarchical tree. This hierarchical topology is intrinsically unreliable—even with huge
effort and expense to increase the reliability of the network
devices close to the top of the hierarchy, we still see failures
on those devices resulting in significant downtime. In 0.3%
of failures, all redundant components in a network device
group became unavailable (e.g., the pair of switches that
comprise each node in the conventional network (Figure 1)
or both the uplinks from a switch). The main causes of failures are network misconfigurations, firmware bugs, and
faulty components.
With no obvious way to eliminate failures from the top
of the hierarchy, VL2’s approach is to broaden the top levels
of the network so that the impact of failures is muted and
performance degrades gracefully, moving from 1 + 1 redundancy to n + m redundancy.
4. ViRtuaL LayeR 2 net WoRKinG
Before detailing our solution, we briefly discuss our design
principles and preview how they will be used in our design.
Randomizing to cope with volatility: The high divergence
and unpredictability of data center traffic matrices suggest
that optimization-based approaches to traffic engineering
risk congestion and complexity to little benefit. Instead,
VL2 uses VLB: destination-independent (e.g., random)
traffic spreading across the paths in the network. VLB, in
theory, ensures a noninterfering packet-switched network6
(the counterpart of a non-blocking circuit-switched network) as long as (a) traffic spreading ratios are uniform, and
(b) the offered traffic patterns do not violate edge constraints
(i.e., line card speeds). We use ECMP to pursue the former
and TCP’s end-to-end congestion control to pursue the latter. While these design choices do not perfectly ensure the
two assumptions (a and b), we show in Section 5. 1 that our
scheme’s performance is close to the optimum in practice.
Building on proven networking technology: VL2 is based
on IP routing and forwarding technologies already available in commodity switches: link-state routing, ECMP forwarding, and IP any-casting. VL2 uses a link-state routing
protocol to maintain the switch-level topology, but not to
disseminate end hosts’ information. This strategy protects
switches from needing to learn voluminous, frequently
changing host information. Furthermore, the routing
design uses ECMP forwarding along with anycast addresses
to enable VLB while minimizing control plane messages
and churn.
Separating names from locators: To be able to rapidly
grow or shrink server allocations and rapidly migrate
VMs, the data center network must support agility, which
means support hosting any service on any server. This,
in turn, calls for separating names from locations. VL2’s
addressing scheme separates servers’ names, termed
application-specific addresses (AAs), from their locations, termed location-specific addresses (LAs). VL2 uses
a scalable, reliable directory system to maintain the mappings between names and locators. A shim layer running
in the networking stack on every host, called the VL2 agent,
invokes the directory system’s resolution service.
embracing end systems: The rich and homogeneous
programmability available at data center hosts provides a
mechanism to rapidly realize new functionality. For example, the VL2 agent enables fine-grained path control by
adjusting the randomization used in VLB. The agent also
replaces Ethernet’s ARP functionality with queries to the
VL2 directory system. The directory system itself is also
realized on regular servers, rather than switches, and thus
offers flexibility, such as fine-grained access control between
application servers.
4. 1. scale-out topologies
As described in Sections 2 and 3, conventional hierarchical
data center topologies have poor bisection bandwidth and
are susceptible to major disruptions due to device failures.
Rather than scale up individual network devices with more
capacity and features, we scale out the devices—building a
broad network offering huge aggregate capacity using a large
number of simple, inexpensive devices, as shown in Figure 4.
This is an example of a folded Clos network6 where the links
between the intermediate switches and the aggregation
switches form a complete bipartite graph. As in the conventional topology, ToRs connect to two aggregation switches,
but the large number of paths between any two aggregation
switches means that if there are n intermediate switches, the
failure of any one of them reduces the bisection bandwidth
by only 1/n—a desirable property we call graceful degradation