spot–free performance for arbitrary traffic matrices, VL2
uses VLB as its traffic engineering philosophy. As illustrated
in Figure 5, VL2 achieves VLB using a combination of ECMP
routing implemented by the switches and packet encapsulation implemented by the shim on each server. ECMP, a
mechanism already implemented in the hardware of most
switches, will distribute flows across the available paths in
the network, with the packets with the same source and destination address taking the same path to avoid packet reordering. To leverage all the available paths in the network and
overcome some limitations in ECMP, the VL2 agent on each
sender encapsulates each packet to an intermediate switch.
Hence, the packet is first delivered to one of the intermediate switches, decapsulated by the switch, delivered to the
ToR’s LA, decapsulated again, and finally sent to the destination server. The source address in the outer headers of
the encapsulated packet is set to a hash of the inner packet’s
addresses and ports—this provides additional entropy to
better distribute flows between the same servers across the
One potential issue for both ECMP and VLB is the chance
that uneven flow sizes and random spreading decisions will
cause transient congestion on some links. Our evaluation
did not find this to be a problem on data center workloads
(Section 5), but should it occur, the VL2 agent on the sender
can detect and deal with it via simple mechanisms. For
example, it can change the hash used to create the source
address periodically or whenever TCP detects a severe
congestion event (e.g., a full window loss) or an Explicit
4. 3. Maintaining host information using
the VL2 directory system
The VL2 directory system provides two key functions:
( 1) lookups and updates for AA-to-LA mappings and ( 2) a
reactive cache update mechanism that ensures eventual
consistency of the mappings with very little update overhead
We expect the lookup workload for the directory system
to be frequent and bursty because servers can communicate
with up to hundreds of other servers in a short time period,
with each new flow generating a lookup for an AA-to-LA mapping. The bursty nature of workload implies that lookups
figure 6. VL2 Directory system architecture.
. . .
DS Directory Servers
. . .
( 6. Disseminate)
require high throughput and low response time to quickly
establish a large number of connections. Since lookups
replace ARP, their response time should match that of ARP,
that is, tens of milliseconds. For updates, however, the workload is driven by server-deployment events, most of which are
planned ahead by the data center management system and
hence can be batched. The key requirement for updates is
reliability, and response time is less critical.
Our directory service replaces ARP in a conventional L2
network, and ARP ensures eventual consistency via timeout
and broadcasting. This implies that eventual consistency of
AA-to-LA mappings is acceptable as long as we provide a reliable update mechanism. Nonetheless, we intend to support
live VM migration in a VL2 network; our directory system
should be able to correct all the stale entries without breaking any ongoing communications.
The differing performance requirements and workload
patterns of lookups and updates lead us to a two-tiered directory system architecture consisting of ( 1) a modest number
( 50–100 servers for 100K servers) of read-optimized, replicated lookup servers that cache AA-to-LA mappings and that
communicate with VL2 agents, and ( 2) a small number ( 5–10
servers) of write-optimized, asynchronous replicated state-machine (RSM) servers offering a strongly consistent, reliable store of AA-to-LA mappings. The lookup servers ensure
low latency, high throughput, and high availability for a high
lookup rate. Meanwhile, the RSM servers ensure strong consistency and durability for a modest rate of updates using
the Paxos19 consensus algorithm.
Each lookup server caches all the AA-to-LA mappings
stored at the RSM servers and independently replies to
lookup queries from agents using the cached state. Since
strong consistency is not required, a lookup server lazily
synchronizes its local mappings with the RSM every 30s. To
achieve high availability and low latency, an agent sends a
query to k (two in our prototype) randomly chosen lookup
servers and simply chooses the fastest reply. Since AA-to-LA
mappings are cached at lookup servers and at VL2 agents’
cache, an update can lead to inconsistency. To resolve inconsistency, the cache-update protocol leverages a key observation: a stale host mapping needs to be corrected only when
that mapping is used to deliver traffic. Specifically, when a
stale mapping is used, some packets arrive at a stale LA—a
ToR which does not host the destination server anymore.
The ToR forwards such non-deliverable packets to a lookup
server, triggering the lookup server to correct the stale
mapping in the source’s cache via unicast.
In this section, we evaluate VL2 using a prototype running on
an 80-server testbed and 10 commodity switches (Figure 7).
Our goals are first to show that VL2 can be built from components available today, and second, that our implementation
meets the objectives described in Section 1.
The testbed is built using the Clos network topology of
Figure 4, consisting of three intermediate switches, three
aggregation switches, and 4 ToRs. The aggregation and
intermediate switches have 24 10Gbps Ethernet ports, of
which 6 ports are used on each aggregation switch and 3