SEPTEMBER2016 | VOL. 59 | NO. 9 | COMMUNICATIONS OF THE ACM 95
simply extracts its relevant portion. Doing so simplifies configuration generation but every switch has to be updated with
the new config each time the cluster configuration changes.
Since cluster configurations do not change frequently, this
additional overhead is not significant.
Switch management approach. We designed a simple management system on the switches. We did not require most of
the standard network management protocols. Instead, we
focused on protocols to integrate with our existing server
management infrastructure. We benefited from not drawing arbitrary lines between server and network infrastructure; in fact, we set out to make switches essentially look like
regular machines to the rest of fleet. Examples include large
scale monitoring, image management and installation, and
syslog collection and alerting.
Fabric operation and management. For fabric operation
and management, we continued with the theme of leveraging the existing scalable infrastructure built to manage
and operate the server fleet. We built additional tools that
were aware of the network fabric as a whole, thus hiding
complexity in our management software. As a result, we
could focus on developing only a few tools that were truly
specific to our large scale network deployments, including
link/switch qualification, fabric expansion/upgrade, and
network troubleshooting at scale. Also important was collaborating closely with the network operations team to provide training before introducing each major network fabric
generation, expediting the ramp of each technology across
Troubleshooting misbehaving traffic flows in a network
with such high path diversity is daunting for operators.
Therefore, we extended debugging utilities such as tracer-oute and ICMP to be aware of the fabric topology. This helped
with locating switches in the network that were potentially
blackholing flows. We proactively detect such anomalies by
running probes across servers randomly distributed in the
cluster. On probe failures, these servers automatically run
traceroutes and identify suspect failures in the network.
6. 1. Fabric congestion
Despite the capacity in our fabrics, our networks experienced high congestion drops as utilization approached
25%. We found several factors contributed to congestion:
(i) inherent burstiness of flows led to inadmissible traffic in
short time intervals typically seen as incast8 or outcast20; (ii)
our commodity switches possessed limited buffering, which
was sub optimal for our server TCP stack; (iii) certain parts of
the network were intentionally kept oversubscribed to save
cost, for example, the uplinks of a ToR; and (iv) imperfect
flow hashing especially during failures and in presence of
variation in flow volume.
We used several techniques to alleviate the congestion in
our fabrics. First, we configured our switch hardware schedulers to drop packets based on QoS. Thus, on congestion we
would discard lower priority traffic. Second, we tuned the hosts
to bound their TCP congestion window for intracluster traffic
to avoid overrunning the small buffers in our switch chips.
Third, for our early fabrics, we employed link-level pause at
static configuration. The Firepath Master Redundancy Pro-
tocol (FMRP) handles master election and bookkeeping
between the active and backup masters over the CPN.
FMRP has been robust in production over multiple years
and many clusters. Since master election is sticky, a misbehaving master candidate does not cause changes in
mastership and churn in the network. In the rare case
of a CPN partition, a multi-master situation may result,
which immediately alerts network operators for manual intervention.
Cluster border router. Our cluster fabrics peer with external networks via BGP. To this end, we integrated a BGP stack
on the CBR with Firepath. This integration has two key aspects: (i) enabling the BGP stack on the CBRs to communicate inband with external BGP speakers, and (ii) supporting
route exchange between the BGP stack and Firepath. Figure
11B shows the interaction between the BGP stack, Firepath,
and the switch kernel and embedded stack.
A proxy process on the CBR exchanges routes between
BGP and Firepath. This process exports intra-cluster routes
from Firepath into the BGP RIB and picks up inter-cluster
routes from the BGP RIB, redistributing them into Firepath.
We made a simplifying assumption by summarizing routes
to the cluster-prefix for external BGP advertisement and the
/0 default route to Firepath. In this way, Firepath manages
only a single route for all outbound traffic, assuming all
CBRs are viable for traffic leaving the cluster. Conversely, we
assume all CBRs are viable to reach any part of the cluster
from an external network. The rich path diversity inherent to
Clos fabrics enables both these assumptions.
5. 3. Configuration and management
Next, we describe our approach to cluster network configuration and management prior to Jupiter. Our primary goal
was to manufacture compute clusters and network fabrics as
fast as possible throughout the entire fleet. Thus, we favored
simplicity and reproducibility over flexibility. We supported
only a limited number of fabric parameters, used to generate all the information needed by various groups to deploy
the network, and built simple tools and processes to operate
the network. As a result, the system was easily adopted by a
wide set of technical and non-technical support personnel
responsible for building data centers.
Configuration generation approach. Our key strategy was
to view the entire cluster network top-down as a single static
fabric composed of switches with pre-assigned roles, rather than bottom-up as a collection of switches individually
configured and assembled into a fabric. We also limited the
number of choices at the cluster-level, essentially providing
a simple menu of fabric sizes and options, based on the projected maximum size of a cluster as well as the chassis type
The configuration system is a pipeline that accepts a
specification of basic cluster-level parameters such as the
size of the spine, base IP prefix of the cluster and the list of
ToRs and their rack indexes and then generates a set of output files for various operations groups.
We distribute a single monolithic cluster configuration
to all switches (chassis and ToRs) in the cluster. Each switch