94 COMMUNICATIONS OF THE ACM | SEPTEMBER 2016 | VOL. 59 | NO. 9
out-of-band Control Plane Network (CPN) appeared to be
substantially simpler and more efficient from a computation
and communication perspective. The switches could then
calculate forwarding tables based on current link state as deltas relative to the underlying, known static topology that was
pushed to all switches.
Overall, we treated the datacenter network as a single
fabric with tens of thousands of ports rather than a collection of hundreds of autonomous switches that had to
dynamically discover information about the fabric. We
were, at this time, inspired by the success of large-scale distributed storage systems with a centralized manager.
design informed the control architecture for both Jupiter
datacenter networks and Google’s B4 WAN,
17 both of which
are based on OpenFlow18 and custom SDN control stacks.
5. 2. Routing
We now present the key components of Firepath, our routing
architecture for Firehose, Watchtower, and Saturn fabrics.
A number of these components anticipate some of the principles of modern SDN, especially in using logically centralized state and control. First, all switches are configured with
the baseline or intended topology. The switches learn actual
configuration and link state through pair-wise neighbor
discovery. Next, routing proceeds with each switch exchanging its local view of connectivity with a centralized Firepath
master, which redistributes global link state to all switches.
Switches locally calculate forwarding tables based on this
current view of network topology. To maintain robustness, we implement a Firepath master election protocol.
Finally, we leverage standard BGP only for route exchange
at the edge of our fabric, redistributing BGP-learned routes
Neighbor discovery to verify connectivity. Building a
fabric with thousands of cables invariably leads to multiple cabling errors. Moreover, correctly cabled links may
be re-connected incorrectly after maintenance. Allowing traffic to use a miscabled link can lead to forwarding
loops. Links that fail unidirectionally or develop high
packet error rates should also be avoided and scheduled
for replacement. To address these issues, we developed
Neighbor Discovery (ND), an online liveness and peer correctness checking protocol. ND uses the configured view
of cluster topology together with a switch’s local ID to determine the expected peer IDs of its local ports and verifies that via message exchange.
Firepath. We support Layer 3 routing all the way to the
ToRs via a custom Interior Gateway Protocol (IGP), Firepath.
Firepath implements centralized topology state distribution, but distributed forwarding table computation with
two main components. A Firepath client runs on each fabric
switch, and a set of redundant Firepath masters run on a
selected subset of spine switches. Clients communicate
with the elected master over the CPN. Figure 11 shows
the interaction between the Firepath client and the rest of
the switch stack. Figure 12 illustrates the protocol message
exchange between various routing components.
At startup, each client is loaded with the static topology
of the entire fabric called the cluster config. Each client
collects the state of its local interfaces from the embedded
stack’s interface manager and transmits this state to the
master. The master constructs a Link State Database (LSD)
with a monotonically increasing version number and dis-
tributes it to all clients via UDP/IP multicast over the CPN.
After the initial full update, a subsequent LSD contains only
the diffs from the previous state. The entire network’s LSD
fits within a 64KB payload. On receiving an LSD update,
each client computes shortest path forwarding with Equal-
Cost Multi-Path (ECMP) and programs the hardware for-
warding tables local to its switch.
Path diversity and convergence on failures. For rapid convergence on interface state change, each client computes
the new routing solution and updates the forwarding tables
independently upon receiving an LSD update. Since clients
do not coordinate during convergence, the network can
experience small transient loss while the network transitions from the old to the new state. However, assuming
churn is transient, all switches eventually act on a globally
consistent view of network state.
Firepath LSD updates contain routing changes due to
planned and unplanned network events. The frequency of
such events observed in a typical cluster is approximately 2000
times/month, 70 times/day, or 3 times/hour.
Firepath master redundancy The centralized Firepath
master is a critical component in the Firepath system. It collects and distributes interface states and synchronizes the
Firepath clients via a keepalive protocol. For availability,
we run redundant master instances on pre-selected spine
switches. Switches know the candidate masters via their
Firepath Master Firepath Master
Firepath protocol Firepath protocol
Kernel and Device drivers
Figure 11. Firepath component interactions. (A) Non-CBR fabric
switch and (B) CBR switch.
Client, BGP 1
Interface state update
Link State database
eBGP protocol (inband)
Client, BGP M
External BGP peers
Figure 12. Protocol messages between Firepath client and Firepath
master, between Firepath masters and between CBR and external