The following sections explain these challenges and the rationale for our approach in detail.
Our topological approach, reliance on merchant silicon, and load balancing across multipath are substantially
similar to contemporaneous research.
1, 15 Our centralized
control protocols running on switch embedded processors
are also related to subsequent substantial efforts in SDN.
Based on our experience in the datacenter, we later applied
SDN to our Wide Area Network.
17 For the WAN, more CPU
intensive traffic engineering and BGP routing protocols led
us to move control protocols onto external servers with more
3. NETWORK EVOLUTION
3. 1. Firehose
Table 2 summarizes the multiple generations of our cluster
network. With our initial approach, Firehose 1.0 (or FH1.0), our
nominal goal was to deliver 1Gbps of nonblocking bisection
bandwidth to each of 10K servers. Figure 3 details the FH1.0
topology using 8x10G switches in both the aggregation blocks
as well as the spine blocks. The ToR switch delivered 2x10GE
ports to the fabric and 24x1GE server ports.
Each aggregation block hosted 16 ToRs and exposed
32x10G ports towards 32 spine blocks. Each spine block had
32x10G towards 32 aggregation blocks resulting in a fabric
that scaled to 10K machines at 1G average bandwidth to any
machine in the fabric.
Since we did not have any experience building switches
but we did have experience building servers, we attempted
to integrate the switching fabric into the servers via a PCI
board. See top right inset in Figure 3. However, the uptime
of servers was less than ideal. Servers crashed and were
upgraded more frequently than desired with long reboot
times. Network disruptions from server failure were especially problematic for servers housing ToRs connecting multiple other servers to the first stage of the topology.
The resulting wiring complexity for server to server connectivity, electrical reliability issues, availability and general
issues associated with our first foray into switching doomed
the effort to never seeing production traffic. At the same
time, we consider FH1.0 to be a landmark effort internally.
Without it and the associated learning, the efforts that followed would not have been possible.
Our first production deployment of a custom datacenter
cluster fabric was Firehose 1. 1 (FH1.1). We had learned from
FH1.0 not to use servers to house switch chips. Thus, we
built custom enclosures that standardized around the compact PCI chassis each with six independent linecards and a
dedicated Single-Board Computer (SBC) to control the linecards using PCI. See insets in Figure 4. The fabric chassis did
Table 2. Multiple generations of datacenter networks.
deployed Merchant silicon ToR config
speed Host speed
Legacynetwork2004 Vendor 48x1G – – 10G 1G 2T
Firehose 1.0 2005 8x10G 4x10G (ToR) 2x10G up 24x1G down 2x32x10G (B) 32x10G (NB) 10G 1G 10T
Firehose 1. 1 2006 8x10G 4x10G up 48x1G down 64x10G (B) 32x10G (NB) 10G 1G 10T
Watchtower 2008 16x10G 4x10G up 48x1G down 4x128x10G (NB) 128x10G (NB) 10G nx1G 82T
Saturn 2009 24x10G 24x10G 4x288x10G (NB) 288x10G (NB) 10G nx10G 207T
Jupiter 2012 16x40G 16x40G 8x128x40G (B) 128x40G (NB) 10/40G nx10G/nx40G 1.3P
B, Indicates blocking; NB, Indicates nonblocking.
Table 1. High-level summary of challenges we faced and our approach to address them.
Challenge Our approach (section discussed in)
Introducing the network to production Initially deploy as bag-on-the-side with a fail-safe big-red button ( 3. 1)
High availability from cheaper components Redundancy in fabric, diversity in deployment, robust software, necessary protocols only,
Individual racks can leverage full uplink capacity to external
Introduce Cluster Border Routers to aggregate external bandwidth shared by all server
racks ( 4. 1)
Routing scalability Scalable in-house IGP, centralized topology view and route control ( 5. 2)
Interoperate with external vendor gear Use standard BGP between Cluster Border Routers and vendor gear ( 5. 2. 5)
Small on-chip buffers Congestion window bounding on servers, ECN, dynamic buffer sharing of chip buffers,
Routing with massive multipath Granular control over ECMP tables with proprietary IGP ( 5. 1)
Operating at scale Leverage existing server installation, monitoring software; tools build and operate fabric as
a whole; move beyond individual chassis-centric network view; single
cluster-wide configuration ( 5. 3)
Inter cluster networking Portable software, modular hardware in other applications in the network hierarchy ( 4. 2)