the management overhead of hosting external applications while simultaneously delivering new levels of performance and scale for Cloud customers.
Inspired by the community’s ability to scale out computing with parallel arrays of commodity servers, we sought a
similar approach for networking. This paper describes
our experience with building five generations of custom
data center network hardware and software by leveraging
commodity hardware components, while addressing the
control and management requirements introduced by our
approach. We used the following principles in constructing our networks:
Clos topologies: To support graceful fault tolerance,
increase the scale/bisection of our datacenter networks, and
accommodate lower radix switches, we adopted Clos topologies1, 9, 15 for our datacenters. Clos topologies can scale to
nearly arbitrary size by adding stages to the topology, principally limited by failure domain considerations and control
plane scalability. They also have substantial in-built path
diversity and redundancy, so the failure of any individual
element can result in relatively small capacity reduction.
However, they introduce substantial challenges as well,
including managing the fiber fanout and more complex
routing across multiple equal-cost paths.
Merchant silicon: Rather than use commercial switches
targeting small-volume, large feature sets and high reliability, we targeted general-purpose merchant switch silicon,
commodity priced, off the shelf, switching components. To
keep pace with server bandwidth demands which scale with
cores per server and Moore’s law, we emphasized bandwidth
density and frequent refresh cycles. Regularly upgrading
network fabrics with the latest generation of commodity
switch silicon allows us to deliver exponential growth in
bandwidth capacity in a cost-effective manner.
Centralized control protocols: Control and management
become substantially more complex with Clos topologies
because we dramatically increase the number of discrete
switching elements. Existing routing and management protocols were not well-suited to such an environment. To control this complexity, we observed that individual datacenter
switches played a pre-determined forwarding role based on
the cluster plan. We took this observation to one extreme by
collecting and distributing dynamically changing link state
information from a central, dynamically elected, node in the
network. Individual switches could then calculate forwarding tables based on current link state relative to a statically
Overall, our software architecture more closely resembles
control in large-scale storage and compute platforms than
traditional networking protocols. Network protocols typi-
cally use soft state based on pair-wise message exchange,
emphasizing local autonomy. We were able to use the dis-
tinguishing characteristics and needs of our datacenter
deployments to simplify control and management proto-
cols, anticipating many of the tenets of modern Software
Defined Networking (SDN) deployments.
12 The datacenter
networks described in this paper represent some of the larg-
est in the world, are in deployment at dozens of sites across
the planet, and support thousands of internal and external
services, including external use through Google Cloud
Platform. Our cluster network architecture found substantial
reuse for inter-cluster networking in the same campus and
even WAN deployments17 at Google.
2. BACKGROUND AND RELATED WORK
The tremendous growth rate of our infrastructure served as
key motivation for our work in datacenter networking. Figure 1
shows aggregate server communication rates since 2008.
Traffic has increased 50x in this time period, roughly doubling
every year. A combination of remote storage access,
5, 13 large-scale data processing,
10, 16 and interactive web services3 drive
our bandwidth demands. More recently, the growth rate has
increased further with the popularity of the Google Cloud
Platform14 running on our shared infrastructure.
In 2004, we deployed traditional cluster networks similar
4 This configuration supported 40 servers connected at
1Gb/s to a Top of Rack (ToR) switch with approximately 10: 1
oversubscription in a cluster delivering 100Mb/s among 20k
servers. High bandwidth applications had to fit under a single
ToR to avoid the heavily oversubscribed ToR uplinks. Deploying
large clusters was important to our services because there were
many affiliated applications that benefited from high bandwidth communication. Consider large-scale data processing to
produce and continuously refresh a search index, web search,
and serving ads as affiliated applications. Larger clusters also
substantially improve bin-packing efficiency for job scheduling by reducing stranding from cases where a job cannot be
scheduled in any one cluster despite the aggregate availability
of sufficient resources across multiple small clusters.
While our traditional cluster network architecture largely
met our scale needs, it fell short in terms of overall performance
and cost. With bandwidth per host limited to 100Mbps,
packet drops associated with incast8 and outcast20 were severe
pain points. Increasing bandwidth per server would have substantially increased cost per server and reduced cluster scale.
We realized that existing commercial solutions could
not meet our scale, management, and cost requirements.
Hence, we decided to build our own custom data center network hardware and software. We started with the key insight
that we could scale cluster fabrics to near arbitrary size by
leveraging Clos topologies (Figure 2) and the then-emerging
(ca. 2003) merchant switching silicon industry.
11 Table 1 summarizes a number of the top-level challenges we faced in
constructing and managing building-scale network fabrics.
Figure 2. A generic three tier Clos architecture with edge switches
(ToRs), aggregation blocks and spine blocks. All generations of Clos
fabrics deployed in our datacenters follow variants of this architecture.