the proper monitoring and alerting been in place for fabric
backplane and CPN links.
Component misconfiguration. A prominent misconfiguration outage was on a Freedome fabric. Recall that a
Freedome chassis runs the same codebase as the CBR with
its integrated BGP stack. A CLI interface to the CBR BGP
stack supported configuration. We did not implement locking to prevent simultaneous read/write access to the BGP
configuration. During a planned BGP reconfiguration of a
Freedome block, a separate monitoring system coincidentally used the same interface to read the running config
while a change was underway. Unfortunately, the resulting
partial configuration led to undesirable behavior between
Freedome and its BGP peers.
We mitigated this error by quickly reverting to the previous configuration. However, it taught us to harden our operational tools further. It was not enough for tools to configure
the fabric as a whole; they needed to do so in a safe, secure
and consistent way.
This paper presents a retrospective on ten years and
five generations of production datacenter networks. We
employed complementary techniques to deliver more
bandwidth to larger clusters than would otherwise be
possible at any cost. We built multi-stage Clos topologies from bandwidth-dense but feature-limited merchant
switch silicon. Existing routing protocols were not easily
adapted to Clos topologies. We departed from conventional wisdom to build a centralized route controller that
leveraged global configuration of a predefined cluster plan
pushed to every datacenter switch. This centralized control extended to our management infrastructure, enabling
us to eschew complex protocols in favor of best practices
from managing the server fleet. Our approach has enabled
us to deliver substantial bisection bandwidth for building-scale fabrics, all with significant application benefit.
Many teams contributed to the success of the datacenter
network within Google. In particular, we would like to
acknowledge the Platforms Networking (PlaNet) Hardware
and Software Development, Platforms Software Quality
Assurance (SQA), Mechanical Engineering, Cluster Engineering
(CE), Network Architecture and Operations (NetOps), Global
Infrastructure Group (GIG), and Site Reliability Engineering
(SRE) teams, to name a few.
ToRs to keep servers from over-running oversubscribed
uplinks. Fourth, we enabled Explicit Congestion Notification
(ECN) on our switches and optimized the host stack response
to ECN signals.
2 Fifth, we monitored application bandwidth
requirements in the face of oversubscription ratios and could
provision bandwidth by deploying Pluto ToRs with four or
eight uplinks as required. Sixth, the merchant silicon had
shared memory buffers used by all ports, and we tuned the
buffer sharing scheme on these chips so as to dynamically
allocate a disproportionate fraction of total chip buffer space
to absorb temporary traffic bursts. Finally, we carefully configured switch hashing functionality to support good ECMP load
balancing across multiple fabric paths.
Our congestion mitigation techniques delivered substantial improvements. We reduced the packet discard rate
in a typical Clos fabric at 25% average utilization from 1%
to < 0.01%. Further improving fabric congestion response
remains an ongoing effort.
6. 2. Outages
While the overall availability of our datacenter fabrics has
been satisfactory, our outages fall into three categories representing the most common failures in production: (i) control
software problems at scale; (ii) aging hardware exposing previously unhandled failure modes; and (iii) misconfigurations
of certain components.
Control software problems at large scale. A datacenter
power event once caused the entire fabric to restart simultaneously. However, the control software did not converge
without manual intervention. The instability took place
because our liveness protocol (ND) and route computation
contended for limited CPU resources on embedded switch
CPUs. On entire fabric reboot, routing experienced huge
churn, which, in turn, led ND not to respond to heartbeat
messages quickly enough. This in turn led to a snowball
effect for routing where link state would spuriously go from
up to down and back to up again. We stabilized the network
by manually bringing up a few blocks at a time.
Going forward, we included the worst case fabric reboot
in our test plans. Since the largest scale datacenter could
never be built in a hardware test lab, we launched efforts to
stress test our control software at scale in virtualized environments. We also heavily scrutinized any timer values in
liveness protocols, tuning them for the worst case while balancing slower reaction time in the common case. Finally,
we reduced the priority of non-critical processes that shared
the same CPU.
Aging hardware exposes unhandled failure modes. Over
years of deployment, our inbuilt fabric redundancy degraded as a result of aging hardware. For example, our software
was vulnerable to internal/backplane link failures, leading to rare traffic blackholing. Another example centered
around failures of the CPN. Each fabric chassis had dual
redundant links to the CPN in active-standby mode. We
initially did not actively monitor the health of both the active and standby links. With age, the vendor gear suffered
from unidirectional failures of some CPN links exposing
unhandled corner cases in our routing protocols. Both
these problems would have been easier to mitigate had
1. Al-Fares, M., Loukissas, A., Vahdat, A.
A scalable, commodity data center
network architecture. In ACM
SIGCOMM Computer Communication
Review. Volume 38 (2008), ACM, 63–74.
2. Alizadeh, M., Greenberg, A., Maltz, D.A.,
Padhye, J., Patel, P., Prabhakar, B.,
Sengupta, S., Sridharan, M. Data center
TCP (DCTCP). ACM SIGCOMM Comput.
Commun. Rev. 41, 4 (2011), 63–74.
3. Barroso, L.A., Dean, J., Holzle, U.
Web search for a planet: The Google
cluster architecture. Micro. IEEE 23,
2 (2003), 22–28.
4. Barroso, L. A., Hölzle, U. The datacenter
as a computer: An introduction to the
design of warehouse-scale machines.
Syn. Lect. Comput. Architect. 4, 1
5. Calder, B., Wang, J., Ogus, A.,
Nilakantan, N., Skjolsvold, A.,
McKelvie, S., Xu, Y., Srivastav, S.,
Wu, J., Simitci, H., et al. Windows
Azure storage: A highly available
cloud storage service with strong
consistency. In Proceedings of the
Twenty- Third ACM Symposium on
Operating Systems Principles (2011),