ers—including Amazon, Google, and
Microsoft—had received advance
notice and had completed, or nearly
completed, an initial round of patching on their hypervisors to address
these concerns. Moreover, some of
the providers informed us that they
had patched months before the vulnerabilities were publicized without
any noticeable performance impact.
For a within-an-instance attack,
the attacker would have to run code
on the same instance. This would require access to the system or application to exploit the vulnerability. It was
not immediately clear what needed to
be done to completely protect against
the multiple variants that could be
used in this attack. The protections
implemented by the public cloud providers remediated Meltdown, but the
Spectre variants required multiple
mitigations. Google published a binary modification technique called
Retpoline that it used to patch its
systems against Spectre Variant 2.
This had the benefit of minimal performance impact compared with CPU
patches. Mitigations for other providers included chip firmware, hypervisor patches, operating system patches, and even application rewrites.
Spectre remediation is made even
more complicated because customers and cloud providers have to work
in tandem, depending on the cloud
service in use. Our impact analysis determined that the within-an-instance
risk was not significantly increased by
running instances in the public cloud:
It was essentially the same risk faced
with the internal servers. Accordingly,
we treated it as we treated all of our
servers: by making individual, risk-based decisions.
Servers. At Goldman Sachs, server
performance is critical, so we must be
careful in patching our servers. In fi-
nancial services, many critical applica-
tions are time-sensitive and effective
only if the processing is completed rap-
idly—for example, applications that
perform trading or large-scale, com-
plex risk calculations. This patch could
have very real-world implications. If
the hundreds of thousands of public
cloud processors used every night to
perform complex risk calculations had
their processing speed reduced by 30%,
in addition to the operational risks that
focus on operating systems, applica-
tions, and the overall computer model.
They are not set up for rapidly revealing
the brand and model number of the
It would have been simpler to patch
all our machines, but we were wary of
news that patches might cause significant performance impacts.
Initially, estimates of performance
impact from patching ranged wildly
on blogs and articles and were not
directly cited in official papers. On
January 18, 2018, Eric Siron of Alta-ro.com summarized that sentiment,
saying, “We’ve all seen the estimations that the Meltdown patch might
affect performance in the range of
5% to 30%. What we haven’t seen is a
reliable data set indicating what happens in a real-world environment.” 2
Those ranges were borne out in our
own testing of patches, with some
systems suffering worse slowdowns
than others. Moreover, roundtables
with other chief information security
officers indicated similar ranges.
These patches had a particularly
poor risk trade-off: high potential performance impact, imperfect security
benefit. Normally, a patch fixes a vulnerability. Because these are fundamental design vulnerabilities—and
worse, vulnerabilities in the hardware
design—the patch opportunities are
limited. Rather than fixing the underlying vulnerability, they essentially
put up a labyrinth to stop an adversary
from exploiting it, but the underlying vulnerability remains. Moreover,
our experience with complex vulnerabilities is that the first patch is often
flawed, so we expected that many of
the patches would be updated over
time—an expectation that has since
Although patching was clearly going to be problematic, our quick triage highlighted some good news.
Exploiting these vulnerabilities required executing code locally on the
victim machine. That led to considering which parts of the operating environment are likely to run untrusted code: hypervisors in the public
cloud, employee endpoints such as
desktops and laptops, mobile devices, and the browsers or applications
that often open email attachments.
Since patches could have significant
performance impacts, every decision
would have to involve a risk trade-off.
The conclusion was that desktops
were at most risk, and testing showed
that the performance impact would
be manageable. We thus immediately
began to patch all of our desktops. For
servers, we decided to investigate fur-
ther and make more nuanced, risk-
based decisions. The risk of cyber-
attack had to be balanced against the
operational risk of the patch breaking
or significantly slowing the systems.
There was no information that the
vulnerabilities were being actively
exploited, which was reassuring. On
the other hand, the nature of the vulnerabilities is such that exploitation
is difficult to detect. If we know a vulnerability is being exploited, we will
try to push a patch even if there is a
high risk of the patch breaking some
of the systems. With these vulnerabilities, the lack of known exploitation
reinforced the decision to take more
time assessing our servers.
To aid in this assessment of risk,
we examined our patch strategy and
compensating controls through the
following lenses: public cloud, servers, employee endpoints, browsers,
and email. These lenses also helped
communicate the risks to our business leadership.
Public cloud. Research showed
that attacks leveraging Meltdown and
Spectre could target a public cloud
environment. In certain cases, an attacker could defeat the technology
used by the public cloud providers to
ensure isolation between customers’
instances. If a malicious user were
able to bypass the hypervisor or container engine controls, then that user
could access other customers’ data
collocated on the same hardware.
Thus, our most immediate concerns were public cloud providers. The
public cloud risk could be further broken into instance-to-instance attacks
and within-an-instance attacks.
In an instance-to-instance attack,
a customer could attack another
customer on the same hypervisor.
Meltdown was the most obvious vector for this attack. An attacker could
theoretically just pay for an instance
on the public cloud and then target
any other customer on that hardware.
Fortunately, several large provid-