everyone will choose that easier option. To offset this impulse, if one option is much more costly, expose that
cost at the user decision point. For example, moving disks into the cloud is
convenient for users but much more
time-consuming (and costly) than the
alternative. The team exposed the cost
of moving disks as a 24-hour duration,
which was much less convenient than
the one-hour duration for a simple exchange of a corporate network-hosted
instance for a Cloud-hosted instance.
Simply exposing this information
when users had to choose between the
two options saved an estimated 1. 8
petabytes of data moves.
Never waste an opportunity to gather
data. Before the migration, the team
didn’t know what proportion of users
depended heavily on the contents of
their local disks. It turns out that only
about 50% of users cared enough about
preserving their disks to wait 24 hours
for the move to complete. That’s a valuable data point for future service expansions or migrations.
Don’t be tempted to make a special
case out of a “one-time” migration.”
Your future self will be thankful if you
take the opportunity to homogenize
when making lasting changes. Previous generations of the corporate network-hosted virtual desktop system
had a slightly different on-disk layout
than the current models used for testing. Not only was this an unpleasant
surprise in production, but it was also
almost impossible to test since no existing tools would create the old disk
type. Fortunately, during the design
phase the team had resisted the urge
to “simplify” the data-copying phase
by putting user data on a second GCE
disk—doing so would have made
these instances special snowflakes for
the lifetime of the Cloud-hosted platform.
Keep the organization flexible.
Organizing the team into virtual workstreams has multiple benefits. This
strategy allowed the team to quickly
gather expertise across reporting
chains, expand and contract teams
throughout the project, reduce communication overhead between teams,
assign singular deliverable objectives
to work groups, and reduce territoriality across teams.
This is an opportunity to “get it right.”
The migration to Google Cloud allowed
the team to reconsider certain implementations that had ossified over time
within the team and organization.
Since a cloud desktop is composed of
a GCE instance running a custom image (production of which is fairly cheap
and well documented; https://cloud.
custom_images), the infrastructure
scales extraordinarily well. Very little
changed when piloting with a dozen
instances versus running with thousands, and what Google has implemented here should be directly applicable to other, smaller companies
without requiring much specialization
to the plan detailed in this article.
While the migration of virtual desktops to Cloud wasn’t painless, it has
been a solid success and a foundation
for further work. Looking to the future,
the Google Corporate Cloud Migrations team is engaged in two primary
streams of work: improving the virtual desktop experience and enabling
Google corporate server workloads to
run on Cloud.
In the desktop space, the team
plans to improve the service management experience by developing various tools that supplement the Google
Cloud platform to help manage the
fleet of cloud desktops. These add-ons
include a disk-inspection tool and a
fleet-management command-line tool
that integrates and orchestrates actions between Cloud and other corporate systems.
There are several possibilities for
improving fleet cost effectiveness. On
the simple end of the spectrum, cloud
desktop could automatically request
that owners of idle machines delete instances they don’t actually need.
Finally, the end-user experience
could be improved by implementing a
self-serve VM cold migration between
datacenters, allowing traveling users
to relocate their instances to a nearby
datacenter to reduce latency to their
VM. Note that these plans are scoped to
cloud desktop as part of the customer/
application-specific logic, as opposed
to features Google as a company is plan-
ning for Compute Engine in general.
As for server workloads, the team
is building on lessons learned from
cloud desktop to provide a migration
path. The main technical challenges in
this space include:
˲ Cataloging and characterizing the
˲Creating scalable and auditable
service and VM lifecycle management
˲Maintaining multiple flavors of
managed operating systems;
˲ Extending BeyondCorp semantics
to protocols that are hard to proxy;
˲ Tackling a new set of security and
˲ Creating performant-shared storage solutions for services requiring databases;
˲ Creating migration tools to automate toilsome operations; and
˲ Implementing a number of ser-vice-specific requirements.
Migrating server workloads also has
the added organizational complexity of
a heterogeneous group of service owners, each with varying priorities and requirements from the departments and
business functions they support.
Titus: Introducing Containers
to the Netflix Cloud
Andrew Leung, Andrew Spyker, and Tim Bozarth
Reliable Cron across the Planet
Štepán Davidovič, Kavita Guliani
Virtualization: Blessing or Curse?
Matt Fata is a Site Reliability Manager at Google, where
he works on corporate virtualization solutions. He has
previously worked as a network engineer and as an IT
support desk manager.
Philippe-Joseph Arida is a Technical Program Manager
at Google, where he works on making GCP the best
platform for enterprise workloads. He previously worked
as a PM at Microsoft on desktop, server, and search
Patrick Hahn is a Site Reliability Engineer at Google and
the Technical Lead of the cloud desktop project. He has
previously worked as a sysadmin in the Web development,
managed IT, and quantitative finance industries.
Betsy Beyer is a technical writer for Google Site
Reliability Engineering in NYC, and the editor of Site
Reliability Engineering: How Google Runs Production
Systems and the Site Reliability Workbook.
Copyright held by owners/authors.
Publication rights licensed to ACM. $15.00.