ing to figure out which change caused
the problem. If you make one change at
a time, and there is a failure, the search
becomes a no-brainer. It is also easier to
back out one change than many.
Heck, even Google, with its highly
sophisticated testing technologies and
methodologies, understands that subtle differences between the staging environment and the production environment may result in deployment failures.
They “canary” their software releases:
upgrading one instance, waiting to see
if it starts properly, then upgrading the
remaining instances slowly over time.
This is not a testing methodology, this
is an insurance policy against incomplete testing—not that their testing
people are not excellent, but nobody is
perfect. The canary technique is now an
industry best practice and is even embedded in the Kubernetes system. (The
term canary is derived from “canary in a
coalmine.” The first instance to be upgraded dies as a warning sign that there
is a problem, just as coal miners used to
bring with them birds, usually canaries,
which are more sensitive to poisonous
gas than humans. If the canary died, it
was a sign to evacuate.)
Since these problems are caused by
software being tightly coupled to a particular schema, the solution is to loosen
the coupling. These can be decoupled
by writing software that works for multiple schemas at the same time. This is
separating rollout and activation.
The first phase is to write code that
doesn’t make assumptions about the
fields in a table. In SQL terms, this means
SELECT statements should specify the
exact fields needed, rather than using
SELECT *. If you do use SELECT *, don’t
assume the fields are in a particular order. LAST_NAME may be the third field
today, but it might not be tomorrow.
With this discipline, deleting a field
from the schema is easy. New releases are
deployed that don’t use the field, and
everything just works. The schema can
be changed after all the instances are
running updated releases. In fact, since
the vestigial field is ignored, you can
procrastinate and remove it later, much
later, possibly waiting until the next
(otherwise unrelated) schema change.
Adding a new field is a simple matter
of creating it in the schema ahead of the
first software release that uses it. We use
Technique 1 (applications manage their
own schema) and deploy a release that
modifies the schema but doesn’t use the
field. With the right transactional lock-
ing hullabaloo, the first instance that
is restarted with the new software will
cleanly update the schema. If there is
a problem, the canary will die. You can
fix the software and try a new canary. Re-
verting the schema change is optional.
Since the schema and software are
decoupled, developers can start using
the new field at their leisure. While in
the past upgrades required finding a
maintenance window compatible with
multiple teams, now the process is de-
coupled and all parties can work in a
coordinated way but not in lockstep.
More complicated changes require
more planning. When splitting a field,
removing some fields, adding others,
and so on, the fun really begins.
First, the software must be written to work with both the old and new
schemas and most importantly must
also handle the transition phase. Suppose you are migrating from storing a
person’s complete name in one field,
to splitting it into individual fields for
first, middle, last name, title, and so on.
The software must detect which field(s)
exist and act appropriately. It must also
work correctly while the database is in
transition and both sets of fields exist.
Once both sets of fields exist, a batch
job might run that splits names and
stores the individual parts, nulling the
old field. The code must handle the
case where some rows are unconverted
and others are converted.
The process for doing this conversion is documented in the accompanying sidebar “The Five Phases of a
Live Schema Change.” It has many
phases, involving creating new fields,
updating software, migrating data,
and removing old fields. This is called
the McHenry Technique in The Practice of Cloud System Administration (of
which I am coauthor with Strata R.
Chalup and Christina J. Hogan); it is
also called Expand/Contract in Release
It!: Design and Deploy Production-Ready
Software by Michael T. Nygard.
The technique is sophisticated
enough to handle the most complex
schema changes on a live distributed
system. Plus, each and every mutation
can be rolled back individually.
The number of phases can be re-
duced for special cases. If one is only
adding fields, phase 5 is skipped be-
cause there is nothing to be removed.
The process reduces to what was de-
scribed earlier in this article. Phases 4
and 5 can be combined or overlapped.
Alternatively, phase 5 from one schema
change can be merged into phase 2 of
the next schema change.
With these techniques you can roll
through the most complex schema
changes without downtime.
Using SQL databases is not an impediment to doing DevOps. Automating
schema management and a little developer discipline enables more vigorous
and repeatable testing, shorter release
cycles, and reduced business risk.
Automating releases liberates us.
It turns a worrisome, stressful, manual upgrade process into a regular
event that happens without incident.
It reduces business risk but, more
importantly, creates a more sustainable workplace.
When you can confidently deploy
new releases, you do it more frequently.
New features that previously sat unreleased for weeks or months now reach
users sooner. Bugs are fixed faster. Security holes are closed sooner. It enables the company to provide better
value to customers.
Thanks to Sam Torno, Mark Henderson,
and Taryn Pratt, SRE, Stack Overflow Inc.;
Steve Gunn, independent; Harald Wa-gener, iNNOVO Cloud GmbH; Andrew
Clay Shafer, Pivotal; Kristian Köhntopp,
Booking.com, Ex-MySQL AB.
The Small Batches Principle
Thomas A. Limoncelli
Adopting DevOps Practices
in Quality Assurance
Thomas A. Limoncelli is the SRE manager at Stack
Overflow Inc. in New York City. His books include The
Practice of System and Network Administration, The
Practice of Cloud System Administration, and Time
Management for System Administrators. He blogs at
EverythingSysadmin.com and tweets at @Yes That Tom.
Copyright held by owner/author.
Publication rights licensed to ACM. $15.00