bug in advance, it just leads to embarrassment when they must admit they
knew about such a risk before it actually happened.
It is difficult to say much about the
technical issue without looking into
the system itself. (Remember KV’s earlier comment about snowflakes.) The
most common way of handling this
type of freezing is itself not completely
satisfying, and that is to have a watchdog process that sees if the system is
making progress and restarts it after a
suitable timeout when it believes the
system is stuck.
There are several problems with the
watchdog approach. The first is what
the watchdog will actually do. Some
watchdogs operate by restarting a
stuck process, and they do this bluntly, by killing the process and restarting it. If the computations undertak-en by the system are all idempotent,
then there is little risk because any
operation that did not complete will
be restarted from the beginning and
should have no side effects. Most systems have side effects, which means
such restarts can cause a cascade of
errors through the whole system. If
the errors are obvious, then a human
operator might be able to roll back
the system to a good, known state and
start the system again. But what if
the errors are a type of silent corruption, returning incorrect answers (as
I mentioned at the beginning of this
column)? In that case, the watchdog is
likely to do more harm than good.
Even if a watchdog approach is not
otherwise harmful, there is a second
problem of choosing an appropriate
timeout duration. Since the system
becomes jammed when it does not
have enough work, some people will
want to set the watchdog timer to be
very fast so as to prevent these jams
from reducing the overall efficiency
of the system. A very short watchdog
timeout has the potential to make
the system thrash, since each restart
caused by the watchdog firing will
require the system to do work to re-
turn to its running state. All the work
done by the system when a process is
restarted is pure overhead; it does not
help the system perform the work it
was intended to do. Conversely, set-
ting a watchdog timeout to be too
long risks having the system remain
stuck for long periods, again reduc-
ing overall efficiency. Too often, the
choice of these timeouts is accom-
plished by a form of black magic, re-
ferred to as “taking a wild guess,” fol-
lowed by a heuristic, which is referred
to as “taking another wild guess,” to
see if it is better than the first.
Do not underestimate the number
of production systems that use these
approaches. I believe if we truly knew
how many of the systems we depend
on every day used black magic under
the hood, we would all be more likely to
buy land in Wyoming, build bunkers,
and live in them.
Unfortunately, as KV has discussed
before, debugging distributed systems
is difficult, but it turns out that not
debugging them and having them fail
catastrophically makes for even more
difficult days.
KV
Related articles
on queue.acm.org
Poisonous Programmers
Kode Vicious
https://queue.acm.org/detail.cfm?id=1348585
From the EDVAC to WEBVACs
Daniel C. Wang
https://queue.acm.org/detail.cfm?id=2756508
Too Big NOT to Fail
Pat Helland, Simon Weaver, and Ed Harris
https://queue.acm.org/detail.cfm?id=3077383
George V. Neville-Neil ( kv@acm.org) is the proprietor of
Neville-Neil Consulting and co-chair of the ACM Queue
editorial board. He works on net working and operating
systems code for fun and profit, teaches courses on
various programming-related subjects, and encourages
your comments, quips, and code snips pertaining to his
Communications column.
Copyright held by author.
There are
several problems
with the watchdog
approach.
The first is
what the watchdog
will actually do.
Calendar
of Events
June 1– 6
ISCA ‘18: The 45th Annual
International Symposium
on Computer Architecture,
Los Angeles, CA,
Contact: Timothy Pinakston,
Email: tpink@usc.edu
June 2
FormaliSE ‘18: 6th Conference
on Formal Methods in Software
Engineering,
Gothenburg, Sweden,
Contact: Stefania Gnesi,
Email: stefania.gnesi@isti.cnr.it
June 3–7
JCDL ‘18: The 18th ACM/IEEE
Joint Conference on Digital
Libraries,
Fort Worth, TX,
Contact: Jiangping Chen,
Email: jiangping.chen@unt.edu
June 4–8
ASIA CCS ‘18: ACM Asia
Conference on Computer and
Communications Security,
Incheon, Republic of Korea,
Sponsored: ACM/SIG,
Contact: Jong Kim,
Email: jkim@postech.ac.kr
June 6–8
PerDis ‘18: The International
Symposium on Pervasive
Displays,
Munich, Germany,
Sponsored: ACM/SIG,
Contact: Albrecht Schmidt,
Email: albrecht.schmidt@
gmail.com
June 9–10
DIS ‘18: Designing
Interactive Systems
Conference Workshops,
Hong Kong,
Sponsored: ACM/SIG,
Contact: Ilpo Koskinen,
Email: ilpo.koskinen@gmail.com
June 10–15
SIGMOD/PODS ‘18:
International Conference
on Management of Data,
Houston, TX,
Sponsored: ACM/SIG,
Contact: Christopher Jermaine,
Email: cmj4@rice.edu
June 11–13
DIS ‘18: Designing Interactive
Systems Conference 2018,
Hong Kong,
Sponsored: ACM/SIG,
Contact: Ilpo Koskinen,
Email: ilpo.koskinen@gmail.
com