Vviewpoints
C
O
L
L
A
G
E
B
Y
A
N
D
R
I
J
B
O
R
Y
S
A
S
S
O
C
I
A
T
E
S
/
S
H
U
T
T
E
R
S
T
O
C
K
DOI: 10.1145/3084356
Kode Vicious
Forced Exception
Handling
You can never discount the human element in programming.
ure to handle nonfatal errors?”
Well, let’s see what else the paper
had to say and then think about how
software is actually implemented in
the real world, rather than how we be-
lieve it ought to be implemented in the
illusory world that management and
marketing inhabit.
To get to the heart of why nonfatal
errors might have led to fatal errors, we
need look no further than this snippet
from the paper: “This difference is likely
because the Java compiler forces developers to catch all the checked exceptions; and a variety of errors are
expected to occur in large distributed
systems, and the developers program more defensively. However, we
found they were often simply sloppy in
handling these errors” (https://www.
usenix.org/system/files/conference/
osdi14/osdi14-paper-yuan.pdf).
Hopefully anyone who has been a
Dear KV,
I subscribe to “The Morning Paper,” a
daily summary prepared by one person,
Adrian Colyer, who curates research papers and sends them out to interested
readers ( https://blog.acolyer.org). Last
fall he reviewed “Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems” ( https://blog.acolyer.
org/2016/10/06/simple-testing-can-
prevent-most-critical-failures/). It had
some surprising results, including:
˲ Almost all catastrophic failures ( 48
in total, or 92%) are the result of incorrect handling of nonfatal errors explicitly signaled in software;
˲ Error handlers with TODO or FIXME in the comments. This example
took down a 4,000-node production
cluster; and
˲ Error handlers that catch an abstract
exception type (for example, Exception or
Throwable in Java) and then take drastic
action such as aborting the system. This
example brought down a whole Hadoop
Distributed File System (HDFS) cluster.
And the list went on from there.
I have been reading your Kode Vicious columns for quite a while, and
as I read the review and then the paper itself, it looked like something you
would be interested in, so I have sent
along the link.
Helpfully Not in Error
Dear Helpfully,
Yes, KV also reads “The Morning Pa-
per,” although he has to admit that he
does not read everything that arrives in
his inbox from that list. Of course, the
paper you mention piqued my interest,
and one of the things you did not point
out is that it is actually a study of distrib-
uted systems failures. Now, how can we
make programming harder? I know!
Let’s take a problem on a single
system and distribute it. Someday I
would like to see a paper that tells us
if problems in distributed systems
increase along with the number of
nodes, or the number of intercon-
nections. Being an optimist, I can
only imagine that it is N(N + 1) / 2,
or worse.
I don’t think you pointed out this paper to KV just to hear me bang my head
on my desk while thinking distributed
systems, so let’s assume you’re asking
the “Why?” question: “Why is it the
case that 92% of the catastrophic failures in this paper are caused by a fail-