Vviewpoints
I
M
A
G
E
C
O
L
L
A
G
E
B
Y
A
N
D
R
I
J
B
O
R
Y
S
A
S
S
O
C
I
A
T
E
S
/
S
H
U
T
T
E
R
S
T
O
C
K
The risk, of course, is that the system
will jam, not when it is convenient for
someone to add a dummy job to clear
the jam, but during some operation
that could cause data loss or return
incorrect results. I rather suspect that
having a system like this jam while coordinating, for example, the balancing of electrical power across a power
grid would have spectacular and perhaps fatal results.
I am not saying every bug must be
fixed at the expense of doing otherwise productive work, but it is bugs
like this one that, in my experience,
tend to hit at the absolute worst possible time. If the team knew about the
Dear KV,
I have been working with a distributed
job-control system for a large computing cluster for the past year. The system
was developed in-house by one of the
co-founders of the company, and he
continues to work on it sporadically,
while a small team of us adds new features and tries to fix bugs. The code isn’t
terrible, but it has one major defect—if
the system doesn’t have enough jobs in
its queues, it tends to freeze up. I have
been working with one other person on
my team to diagnose the problem, but
it has been assigned a very low priority by management because as long as
we add dummy jobs when the system
would otherwise be idle, the bug does
not occur. I have never seen a system
act like this, and I have to wonder: Is
this kind of problem common in distributed job-control systems?
Jobless
Dear Jobless,
Is the specific problem of a system
freezing up because of starvation
common in distributed job-control
systems? It has been my experience
that each distributed system is a precious snowflake—and KV does not
like snow!
Let’s first address the high-level issue—the fact that no one cares if you
fix the bug, because if you put in dummy jobs, the system “just works.” The
phrase “just works” is one of the most
overused in computing, and what it
really indicates, in this case, is that
someone is intellectually lazy, or that
his or her motivation lies elsewhere.
“Why should we care that we’re run-
ning our systems at 100% power draw,
when fixing the problem would cost
time and money?” Apart from the
fact that computing now consumes a
significant percentage of the world’s
electricity, leaving a bug like this un-
addressed can have other deleterious
side effects.
That a system can randomly jam
does not just indicate a serious bug
in the system; it is also a major source
of risk. You do not say what your distributed job-control system controls,
but let’s just say I hope it is not something with significant, real-world
side effects—like a power station, jet
aircraft, or financial trading system.
DOI: 10.1145/3208099
Kode Vicious
Watchdogs vs. Snowflakes
Taking wild guesses.