Vviewpoints
DOI: 10.1145/2184319.2184331
Article development led by
queue.acm.org
Kode vicious
scale failure
Dear kV,
I have been digging into a network-based logging system at work because,
from time to time, the system jams up,
even when there seems to be no good
reason for it to do so. What I found
would be funny, if only it were not my
job to fix it: the central dispatcher for
the entire logging system is a simple
for loop around a pair of read and
write calls; the for loop takes input
from one of a set of file descriptors and
sends output to one of another set of
file descriptors. The system works fine
as long as none of the remote readers or
writers ever blocks, and normally that is
not a problem. The problem has come
about because what was once handling
fewer than 10 machines is now handling 40, some of which are remote
across a wide area network. The obvious fix is to make the code nonblocking, but what I am surprised about is
that anyone would write code this way.
It is obvious from the first time you look
at the code that it cannot scale.
Blocked and Loopy
PHotoGraPH by alICIa kubIsta
Dear Loopy,
I would like to say that I am sure the
original author of the code you are
looking at was not trying to torture you;
but after seeing many similar pieces of
code, it is difficult for me to continue
to accept this particular bit of make-
believe. What you are probably looking
at is “throwaway” or “prototype” code
that got away. The person who wrote
the code probably had a boss pop
into his cubical one day with a “great
idea” to improve the logging system
by using the network and a central dis-
patcher, and then asked the program-
mer to code up something simple to
toss around. That something simple
is what you now see. In my mind, I see
the programmer getting the code run-
ning, and—since programmers are op-
timists—being excited when it ran and
considering it done.
Dear kV,
My employer recently deployed a system on its network that is very sensitive to variations in network traffic. Although our team let people know that
the amount of load on our network
might cause problems with this particular application, it was decided to deploy
the software anyway and see what happened in production. As you can imagine, most of the time things work pretty
well; but occasionally, often because of
random misconfigurations or because
another application abuses the network