k ode vicious
you hard. If converting some number of records took
five minutes locally, it’s going to take 500 minutes (8 1/3
hours) remotely. If I were you, I would bring a book to
work, or download a movie like the rest of your cowork-ers, before you start any more database conversions.
There are ways to ameliorate these problems, though
getting around the speed of light has eluded better minds
for quite a while. The first is to run all the conversions
locally; the second is to write the conversion program
so that it batches requests. This makes more efficient use
of the high-bandwidth network that your company has
probably paid a small fortune to rent. If the database conversion on the server side can process a batch of records
in less time than it takes to move all those records across
the link, then you will gain some time; but if each record
must be processed serially, then you’re always going to
suffer with slow performance over a long-distance link.
I’ve been trying to debug a problem in a network server.
Under load the server seems to lock up, but when I look
at the process status on the system, the state is always
RUN, which means the process isn’t blocked. The program is not in an infinite loop, because each time I attach
a debugger, the code is in a different part of its processing
loop, but there is never any output or progress. How can
a program be running, not stuck in an infinite loop, and
yet produce no output?
Dead Tired from Looking for a Deadlock
What you’re looking at—actually what is staring you in
the face—is a livelock, not a deadlock. Though most people learn about deadlock when they study computer science, they rarely learn about livelock until they’re in one.
Livelock is common in software where the general flow of
the code is: read, process, write, read, process, write.
In a livelock situation the code is overwhelmed by
the incoming load of work. Many systems—including, it
would seem, your software—have a safety valve so that if
overwhelmed, they can drop unfinished work. The problem is that the safety valve may not be good enough.
If a server drops a request because it runs out of time,
memory, or other resources, then it will get stuck in a
loop where it reads a request, tries to process it, and fails,
then goes back and reads the next request, tries to process
it, and fails. For all intents and purposes the code looks
like it’s running, and it is, but it’s not succeeding at doing
what it should. It’s actually flailing and failing.
One way to overcome livelock is to buy more
resources, such as faster processors or more memory. This
“throw money at the problem” solution is intellectually
cowardly and should be kept as a last resort. In many
situations it’s just not possible to get more raw hardware
power since many companies already run their servers
with the fastest processors and the maximum amount of
memory. Usually companies do this because they have
hired the wrong people to write their software (i.e., engineers who don’t think about what resources their code
is using—you know, idiots). For the nonidiots there are a
couple of other ways to deal with livelock.
A common trick is to build code into the system that
allows someone to tune the processing loop. The tuning
is fairly simple: it says that the system will process only so
many pieces of work per unit of time. Initially the knob is
set to infinite, until the system winds up in livelock, and
then the knob is turned down to just under the number
of requests that caused the livelock. Yes, this means that
the system will throw away requests for work, but it will
still be able to do work, instead of being swamped and
unable to do any work at all. Using this trick requires
your software to know how many requests it’s processing
per unit of time, so you’ll need some code to count that.
To implement a system in which the number of
requests can be tuned, you need to design the code so
that it can throw away requests as early as possible in the
processing loop, perhaps even as far back as the read. If
the system knows that it’s overloaded, it should not waste
additional resources trying to bring in any more work.
Often the problem in a livelock situation is related to
processing time, in that there isn’t enough of it. If you
think that time is your problem, then you need to profile
your software and see where it’s wasting, er, spending its
time. If the processing can be sped up, then obviously
your code will be able to stay out of livelock longer.
Tuning software is a hard problem, and one that has
been covered in Queue before (February 2006). Since I”m
sure you’ve kept all your back issues, you should easily be
able to go back and reference the excellent advice therein.
KODE VICIOUS, known to mere mortals as George V.
Neville-Neil, works on networking and operating system
code for fun and profit. He also teaches courses on various
subjects related to programming. His areas of interest are
code spelunking, operating systems, and rewriting your bad
code (OK, maybe not that last one). He earned his bachelor’s
degree in computer science at Northeastern University and is
a member of ACM, the Usenix Association, and IEEE. He is an
avid bicyclist and traveler who makes San Francisco his home.
© 2008 ACM 1542-7730/08/0300 $5.00