k ode vicious

you hard. If converting some number of records took five minutes locally, it’s going to take 500 minutes (8 1/3 hours) remotely. If I were you, I would bring a book to work, or download a movie like the rest of your cowork-ers, before you start any more database conversions.

There are ways to ameliorate these problems, though getting around the speed of light has eluded better minds for quite a while. The first is to run all the conversions locally; the second is to write the conversion program so that it batches requests. This makes more efficient use of the high-bandwidth network that your company has probably paid a small fortune to rent. If the database conversion on the server side can process a batch of records in less time than it takes to move all those records across the link, then you will gain some time; but if each record must be processed serially, then you’re always going to suffer with slow performance over a long-distance link.

KV

Dear KV,

I’ve been trying to debug a problem in a network server. Under load the server seems to lock up, but when I look at the process status on the system, the state is always RUN, which means the process isn’t blocked. The program is not in an infinite loop, because each time I attach a debugger, the code is in a different part of its processing loop, but there is never any output or progress. How can a program be running, not stuck in an infinite loop, and yet produce no output?

Dead Tired from Looking for a Deadlock

Dear Dead,

What you’re looking at—actually what is staring you in the face—is a livelock, not a deadlock. Though most people learn about deadlock when they study computer science, they rarely learn about livelock until they’re in one. Livelock is common in software where the general flow of the code is: read, process, write, read, process, write.

In a livelock situation the code is overwhelmed by the incoming load of work. Many systems—including, it would seem, your software—have a safety valve so that if overwhelmed, they can drop unfinished work. The problem is that the safety valve may not be good enough.

If a server drops a request because it runs out of time, memory, or other resources, then it will get stuck in a loop where it reads a request, tries to process it, and fails, then goes back and reads the next request, tries to process it, and fails. For all intents and purposes the code looks like it’s running, and it is, but it’s not succeeding at doing what it should. It’s actually flailing and failing.

One way to overcome livelock is to buy more

resources, such as faster processors or more memory. This “throw money at the problem” solution is intellectually cowardly and should be kept as a last resort. In many situations it’s just not possible to get more raw hardware power since many companies already run their servers with the fastest processors and the maximum amount of memory. Usually companies do this because they have hired the wrong people to write their software (i.e., engineers who don’t think about what resources their code is using—you know, idiots). For the nonidiots there are a couple of other ways to deal with livelock.

A common trick is to build code into the system that allows someone to tune the processing loop. The tuning is fairly simple: it says that the system will process only so many pieces of work per unit of time. Initially the knob is set to infinite, until the system winds up in livelock, and then the knob is turned down to just under the number of requests that caused the livelock. Yes, this means that the system will throw away requests for work, but it will still be able to do work, instead of being swamped and unable to do any work at all. Using this trick requires your software to know how many requests it’s processing per unit of time, so you’ll need some code to count that.

To implement a system in which the number of requests can be tuned, you need to design the code so that it can throw away requests as early as possible in the processing loop, perhaps even as far back as the read. If the system knows that it’s overloaded, it should not waste additional resources trying to bring in any more work.

Often the problem in a livelock situation is related to processing time, in that there isn’t enough of it. If you think that time is your problem, then you need to profile your software and see where it’s wasting, er, spending its time. If the processing can be sped up, then obviously your code will be able to stay out of livelock longer.

Tuning software is a hard problem, and one that has been covered in Queue before (February 2006). Since Im sure you’ve kept all your back issues, you should easily be able to go back and reference the excellent advice therein.

KV

KODE VICIOUS, known to mere mortals as George V. Neville-Neil, works on networking and operating system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and rewriting your bad code (OK, maybe not that last one). He earned his bachelor’s degree in computer science at Northeastern University and is a member of ACM, the Usenix Association, and IEEE. He is an avid bicyclist and traveler who makes San Francisco his home. © 2008 ACM 1542-7730/08/0300 $5.00

References:

mailto:feedback@acmqueue.com

Archives