system to find bugs!” I hear the DevO-ps folks cry. And cry they will, because
sorting through all that data to find the
needle in the noise will definitely not
make them happier or give them the
ability to find the bug.
What is needed in any monitoring
system is the ability to increase or reduce the level of polling and data collection as system needs dictate. If you
are actively debugging a system, then
you probably want to turn the volume
of data up to 11, but if the system is
running well, you can dial the volume back down to 4 or 5. The volume
can be thought of as the polling frequency times the amount of data being captured. Perhaps you want more
frequent polling but less data per request, or perhaps you want more data
for a broader picture but polled less
frequently. These are the horizontal
and vertical adjustments you should be
able to make to your system at runtime.
A one-size-fits-all monitoring system
fits no one well. The fear, of course,
is that by not having the volume at 11
you will miss something important—
and that is a valid fear—but unless the
whole reason for your existence is to
capture all events at all times, you will
have to find the right balance between
0 and maximum volume.
Scaling in Games and Virtual Worlds
January 02, 2009
Kode Vicious Bugs Out
Tackling the uncertainties of heisenbugs
A Conversation with Bruce Lindsay
Designing for failure may be the key to
Software Needs Seatbelts and Airbags
Emery D. Berger
Finding and fixing bugs in deployed
software is difficult and time-consuming.
Here are some alternatives.
George V. Neville-Neil ( email@example.com) is the proprietor of
Neville-Neil Consulting and co-chair of the ACM Queue
editorial board. He works on networking and operating
systems code for fun and profit, teaches courses on
various programming-related subjects, and encourages
your comments, quips, and code snips pertaining to his
Copyright held by author.
is definitely a handle for those people
somewhere on social media—you need
to find the Goldilocks zone for your
monitoring system. To find that zone,
you must first know what you’re asking for. Figure out which commands
the monitoring system is going to execute on your servers, and then run
them individually in a test environment and measure the resources they
require. You care about runtime, which
can be found to a coarse level with the
time( 1) command. Here is an example from the server just mentioned.
time sysctl -a > /dev/null
sysctl -a > /dev/null 0.02s
user 0.24s system 98% cpu
Here, grabbing all of the system’s
various system-control variables takes
about a quarter of a second of CPU time,
most of which is system overhead—that
is, time spent in the operating system
getting the information you requested.
The time( 1) command can be used on
any utility or program you choose.
Now that you have a rough guess
as to the amount of CPU time that the
request might take, you need to know
how much data you’re talking about.
Using a program that counts characters, such as wc( 1), will give you an
idea of how much data you’re going to
be gathering and moving off the system for each polling request.
sysctl -a | wc -c
You would be grabbing more than
a quarter of a megabyte of data here,
which in today’s world isn’t much, but
it still averages out to 6,314 bytes per
second if you poll every minute; and, in
reality, the instantaneous rate is much
higher, causing a 3Mbps blip on the
network every time you request those
Of course, no one in his or her right
mind would just blindly dump all the
sysctl values from the kernel every
minute—you would be much more nuanced in asking for data. KV has seen
a lot of unsubtle things in his time, including monitoring systems that were
set up to do just this sort of ridiculous
level of monitoring. “We don’t want to
lose any events; we need a transparent
Students and faculty
can take advantage of
to invite renowned
thought leaders in
to deliver compelling
and insightful talks
on the most important
topics in computing
and IT today.
ACM covers the cost
for the speaker
to travel to your event.