available on those systems anyway.
Even for cases where engineers can
access the buggy software with the
tools they need, pausing the program
in the debugger usually represents an
unacceptable disruption of produc-
tion service and an unacceptable risk
that a fat-fingered debugger command
might cause the program to crash. Ad-
ministrators often cannot take the risk
of downtime in order to understand a
failure that caused a previous outage.
More importantly, they should not have
to. Even in 1951 Gill cited the “extrava-
gant waste of machine time involved”
in concluding that “single-[step] op-
eration is a useful facility for the main-
tenance engineer, but the programmer
can only regard it as a last resort.”
The most crippling problem with
in situ debugging is it can only be used
to understand reproducible problems.
Many production issues are either very
rare or involve complex interactions
of many systems, which are often very
difficult to replicate in a development
environment. The rarity of such is-
sues does not make them unimport-
ant: quite the contrary, an operating
system crash that happens only once a
week can be extremely costly in terms
of downtime, but any bug that can be
made to occur only once a week is very
difficult to debug live. Similarly, a fatal
error that occurs once a week in an ap-
plication used by thousands of people
may result in many users hitting the bug
each day, but engineers cannot attach a
debugger on every user’s system.
So-called printf debugging is a
common technique for dealing with
the reproducibility issue. In this ap-
proach, engineers modify the software
to log bits of relevant program state
at key points in the code. This causes
data to be collected without human
intervention so it can be examined
after a problem occurs to understand
what happened. By automating the
data collection, this technique usually
results in significantly less impact to
production service because when the
program crashes, the system can im-
mediately restart it without waiting
for an engineer to log in and debug the
problem interactively.
figure 1. a simple mdB example.
mdb core
Loading modules: [ ld.so. 1 ]
> ::status
debugging core file of example1 (32-bit) from solaron
file: /export/home/dap/tmp/example1
initial argv: ./example1
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr= 10
> ::walk thread | ::findstack -v
stack pointer for thread 1: 8047b98
[ 08047b98 func+0x20() ]
08047bbc main+0x21( 1, 8047bdc, 8047be4)
08047bd0 _start+0x80( 1, 8047cc4, 0, 8047ccf, 8047cdc, 8047ced)
> func+0x20::dis
...
func+0x20: movl $0x0,(%eax)
...
use.” Most importantly, after the system saves all the program state, it can
restart the program immediately to
restore service quickly. With such systems in place, even rare bugs can often
be root-caused and fixed based on the
first occurrence, whether in development, test, or production. This enables
software vendors to fix bugs before too
many users encounter them.
To summarize, in order to root-cause failures that occur anywhere
from development to production, a
postmortem debugging facility must
satisfy several constraints:
˲ ˲Application software must not
require modifications that cannot be
used in production in order to support
postmortem debugging, such as unop-timized code or additional debug data
that would significantly impact performance (or affect correctness at all).
˲ ˲ The facility must be always on: It
must not require an administrator to
attach a debugger or otherwise enable
postmortem support before the problem occurs.
˲ ˲ The facility must be fully automatic: It should detect the crash, save
program state, and then immediately
allow the system to restart the failed
component to restore service as quickly as possible.
˲ ˲ The dump (saved state) must be
comprehensive: a stack trace, while
probably the single most valuable
piece of information, very often does
not provide sufficient information to
root-cause a problem from a single occurrence. Usually engineers want both
global state and each thread’s state
(including stack trace and each stack
frame’s arguments and variables). Of
course, there’s a wide range of possible results in this dimension; the
“constraint” (such as it is) is that the
facility must provide enough information to be useful for nontrivial problems. The more information that can
be included in the dump, the more
likely engineers will be able to identify
the root cause based on just one occurrence.
˲ ˲ The dump must be transferable to
other systems for analysis. This allows
engineers to analyze the data using
whatever tools they need in a familiar
environment and obviates the need
for engineers to access production systems in many cases.