EA That’s my experience in debugging the debugger.
SB Just an anecdote here. A debugger is the only program
I’ve ever written where three machines are playing: the
machine you’re debugging, the machine you’re running
on, and the machine the debugger was compiled on.
DTrace seems to be a real leap forward in debugging.
Bryan, can you tell us a little bit about why you did it and
what you wish you had done differently?
BC The reason we did it is the same reason you guys
did your things: we needed it. We were trying to debug
incredibly complicated systems, and Sun was building
larger and larger systems with SMP. Our systems got
dramatically larger and more complicated in a very short
period of time. We had an SMP kernel in Solaris, and we
were struggling to understand the system when it failed
fatally.
That’s why Mike [Shapiro] developed mdb, and I
helped him by developing some of the intelligent modules
we can plug into mdb. Once we had actually diagnosed
the fatal failures of the system, then we had this problem
of the transient failures. Why does the system suck?
EA Heisenbugs.
BC Yes, the software is up, it’s functioning correctly, but
it’s sucking at some level. I was working in the Solaris
performance group—that’s why I originally came to
Sun, to work with [Sun Fellow and CTO for storage] Jeff
Bonwick—and we were grasping at straws using these
tools that would give you only the happy/sad state of the
system. The tools would tell us, “Here’s the number of
operations you’re doing, here’s your percent utilization,”
and so on, and then we would try to back-calculate where
we were in the software stack. The problem is that you’re
looking at the lowest layer of the software stack, trying to
draw inferences about the highest layer of the software
stack.
What we didn’t realize when we set out to do DTrace
is how acute this problem was, and the problem was so
much worse for people that were developing Java or PHP
or Python or Ruby, because they’re at an even higher level
of abstraction. They’re inducing more unintended work
out of the system. They’ve got systems that suck even
more than ours.
That’s the reason we developed DTrace. Historically,
we have had two branches of our code. We have had the
branch that is debuggable, with all this ifdef debug junk
in it, and then we’ve had the branch that we ship.
It’s kind of absurd that where the bugs are most critical—in those production environments—we’ve got the
least amount of infrastructure to understand what is
going on.
EA I’ve argued that you just ship it with the debugging in
it for two reasons: one is you don’t want to ship something different from what you tested; and second, you
always need to test stuff.
BC There’s a certain level of debugging infrastructure that
you should ship, but the problem is that when you are
looking at the debugging infrastructure in the very bowels of the system, that debugging infrastructure has costs
associated with it. Even if it’s as simple as loading a flag
to indicate that something is not enabled, that’s a load,
a compare, and a branch. That costs. And when you do a
load, a compare, and a branch when you are scheduling
a thread, you will have a system that is too slow to ship.
You’ll have the Linux guys laughing at you because your
scheduler is slow. It’s a little hard to make the argument,
“Our scheduler is slow because we need to debug it when
it’s broken.”
That’s not something a user of that scheduler wants to
hear. We realized we needed to change that model, and
that’s what DTrace does. My final observation on DTrace
is that there should be no probe effect when the instrumentation is disabled. If I’m not asking the question,
then my app runs just as fast as if it weren’t there at all.
EA Right, and that’s pretty profound.
SB I’d like to ask Eric a similar question. Debugging code
that deals with network events as opposed to system
events is a different game. I’d be interested in knowing
what the challenges have been in debugging sendmail
over the years.
EA The obvious one is you’re dealing with multiple
machines running a protocol at the same time. It’s
sometimes not clear which end of the connection you’re
debugging. You need to make sure that the output gets
someplace usable, which is actually harder than it looks.
A lot of people don’t realize I wrote syslog, which is a
standard tool now, in the process of writing sendmail,
precisely because I needed to have this place where it
would go, which was not stdout, because stdout wasn’t
connected to anything. At that point, there were log files
scattered all over the system, so I used MPX files. That’s
what I originally built syslog on.
Other things are a little subtler—timing issues, for
example. You’re sometimes dealing with TCP/IP. TCP/IP
implementations vary far more vastly than anyone really
wants to admit, and certainly for SMTP a lot of these little
things can turn into great big things.
BC This is an interesting common theme among the
three of us. Eric developed syslog because it was a problem he needed to understand and debug sendmail; we
developed DTrace because it was a tool that we needed to