and turn into nicely formatted output suitable for documenting a program. It can produce Unix man pages,
La TeX, HTML, RTF, PostScript, and PDF.
What is most interesting for the code spelunker is
Doxygen’s ability to extract information from any source
code by running preprocessors over the code in question. Doxygen is a static analysis tool in that it analyzes
the source code of a program but does not look into
the program state while it is running. The great thing
about a static analysis tool is that it can be run at any
time and does not require that the software be executing. In analyzing something like an operating system,
this is extremely helpful. The features that make Doxygen
most relevant to our work are those related to how data
is extracted from the source code. When you start out
with the intention of documenting your own code with
Doxygen, you are already working with the system and
very little extra needs to be done. If you’re code spelunking an unknown code base, then you will need to be
more aggressive and manually turn on certain features in
the Doxyfile, which is Doxygen’s configuration file. These
features are listed here:
Feature
EXTRACT_ALL
SOURCE_BROWSER
CLASS_DIAGRAMS
HAVE_DOT
CALL_GRAPH
CALLER_GRAPH
Meaning
Extract everything you can from the source code
Create a full cross reference of the source code
Create class diagrams and inheritance graphs
Create useful code spelunking graphs
Make a call graph following all function calls
Output a graph of the caller dependencies
The option HAVE_DOT is the most important one
because it’s what allows Doxygen to generate the most
useful output for the code spelunker, including class, collaboration, call, and caller graphs. We’ll take a brief look
at two of these types of graphs: call and caller. The code
that we’re analyzing in this article is the TCP/IP stack of
the FreeBSD operating system. The BSD TCP/IP stack has
been studied in the past12 and continues to be studied by
researchers working on the TCP/IP suite.
For our examples we look at a single function, ip_output(), which is called in various parts of the network
stack in order to send an IP datagram to the network.
The ip_output() function is quite important to the stack
because all normal packet transmissions flow through it.
If a bug were found in this function, or if the API needed
to be changed for some reason, it would be important to
trace back all of the current users (callers) of the function.
In figure 1 we see the caller graph produced by Doxygen
for ip_output().
more queue: www.acmqueue.com
In figure 1 no fewer than 16 separate routines, in
nearly as many modules, depend on the ip_output() function. To effect a fix or update the API, all of these routines
need to be investigated.
The opposite of a caller graph is a call graph. A call
graph is familiar to users of the tools mentioned previously, such as Cscope and global, which allow the user
to move interactively through the call graph of a function, jumping into and out of underlying functions while
browsing the source code. Doxygen gives us a different
way of interacting with the call graph. Figure 2 shows the
call graph for the ip_output() function.
The call graph, like the caller graph, provides a good
visual overview of how the function fits into the overall system. Both of these figures function as maps from
which we can derive clues as to how the software is structured. One clue that is relatively easy to see is another
hot spot in the packet output code, namely tcp_output(),
which is called from seven different routines.
The kind of information that Doxygen can show
comes at a price. Generating the graphs shown here,
which required analyzing 136 files consisting of 125,000
lines of code, took 45 minutes on a dual-core 2.5-GHz
MacBook Pro laptop. Most of the time was taken up by
generating the call and caller graphs, which are by far
the most useful pieces of information to a code spelunker. 5
DTrace. One of the most talked-about system tools in
the past few years is DTrace, a project from Sun Microsystems released under the CDDL (Common Development
and Distribution License) that has been ported to the
FreeBSD and Mac OS X operating systems. Regardless of
whether the designers of DTrace were specifically targeting code spelunking when they wrote their tool, it is
clearly applicable.
DTrace has several components: a command-line
program, a language, and a set of probes that give information about various events that occur throughout the
system. The system was designed such that it could be
run against an application for which the user had no
source code.
DTrace is the next logical step in the line of program-tracing programs that came before it, such as ktrace
and truss. What DTrace brings to code spelunking is a
much richer set of primitives, both in terms of its set of
probes and the D language, which makes it easier for
code spelunkers to answer the questions they have. A
program such as ktrace shows only the system calls that
the program executes while it’s running, which are all of
the application’s interactions with the operating system.
ACM QUEUE November/December 2008 29