and turn into nicely formatted output suitable for documenting a program. It can produce Unix man pages, La TeX, HTML, RTF, PostScript, and PDF.
What is most interesting for the code spelunker is Doxygen’s ability to extract information from any source code by running preprocessors over the code in question. Doxygen is a static analysis tool in that it analyzes the source code of a program but does not look into the program state while it is running. The great thing about a static analysis tool is that it can be run at any time and does not require that the software be executing. In analyzing something like an operating system, this is extremely helpful. The features that make Doxygen most relevant to our work are those related to how data is extracted from the source code. When you start out with the intention of documenting your own code with Doxygen, you are already working with the system and very little extra needs to be done. If you’re code spelunking an unknown code base, then you will need to be more aggressive and manually turn on certain features in the Doxyfile, which is Doxygen’s configuration file. These features are listed here:
Feature EXTRACT_ALL SOURCE_BROWSER CLASS_DIAGRAMS HAVE_DOT CALL_GRAPH CALLER_GRAPH
Meaning
Extract everything you can from the source code
Create a full cross reference of the source code
Create class diagrams and inheritance graphs
Create useful code spelunking graphs
Make a call graph following all function calls
Output a graph of the caller dependencies
The option HAVE_DOT is the most important one because it’s what allows Doxygen to generate the most useful output for the code spelunker, including class, collaboration, call, and caller graphs. We’ll take a brief look at two of these types of graphs: call and caller. The code that we’re analyzing in this article is the TCP/IP stack of the FreeBSD operating system. The BSD TCP/IP stack has been studied in the past12 and continues to be studied by researchers working on the TCP/IP suite.
For our examples we look at a single function, ip_output(), which is called in various parts of the network stack in order to send an IP datagram to the network. The ip_output() function is quite important to the stack because all normal packet transmissions flow through it. If a bug were found in this function, or if the API needed to be changed for some reason, it would be important to trace back all of the current users (callers) of the function. In figure 1 we see the caller graph produced by Doxygen for ip_output().
more queue: www.acmqueue.com
In figure 1 no fewer than 16 separate routines, in nearly as many modules, depend on the ip_output() function. To effect a fix or update the API, all of these routines need to be investigated.
The opposite of a caller graph is a call graph. A call graph is familiar to users of the tools mentioned previously, such as Cscope and global, which allow the user to move interactively through the call graph of a function, jumping into and out of underlying functions while browsing the source code. Doxygen gives us a different way of interacting with the call graph. Figure 2 shows the call graph for the ip_output() function.
The call graph, like the caller graph, provides a good visual overview of how the function fits into the overall system. Both of these figures function as maps from which we can derive clues as to how the software is structured. One clue that is relatively easy to see is another hot spot in the packet output code, namely tcp_output(), which is called from seven different routines.
The kind of information that Doxygen can show comes at a price. Generating the graphs shown here, which required analyzing 136 files consisting of 125,000 lines of code, took 45 minutes on a dual-core 2.5-GHz MacBook Pro laptop. Most of the time was taken up by generating the call and caller graphs, which are by far the most useful pieces of information to a code spelunker. 5
DTrace. One of the most talked-about system tools in the past few years is DTrace, a project from Sun Microsystems released under the CDDL (Common Development and Distribution License) that has been ported to the FreeBSD and Mac OS X operating systems. Regardless of whether the designers of DTrace were specifically targeting code spelunking when they wrote their tool, it is clearly applicable.
DTrace has several components: a command-line program, a language, and a set of probes that give information about various events that occur throughout the system. The system was designed such that it could be run against an application for which the user had no source code.
DTrace is the next logical step in the line of program-tracing programs that came before it, such as ktrace and truss. What DTrace brings to code spelunking is a much richer set of primitives, both in terms of its set of probes and the D language, which makes it easier for code spelunkers to answer the questions they have. A program such as ktrace shows only the system calls that the program executes while it’s running, which are all of the application’s interactions with the operating system.
ACM QUEUE November/December 2008 29
References:
Archives