figure 3. abbreviated output from
the execution of ls under Dtrace.
> sudo dtrace -s calls.d -c ls
dtrace: script ‘calls.d’ matched
[output of ls command removed for brevity]
dtrace: pid 7008 has exited
strcoll _ l 148
_ _ error 326
_ none _ mbrtowc 662
pthread _ getspecific 1424
a quick example will demonstrate how
it can be used for code spelunking.
When presented with a new and unknown system to spelunk one of the first
things to find out is which services the
program uses. Programs like ktrace and
truss can show this type of information
but DTrace extends this ability greatly.
We will now find out what services the ls
program requires to execute as well as
which ones are used most often.
3: @[probefunc] = count();
The script here is written in the D
language and should be relatively easy
to decipher for anyone familiar with C.
The script contains a single function,
which counts the entry into any call that
the ls program makes. Where a C programmer might find a function name
and argument list we instead see what
is called a predicate. The predicate is a
way of selecting the probes that DTrace
will record data for. The predicate on
line 1 selects the entry into any call for
the associated process. When the calls.d
script is executed with dtrace in Figure
3, its pid$ variable is replaced with the
process ID of the program that is given
after the -c command-line argument.
DTrace also allows the tracing of live
processes by replacing -c with -p and
the program name with a live process
ID. Figure 3 gives abbreviated output
from the execution of ls under DTrace.
Only the last several lines, those with
high entry counts, are shown. From this
snapshot we can see that ls does a lot of
work with the string functions strcoll
and strcmp, and if we were trying to optimize the program we might look first
at where these functions were called.
With thousands of predefined probe
points, and the ability to dynamically
create probes for user processes, it’s
obvious that DTrace is the most powerful code spelunking tool developed in
the last decade.
In reviewing the tools mentioned
here—as well as those that are not—a
few challenges remain apparent. The
first challenge is the move by some
developers away from a tool-based approach to an all-in-one approach.
A tool-based approach can best be
understood by looking at the programs
available on any Unix-like system. The
use of several programs, mixed and
matched, to complete a task has several obvious benefits that are well documented by others. When working with
large code bases, the downfalls of an
all-in-one approach, such as an IDE,
become a bit clearer. A system such as
the FreeBSD kernel is already several
hundred megabytes of text. Processing
that code base with tools like Cscope
and global in order to make it more easily navigable generates a further 175MB
of data. Although 175MB of data may be
small in comparison to the memory of
the average desktops or laptops, which
routinely come with 2GB to 4GB of
RAM, storing all that state in memory
while processing leads to lower performance in whatever tool is being used.
The pipeline processing of data, which
keeps in-memory data small, improves
the responsiveness of the tools involved. Loading the Free BSD kernel into
Eclipse took quite a long time and then
took up several hundred megabytes of
RAM. I have seen similar results with
other IDEs on other large code bases.
An even larger challenge looms for
those who work on not only large, but
heterogeneous, code bases. Most Web
sites today are a melange of PHP or Python with C or C++ extensions, using
MySQL or PostgreSQL as a database
backend, all on top of an OS written
in C. It is often the case that tracking
down particularly difficult problems
requires crossing language barriers
several times—from PHP into C++ and
then into SQL, then perhaps back to C
or C++. Thus far I have seen no evidence
of tools that understand how to analyze
these cross-language interactions.
The area that deserves the most attention is visualization. Of all the tools
reviewed, only Doxygen generates interesting and usable visual output. The
other tools have a very narrow, code-based focus in which the user is usually looking at only a small part of the
system being investigated.
Working in this way is a bit like trying to understand the United States by
staring at a street sign in New York. The
ability to look at a high-level representation of the underlying system without
the fine details would be perhaps the
best tool for the code spelunker. Being able to think of software as a map
that can be navigated in different ways,
for instance, by class relations and call
graphs, would make code spelunkers
far more productive.
One last area that has not been
covered is the network. Network spelunking, the ability to understand an
application based on its network traffic, is still in a very nascent state, with
tools like Wireshark being the state of
the art. Many applications are already
running online and to being able to
understand and work with them at the
network level is very important.
1. CScope Man Page; http://cscope.sourceforge.net/
2. Doxygen Web Site; http://www.stack.nl/~dimitri/
3. GNU GLOBAL Source Code Tag System. Tama
Communications Corp., Apr. 21, 2008.
4. gprof; http://www.gnu.org/manual/gprof- 2. 9.1/gprof .
5. Graphviz Web Site; http://www.graphviz.org/.
6. How To Use DTrace. Sun Microsystems, 2005.
Available on the Web at http://www.sun.com/
7. HPC System Call Usage Trends. Terry Jones, Andrew
Tauferner, Todd Inglett Linux Clusters Institute 2007.
8. ktrace: standard tool on open source OSes.
9. Neville-Neil, G.V. Code spelunking: Exploring
cavernous code basis. ACM Queue (Sept. 2003). ACM,
10. Sun Microsystems. Solaris Dynamic Tracing Guide
11. Truss is available on Solaris.
12. Wright, G.R. and Stevens, W.R. TCP/IP Illustrated, Vol.
2: The Implementation. Addison-Wesley Professional.
George V. Neville-Neil ( firstname.lastname@example.org) is a columnist for
Communications and ACM Queue, as well as a member
of the Queue Editorial Board. He works on networking and
operating system code and teaches courses on various
subjects related to programming.
© 2008 ACM 0001-0782/08/1000 $5.00