with many tools, the plotting of data
can mislead you as easily as it can lead
you somewhere.
There are plenty of tools with which
to plot your data, and I usually shy away
from advocating particular tools in
these responses, but I can say that if you
were trying to plot a lot of data, where a
lot is more than 32,767 elements, you
would be wise to use something like
gnuplot. Every time I’ve seen people try
to use a certain vendor’s spreadsheet to
plot data sets larger than 32,767, things
have gone awry—I might even say that
they were brought up “short” by that
particular program. The advantage of
gnuplot is that as long as you have a lot
of memory (and memory is inexpensive now), you can plot very large data
sets. KV recently outfitted a machine
with 24GB of RAM just to plot some important data. I’m a big believer in big
memory for data, but not for programs,
but let’s just stop that digression here.
Let’s now walk through the important points to remember when plotting
data. The first is that if you intend to
compare several plots, your measurement axis—the one on which you’re
showing the magnitude of a value—
absolutely must remain constant or
be easily comparable among the total
set of graphs that you generate. A plot
with a y-axis that goes from 0 to 10 and
another with a y-axis from 0 to 25 may
look the same, but their meaning is
completely different. If the data you’re
plotting runs from 0 to 25, then all of
your graphs should run from, for example, 0 to 30. Why would you waste those
last five ticks? Because when you’re
generating data from a large data set,
you might have missed something, perhaps a crazy outlier that goes to 60, but
only on every 1,000th sample. If you set
the limits of your axes too tightly initially, then you might never find those
outliers, and you would have done an
awful lot of work to convince yourself—
and whoever else sees your pretty little
plot—that there really isn’t a problem,
when in fact it was right under your
nose, or more correctly, right above the
limit of you graph.
Since you mention you are plotting
large data sets, I’ll assume you mean
more than 100,000 points. I have rou-
tinely plotted data that runs into the
millions of individual points. When
you plot the data the first time, it’s im-
portant not only to get the y-axis limits
correct, but also to plot as much data
as absolutely possible, given the limits
of the system on which you’re plotting
the data. Some problems or effects are
not easily seen if you reduce the data too
much. Reduce the data set by 90% (look
at every 10th sample), and you might
miss something subtle but important. If
your data won’t all fit into main memory
in one go, then break it down by chunks
along the x-axis. If you have one million
samples, graph them 100,000 at a time,
print out the graphs, and tape them to-
gether. Yes, it’s kind of a quick-and-dirty
solution but it works, trust me.
Related articles
on queue.acm.org
Code Spelunking Redux
George v. neville-neil
http://queue.acm.org/detail.cfm?id=1483108
Unifying Biological Image
Formats with hDF5
Matthew T. Dougherty, Michael J. Folk,
Erez Zadok, Herbert J. Bernstein,
Frances C. Bernstein, Kevin W. Eliceiri,
Werner Benger, Christoph Best
http://queue.acm.org/detail.cfm?id=1628215
A Conversation with Jeff heer, Martin
Wattenberg, and Fernanda Viégas
http://queue.acm.org/detail.cfm?id=1744741
George V. neville-neil ( kv@acm.org) is the proprietor of
neville-neil Consulting and a member of the ACM Queue
editorial board. He works on networking and operating
systems code for fun and profit, teaches courses on
various programming-related subjects, and encourages
your comments, quips, and code snips pertaining to his
Communications column.