the science of finding the traces of those trails to detect the presence
of hidden information. General and specific algorithms are the two
main types of attacks used to detect hidden information. Specific
algorithms target a certain type of steganography, such as precise formulations of least-significant bit modification, in the hope of directly
detecting the presence of a hidden message. These techniques are very
good for attacking files that have messages hidden with programs that
are easily available to find, because their algorithms are generally well-known. General algorithms, however, are considered to be “blind” in
the sense that they attempt to find an embedded message without
knowing the process that was used to hide the data [ 7].
Because specific algorithms are limited by nature to previously
existing steganography techniques, the area of general algorithms is
more fruitful in steganalysis research. Furthermore, general algorithms can be broken down into two different types of attack. The first
type is the empirical attack, which looks for noticeable changes within
the cover file. This can be as simple as a person scanning an image
with the naked eye or as involved as employing a computer to decompose an image into its bit planes to find abnormalities. However, as
steganographic algorithms become more sophisticated, empirical
attacks lose much of their utility because they are limited to what can
be observed directly from the cover file. Therefore, a second and more
powerful type of attack is statistical analysis. To overcome the problems of empirical attacks, statistical analysis examines a file to see if
its makeup has any unusual disparities from a normal file of its type.
Popular examples of this type of analysis include checking the palette
ordering of an image, detecting image signatures that seem unusual
for native files of that type, and finding a great amount of periodicity
among coefficients that could indicate patterns of embedded data [ 6].
Steganography and Wikipedia
As with many other tools, steganography can be used for both positive
and negative purposes. Just as easily as two pen pals can share a fun
private message, criminals and terrorists also have the ability to
secretly communicate using steganography. With the billions of images
being passed back and forth on the Internet, many things could easily
slip under the radar. For example, examine the two similar images in
Figure 1. They could belong to a varied collection of sites, ranging from
topics such as travel to photography or even weather services.
message is a large problem. Even with the original image right next to
it, there is almost no visual way to tell a dangerous message from an
innocent one. Although there is a large amount of steganography
research on finding new algorithms for either hiding or recovering
data, it seems that there is relatively little research focusing on how to
apply steganalysis to real situations. With the large amount of data
flowing across the Internet, governments and private companies
need to find a feasible way to sift through it all in order to find dangerous messages and agendas. In an attempt to understand how steganalysis on such a massive scale could be undertaken, I designed an
elementary search on Wikipedia, a popular user-driven encyclopedia, in an attempt to find hidden information within its images.
Because anyone can edit the pages on Wikipedia, it is not only the
perfect model for a large database with potential hidden information
lurking in its corners, but is also an example of a seemingly innocent
site that should be a legitimate concern for various organizations.
Experimental Setup and Raw Data
In order to model a large-scale database scan, the experiment has
two distinct steps:
1. Use the program Wikix to download a set of images from Wikipedia.
2. Use the program StegAlyzerSS in order to scan the images for hidden data.
The testing computer was my laptop, which dual-boots Windows
XP and Ubuntu 7. 10 with a 1.86GHz Pentium M chip and 2GB of
RAM. I ran five different image-scanning trials, ranging from
196–13,147 files in each experiment (with 13,147 files being approximately 5.23GB of data). Because the Wikipedia image database was
quoted to be about 406GB of data as of October 2007 , Wikipedia’s
database is estimated to have over one-million images. Figure 2
shows the number of files that were scanned in each trial.
Figure 2: The number of files scanned in each trial.
Figure 1: A text file is hidden within one of these images.
Is it obvious as to which one contains the hidden information?
Using the free S-Tools steganographic suite, I took an image and
imported a text file into it in less than a minute. With any steganography program that has a well-designed user interface, almost anyone can do this without any formal training. The fact that any of the
billions of images on the Internet could potentially hide a dangerous
The file-scanning algorithm ran linearly. On average, it scanned
about 9.4 files/second. For about 13,000 files, it took roughly 27 minutes.
When extrapolated to the entire image database, it would take about 31
hours to scan. The scanning time for each trial is illustrated in Figure 3.
StegAlyzerSS looks for three different trails left by steganography
programs: the signature of a particular algorithm, appended information after the end-of-file character, and disturbances to the least-significant bits of the image file. In every trial, StegAlyzerSS did not
find any signatures, and, excluding the 5th trial, no traces of least-