and two decades later suffix trees and
companion structures with their applications gave rise to several chapters in reference books by Crochemore and Rytter, Dan Gusfield, and
Crochemore, Hancart, and Lecroq
(see the appendix available with this
article in the ACM Digital Library).
The space required by suffix trees
has been a nuisance in applications
where they were needed the most.
With genomes on the order of giga-bytes, for instance, the space difference between 20 times larger than
the source versus, say, only 11 times
larger, can be substantial. For a few
lustra, Stefan Kurtz and his co-workers devoted their effort to cleverly allocating the tree and some of its companion structures.
28 In 2001, David R.
Clark and J. Ian Munro proposed one
of the best space-saving methods on
secondary storage.
13 Clark and Munro’s “succinct suffix tree” sought to
preserve as much of the structure of
the suffix tree as possible. Udi Manber
and Eugene W. Myers took a different
approach, however. In 1990, they introduced the “suffix array,”
31 which
eliminated most of the structure of
the suffix tree, but was still able to
implement many of the same operations, requiring space equal to 2 integers per text character and searching
in time O(|P| + log n) (reducible to 1 by
accepting search time O(|P| + log n)).
The suffix array stores the suffixes of
the input in lexicographic order and
can be seen as the sequence of leaves’
labels as found in the suffix tree by a
preorder traversal that would expand
each node according to the lexicographic order.
Although the suffix array seemed
at first to be a different data structure
than the suffix tree, the distinction
has receded. For example, Manber
and Myers’s original construction of
the suffix array took O(n log n) time
for any alphabet, but the suffix array
could be constructed in linear time
from the suffix tree for any alphabet.
In 2001, Toru Kasai et al.
27 showed the
suffix tree could be constructed in lin-
ear time from the suffix array. There-
fore, the suffix array was shown to be
a succinct representation of the suffix
tree. In 2003, three groups presented
three different modifications of Far-
ach’s algorithm for suffix tree con-
struction to give the first linear-time
algorithms for directly constructing
the suffix array; that is, the first linear-
time algorithms for computing suffix
arrays that did not first compute the
full suffix tree. Since then, there have
been many algorithms for fast con-
struction of suffix arrays, notably by
Nong, Zhang, and Chan,
35 which is
linear time and fast in practice. With
fast construction algorithms and
small space required, the suffix ar-
ray is the suffix-tree variant that has
gained the most widespread adoption
in software systems. A more recent
succinct suffix tree and array, which
take O(n) bits to represent for a binary
alphabet (O(n log σ) bits otherwise),
was presented by Grossi and Vitter.
21
Actually, the histories of suffix
trees and compression are tightly intertwined. This should not come as a
surprise, since the redundancies that
pattern discovery tries to unearth are
ideal candidates to be removed for
purposes of compression. In 1994, M.
Burrows and D.J. Wheeler proposed a
breakthrough compression method
based on suffix sorting.
11 Circa 1995,
Amihood Amir, Gary Benson, and
Martin Farach posed the problem of
searching in compressed texts.
1 In
2000, Paolo Ferragina and Giovanni
Manzini introduced the FM-inde x, a
compressed suffix array based on the
Burrows-Wheeler transform.
19 This
structure, which may be smaller than
the source file, supports searching
without decompression. This was extended to compressed tree indexing
problems in Ferragina et al.
18 using a
modification of the Burrows-Wheeler
transform.
Fallout, Extensions,
and Challenges
As highlighted out the outset, there
has been hardly any application of
text processing that did not need
these indexes at one point or another.
A prominent case has been searching with errors, a problem first efficiently tackled in 1985 by Gad Landau in his Ph.D. thesis.
29 In this kind
of search, one looks for substrings of
the text that differ from the pattern in
a limited number of errors such as a
single character deletion, insertion
or substitution. To efficiently solve
this problem, Landau combined suf-
Although the
suffix array
seemed at first
to be a different
data structure than
the suffix tree,
the distinction
has receded.