Their impact on computer science
and IT at large cannot be overstated.
Text searching and bioinformatics
would not be the same without them.
In 2013, the Combinatorial Pattern
Matching symposium celebrated the
40th anniversary of the appearance of
Weiner’s invention of the suffix tree41
with a special session entirely dedicated to that event.
History Bits and Pieces
At the dawn of “stringology,” Donald
Knuth conjectured the problem of
finding the longest substring common to two long text sequences of total length n required (n log n) time. An
O(n log n)-time had been provided by
Karp, Miller, and Rosenberg.
26 That
construction was destined to play a
role in parallel pattern matching, but
Knuth’s conjecture was short lived: in
1973, Peter Weiner showed the problem admitted an elegant linear-time
solution,
41 as long as the alphabet of
the string was fixed. Such a solution
was actually a byproduct of a construction he had originally set up for
a different purpose, that is, identifying any substring of a text file without specifying all of them. In doing
so, Weiner introduced the notion of
a textual inverted index that would
elicit refinements, analyses, and applications for 40 years and counting,
a feature hardly shared by any other
data structure.
Weiner’s original construction processed the text file from right to left.
As each new character was read in, the
structure, which he called a “bi-tree,”
would be updated to accommodate
longer and longer suffixes of the text
file. Thus, this was an inherently off-line construction, since the text had
to be known in its entirety before the
construction could begin. Alternatively, one could say the algorithm
would build the structure for the reverse of the text online. About three
years later, Ed McCreight provided a
left-to-right algorithm and changed
the name of the structure to “suffix
tree,” a name that would stick.
32
Let x be a string of n − 1 symbols
over some alphabet Σ and an extra
character not in Σ. The expanded suffix tree Tx associated with x is a digital
search tree collecting all suffixes of x$.
Specifically, Tx is defined as follows.
Over the years, such structures
have held center stage in text search-
ing, indexing, statistics, and com-
pression as well as in the assembly,
alignment, and comparison of bi-
osequences. Their range of scope ex-
tends to areas as diverse as detecting
plagiarism, finding surprising sub-
strings in a text, testing the unique
decipherability of a code, and more.
Figure 1. The expanded suffix tree of the string x = abcabcaba.
c
a
b
c
a
a
a
$
$
a
$
$
3
6
10
8
9
7
4
1
a
a
a
a
c
c
b
b
b
b
$
$
5
2
b
a
a
a
$
$
$
$
b
b
ca
a
a
Figure 2. Building an expanded suffix tree by insertion of consecutive suffixes (showing
here the insertion of abcaba$).
1
$
b
c
a
b
$
$
2
4
c
a
c
a
b
b
ac
c
a
a
a
$
3
b
b
b
c
c
a
b
a
a
b
The insertion of suffix sufi (i = 1, 2, …, n) consists of two phases. In the first phase, we search for sufi
in Ti – 1. Note the presence of guarantees that every suffix will end in a distinct leaf. Therefore, this
search will end with failure sooner or later. At that point, we will have identified the longest prefix of
sufi that has a locus (that is, a terminal node) in Ti – 1. Let headi abcab in the example be this prefix
and α the locus of headi. We can write sufi = headi ∙ taili with taili (a$ in the example) nonempty. In
the second phase, we need to add to Ti – 1 a path leaving node α and labeled taili. This achieves the
transformation of Ti – 1 into Ti .