1. Tx has n leaves, labeled from 1 to n.
2. Each arc is labeled with a symbol
of Σ ∪{$}. For any i, 1 ≤ i ≤ n, the con-
catenation of the labels on the path
from the root of Tx to leaf i is precisely
the suffix
sufi = xixi+ 1…xn− 1$.
3. For any two suffixes sufi and sufj
of x$, if wij is the longest common prefix that sufi and sufj have in common,
then the path in Tx relative to wij is
the same for sufi and sufj .
An example of expanded suffix tree
is given in Figure 1.
The tree can be interpreted as
the state transition diagram of a deterministic finite automaton where
all nodes and leaves are final states,
the root is the initial state, and the
labeled arcs, which are assumed to
point downward, represent part of
the state-transition function. The
state transitions not specified in the
diagram lead to a unique non-final
sink state. Our automaton recognizes
the (finite) language consisting of all
substrings of string x. This observation also clarifies how the tree can be
used in an online search: letting y be
the pattern, we follow the downward
path in the tree in response to consecutive symbols of y, one symbol at a
time. Clearly, y occurs in x if and only
if this process leads to a final state.
In terms of Tx, we say the locus of a
string y is the node α, if it exists, such
that the path from the root of Tx to α
is labeled y.
An algorithm for the direct construction of the expanded Tx (often
called suffix trie) is readily derived
(see Figure 2). We start with an empty
tree and add to it the suffixes of x one
at a time. This procedure takes time
Θ(n2) and O(n2) space, however, it is
easy to reduce space to O(n) thereby
producing a suffix tree in compact
form (Figure 3). Once this is done, it
becomes possible to aim for an expectedly non-trivial O(n) time construction.
At the CPM Conference of 2013,
McCreight revealed his O(n) time
construction was not born as an al-
ternative to Weiner’s—he had de-
veloped it in an effort to understand
Weiner’s paper, but when he showed
it to Weiner asking him to confirm
he had understood that paper the
answer was “No, but you have come
up with an entirely different and el-
egant construction!” In unpublished
lecture notes of 1975, Vaughan Pratt
displayed the duality of this structure
and Weiner’s “repetition finder.”
37
McCreight’s algorithm was still in-
herently offline, and it immediately
triggered a search for an online ver-
sion. Some partial attempts at an on-
line algorithm were made, but such
a variant had to wait almost two de-
cades for Esko Ukkonen’s paper in
1995.39 In all these linear-time con-
structions, linearity was based on
the assumption of a finite alphabet
and took Θ(n log n) time without
that assumption. In 1997, Martin
Farach introduced an algorithm that
abandoned the one suffix-at-time
approach prevalent until then; this
algorithm gives a linear-time reduc-
tion from suffix-tree construction
to character sorting, and thus is op-
timal for all alphabets.
17 In particu-
lar, it runs in linear time for a larg-
er class of alphabets, for example,
when the alphabet size is polynomial
in input length.
Around 1984, Blumer et al.
9 and Cro-
chemore14 exposed the surprising re-
sult that the smallest finite automaton
recognizing all and only the suffixes of
a string of n characters has only O(n)
states and edges. Initially coined a
directed acyclic word graph (DAWG),
it can even be further reduced if all
states are terminal states.
14 It then ac-
cepts all substrings of the string and
is called the factor—substring autom-
aton. There is a nice relation between
the index data structures when the
string has no end-marker and its suf-
fixes are marked with terminal states
in the tree.
Then, the suffix tree is the edge-
compacted version of the tree and its
number of nodes can be minimized
like with any automaton thereby
providing the compact DAWG of the
string. Permuting the two operations,
compaction and minimization, leads
to the same structure. Apparently Ana-
toli Slissenko (see the appendix avail-
key insights
˽ The suffix tree is the core data structure
in string analysis.
˽ It has a rich history, with connections
to compression, matching, automata,
data structures and more.
˽ There are powerful techniques to build
suffix trees and use them efficiently in
many applications.
Figure 3. A suffix tree in compact form.
1
4
7
9
10
86
3
$
b
c
a
b
a
a
$
$
a
$
b
c
ca
b$
b
c
a
b
$$
aa
$
5
2
ac
a
b
a
$
a
$
b
a
c
This is obtained by first collapsing every chain formed by nodes with only one child into a single arc.
The resulting compact version of Tx has at most n internal nodes, since there are n + 1 leaves in total
and every internal node is branching. The labels of the generic arc are now a substring, rather than a
symbol of x$. However, arc labels can be expressed by suitable pairs of pointers to a common copy of
x$ thus achieving O(n) space bound overall.