Thus, by a remarkable alignment
of stars, the compression method
brought about by Lempel and Ziv was
not only optimal in the information
theoretic sense, but it found an optimal, linear-time implementation by
the suffix tree, as was detailed immediately by Michael Rodeh, Vaugham
Pratt, and Shimon Even.
38
In his original paper, Weiner listed
a few applications of his “bi-tree” including most notably offline string
searching: preprocessing a text file
to support queries that return the occurrences of a given pattern in time
linear in the length of the pattern.
And of course, the “bi-tree” addressed
Knuth’s conjecture, by showing how
to find the longest substring common to two files in linear time for a
finite alphabet. There followed unpublished notes by Pratt entitled “
Improvements and Applications for the
Weiner Repetition Finder.”
37 A decade
later, Alberto Apostolico would list
more applications in a paper entitled
“The Myriad Virtues of Suffix Trees,”
2
able with this article in the ACM Digital
Library under Source Material) ended up with a similar structure for his
work on the detection of repetitions
in strings. These automata provide
another more efficient counterexam-ple to Knuth’s conjecture when they
are used, against the grain, as pattern-matching machines (see Figure 4).
The appearance of suffix trees
dovetailed with some interesting and
independent developments in information theory. In his famous approach to the notion of information,
Kolmogorov equated the information
or structure in a string to the length
of the shortest program that would
be needed to produce that string by
a Universal Turing Machine. The unfortunate thing is this measure is not
computable and even if it were, most
long strings are incompressible (that
is, lack a short program producing
them), since there are increasingly
many long strings and comparatively
much fewer short programs (
themselves strings).
The regularities exploited by Kolmogorov’s universal and omniscient
machine could be of any conceivable
kind, but what if one limited them to
the syntactic redundancies affecting
a text in the form of repeated substrings? If a string is repeated many
times one could profitably encode all
occurrences by a pointer to a common copy. This copy could be internal
or external to the text. In the former
case one could have pointers going in
both directions or only in one direction, allow or forbid nesting of pointers, and so on. In his doctoral thesis,
Jim Storer showed that virtually all
such “macro schemes” are intractable, except one. Not long before that,
in a landmark paper entitled “On the
Complexity of Finite Sequences,”
30
Abraham Lempel and Jacob Ziv had
proposed a variable-to-block encoding, based on a simple parsing of the
text with the feature that the compression achieved would match, in the
limit, that produced by a compressor
tailored to the source probabilities.
Figure 4. The compact suffix tree (left) and the suffix automaton (right) of the string “bananas.”
$
a
n
a
aa
a
a
$
2
1
4
6
5
37
n
n
b
n
$
$
$
$
$
n
a
n
a
0
ba
a
a
n
n
n
n
n a as
s
s
s
7
1 2345 6
2′
1′
3′
Failure links are represented by the dashed arrows. Despite the fact it is an index on the string, the
same automaton can be used as a pattern-matching machine to locate substrings of “bananas” in
another text or to compute their longest common substring. The process runs online on the second
string. Assume for example “bana” has just been scanned from the second string and the current state
of the automaton is state 4. If the next letter is “n,” the common substring is “banan” of length 5 and
the new state is 5. If the next letter is “s,” the failure link is used and from state 3’ corresponding to
a common substring “ana” of length 3 we get the common substring “ana” with the new state 7.
If the next letter is “b,” iterating the failure link leads to state 0 and we get the common substring “b”
with the new state 1. Finally, any other next letter will produce the empty common substring and state 0.