et al.,
24 the structure of generalized
suffix tree is crucially used to design
a linear machine-word data structure
to return the top-k most frequent documents containing a pattern p in time
nearly linear in pattern size.
One surprising variant of the suffix
tree was introduced by Brenda Baker
for purposes of detection of plagiarism in student reports as well as optimization in software development.
7
This variant of pattern matching,
called “parameterized matching,” enables one to find program segments
that are identical up to a systematic
change of parameters, or substrings
that are identical up to a systematic
relabeling or permutation of the characters in the alphabet. One obvious
extension of the notion of a suffix
tree is to more than one dimension,
albeit the mechanics of the extension
itself are far from obvious.
34 Among
more distant relatives, one finds
“wavelet trees.” Originally proposed
as a representation of compressed
suffix arrays,
20 wavelet trees enable
one to perform on general alphabets
the ranking and selection primitives
previously limited to bit vectors, and
more.
The list could go on and on, but the
scope of this article was not meant
to be exhaustive. Actually, after 40
years of unrelenting developments,
it is fair to assume the list will continue to grow. Open problems also
abound. For instance, many of the
observed sequences are expressed in
numbers rather than characters, and
in both cases are affected by various
types of errors. While the outcome of
a two-character comparison is just
one bit, two numbers can be more or
less close, depending on their difference or some other metric. Likewise,
two text strings can be more or less
similar, depending on the number of
elementary steps necessary to change
one in the other. The most disruptive
aspect of this framework is the loss of
the transitivity property that leads to
the most efficient exact string matching solutions. And yet indexes capable of supporting fast and elegant approximate pattern queries of the kind
just highlighted would be immensely
useful. Hopefully, they will come up
soon and, in time, have their own 40th
-anniversary celebration.
Acknowledgments. We are grateful to Ed McCreight, Ronnie Martin,
Vaughan Pratt, Peter Weiner, and Jacob Ziv for discussions and help. We
are indebted to the referees for their
careful scrutiny of an earlier version
of this article, which led to many improvements.
References
1. Amir, A., Benson, G. and Farach, M. Let sleeping
files lie: Pattern matching in Z-compressed files. In
Proceedings of the 5th ACM-SIAM Annual Symposium
on Discrete Algorithms (Arlington, VA, 1994), 705–714.
2. Apostolico, A. The myriad virtues of suffix trees.
Combinatorial Algorithms on Words, vol. 12 of NATO
Advanced Science Institutes, Series F. A. Apostolico
and Z. Galil, Eds. Springer-Verlag, Berlin, 1985, 85–96.
3. Apostolico, A., Bock, M.E. and Lonardi, S. Monotony of
surprise and large-scale quest for unusual words.
J. Computational Biology 10, 3 / 4 (2003), 283–311.
4. Apostolico, A., Denas, O. and Dress, A. Efficient tools
for comparative substring analysis. J. Biotechnology
149, 3 (2010), 120–126.
5. Apostolico, A. and Preparata, F.P. Optimal off-line
detection of repetitions in a string. Theor. Comput. Sci.
22, 3 (1983), 297–315.
6. Apostolico, A. and Preparata, F.P. Data structures
and algorithms for the strings statistics problem.
Algorithmica 15, 5 (May 1996), 481–494.
7. Baker, B.S. Parameterized duplication in strings:
Algorithms and an application to software maintenance.
SIAM J. Comput. 26, 5 (1997), 1343–1362.
8. Béal, M.-P., Mignosi, F. and Restivo, A. Minimal
forbidden words and symbolic dynamics. In
Proceedings of the 13th Annual Symposium on
Theoretical Aspects of Computer Science, vol. 1046 of
Lecture Notes in Computer Science (Grenoble, France,
Feb. 22–24, 1996). Springer, 555–566.
9. Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D.,
Chen, M. T. and Seiferas, J. The smallest automaton
recognizing the subwords of a text. Theor. Comput. Sci.
40, 1 (1985), 31–55.
10. Brodal, G.S., Lyngsø, R.B., Östlin, A. and Pedersen, C. N.S.
Solving the string statistics problem in time O(n log n).
In Proceedings of the 29th International Colloquium on
Automata, Languages and Programming, vol. 2380 of
Lecture Notes in Computer Science (Malaga, Spain,
July 8–13, 2002). Springer, 728–739.
11. Burrows, M. and Wheeler, D.J. A block-sorting lossless
data compression algorithm. Technical Report 124,
Digital Equipment Corp., May 1994.
12. Chairungsee, S. and Crochemore, M. Using minimal
absent words to build phylogeny. Theoretical
Computer Science 450, 1 (2012), 109–116.
13. Clark, D.R. and Munro, J.I. Efficient suffix trees on
secondary storage. In Proceedings of the 7th ACM-SIAM Annual Symposium on Discrete Algorithms,
(Atlanta, GA, 1996), 383–391.
14. Crochemore, M. Transducers and repetitions.
Theor. Comput. Sci., 45, 1 (1986), 63–86.
15. Crochemore, M., Mignosi, F. and Restivo, A. Automata
and forbidden words. Information Processing Letters
67, 3 (1998), 111–117.
16. Crochemore, M., Mignosi, F., Restivo, A and Salemi,
S. Data compression using antidictonaries. In
Proceedings of the IEEE: Special Issue Lossless Data
Compression 88, 11 (2000). J. Storer, Ed., 1756–1768.
17. Farach, M. Optimal suffix tree construction with large
alphabets. In Proceedings of the 38th IEEE Annual
Symposium on Foundations of Computer Science
(Miami Beach, FL, 1997), 137–143.
18. Ferragina, P., Luccio, F., Manzini, G. and Muthukrishnan,
S. Compressing and indexing labeled trees with
applications. JACM 57, 1 (2009).
19. Ferragina, P. and Manzini, G. Opportunistic data
structures with applications. In FOCS (2000), 390–398.
20. Grossi, R., Gupta, A. and Vitter, J. S. High-order entropy-compressed text indexes. In SODA (2003), 841–850.
21. Grossi, R. and Vitter, J.S. Compressed suffix arrays
and suffix trees with applications to text indexing and
string matching. In Proceedings ACM Symposium on
the Theory of Computing (Portland, OR, 2000). ACM
Press, 397–406).
22. Gusfield, D. Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology.
Cambridge University Press, Cambridge, U.K., 1997.
23. Harel, D. and Tarjan, R.E. Fast algorithms for finding
nearest common ancestors. SIAM J. Comput. 13, 2
(1984), 338–355.
24. Hon, W.-K., Shah, R. and Vitter, J.S. Space-efficient
framework for top-k string retrieval problems. In
FOCS. IEEE Computer Society, 2009, 713–722.
25. Hui, L.C.K. Color set size problem with applications
to string matching. In Proceedings of the 3rd
Annual Symposium on Combinatorial Pattern
Matching, no. 644 in Lecture Notes in Computer
Science, (Tucson, AZ, 1992). A. Apostolico, M.
Crochemore, Z. Galil, and U. Manber, Eds. Springer-Verlag, Berlin, 230–243.
26. Karp, R.M., Miller, R.E., and Rosenberg, A.L. Rapid
identification of repeated patterns in strings, trees
and arrays. In Proceedings of the 4th ACM Symposium
on the Theory of Computing (Denver, CO, 1972). ACM
Press, 125–13.
27. Kasai, T., Lee, G., Arimura, H., Arikawa, S. and Park,
K. Linear-time longest-common-prefix computation
in suffix arrays and its applications. CPM.
Springer-Verlag, 2001, 181–192.
28. Kurtz, S. Reducing the space requirements of suffix
trees. Softw. Pract. Exp. 29, 13 (1999), 1149–1171.
29. Landau, G.M. String matching in erroneus input.
Ph. D. Thesis, Department of Computer Science, Tel-Aviv University, 1986.
30. Lempel, A. and Ziv, J. On the complexity of finite
sequences. IEEE Trans. Inf. Theory 22 (1976), 75–81.
31. Manber, U. and Myers, G. Suffix arrays: A new method
for on-line string searches. In Proceedings of the 1st
ACM-SIAM Annual Symposium on Discrete
Algorithms (San Francisco, CA, 1990), 319–327.
32. McCreight, E.M. A space-economical suffix tree
construction algorithm. J. Algorithms 23, 2 (1976),
262–272.
33. Muthukrishnan, S. Efficient algorithms for document
listing problems. In Proceedings of the 13th ACM-
SIAM Annual Symposium on Discrete Algorithms
(2002), 657–666.
34. J. C. Na, P. Ferragina, R. Giancarlo, and K. Park. Two-dimensional pattern indexing. In Encyclopedia of
Algorithms. 2008.
35. Nong, G., Zhang, S. and Chan, W.H. Two efficient
algorithms for linear time suffix array construction.
IEEE Trans. Comput. 60, 10 (2011), 1471–1484.
36. Poe, E.A. The Gold-Bug and Other Tales. Dover Thrift
Editions Series. Dover, 1991.
37. Pratt, V. Improvements and applications for the
Weiner repetition finder. Manuscript, 1975.
38. Rodeh, M., Pratt, V. and Even, S. Linear algorithm
for data compression via string matching. J. Assoc.
Comput. Mach. 28, 1 (1981), 16–24.
39. Ukkonen, E. On-line construction of suffix trees.
Algorithmica 14, 3 (1995), 249–260.
40. Ulitsky, I., Burstein, D., Tuller, T. and Chor, B. The
average common substring approach to phylogenomic
reconstruction. J. Computational Biology 13, 2 (2006),
336–350.
41. Weiner, P. Linear pattern matching algorithms. In
Proceedings of the 14th Annual IEEE Symposium on
Switching and Automata Theory, ( Washington, D. C.,
1973), 1–11.
Alberto Apostolico held joint appointments with Georgia
Tech’s School of Computational Science and Engineering
School of Interactive computing as a professor and a
researcher. He passed away on July 20, 2015.
Maxime Crochemore ( maxime.crochemore@kcl.ac.uk)
is a professor at King’s College London and Université
Paris-Est, France.
Martin Farach-Colton ( farach@cs.rutgers.edu) is a
professor in the Department of Computer Science at
Rutgers University, Piscataway, NJ.
Zvi Galil ( galil@cc.gatech.edu) is Dean of the College of
Computing at Georgia Institute of Technology, Atlanta, GA.
S. Muthukrishnan ( muthu@cs.rutgers.edu) is a professor
in the Department of Computer Science at Rutgers
University, Piscataway, NJ.
Copyright held by authors.
Publication rights licensed to ACM. $15.00.