|
ABSTRACT
Sorting the suffixes of a string into lexicographical order is a fundamental task in a number of contexts, most notably lossless compression (Burrows--Wheeler transformation) and text indexing (suffix arrays). Most approaches to suffix sorting produce a sorted array of suffixes directly, continually moving suffixes into their final place in the array until the ordering is complete. In this article, we describe a novel and resource-efficient (time and memory) approach to suffix sorting, which works in a complementary way—by assigning each suffix its rank in the final ordering, before converting to a sorted array, if necessary, once all suffixes are ranked. We layer several powerful extensions on this basic idea and show experimentally that our approach is superior to other leading algorithms in a variety of real-world contexts.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 1, 53--86.
|
| |
2
|
Andersson, A. and Nilsson, S. 1998. Implementing radix sort. ACM Journal of Experimental Algorithmics 3.
|
| |
3
|
Bentley, J. L. and McIlroy, M. D. 1993. Engineering a sort function. Software—Practice and Experience 23, 11, 1249--1265.
|
| |
4
|
Bentley, J. L. and Sedgewick, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, New Orleans, LA. 360--369.
|
| |
5
|
Burrows, M. and Wheeler, D. J. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, CA.
|
| |
6
|
Ferragina, P. and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS 00). IEEE Computer Society, Redondo Beach, CA. 390--398.
|
| |
7
|
González, R., Grabowski, S., Mäkinen, V., and Navarro, G. 2005. Practical implementation of rank and select queries. In Poster Proceedings Volume of Fourth Workshop on Efficient and Experimental Algorithms (WEA'05). CTI Press and Ellinika Grammata, Greece. 27--38.
|
| |
8
|
Grossi, R. and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35, 2, 378--407.
|
| |
9
|
Itoh, H. and Tanaka, H. 1999. An efficient method for in memory construction of suffix arrays. In Proceedings of the sixth Symposium on String Processing and Information Retrieval. IEEE Computer Society, Cancun, Mexico, 81--88.
|
| |
10
|
Kärkkäinen, J. and Burkhardt, S. 2003. Fast lightweight suffix array construction and checking. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin. 55--69.
|
| |
11
|
Kärkkäinen, J. and Sanders, P. 2003. Simple linear work suffix array construction. In Proceedings of the 30th International Colloq. Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 2971. Springer-Verlag, Berlin. 943--955.
|
| |
12
|
Karp, R. M., Miller, R. E., and Rosenberg, A. L. 1972. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings of the Fourth Annual ACM Symposium on Theory of Computing. Denver, Colorado. 125--136.
|
| |
13
|
Kim, D. K., S., S. J., Park, H., and Park, K. 2003. Linear-time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin. 186--199.
|
| |
14
|
Ko, P. and Aluru, S. 2003. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching: 14th Annual Symposium, CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin. 200--210.
|
| |
15
|
Larsson, N. J. and Sadakane, K. 1999. Faster suffix--sorting. Tech. Rep. LU-CS-TR:99-214 [LUNFD6/(NFCS-3140)/1-20/(1999)], Department of Computer Science, Lund University, Sweden.
|
| |
16
|
Mäkinen, V. and Navarro, G. 2005. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12, 2, 40--66.
|
| |
17
|
Manber, U. and Myers, G. W. 1993. Suffix arrays: A new model for on-line string searches. SIAM Journal of Computing 22, 5, 935--948.
|
| |
18
|
Manzini, G. 2004. Two space saving tricks for linear time lcp computation. In SWAT 2004: 9th Scandinavian Workshop on Algorithm Theory, T. Hagerup and J. Katajainen, Eds. Lecture Notes in Computer Science, vol. 3111. Springer-Verlag, Berlin. 372--383.
|
| |
19
|
Manzini, G. and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33--50.
|
| |
20
|
McIlroy, P. M., Bostic, K., and McIlroy, M. D. 1993. Engineering radix sort. Computing Systems 6, 1, 5--27.
|
| |
21
|
Moffat, A. and Isal, R. Y. K. 2005. Word-based compression using the Burrows-Wheeler transform. Information Processing and Management 41, 1175--1192.
|
| |
22
|
Musser, D. R. 1997. Introspective sorting and selection algorithms. Software—Practice and Experience 27, 8, 983--993.
|
| |
23
|
Puglisi, S. J. 2005. Exposition and analysis of a suffix sorting algorithm. Tech. Rep. CAS-05-02-WS, Department of Computing and Software, McMaster University, Canada.
|
| |
24
|
Puglisi, S. J., Smyth, W. F., and Turpin, A. H. 2005. The performance of linear time suffix sorting algorithms. In Proceedings of the Data Compression Conference DCC '05, M. Cohn and J. Storer, Eds. IEEE Computer Society Press, Los Alamitos, CA. 358--368.
|
| |
25
|
Puglisi, S. J., Smyth, W. F., and Turpin, A. H. 2007. A taxonomy of suffix array construction algorithms. ACM Computing Surveys. 39, 2, Article 1.
|
| |
26
|
Sadakane, K. 1998. A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation. In Proceedings of the Data Compression Conference DCC '98. IEEE Computer Society, Los Alamitos, CA. 129--138.
|
| |
27
|
Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the Thirteenth ACM-SIAM Symposium on Discrete Algorithms. 225--232.
|
| |
28
|
Schindler, M. 2002. szip homepage. http://www.compressconsult.com/szip/. Available Online.
|
| |
29
|
Schürmann, K. and Stoye, J. 2005. An incomplex algorithm for fast suffix array construction. In Proceedings of The Seventh Workshop on Algorithm Engineering and Experiments (ALENEX'05). SIAM. 77--85.
|
| |
30
|
Seward, J. 2000. On the performance of BWT sorting algroithms. In Proceedings of the Data Compression Conference DCC '00. IEEE Computer Society, Los Alamitos, CA. 173--182.
|
| |
31
|
Seward, J. 2004. The bzip2 and libbzip2 home page. http://sources.redhat.com/bzip2/. Available Online.
|
| |
32
|
Sinha, R. and Zobel, J. 2004. Cache-conscious sorting of large sets of strings with dynamic tries. ACM Journal of Experimental Algorithmics 9.
|
| |
33
|
Smyth, W. F. 2003. Computing Patterns in Strings. Addison-Wesley-Pearson Education Limited, Essex, England.
|
| |
34
|
Vines, P. and Zobel, J. 1998. Compression techniques for Chinese text. Software—Practice and Experience 28, 12, 1299--1314.
|
|