|
ABSTRACT
We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of S (Substring Compression Query or SCQ) or to find the length l substring of S whose compression is the least (Least Compressible Substring or LCS problem).Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis.We present the first known, nearly optimal algorithms for substring compression problems---SCQ, LCS and their generalizations---that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
{BCL02} D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Physical Review Letters, 88(048702), 2002.
|
| |
6
|
|
| |
7
|
{BW94} M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report SRC-RR-124, Hewlett Packard Laboratories, 1994.
|
 |
8
|
|
| |
9
|
{CLMT02} X. Chen, M. Li, B. Ma, and J. Tromp. DNA-Compress: Fast and effective DNA sequence compression. Bioinformatics, 18(12):1696--1698, 2002.
|
| |
10
|
|
 |
11
|
|
| |
12
|
{EMS03} F. Ergun, S. Muthukrishnan, and S. C. Sahinalp. Comparing sequences with segment rearrangements. In Proceedings of the 23rd Conference on Foundations of Software Technology and Theoretical Computer Science, 2003.
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
 |
19
|
Richard M. Karp , Raymond E. Miller , Arnold L. Rosenberg, Rapid identification of repeated patterns in strings, trees and arrays, Proceedings of the fourth annual ACM symposium on Theory of computing, p.125-136, May 01-03, 1972, Denver, Colorado, United States
[doi> 10.1145/800152.804905]
|
| |
20
|
|
| |
21
|
Ming Li , Xin Chen , Xin Li , Bin Ma , Paul Vitányi, The similarity metric, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, January 12-14, 2003, Baltimore, Maryland
|
 |
22
|
|
| |
23
|
{MSIO00} T. Matsumoto, K. Sadakane, H. Imai, and T. Okazaki. Can general-purpose compression schemes really compress DNA sequences? In Currents in Computational Molecular Biology, pages 76--77, 2000.
|
| |
24
|
|
| |
25
|
|
| |
26
|
{RDD+97} E. Rivals, O. Delgrange, J.-P. Delahaye, M. Dauchet, M.-O. Delorme, A. Henaut, and E. Ollivier. Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. Comp. Appl. BioSci, 13(2):131--136, 1997.
|
| |
27
|
{RDDD94} E. Rivals, O. Delgrange, M. Dauchet, and J-P. Delahaye. Compression and sequence comparison. In Proceedings of DIMACS Workshop on Sequence Comparison, 1994.
|
| |
28
|
{RDDD97} E. Rivals, M. Dauchet, J. Delahaye, and O. Delgrange. Fast discerning repeats in DNA sequences with a compression algorithm. In Proc. Genome Informatics Workshop, pages 215--226, 1997.
|
 |
29
|
|
| |
30
|
{Sah97} S. C. Sahinalp. Locally consistent parsing for string processing. PhD thesis, University of Maryland, 1997.
|
 |
31
|
|
| |
32
|
{SS03} D. Shapira and J. Storer. Large edit distance with multiple block operations. In 10th International Symposium on String Processing and Information Retrieval (SPIRE), volume 2857 of Lecture Notes in Computer Science, 2003.
|
| |
33
|
{SV95} S. C. Sahinalp and U. Vishkin. Data compression using locally consistent parsing. Technical report, University of Maryland Department of Computer Science, 1995.
|
| |
34
|
|
| |
35
|
{SYKT01} H. Sato, T. Yoshioka, A. Konagaya, and T. Toyoda. DNA data compression in the post genome era. In Proceedings of Genome Informatics 12, pages 512--514, 2001.
|
| |
36
|
{Wei73} P. Weiner. Linear pattern matching algorithm. In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pages 1--11, 1973.
|
| |
37
|
{ZL77} J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, IT-23:337--343, 1977.
|
|