ACM Home Page
Please provide us with feedback. Feedback
An efficient normalized maximum likelihood algorithm for DNA sequence compression
Full text PdfPdf (427 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 23 ,  Issue 1  (January 2005) table of contents
Pages: 3 - 34  
Year of Publication: 2005
ISSN:1046-8188
Authors
Gergely Korodi  Tampere University of Technology, Tampere, Finland
Ioan Tabus  Tampere University of Technology, Tampere, Finland
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 113,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1055709.1055711
What is a DOI?

ABSTRACT

This article presents an efficient algorithm for DNA sequence compression, which achieves the best compression ratios reported over a test set commonly used for evaluating DNA compression programs. The algorithm introduces many refinements to a compression method that combines: (1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression, through reference to preceding approximate matching blocks, (2) encoding by a first order context coding and (3) representing strings in clear, to make efficient use of the redundancy sources in DNA data, under fast execution times. One of the main algorithmic features is the constraint on the matching blocks to include reasonably long contiguous matches, which not only reduces significantly the search time, but also can be used to modify the NML model to exploit the constraint for getting smaller code lengths. The algorithm handles the changing statistics of DNA data in an adaptive way and by predictively encoding the matching pointers it is successful in compressing long approximate matches. Apart from comparison with previous DNA encoding methods, we present compression results for the recently published human genome data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403--410.
 
3
Chen, X., Kwong, S., and Li, M. 2001. A compression algorithm for DNA sequences. IEEE Engineering in Medicine and Biology. 61--66.
 
4
Chen, X., Li, M., Ma, B., and Tromp, J. 2002. DNACompress: fast and effective DNA sequence compression. Bioinformatics 18, 1696--1698.
 
5
Grumbach, S. and Tahi, F. 1993. Compression of DNA sequences. In Proceedings of the Data Compression Conference. 340--350.
 
6
 
7
 
8
Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., and Zhang, H. 2001. An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 149--154.
 
9
 
10
 
11
Ma, B., Tromp, J., and Li, M. 2002. PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440--445.
 
12
Matsumoto, T., Sadakane, K., and Imai, H. 2000. Biological sequence compression algorithms. In Genome Informatics Workshop. Universal Academy Press, 43--52.
 
13
Pearson, W. and Lipman, D. 1988. Improved tools for biological sequence comparison. In Proc. Natl. Acad. Sci. USA. Vol. 85. 2444--2448.
 
14
Rissanen, J. 1976. Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev. 20, 3, 198--203.
 
15
Rivals, E., Delahaye, J., Dauchet, M., and Delgrange, O. 1995. A guaranteed compression scheme for repetitive DNA sequences. Tech. Rep. IT--285, LIFL Lille I Univ.
 
16
 
17
 
18
Wilbur, W. and Lipman, D. 1983. Rapid similarity searches of nucleic acid and protein data banks. In Proc. Natl. Acad. Sci. USA. Vol. 80. 726--730.
 
19
Williams, H. and Zobel, J. 1997. Compression of nucleotide databases for fast searching. Computer Applications in the Biosciences 13, 5, 549--554.
 
20
Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.


Collaborative Colleagues:
Gergely Korodi: colleagues
Ioan Tabus: colleagues