ACM Home Page
Please provide us with feedback. Feedback
Mining periodic patterns with gap requirement from sequences
Full text PdfPdf (523 KB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 1 ,  Issue 2  (August 2007) table of contents
Article No. 7  
Year of Publication: 2007
ISSN:1556-4681
Authors
Minghua Zhang  National University of Singapore, Singapore
Ben Kao  The University of Hong Kong, Hong Kong
David W. Cheung  The University of Hong Kong, Hong Kong
Kevin Y. Yip  Yale University, New Haven, CT
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 224,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1267066.1267068
What is a DOI?

ABSTRACT

We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. J. Molec. Biol. 215, 403--410.
 
3
Bairoch, A. and Boeckmann, B. 1992. The swiss-prot protein sequence data bank. Nucleic Acids Res. 20, Suppl, 2019--2022.
 
4
Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M., and Rodier, F. 1985. The mosaic genome of warm-blooded vertebrates. Science 228, 4702, 953--958.
 
5
Coward, E. and Drablos, F. 1998. Detecting periodic patterns in biological sequences. Bioinformatics 14, 6, 498--507.
 
6
Fickett, J. W. and Tung, C. S. 1992. Assessment of protein coding measures. Nucleuic Acids Res. 20, 6441--6450.
 
7
 
8
Herzel, H., Weiss, O., and Trifonov, E. N. 1999. 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics 15, 3, 187--193.
 
9
Jonassen, I. 1996. Efficient discovery of conserved patterns using a pattern graph. Tech. Rep. Report No. 118, University of Bergen.
 
10
 
11
 
12
NCBI. The National Center for Biotechnology Information web site. http://www.ncbi.nlm.nih.gov.
 
13
 
14
Reddy, P. and Housman, D. 1997. The complex pathology of trinucleotide repeats. Current Opinion in Cell Biology 9, 3, 364--372.
 
15
Rigoutsos, I. and Floratos, A. 1998. Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 14, 1.
 
16
 
17
Tomita, M., Wada, M., and Kawashima, Y. 1999. ApA dinucleotide periodicity in prokaryote, eukaryote, and organelle genomes. J. Molec. Evolut. 49, 182--192.
 
18
Trifonov, E. N. 1998. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A 249, 511--516.
 
19
van Belkum, A., W. van Leeuwen, S. S., Willemse, D., van Alphen, L., and Verbrugh, H. 1997. Variable number of tandem repeats in clinical strains of haemophilus influenzae. Infect. Immun. 65, 12, 5017--5027.
 
20
Widom, J. 1996. Short-range order in two eukaryotic genomes: Relation to chromosome structure. J. Molec. Biol. 259, 579--588.
21
22
23
 
24
Zhang, M., Kao, B., Yip, C., and Cheung, D. 2001. A GSP-based efficient algorithm for mining frequent sequences. In Proceedings of IC-AI'2001 (Las Vegas, NV).


Collaborative Colleagues:
Minghua Zhang: colleagues
Ben Kao: colleagues
David W. Cheung: colleagues
Kevin Y. Yip: colleagues