|
ABSTRACT
We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
2
|
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. J. Molec. Biol. 215, 403--410.
|
| |
3
|
Bairoch, A. and Boeckmann, B. 1992. The swiss-prot protein sequence data bank. Nucleic Acids Res. 20, Suppl, 2019--2022.
|
| |
4
|
Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M., and Rodier, F. 1985. The mosaic genome of warm-blooded vertebrates. Science 228, 4702, 953--958.
|
| |
5
|
Coward, E. and Drablos, F. 1998. Detecting periodic patterns in biological sequences. Bioinformatics 14, 6, 498--507.
|
| |
6
|
Fickett, J. W. and Tung, C. S. 1992. Assessment of protein coding measures. Nucleuic Acids Res. 20, 6441--6450.
|
| |
7
|
|
| |
8
|
Herzel, H., Weiss, O., and Trifonov, E. N. 1999. 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics 15, 3, 187--193.
|
| |
9
|
Jonassen, I. 1996. Efficient discovery of conserved patterns using a pattern graph. Tech. Rep. Report No. 118, University of Bergen.
|
| |
10
|
|
| |
11
|
|
| |
12
|
NCBI. The National Center for Biotechnology Information web site. http://www.ncbi.nlm.nih.gov.
|
| |
13
|
Jian Pei , Jiawei Han , Behzad Mortazavi-Asl , Helen Pinto , Qiming Chen , Umeshwar Dayal , Meichun Hsu, PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, Proceedings of the 17th International Conference on Data Engineering, p.215-224, April 02-06, 2001
|
| |
14
|
Reddy, P. and Housman, D. 1997. The complex pathology of trinucleotide repeats. Current Opinion in Cell Biology 9, 3, 364--372.
|
| |
15
|
Rigoutsos, I. and Floratos, A. 1998. Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 14, 1.
|
| |
16
|
|
| |
17
|
Tomita, M., Wada, M., and Kawashima, Y. 1999. ApA dinucleotide periodicity in prokaryote, eukaryote, and organelle genomes. J. Molec. Evolut. 49, 182--192.
|
| |
18
|
Trifonov, E. N. 1998. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A 249, 511--516.
|
| |
19
|
van Belkum, A., W. van Leeuwen, S. S., Willemse, D., van Alphen, L., and Verbrugh, H. 1997. Variable number of tandem repeats in clinical strains of haemophilus influenzae. Infect. Immun. 65, 12, 5017--5027.
|
| |
20
|
Widom, J. 1996. Short-range order in two eukaryotic genomes: Relation to chromosome structure. J. Molec. Biol. 259, 579--588.
|
 |
21
|
Jiong Yang , Wei Wang , Philip S. Yu, Mining asynchronous periodic patterns in time series data, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.275-279, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347150]
|
 |
22
|
|
 |
23
|
|
| |
24
|
Zhang, M., Kao, B., Yip, C., and Cheung, D. 2001. A GSP-based efficient algorithm for mining frequent sequences. In Proceedings of IC-AI'2001 (Las Vegas, NV).
|
|