|
ABSTRACT
Pattern discovery in sequences is an important problem in many applications, especially in computational biology and text mining. However, due to the noisy nature of data, the traditional sequential pattern model may fail to reflect the underlying characteristics of sequence data in these applications. There are two challenges: First, the mutation noise exists in the data, and therefore symbols may be misrepresented by other symbols; Secondly, the order of symbols in sequences could be permutated. To address the above problems, in this paper we propose a new sequential pattern model called mutable permutation patterns. Since the Apriori property does not hold for our permutation pattern model, a novel Permu-pattern algorithm is devised to mine frequent mutable permutation patterns from sequence databases. A reachability property is identified to prune the candidate set. Last but not least, we apply the permutation pattern model to a real genome dataset to discover gene clusters, which shows the effectiveness of the model. A large amount of synthetic data is also utilized to demonstrate the efficiency of the Permu-pattern algorithm.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Bergeron, J. Stoye. On the Similarity of Sets of Permutations and Its Applications to Genome Comparison. COCOON 2003.
|
| |
2
|
R. Eres, G.M. Landau, L. Parida. Permutation Pattern Discovery in Biosequences. Journal of Computational Biology, 2004, 11(6):1050--1060. doi:10.1089/cmb.2004.11.1050.
|
| |
3
|
|
| |
4
|
BLAST, available at http://ncbi.nih.gov/BLAST/.
|
 |
5
|
|
| |
6
|
G. Deckers-Hebestreit, K. Altendorf. THE F0F1-TYPE ATP SYNTHASES OF BACTERIA: Structure and Function of the F0 Complex. Annual Review of Microbiology, Vol. 50: 791--824.
|
| |
7
|
R. Durbin, S. Eddy, A. Krogh, G. Mitchison. Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
|
| |
8
|
C. Fellbaum. WordNet: An Electronic Lexical Database. The MIT Press, 1998.
|
| |
9
|
|
| |
10
|
D.Graur, W.H. Li. Fundamentals of Molecular Evolution. Ed. Sinauer Associates, Inc., 1991.
|
| |
11
|
J. Han, J. Pei. Mining Frequent Patterns by Pattern--growth: Methodology and Implications. KDD, 2000.
|
| |
12
|
|
| |
13
|
M. Hu, J. Yang, and W. Su. Permu--pattern: Discovery of Mutable Permutation Patterns with Proximity Constraint. CWRU EECS Dept. Tech. Report, 2007.
|
| |
14
|
M. Joshi, G. Karypis, V. Kumar. A Universal Formulation of Sequential Patterns. KDD workshop on Temporal Data Mining, 2001.
|
| |
15
|
G.M. Landau, L. Parida, O. Weimann. Gene Proximity Analysis Across Whole Genomes via PQ Trees. Journal of Computational Biology, 12(10), pp 1289---1306, 2005.
|
| |
16
|
R. Overbeek, M. Fonstein, M. D'Souza, G. D. Pusch, N. Maltsev. The Use of Gene Clusters to Infer Functional Coupling. Proc. Natl. Acad. Sci. U.S.A., 96(6):2896´lC2901, 1999.
|
| |
17
|
Jian Pei , Jiawei Han , Behzad Mortazavi-Asl , Helen Pinto , Qiming Chen , Umeshwar Dayal , Meichun Hsu, PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, Proceedings of the 17th International Conference on Data Engineering, p.215-224, April 02-06, 2001
|
 |
18
|
|
| |
19
|
Jian Pei , Jian Liu , Haixun Wang , Ke Wang , Philip S. Yu , Jianyong Wang, Efficiently Mining Frequent Closed Partial Orders, Proceedings of the Fifth IEEE International Conference on Data Mining, p.753-756, November 27-30, 2005
[doi> 10.1109/ICDM.2005.57]
|
| |
20
|
R. Rymon. Search Through Systematic Set Enumeration. In Int'l. Conf. on Principles of Knowledge Representation and Reasoning, 1992.
|
| |
21
|
|
| |
22
|
T. Schmidt, J. Stoye. Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences. CPM, 2004
|
| |
23
|
Roman L. Tatusov, Natalie D. Fedorova, John D. Jackson, Aviva R. Jacobs, Boris Kiryutin, Eugene V. Koonin, Dmitri M. Krylov, Raja Mazumder, Sergei L. Mekhedov, Anastasia N. Nikolskaya, B. Sridhar Rao, Sergei Smirnov, Alexander V. Sverdlov, Sona Vasudevan, Yuri I. Wolf, Jodie J. Yin, Darren A. Natale. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 2003, 4:41
|
 |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
|