|
ABSTRACT
Pattern discovery in long sequences is of great importance in many applications including computational biology study, consumer behavior analysis, system performance analysis, etc. In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with little impact to the biological function of the protein. It would be desirable if the occurrence of D in the observation can be related to a possible mutation from N in an appropriate manner. Unfortunately, the support measure (i.e., the number of occurrences) of a pattern does not serve this purpose. In this paper, we introduce the concept of compatibility matrix as the means to provide a probabilistic connection from the observation to the underlying true value. A new metric match is also proposed to capture the "real support" of a pattern which would be expected if a noise-free environment is assumed. In addition, in the context we address, a pattern could be very long. The standard pruning technique developed for the market basket problem may not work efficiently. As a result, a novel algorithm that combines statistical sampling and a new technique (namely border collapsing) is devised to discover long patterns in a minimal number of scans of the sequence database with sufficiently high confidence. Empirical results demonstrate the robustness of the match model (with respect to the noise) and the efficiency of the probabilistic algorithm.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2), 127-158, 1999.
|
 |
4
|
|
| |
5
|
G. Benson and M. Waterman. A method for fast database search for all k-nucleotide repeats. Nucleic Acid Research, 22, 4828-4836, 1994.
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
R. Durbin, S. Eddy, A. Krough, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
|
| |
10
|
|
| |
11
|
National Center for Biotechnology Information. Available at "http:/www.ncbi.nlm.nih.gov".
|
| |
12
|
|
 |
13
|
Jiawei Han , Jian Pei , Yiwen Yin, Mining frequent patterns without candidate generation, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.1-12, May 15-18, 2000, Dallas, Texas, United States
|
 |
14
|
Jiawei Han , Jian Pei , Behzad Mortazavi-Asl , Qiming Chen , Umeshwar Dayal , Mei-Chun Hsu, FreeSpan: frequent pattern-projected sequential pattern mining, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.355-359, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347167]
|
| |
15
|
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13-30, 1963.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
 |
23
|
Jason Tsong-Li Wang , Gung-Wei Chirn , Thomas G. Marr , Bruce Shapiro , Dennis Shasha , Kaizhong Zhang, Combinatorial pattern discovery for scientific data: some preliminary results, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.115-125, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
24
|
|
 |
25
|
|
 |
26
|
Jiong Yang , Wei Wang , Philip S. Yu, Mining asynchronous periodic patterns in time series data, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.275-279, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347150]
|
 |
27
|
|
| |
28
|
J. Yang, W. Wang, and P. Yu. Mining long sequential patterns in a noisy environment. IBM Research Report, 2001.
|
| |
29
|
|
| |
30
|
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. Proc. 3rd KDD, 283-286, 1997.
|
| |
31
|
|
CITED BY 31
|
|
|
|
|
|
|
|
Jian Pei , Jiawei Han , Behzad Mortazavi-Asl , Jianyong Wang , Helen Pinto , Qiming Chen , Umeshwar Dayal , Mei-Chun Hsu, Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach, IEEE Transactions on Knowledge and Data Engineering, v.16 n.11, p.1424-1440, November 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lei Chang , Tengjiao Wang , Dongqing Yang , Hua Luan , Shiwei Tang, Efficient algorithms for incremental maintenance of closed sequential patterns in large databases, Data & Knowledge Engineering, v.68 n.1, p.68-106, January, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jian Pei , Haixun Wang , Jian Liu , Ke Wang , Jianyong Wang , Philip S. Yu, Discovering Frequent Closed Partial Orders from Strings, IEEE Transactions on Knowledge and Data Engineering, v.18 n.11, p.1467-1481, November 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|