|
ABSTRACT
In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. We investigate the fundamental aspects of this data mining problem that can make discovery "easy" or "hard." We present a general framework for characterizing learning in this context by deriving the Bayes error rate for this problem under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Baldi, P., Chauvin, Y., McClure, M. and Hunkapiller, T. (1994) Hidden Markov Models of Biological Primary Sequence Information. Proceedings of the National Academy of Science, 91, pp. 1059--1063.
|
 |
3
|
|
| |
4
|
Chow, C. K. (1962) A recognition method using neighbor dependence, IRE Trans. Elect. Comput., vol. EC-11, pp. 683--690.
|
| |
5
|
Chu, J. T. (1974) Error Bounds for a Contextual Recognition Procedure. IEEE Trans. Elect. Comput., Vol. C-20, No. 10.
|
| |
6
|
Chudova, D. and Smyth P. (2002) Pattern discovery in sequences under a Markov assumption, Technical Report ICS-TR-02-08, University of California, Irvine.
|
| |
7
|
Duda, R. O., and Hart, P. E. (1973) Pattern Recognition and Scene Analysis, New York, NY: John Wiley.
|
| |
8
|
Eddy, S.R. (1995) Multiple Alignment Using Hidden Markov Models. Intelligent Systems for Molecular Biology, 3, pp. 114--120.
|
| |
9
|
Van Helden, J.V., Abdre, B, and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of Yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281, pp. 827--842
|
| |
10
|
Lee, E.T. (1974) Bounds and approximations for error probabilities in character recognition. Proceedings of the International Conference on Cybernetics and Society, pp. 324--329.
|
| |
11
|
Lawrence, C.E., Altshul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208--214
|
| |
12
|
Liu X, Brutlag D L, Liu J S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 127--138.
|
| |
13
|
Liu, J. S., Neuwald, A.F., and Lawrence, C.E. (1995) Bayesian models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies. Journal of the American Statistical Association, 90, pp. 1156--1170.
|
| |
14
|
McLachlan, G. J. (1992) Discriminant Analysis and Statistical Pattern Recognition, New York, NY: John Wiley and Sons.
|
| |
15
|
Pevzner, P. A. (2000) Computational Molecular Biology: an Algorithmic Approach. Cambridge, MA: The MIT Press.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
Robison, K., McGuire, A.M., Church, G.M. (1998) A comprehensive library of DNA-binding Site Matrices for 55 Proteins Applied to the Complete Escherichia coli K-12 Genome. Journal of Molecular Biology, 284:241--254.
|
| |
20
|
Stormo, G.D., Hartzell, G.W. Proc. Natl. Acad. Sci., 86:1183--1187, 1989
|
| |
21
|
Sze, S.-H., Gelfand, M.S., Pevzner, P. A., (2002) Finding weak motifs in DNA sequences. Pacific Symposium on Biocomputing 2002, pp. 235--246
|
| |
22
|
|
CITED BY 7
|
|
|
|
|
Gang Wu , Yi Wu , Long Jiao , Yuan-Fang Wang , Edward Y. Chang, Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance, Proceedings of the eleventh ACM international conference on Multimedia, November 02-08, 2003, Berkeley, CA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|