|
Warning: The download time has expired please click on the item to try again.
ABSTRACT
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif M of length 15, where each planted instance differs from M in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem. Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)-motif problems.
We introduce a novel motif discovery algorithm based on the use of random projections of the input's substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of d for which the algorithm fails to recover the planted (l, d)-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. D. Andersen, S. J. Taplitz, S. Wong, G. Bristol, B. Larkin, and H. R. Herschman. Metal-dependent binding of a factor in vivo to the metal-responsive elements of the metallothionein 1 gene promoter. Molecular and Cellular Biology, 7:3574-81, 1987.
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
D. S. W. Boam, A. R. Clark, and K. Docherty. Positive and negative regulation of the human insulin gene by multiple trans-acting factors. Journal of Biological Chemistry, 265:8285-96, 1990.
|
| |
6
|
A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 15:1202-1215, 1998.
|
| |
7
|
J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 2001. To appear.
|
| |
8
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1-38, 1977.
|
| |
9
|
Y. M. Fraenkel, Y. Mandel, D. Friedberg, and H. Margalit. Identification of common motifs in unaligned DNA sequences: application to Escherichia coli Lrp regulon. Computer Applications in the Biosciences, 11(4):379-387, 1995.
|
| |
10
|
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from Escherichia coli. Journal of Molecular Biology, 186(1):117-128, Nov. 1985.
|
| |
11
|
|
| |
12
|
W. S. Hayes and M. Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacific Symposium on Biocomputing, pages 279-290, 1998.
|
| |
13
|
G. Z. Hertz and G. D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, July/August 1999.
|
 |
14
|
|
| |
15
|
M. Kozak. Comparison of initiation of protein synthesis in procaryotes, eucaryotes, and organelles. Microbiological Reviews, 47(1):1-45, 1983.
|
| |
16
|
C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 8 October 1993.
|
| |
17
|
C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7:41-51, 1990.
|
| |
18
|
B. Lewin. Genes VI. Oxford University Press, 1997.
|
| |
19
|
M. Linial, N. Linial, N. Tishby, and G. Yona. Global self-organization of all known protein sequences reveals inherent biological signatures. Journal of Molecular Biology, 268:539-556, 1997.
|
| |
20
|
C. J. McInerny, J. F. Partridge, G. E. Mikesell, D. P. Creemer, and L. L. Breeden. A novel Mcm1-dependent element in the SWI4, CLN3, CDC6, and CDC47 promoters activates M/G-specific transcription. Genes & Development, 11:1277-1288, 1997.
|
| |
21
|
A. L. Means and P. G. Farnham. Transcription initiation from the dihydrofolate reductase promoter is positioned by hip1 binding at the initiation site. Molecular and Cellular Biology, 10:653-61, 1990.
|
| |
22
|
S. Natsan and M. Gilman. Yy1 facilitates the association of serum response factor with the c-fos serum response element. Molecular and Cellular Biology, 15:5975-82, 1995.
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
R. Staden. Methods for discovering novel motifs in nucleic acid sequences. Computer Applications in the Biosciences, 5(4):293-298, 1989.
|
| |
29
|
|
| |
30
|
J. van Helden, B. Andr~, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281(5):827-842, Sept. 4 1998.
|
| |
31
|
E. Wingender, P. Dietze, H. Karas, and R. Kn~ppel. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research, 24(1):238-241, 1996. http://transfac.gbfbraunschweig. de/TRANSFAC/.
|
CITED BY 26
|
|
|
|
|
Yoseph Barash , Gal Elidan , Nir Friedman , Tommy Kaplan, Modeling dependencies in protein-DNA binding sites, Proceedings of the seventh annual international conference on Research in computational molecular biology, p.28-37, April 10-14, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
Xiaoming Wu , Bo Wang , Changxin Song , Jingzhi Cheng, A combined model and a varied Gibbs sampling algorithm used for motif discovery, Proceedings of the second conference on Asia-Pacific bioinformatics, p.99-104, January 01, 2004, Dunedin, New Zealand
|
|
|
Jessica Lin , Eamonn Keogh , Stefano Lonardi , Bill Chiu, A symbolic representation of time series, with implications for streaming algorithms, Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, June 13-13, 2003, San Diego, California
|
|
|
Mayur Datar , Nicole Immorlica , Piotr Indyk , Vahab S. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, Proceedings of the twentieth annual symposium on Computational geometry, June 08-11, 2004, Brooklyn, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
C. Iliopoulos , K. Perdikuri , E. Theodoridis , A. Tsakalidis , K. Tsichlas, Algorithms for extracting motifs from biological weighted sequences, Journal of Discrete Algorithms, v.5 n.2, p.229-242, June, 2007
|
|
|
Dragomir Yankov , Eamonn Keogh , Jose Medina , Bill Chiu , Victor Zordan, Detecting time series motifs under uniform scaling, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
Eran Segal , Yoseph Barash , Itamar Simon , Nir Friedman , Daphne Koller, From promoter sequence to expression: a probabilistic framework, Proceedings of the sixth annual international conference on Computational biology, p.263-272, April 18-21, 2002, Washington, DC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|