|
ABSTRACT
Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter, Molecular Biology of the Cell, fourth ed. Garland Science, 2004.
|
| |
2
|
C.E. Lawrence et al., "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment," Science, vol. 262, pp. 208-214, 1993.
|
| |
3
|
A.F. Neuwald, J.S. Liu, and C.E. Lawrence, "Gibbs Motif Sampling: Detection of Bacterial Outer Membrane Protein Repeats," Protein Science, vol. 4, pp. 1618-1632, 1995.
|
| |
4
|
A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich et al., "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, pp. D138-D141, 2004.
|
| |
5
|
A. Phillips, D. Janies, and W. Wheeler, "Multiple Sequence Alignments in Phylogenetic Analysis," Molecular Phylogenetics and Evolution, vol. 16, pp. 317-330, 2000.
|
| |
6
|
B. Rost and C. Sander, "Combining Evolutionary Information and Neural Networks to Predict Protein Secondary Structure," Proteins, vol. 19, pp. 55-72, 1994.
|
| |
7
|
C.D. Livingstone and G.J. Barton, "Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation," Computer Applications in the Biosciences, vol. 9, pp. 745-756, 1993.
|
| |
8
|
J.D. Thompson, F. Plewniak, and O. Poch, "A Comprehensive Comparison of Multiple Sequence Alignment Programs," Nucleic Acids Research, vol. 27, pp. 2682-2690, 1999.
|
| |
9
|
"The ENCODE Project Consortium: Identification and Analysis of Functional Elements in 1 Percent of the Human Genome by the ENCODE Pilot Project," Science, vol. 447, pp. 799-816, 2007.
|
| |
10
|
K.D. MacIsaac and E. Fraenkel, "Practical Strategies for Discovering Regulatory DNA Sequence Motifs," PLoS Computational Biology, vol. 2, e36, 2006.
|
| |
11
|
M. Tompa et al., "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites," Nature Biotechnology , vol. 23, pp. 137-144, 2005.
|
| |
12
|
G.D. Stormo, "DNA Binding Sites: Representation and Discovery," Bioinformatics, vol. 16, pp. 16-23, 2000.
|
| |
13
|
H. Ji and W.H. Wong, "Computational Biology: Towards Deciphering Gene Regulatory Information in Mammalian Genomes," Biometrics, vol. 62, pp. 645-663, 2006.
|
| |
14
|
G.D. Stormo and G.W. Hartzell, "Identifying Protein-Binding Sites from Unaligned DNA Fragments," Proc. Nat'l Academy of Sciences USA, vol. 86, pp. 1183-1187, 1989.
|
| |
15
|
X. Liu, D.L. Brutlag, and J.S. Liu, "BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes," Proc. Sixth Pacific Symp. Biocomputing (PSB '01), vol. 6, pp. 127-138, 2001.
|
| |
16
|
|
| |
17
|
S. Sinha and M. Tompa, "Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresention," Nucleic Acids Research, vol. 30, pp. 5549-5560, 2002.
|
| |
18
|
G. Pesole, N. Prunella, S. Liuni, M. Attimonelli, and C. Saccone, "WORDUP: An Efficient Algorithm for Discovering Statistically Significant Patterns in DNA Sequences," Nucleic Acids Research, vol. 20, pp. 2871-2875, 1992.
|
| |
19
|
H.J. Bussemaker, H. Li, and E.D. Siggia, "Building a Dictionary for Genomes: Identification of Presumptive Regulatory Sites by Statistical Analysis," Proc. Nat'l Academy of Sciences USA, vol. 97, pp. 10096-10100, 2000.
|
| |
20
|
G. Pavesi, G. Mauri, and G. Pesole, "An Algorithm for Finding Signals of Unknown Length in DNA Sequences," Bioinformatics, vol. 17, no. 1, pp. S207-S214, 2001.
|
| |
21
|
J. Buhler and M. Tompa, "Finding Motifs Using Random Projection," J. Computational Biology, vol. 9, pp. 225-242, 2002.
|
| |
22
|
A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion)," J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
|
| |
23
|
C.E. Lawrence and A.A. Reilly, "An Expectation Maximization Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences," Proteins: Structure, Function and Genetics, vol. 7, pp. 41-51, 1990.
|
| |
24
|
G.C.G. Wei and M.A. Tanner, "A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms," J. Am. Statistical Assoc., vol. 85, pp. 699-704, 1990.
|
| |
25
|
G. Celeux, D. Chauveau, and J. Diebolt, "Stochastic Versions of the EM Algorithm: An Experimental Study in the Mixture Case," J. Statistical Computation and Simulation, vol. 55, pp. 287-314, 1996.
|
| |
26
|
B. Delyon, M. Lavielle, and E. Moulines, "Convergence of a Stochastic Approximation Version of the EM Algorithm," Annals of Statistics, vol. 27, pp. 94-128, 1999.
|
| |
27
|
R.A. Levine and G. Casella, "Implementations of Monte Carlo EM Algorithm," J. Computational and Graphical Statistics, vol. 10, pp. 422-439, 2001.
|
| |
28
|
C.F.J. Wu, "On the Convergence Properties of the EM Algorithm," The Annals of Statistics, vol. 11, pp. 95-103, 1983.
|
| |
29
|
|
| |
30
|
S. Kullback and R.A. Leibler, "On Information and Sufficiency," Annals of Math. Statistics, vol. 22, pp. 79-86, 1951.
|
| |
31
|
O.G. Berg and P.H. von Hippel, "Selection of DNA Binding Sites by Regulatory Proteins: Statistical-Mechanical Theory and Application to Operators and Promoters," J. Molecular Biology, vol. 193, pp. 723-750, 1987.
|
| |
32
|
L. Wang and T. Jiang, "On the Complexity of Multiple Sequence Alignment," J. Computational Biology, vol. 1, pp. 337-348, 1994.
|
| |
33
|
C.-P. Bi, "SEAM: A Stochastic EM-Type Algorithm for Motif-Finding in Biopolymer Sequences," J. Bioinformatics and Computational Biology, vol. 5, pp. 47-77, 2007.
|
| |
34
|
|
| |
35
|
S. Keles, M.J. van der Laan, S. Dudoit, B. Xing, and M.B. Eisen, "Supervised Detection of Regulatory Motifs in DNA Sequences," Statistical Applications in Genetics and Molecular Biology, vol. 2, p. 5, 2003.
|
| |
36
|
C.-P. Bi, "Multiple Sequence Local Alignment Using Monte Carlo EM Algorithm," Proc. Third Int'l Symp. Bioinformatics Research and Applications (ISBRA '07), vol. 4463, I. Mandoiu and A. Zelikovsky, eds., pp. 465-476, 2007.
|
| |
37
|
M.C. Frith, U. Hansen, J.L. Spouge, and Z. Weng, "Finding Functional Sequence Elements by Multiple Local Alignment," Nucleic Acids Research, vol. 32, pp. 189-200, 2004.
|
| |
38
|
B. De Crombrugghe, S. Busby, and H. Buc, "Cyclic AMP Receptor Protein: Role in Transcription Activation," Science, vol. 224, pp. 831-838, 1984.
|
| |
39
|
O.G. Berg, "Selection of DNA Binding Sites by Regulatory Proteins: The LexA Protein and the Arginine Repressor Use Different Strategies for Functional Specificity," Nucleic Acids Research, vol. 16, pp. 5089-5105, 1988.
|
| |
40
|
G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B. De Moor, P. Roze, and Y. Moreau, "A Gibbs Sampling Method to Detect Over-Represented Motifs in the Upstream Regions of Co-Expressed Genes," J. Computational Biology, vol. 9, pp. 447-464, 2002.
|
| |
41
|
C.-P. Bi, J.S. Leeder, and C.A. Vyhlidal, "A Comparative Study on Computational Two-Block Motif Detection: Algorithms and Applications," Molecular Pharmaceutics, vol. 5, pp. 3-16, 2008.
|
| |
42
|
H. Salgado, S. Gama-Castro, M. Peralta-Gil, E. Diaz-Peredo, F. Sanchez-Solano, A. Santos-Zavaleta, I. Martinez-Flores, V. Jimenez-Jacinto, C. Bonavides-Martinez, J. Segura-Salazar, A. Martinez-Antonio, and J. Collado-Vides, "RegulonDB (version 5.0): Escherichia Coli K-12 Transcriptional Regulatory Network, Operon Organization, and Growth Conditions," Nucleic Acids Research, vol. 34, pp. D394-D397, 2006.
|
| |
43
|
A.E. Kel, O.V. Kel-Margoulis, P.J. Farnham, S.M. Bartley, E. Wingender, and M.Q. Zhang, "Computer-Assisted Identification of Cell Cycle-Related Genes: New Targets for E2F Transcription Factors," J. Molecular Biology, vol. 309, pp. 99-120, 2001.
|
| |
44
|
C.M. Klinge, "Estrogen Receptor Interaction with Estrogen Response Elements," Nucleic Acids Research, vol. 29, pp. 2905-2919, 2001.
|
| |
45
|
W.R. Gilks, S. Richardson and D.J. Spielgelhalter, eds., Markov Chain Monte Carlo in Practice. Chapman and Hall, 1996.
|
| |
46
|
F. Liang and W.H. Wong, "Evolutionary Monte Carlo: Applications to Cp Model Sampling and Change Point Problem," Statistica Sinica, vol. 10, pp. 317-342, 2000.
|
| |
47
|
C.-P. Bi, "Evolutionary Metropolis Sampling in Sequence Alignment Space," Proc. IEEE Congress on Evolutionary Computation (CEC '08), pp. 189-194, 2008.
|
| |
48
|
|
| |
49
|
C.-P. Bi, "A Genetic-Based EM Motif-Finding Algorithm for Biological Sequence Analysis," Proc. IEEE Symp. Computational Intelligence in Bioinformatics and Computational Biology (CIBCB '07), pp. 275-282, 2007.
|
| |
50
|
C.-P. Bi, "Data Augmentation Algorithms for Detecting Conserved Domains in Protein Sequences: A Comparative Study," J. Proteome Research, vol. 7, pp. 192-210, 2008.
|
| |
51
|
R.C. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research, vol. 32, pp. 1792-1797, 2004.
|
| |
52
|
J.D. Thompson, D.G. Higgins, and T.J. Gibson, "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice," Nucleic Acids Research, vol. 22, pp. 4673-4680, 1994.
|
| |
53
|
A. Sandelin, W. Alkema, P. Engstrom, W.W. Wasserman, and B. Lenhard, "JASPAR: An Open-Access Database for Eukaryotic Transcription Factor Binding Profiles," Nucleic Acids Research, vol. 32, pp. D91-D94, 2004.
|
| |
54
|
C. Tuerk and L. Gold, "Systematic Evolution of Ligands by Exponential Enrichment: RNA Ligands to Bacteriophage T4 DNA Polymerase," Science, vol. 249, pp. 505-510, 1990.
|
| |
55
|
E. Blanco, D. Farre, M.M Alba, X. Messeguer, and R. Guigo, "ABS: A Database of Annotated Regulatory Binding Sites from Orthologous Promoters," Nucleic Acids Research, vol. 34, pp. D63-D67, 2006.
|
| |
56
|
|
| |
57
|
T. Krell, A.J. Molina-Henares, and J.L. Ramos, "The IclR Family of Transcriptional Activators and Repressors Can Be Defined by a Single Profile," Protein Science, vol. 15, pp. 1207-1213, 2006.
|
| |
58
|
J. Chen, J.B. Anderson, C. DeWeese-Scott, N.D. Fedorova, L.Y. Geer, S. He, D.I. Hurwitz, J.D. Jackson, A.R. Jacobs, C.J. Lanczycki, C.A. Liebert, C. Liu, T. Madej, A. Marchler-Bauer, G.H. Marchler, R. Mazumder, A.N. Nikolskaya, B.S. Rao, A.R. Panchenko, B.A. Shoemaker, V. Simonyan, J.S. Song, P.A. Thiessen, S. Vasudevan, Y. Wang, R.A. Yamashita, J.J. Yin, and S.H. Bryant, "MMDB: Entrezs 3D-Structure Database," Nucleic Acids Research, vol. 31, pp. 474-477, 2003.
|
| |
59
|
R.G. Zhang, Y. Kim, T. Skarina, S. Beasley, R. Laskowski, C. Arrowsmith, A. Edwards, A. Joachimiak, and A. Savchenko, "Crystal Structure of Thermotoga Maritima 0065, a member of the IclR transcriptional factor family," J. Biological Chemistry, vol. 277, pp. 19183-19190, 2002.
|
| |
60
|
J.D. Thompson, P. Koehl, R. Ripp, and O. Poch, "BAliBASE 3.0: Latest Developments of the Multiple Sequence Alignment Benchmark," Proteins, vol. 61, pp. 127-136, 2005.
|
| |
61
|
S. Kim, Z. Wang, and M. Dakilic, "iGibbs: Improving Gibbs Motif Sampler for Proteins by Sequence Clustering and Iterative Pattern Sampling," Proteins: Structure, Function, and Bioinformatics, vol. 66, pp. 671-681, 2007.
|
| |
62
|
S.T. Jensen, X.S. Liu, Q. Zhou, and J.S. Liu, "Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective," Statistical Science, vol. 19, pp. 188-204, 2004.
|
| |
63
|
C.-P. Bi, "DNA Motif Alignment by Evolving a Population of Markov Chains," BMC Bioinformatics, in press.
|
| |
64
|
Z.S. Qin, L.A. McCue, W. Thompson, L. Mayerhfer, C.E. Lawrence, and J.S. Liu, "Identification of Co-Regulated Genes through Bayesian Clustering of Predicted Regulatory Binding Sites," Nature Biotechnology, vol. 21, pp. 435-439, 2003.
|
|