|
ABSTRACT
Signal finding (pattern discovery) in biological sequences is a fundamental problem in both computer science and molecular biology. Many approaches have been proposed for extracting interesting patterns (or motifs) from DNA/RNA and protein sequences. Some approaches are based on simple and multiple alignment techniques, some use biological knowledge and others do not. In this paper, we propose a de novo framework that performs motifs identification and exploits a constrained co-clustering technique allowing one to simultaneously find associations between groups of protein sequences and groups of motifs. We show that the presented approach is able to group together protein sequences belonging to the same families and, at the same time to provide a set of characterizing motifs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Bairoch. PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res, 19: 2241--2245, 1991.
|
| |
2
|
|
| |
3
|
|
| |
4
|
H. Cho, I. Dillon, Y. Guan, and S. Sra. Minimun sum-square residue co-clustering of gene expression data. In Proceedings SIAM SDM 2004, 2004.
|
| |
5
|
G. Crooks, G. Hon, J. Chandonia, and S. Brenner. WebLogo: A sequence logo generator. Genome Research, 14: 1188--1190, 2004.
|
 |
6
|
|
| |
7
|
M. Dogǧruel, T. Down, and T. J. Hubbard. NestedMICA as an ab initio protein motif discovery tool. BMC Bioinformatics, 14: 9--19, 2008.
|
| |
8
|
H. Dyson and P. Wright. Coupling of folding and binding for unstructured proteins. Curr Opin Struct Biol, 12: 54--60, 2002.
|
| |
9
|
R. Edwards, N. Davey, and D. Shields. SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS ONE, 2: 967--978, 2007.
|
| |
10
|
R. Finn, J. Mistry, B. Schuster-Bckler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. Eddy, E. Sonnhammer, and A. Bateman. Pfam: clans, web tools and services. Nucleic Acids Res, 34: 247--251, 2006.
|
| |
11
|
|
| |
12
|
J. A. Hartigan. Direct clustering of a data matrix. Journal of American Statistical Association, 67: 123--129, 1972.
|
| |
13
|
S. Kim, Z. Wang, and M. Dalkilic. iGibbs: Improving Gibbs Motif Sampler for Proteins by Sequence Clustering and Iterative Pattern Sampling. PROTEINS: Structure, Function, and Bioinformatics, 66: 671--681, 2007.
|
| |
14
|
M.A. Larkin , G. Blackshields , N.P. Brown , R. Chenna , P.A. McGettigan , H. McWilliam , F. Valentin , I.M. Wallace , A. Wilm , R. Lopez , J.D. Thompson , T.J. Gibson , D.G. Higgins, Clustal W and Clustal X version 2.0, Bioinformatics, v.23 n.21, p.2947-2948, November 2007
[doi> 10.1093/bioinformatics/btm404]
|
| |
15
|
I. Letunic, R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res, 34: 257--260, 2006.
|
| |
16
|
K. Mukherjee and T. Burglin. Comprehensive analysis of animal TALE homeobox genes: New conserved motifs and cases of accelerated evolution. J. Mol. Evol., 65: 137--153, 2007.
|
| |
17
|
N. Mulder, R. Apweiler, T. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, V. Buillard, L. Cerutti, R. Copley, E. Courcelle, U. Das, L. Daugherty, M. Dibley, R. Finn, W. Fleischmann, J. Gough, D. Haft, N. Hulo, S. Hunter, D. Kahn, A. Kanapin, A. Kejariwal, A. Labarga, P. Langendijk-Genevaux, D. Lonsdale, R. Lopez, I. Letunic, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, A. Nikolskaya, S. Orchard, C. Orengo, R. Petryszak, J. Selengut, C. Sigrist, P. Thomas, F. Valentin, D. Wilson, C. Wu, and C. Yeats. New developments in the InterPro database. Nucleic Acids Res, 35: 224--228, 2007.
|
| |
18
|
F. Nollet, P. Kools, and F. van Roy. Phylogenetic analysis of the cadherin superfamily allows identification of six major subfamilies besides several solitary members. J Mol Biol., 299: 551--572, 2000.
|
| |
19
|
M. Noyes, R. Christensen, A. Wakabayashi, G. Stormo, M. Brodsky, and S. Wolfe. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell, 27: 1277--1289, 2008.
|
| |
20
|
R. Pensa and J.-F. Boulicaut. Constrained Co-clustering of Gene Expression Data. In Proceedings SIAM SDM 2008, pages 25--36, 2008.
|
| |
21
|
R. Pensa, J.-F. Boulicaut, and M. Atzori. Co-clustering Numerical Data under User-defined Constraints. Technical report, Dept computer science - University od Pisa, 2008.
|
| |
22
|
E. Perrodou, C. Chica, O. Poch, T. Gibson, and J. Thompson. A new protein linear motif benchmark for multiple sequence alignment software. BMC Bioinformatics, 9: 213--228, 2008.
|
| |
23
|
I. Rigoutsos and A. Floratos. A combinatorial pattern discovery in biological sequences: the TEIRESIS algorithm. Bioinformatics, 16: 55--67, 1998.
|
| |
24
|
J. Skilling. Nested Sampling. Technical report, American Institute of Physics Conference Series, 2004.
|
|