|
ABSTRACT
The motif discovery problem consists of finding over-represented patterns in a collection of sequences. Its difficulty stems partly from the large number of possibilities to define both the motif space to be searched and the notion of over-representation. Since the size of the search space is generally exponential in the motif length, many heuristic methods, including evolutionary algorithms, have been developed. However, comparatively little attention has been devoted to the adequate evaluation of motif quality, especially when comparing motifs of different lengths. We propose an evolution strategy to solve the motif discovery problem based on a new fitness function that simultaneously takes into account (1) the number of motif occurrences, (2) the motif length, and (3) its information content. Experimental results show that the proposed method succeeds in uncovering the correct motif positions and length with high accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymer. In ISMB' 94, pages 28--36, 1994.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
G. B. Fogel, D. G. Weekes, G. Varga, E. R. Dow, H. B. Harlow, J. E. Onyia, and C. Su. Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res, 32(13): 3826--3835, 2004.
|
| |
8
|
|
| |
9
|
S. T. Jensen, X. S. Liu, Q. Zhou, and J. S. Liu. Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Statistical Science, 19(1):188---204, 2004.
|
| |
10
|
J. Kalinowski et al. The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of l-aspartate-derived amino acids and vitamins. Journal of Biotechnology, 104(1-3):5--25, 2003.
|
| |
11
|
M. Kaya. Motif discovery using multi-objective genetic algorithm in biosequences. Advances in Intelligent Data Analysis VII, 4723:320--331, 2007.
|
| |
12
|
T. A. Kohl, J. Baumbach, B. Jungwirth, A. Pühler, and A. Tauch. The GlxR regulon of the amino acid producer Corynebacterium glutamicum: In silico and in vitro detection of DNA binding sites of a global transcription regulator. Journal of Biotechnology, 135(4): 340--350, 2008.
|
| |
13
|
|
 |
14
|
|
 |
15
|
|
| |
16
|
G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res, 32(Web Server Issue):W199--W203, 2004.
|
| |
17
|
S. Rahmann, T. Müller, and M. Vingron. On the power of profiles for transcription factor binding site detection. Statistical Applications in Genetics and Molecular Biology, 2(1):Article 7, 2003.
|
| |
18
|
K. Robinson, A. McGuire, and G. Church. A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K 12 genome. Journal of Molecular Biology, 284:241--254, 1998.
|
| |
19
|
A. Sandelin, W. Alkema, P. G. Engström, W. W. Wasserman, and B. Lenhard. JASPAR: an open access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res, 32(1):D91--D94, 2004.
|
| |
20
|
G. Sandve, O. Abul, V. Walseng, and F. Drabløs. Improved benchmarks for computational motif discovery. BMC Bioinformatics, 8(1):193, 2007.
|
| |
21
|
G. Sandve and F. Drabløs. A survey of motif discovery methods in an integrated framework. Biology Direct, 1:Article 11, 2006.
|
| |
22
|
T. Schneider and R. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res, 18:6097--6100, 1990.
|
| |
23
|
T. Schneider, G. Stromo, L. Gold, and A. Ehrenfeucht. Information content of binding sites on nucleotide sequences. Journal of Molecular Biology, 188(3):415--431, 1986.
|
| |
24
|
S. Sinha and M. Tompa. YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res, 31(13):3586--3588, 2003.
|
| |
25
|
G. D. Stormo. DNA binding sites: representation and discovery. Bioinformatics, 16:16--23, 2000.
|
| |
26
|
M. Tompa et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 23:137--144, 2005.
|
| |
27
|
|
INDEX TERMS
Primary Classification:
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.8
Problem Solving, Control Methods, and Search
Subjects:
Heuristic methods
Additional Classification:
I.
Computing Methodologies
I.5
PATTERN RECOGNITION
I.5.1
Models
Subjects:
Statistical
J.
Computer Applications
J.3
LIFE AND MEDICAL SCIENCES
Subjects:
Biology and genetics
General Terms:
Algorithms,
Design,
Experimentation,
Theory
Keywords:
computational biology,
dna,
ea,
es,
evolution strategies,
evolutionary algorithms,
local search,
motif discovery,
transcription factor
|