|
ABSTRACT
We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Kärkkäinen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
S. Altschul T. Madden A. Schäffer J. Zhang Z. Zhang W. Miller and D. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” <i>Nucleic Acids Research,</i> vol. 25, no. 17, pp. 3389-3402, 1997.
|
| |
4
|
B. Ma J. Tromp and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” <i>Bioinformatics,</i> vol. 18, no. 3, pp. 440-445, 2002.
|
| |
5
|
S. Schwartz J. Kent A. Smit Z. Zhang R. Baertsch R. Hardison D. Haussler and W. Miller, “Human-Mouse Alignments with BLASTZ,” <i>Genome Research,</i> vol. 13, pp. 103-107, 2003.
|
| |
6
|
L. Noé and G. Kucherov, “Improved Hit Criteria for DNA Local Alignment,” <i>BMC Bioinformatics,</i> vol. 5, no. 149, Oct. 2004.
|
| |
7
|
P. Pevzner and M. Waterman, “Multiple Filtration and Approximate Pattern Matching,” <i>Algorithmica,</i> vol. 13, pp. 135-154, 1995.
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
B. Brejova D. Brown and T. Vinar, “Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specificity,” <i>Proc. Third Int'l Workshop Algorithms in Bioinformatics (WABI),</i> pp. 39-54, Sept. 2003.
|
| |
13
|
|
| |
14
|
|
| |
15
|
M. Csürös, “Performing Local Similarity Searches with Variable Length Seeds,” <i>Proc. 15th Ann. Combinatorial Pattern Matching Symp. (CPM),</i> pp. 373-387, 2004.
|
| |
16
|
M. Li B. Ma D. Kisman and J. Tromp, “PatternHunter II: Highly Sensitive and Fast Homology Search,” <i>J. Bioinformatics and Computational Biology,</i> vol. 2, no. 3, pp. 417-440, Sept. 2004.
|
 |
17
|
|
| |
18
|
D.G. Brown, “Multiple Vector Seeds for Protein Alignment,” <i>Proc. Fourth Int'l Workshop Algorithms in Bioinformatics (WABI),</i> pp. 170-181, Sept. 2004.
|
| |
19
|
J. Xu D. Brown M. Li and B. Ma, “Optimizing Multiple Spaced Seeds for Homology Search,” <i>Proc. 15th Symp. Combinatorial Pattern Matching,</i> pp. 47-58, 2004.
|
| |
20
|
|
| |
21
|
F. Li and G. Stormo, “Selection of Optimal DNA Oligos for Gene Expression Arrays,” <i>Bioinformatics,</i> vol. 17, pp. 1067-1076, 2001.
|
| |
22
|
L. Kaderali and A. Schliep, “Selecting Signature Oligonucleotides to Identify Organisms Using DNA Arrays,” <i>Bioinformatics,</i> vol. 18, no. 10, pp. 1340-1349, 2002.
|
| |
23
|
S. Rahmann, “Fast Large Scale Oligonucleotide Selection Using the Longest Common Factor Approach,” <i>J. Bioinformatics and Computational Biology,</i> vol. 1, no. 2, pp. 343-361, 2003.
|
| |
24
|
J. Zheng T. Close T. Jiang and S. Lonardi, “Efficient Selection of Unique and Popular Oligos for Large EST Databases,” <i>Proc. 14th Ann. Combinatorial Pattern Matching Symp. (CPM),</i> pp. 273-283, 2003.
|
| |
25
|
|
CITED BY 3
|
|
Pierre Peterlongo , Nadia Pisanti , Frédéric Boyer , Alair Pereira do Lago , Marie-France Sagot, Lossless filter for multiple repetitions with Hamming distance, Journal of Discrete Algorithms, v.6 n.3, p.497-509, September, 2008
|
|
|
|
|
|
|
INDEX TERMS
Primary Classification:
I.
Computing Methodologies
I.5
PATTERN RECOGNITION
I.5.3
Clustering
Subjects:
Similarity measures
Additional Classification:
I.
Computing Methodologies
I.5
PATTERN RECOGNITION
I.5.0
General
J.
Computer Applications
J.3
LIFE AND MEDICAL SCIENCES
Subjects:
Biology and genetics
General Terms:
Algorithms,
Design,
Measurement,
Performance,
Theory
Keywords:
Index Terms- Filtration,
string matching,
gapped seed,
gapped q-gram,
local alignment,
sequence similarity,
seed family,
multiple spaced seeds,
dynamic programming,
EST,
oligonucleotide selection.
|