APPENDICES and SUPPLEMENTS
|
|
The software suite accompanying the article; this is a small Unix tar file, which includes the source code, a Makefile, and the test files used in the article.
|
ABSTRACT
The most important features of a string matching algorithm are its efficiency and its flexibility. Efficiency has traditionally received more attention, while flexibility in the search pattern is becoming a more and more important issue. Most classical string matching algorithms are aimed at quickly finding an exact pattern in a text, being Knuth-Morris-Pratt (KMP) and the Boyer-Moore (BM) family the most famous ones. A recent development uses deterministic "suffix automata" to design new optimal string matching algorithms, e.g. BDM and TurboBDM. Flexibility has been addressed quite separately by the use of "bit-parallelism", which simulates automata in their nondeterministic form by using bits and exploiting the intrinsic parallelism inside the computer word, e.g. the Shift-Or algorithm. Those algorithms are extended to handle classes of characters and errors in the pattern and/or in the text, their drawback being their inability to skip text characters. In this paper we merge bit-parallelism and suffix automata, so that a nondeterministic suffix automaton is simulated using bit-parallelism. The resulting algorithm, called BNDM, obtains the best from both worlds. It is much simpler to implement than BDM and nearly as simple as Shift-Or. It inherits from Shift-Or the ability to handle flexible patterns and from BDM the ability to skip characters. BNDM is 30%-40% faster than BDM and up to 7 times faster than Shift-Or. When compared to the fastest existing algorithms on exact patterns (which belong to the BM family), BNDM is from 20% slower to 3 times faster, depending on the alphabet size. With respect to flexible pattern searching, BNDM is by far the fastest technique to deal with classes of characters and is competitive to search allowing errors. In particular, BNDM seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that area. As a theoretical development related to flexible pattern matching, we introduce a new automaton to recognize suffixes of patterns with classes of characters. To the best of our knowledge, this automaton has not been studied before.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
BAEZA-YATES, R. AND NAVARRO, G. 1999. Faster approximate string matching. <i>Algorithmica 23</i>, 2, 127-158.
|
| |
6
|
|
| |
7
|
BLUMER, A., EHRENFEUCHT, A., AND HAUSSLER, D. 1989. Average sizes of suffix trees and dawgs. <i>Discrete Applied Mathematics 24</i>, 1, 37-45.
|
 |
8
|
|
| |
9
|
|
| |
10
|
CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1993. Fast practical multi-pattern matching. Rapport 93-3, Institut Gaspard Monge, Universiteacute; de Marne la Valleacute;e.
|
| |
11
|
|
| |
12
|
CZUMAJ, A., CROCHEMORE, M., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Speeding up two string-matching algorithms. <i>Algorithmica 12</i>, 247-267.
|
| |
13
|
|
| |
14
|
FISCHER, M. J. AND PATERSON, M. 1974. String matching and other products. In R. M. KARP Ed., <i>Proceedings SIAM-AMS Complexity of Computation</i> (Providence, RI, 1974), pp. 113-125.
|
| |
15
|
HORSPOOL, R.N. 1980. Practical fast searching in strings. <i>Software Practice and Experience 10</i>, 501-506.
|
| |
16
|
|
| |
17
|
KNUTH, D. E., MORRIS, J. H., AND PRATT, V.R. 1977. Fast pattern matching in strings. <i>SIAM Journal on Computing 6</i>, 1, 323-350.
|
| |
18
|
LECROQ, T. 1992. <i>Recherches de mot.</i> Ph.D. thesis, Universiteacute; d'Orleacute;ans, France.
|
 |
19
|
|
| |
20
|
NAVARRO, G. 1997. A partial deterministic automaton for approximate string matching. In <i>Proc. of WSP'97</i> (1997), pp. 112-124. Carleton University Press.
|
| |
21
|
NAVARRO, G. 1998. <i>Approximate Text Searching</i>. Ph.D. thesis, Department of Computer Science, University of Chile. Tech. Report TR/DCC-98-14. ftp://ftp.dcc.uchile.cl/- pub/users/gnavarro/thesis98.ps.gz.
|
 |
22
|
|
| |
23
|
NAVARRO, G. 2000b. Nrgrep: a fast and flexible pattern matching tool. Technical Report TR/DCC-2000-3 (Aug.), Dept. of Computer Science, Univ. of Chile. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/nrgrep.ps.gz.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
 |
27
|
Gonzalo Navarro , Mathieu Raffinot, Fast and simple character classes and bounded gaps pattern matching, with application to protein searching, Proceedings of the fifth annual international conference on Computational biology, p.231-240, April 22-25, 2001, Montreal, Quebec, Canada
[doi> 10.1145/369133.369220]
|
| |
28
|
PINTER, R.Y. 1985. Efficient string matching with don't care pattern. In A. APOSTOLICO AND Z. GALIL Eds., <i>Combinatorial Algorithms on Words</i>, pp. 11-29. Springer-Verlag, Berlin.
|
| |
29
|
RAFFINOT, M. 1997a. Asymptotic estimation of the average number of terminal states in dawgs. In <i>Proc. WSP'97</i> (1997), pp. 140-148. Carleton University Press.
|
| |
30
|
RAFFINOT, M. 1997b. On the multi backward dawg matching algorithm (MultiBDM). In <i>Proc. WSP'97</i> (1997), pp. 149-165. Carleton University Press.
|
 |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
WU, S., MANBER, U., AND MYERS, E. 1996. A sub-quadratic algorithm for approximate limited expression matching. <i>Algorithmica 15</i>, 1, 50-67.
|
| |
35
|
YAO, A. 1979. The complexity of pattern matching for a random string. <i>SIAM J. on Computing 8</i>, 368-387.
|
CITED BY 18
|
|
|
|
|
Gonzalo Navarro , Mathieu Raffinot, Fast and simple character classes and bounded gaps pattern matching, with application to protein searching, Proceedings of the fifth annual international conference on Computational biology, p.231-240, April 22-25, 2001, Montreal, Quebec, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|