ACM Home Page
Please provide us with feedback. Feedback
A guided tour to approximate string matching
Full text PdfPdf (1.19 MB)
Source ACM Computing Surveys (CSUR) archive
Volume 33 ,  Issue 1  (March 2001) table of contents
Pages: 31 - 88  
Year of Publication: 2001
ISSN:0360-0300
Author
Gonzalo Navarro  Univ. of Chile, Santiago, Chile
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 116,   Downloads (12 Months): 1092,   Citation Count: 124
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/375360.375365
What is a DOI?

ABSTRACT

We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
ALTSCHUL, S., GISH, W., MILLER, W., MYERS,G.,AND LIPMAN, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403-410.
 
4
 
5
 
6
APOSTOLICO, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. Springer-Verlag, Barlin, 85-96.
 
7
 
8
 
9
APOSTOLICO,A.AND GUERRA, C. 1987. The Longest Common Subsequence problem revisited. Algorithmica 2, 315-336.
 
10
ARAUJO, M., NAVARRO,G.,AND ZIVIANI, N. 1997. Large text searching allowing errors. In Proceedings of the 4th South American Workshop on String Processing (WSP '97), Carleton Univ. Press. 2-20.
 
11
ARLAZAROV, V., DINIC, E., KONROD, M., AND FARADZEV, I. 1975. On economic construction of the transitive closure of a directed graph. Sov. Math. Dokl. 11, 1209, 1210. Original in Russian in Dokl. Akad. Nauk SSSR 194, 1970.
 
12
ATALLAH, M., JACQUET,P.,AND SZPANKOWSKI, W. 1993. A probabilistic approach to pattern matching with mismatches. Random Struct. Algor. 4, 191- 213.
 
13
 
14
BAEZA-YATES, R. 1991. Some new results on approximate string matching. In Workshop on Data Structures, Dagstuhl, Germany. Abstract.
 
15
 
16
17
 
18
BAEZA-YATES,R.AND GONNET, G. 1994. Fast string matching with mismatches. Information and Computation 108, 2, 187-199. Preliminary version as Tech. Rep. CS-88-36, Data Structuring Group, Univ. of Waterloo, Sept. 1988.
 
19
 
20
BAEZA-YATES,R.AND NAVARRO, G. 1998. New and faster filters for multiple approximate string matching. Tech. Rep. TR/DCC-98-10, Dept. of Computer Science, University of Chile. Random Struct. Algor. to appear. ftp://ftp. dcc.ptuchile.cl/pub/users/gnavarro/multi. ps.gz.
 
21
 
22
 
23
 
24
 
25
 
26
BLUMER, A., BLUMER, J., HAUSSLER, D., EHRENFEUCHT, A., CHEN, M., AND SEIFERAS, J. 1985. The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31-55.
27
 
28
 
29
CHANG,W.AND LAWLER, E. 1994. Sublinear approximate string matching and biological applications. Algorithmica 12, 4/5, 327-344. Preliminary version in FOCS '90.
 
30
 
31
CHVATAL,V.AND SANKOFF, D. 1975. Longest common subsequences of two random sequences. J. Appl. Probab. 12, 306-315.
 
32
COBBS, A. 1995. Fast approximate matching using suffix trees. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching (CPM '95), 41-54.
 
33
 
34
 
35
 
36
 
37
 
38
CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI,W.,AND RYTTER,W. 1994. Speeding up two string-matching algorithms. Algorithmica 12, 247-267.
39
 
40
 
41
DEKEN, J. 1979. Some limit results for longest common subsequences. Discrete Math. 26, 17-31.
 
42
 
43
 
44
45
 
46
 
47
 
48
GIEGERICH, R., KURTZ, S., HISCHKE,F.,AND OHLEBUSCH, E. 1997. A general technique to improve filter algorithms for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press. 38-52. Preliminary version as Tech. Rep. 96-01, Universit at Bielefeld, Germany, 1996.
 
49
GONNET, G. 1992. A tutorial introduction to Computational Biochemistry using Darwin. Tech. rep., Informatik E. T. H., Zuerich, Switzerland.
 
50
 
51
GONZALEZ,R.AND THOMASON, M. 1978. Syntactic Pattern Recognition. Addison-Wesley, Reading, MA.
52
 
53
GROSSI,R.AND LUCCIO, F. 1989. Simple and efficient string matching with k mismatches. Inf. Process. Lett. 33, 3, 113-120.
 
54
55
 
56
57
 
58
HOLSTI,N.AND SUTINEN, E. 1994. Approximate string matching using q-gram places. In Proceedings of 7th Finnish Symposium on Computer Science. Univ. of Joensuu. 23-32.
 
59
 
60
HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice Exper. 10, 501-506.
 
61
JOKINEN,P.AND UKKONEN, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings of the 2nd Mathematical Foundations of Computer Science (MFCS '91). Springer- Verlag, Berlin, vol. 16, 240-248.
 
62
 
63
 
64
KECECIOGLU,J.AND SANKOFF, D. 1995. Exact and approximation algorithms for the inversion distance between two permutations. Algorithmica 13, 180-210.
 
65
 
66
KNUTH, D., MORRIS, J., JR, AND PRATT, V. 1977. Fast pattern matching in strings. SIAM J. Com-put. 6, 1, 323-350.
67
 
68
KUMAR,S.AND SPAFFORD, E. 1994. A patternmatching model for intrusion detection. In Proceedings of the National Computer Security Conference, 11-21.
 
69
KURTZ, S. 1996. Approximate string searching under weighted edit distance. In Proceedings of the 3rd South American Workshop on String Processing (WSP '96). Carleton Univ. Press. 156- 170.
 
70
 
71
 
72
 
73
 
74
LAWRENCE,S.AND GILES, C. L. 1999. Accessibility of information on the web. Nature 400, 107-109.
 
75
 
76
LEVENSHTEIN, V. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transmission 1, 8-17.
 
77
LEVENSHTEIN, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 8, 707-710. Original in Russian in Dokl. Akad. Nauk SSSR 163, 4, 845-848, 1965.
 
78
LIPTON,R.AND LOPRESTI, D. 1985. A systolic array for rapid string comparison. In Proceedings of the Chapel Hill Conference on VLSI, 363- 376.
 
79
LOPRESTI,D.AND TOMKINS, A. 1994. On the search-ability of electronic ink. In Proceedings of the 4th International Workshop on Frontiers in Handwriting Recognition, 156-165.
 
80
81
 
82
LUCZAK,T.AND SZPANKOWSKI, W. 1997. A suboptimal lossy data compression based on approximate pattern matching. IEEE Trans. Inf. Theor. 43, 1439-1451.
 
83
MANBER,U.AND WU, S. 1994. GLIMPSE: A tool to search through entire file systems. In Proceedings of USENIX Technical Conference. USENIX Association, Berkeley, CA, USA. 23-32. Preliminary version as Tech. Rep. 93-34, Dept. of Computer Science, Univ. of Arizona, Oct. 1993.
 
84
MASEK,W.AND PATERSON, M. 1980. A faster algorithm for computing string edit distances. J. Comput. Syst. Sci. 20, 18-31.
 
85
MASTERS, H. 1927. A study of spelling errors. Univ. of Iowa Studies in Educ. 4,4.
86
 
87
MELICHAR, B. 1996. String matching with k differences by finite automata. In Proceedings of the International Congress on Pattern Recognition (ICPR '96). IEEE CS Press, Silver Spring, MD. 256-260. Preliminary version in Computer Anal-ysis of Images and Patterns (LNCS, vol. 970, 1995).
88
 
89
 
90
MYERS, G. 1994a. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 4/5, 345-374. Perliminary version in Tech. Rep. TR90-25, Computer Science Dept., Univ. of Arizona, Sept. 1991.
 
91
MYERS, G. 1994b. Algorithmic Advances for Searching Biosequence Databases. Plenum Press, New York, 121-135.
 
92
MYERS, G. 1986a. Incremental alignment algorithms and their applications. Tech. Rep. 86-22, Dept. of Computer Science, Univ. of Arizona.
 
93
MYERS, G. 1986b. An O(ND) difference algorithm and its variations. Algorithmica 1, 251-266.
 
94
MYERS, G. 1991. An overview of sequence comparison algorithms in molecular biology. Tech. Rep. TR-91-29, Dept. of Computer Science, Univ. of Arizona.
95
 
96
NAVARRO, G. 1997a. Multiple approximate string matching by counting. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press, 125-139.
 
97
NAVARRO, G. 1997b. A partial deterministic automaton for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press, 112-124.
 
98
NAVARRO, G. 1998. Approximate Text Searching. Ph.D. thesis, Dept. of Computer Science, Univ. of Chile. Tech. Rep. TR/DCC-98-14. ftp://ftp. dcc.uchile.cl/pub/users/gnavarro/thesis98. ps.gz.
 
99
 
100
NAVARRO, G. 2000b. Nrgrep: A fast and flexible pattern matching tool, Tech. Rep. TR/DCC-2000-3. Dept. of Computer Science, Univ. of Chile, Aug. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/ nrgrep.ps.gz.
 
101
NAVARRO,G.AND BAEZA-YATES, R. 1998a. Improving an algorithm for approximate pattern matching. Tech. Rep. TR/DCC-98- 5, Dept. of Computer Science, Univ. of Chile. Algorithmica, to appear. ftp:// ftp.dcc.uchile.cl/pub/users/gnavarro/dexp. ps.gz.
 
102
NAVARRO,G.AND BAEZA-YATES, R. 1998b. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1,2.http://www.clei.cl.
 
103
 
104
 
105
106
 
107
 
108
NEEDLEMAN,S.AND WUNSCH, C. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 444-453.
 
109
 
110
 
111
 
112
RIVEST, R. 1976. Partial-match retrieval algorithms. SIAM J. Comput. 5,1.
 
113
SAHINALP,S.AND VISHKIN, U. 1997. Approximate pattern matching using locally consistent parsing. Manuscript, Univ. of Maryland Institute for Advanced Computer Studies (UMIACS).
 
114
SANKOFF, D. 1972. Matching sequences under deletion/insertion constraints. In Proceedings of the National Academy of Sciences of the USA, vol. 69, 4-6.
 
115
SANKOFF,D.AND KRUSKAL, J., Eds. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.
 
116
SANKOFF,D.AND MAINVILLE, S. 1983. Common Subsequences and Monotone Subsequences. Addison-Wesley, Reading, MA, 363-365.
 
117
 
118
SELLERS, P. 1974. On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 26, 787-793.
 
119
SELLERS, P. 1980. The theory and computation of evolutionary distances: pattern recognition. J. Algor. 1, 359-373.
 
120
SHI, F. 1996. Fast approximate string matching with q-blocks sequences. In Proceedings of the 3rd South American Workshop on String Processing (WSP'96). Carleton Univ. Press. 257- 271.
121
 
122
SUTINEN, E. 1998. Approximate Pattern Matching with the q-Gram Family. Ph.D. thesis, Dept. of Computer Science, Univ. of Helsinki, Finland. Tech. Rep. A-1998-3.
 
123
 
124
 
125
 
126
 
127
128
 
129
 
130
UKKONEN, E. 1985b. Finding approximate patterns in strings. J. Algor. 6, 132-137.
 
131
 
132
 
133
UKKONEN, E. 1995. Constructing suffix trees online in linear time. Algorithmica 14, 3, 249- 260.
 
134
UKKONEN,E.AND WOOD, D. 1993. Approximate string matching with suffix automata. Algorithmica 10, 353-364. Preliminary version in Rep. A-1990-4, Dept. of Computer Science, Univ. of Helsinki, Apr. 1990.
 
135
ULLMAN, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Comput. J. 10, 141-147.
 
136
VINTSYUK, T. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 52-58.
137
 
138
WATERMAN, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.
 
139
WEINER, P. 1973. Linear pattern matching algorithms. In Proceedings of IEEE Symposium on Switching and Automata Theory, 1-11.
 
140
 
141
WU,S.AND MANBER, U. 1992a. Agrepfia fast approximate pattern-matching tool. In Proceedings of USENIX Technical Conference. USENIX Association, Berkeley, CA, USA. 153-162.
142
 
143
 
144
WU, S., MANBER,U.,AND MYERS, E. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15,1,50- 67. Preliminary version as Tech. Rep. TR29-36, Computer Science Dept., Univ. of Arizona, 1992.
 
145
YAO, A. 1979. The complexity of pattern matching for a random string. SIAM J. Comput. 8, 368- 387.
 
146
147

CITED BY  124


REVIEW

"Paul Cull : Reviewer"

String matching is used in a variety of areas, including word processing, information retrieval, and computational biology. This survey considers the problem of locating within a text string all substrings which are nearly the same as an inp  more...