|
ABSTRACT
We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
ALTSCHUL, S., GISH, W., MILLER, W., MYERS,G.,AND LIPMAN, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403-410.
|
| |
4
|
|
| |
5
|
|
| |
6
|
APOSTOLICO, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. Springer-Verlag, Barlin, 85-96.
|
| |
7
|
|
| |
8
|
|
| |
9
|
APOSTOLICO,A.AND GUERRA, C. 1987. The Longest Common Subsequence problem revisited. Algorithmica 2, 315-336.
|
| |
10
|
ARAUJO, M., NAVARRO,G.,AND ZIVIANI, N. 1997. Large text searching allowing errors. In Proceedings of the 4th South American Workshop on String Processing (WSP '97), Carleton Univ. Press. 2-20.
|
| |
11
|
ARLAZAROV, V., DINIC, E., KONROD, M., AND FARADZEV, I. 1975. On economic construction of the transitive closure of a directed graph. Sov. Math. Dokl. 11, 1209, 1210. Original in Russian in Dokl. Akad. Nauk SSSR 194, 1970.
|
| |
12
|
ATALLAH, M., JACQUET,P.,AND SZPANKOWSKI, W. 1993. A probabilistic approach to pattern matching with mismatches. Random Struct. Algor. 4, 191- 213.
|
| |
13
|
|
| |
14
|
BAEZA-YATES, R. 1991. Some new results on approximate string matching. In Workshop on Data Structures, Dagstuhl, Germany. Abstract.
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
BAEZA-YATES,R.AND GONNET, G. 1994. Fast string matching with mismatches. Information and Computation 108, 2, 187-199. Preliminary version as Tech. Rep. CS-88-36, Data Structuring Group, Univ. of Waterloo, Sept. 1988.
|
| |
19
|
|
| |
20
|
BAEZA-YATES,R.AND NAVARRO, G. 1998. New and faster filters for multiple approximate string matching. Tech. Rep. TR/DCC-98-10, Dept. of Computer Science, University of Chile. Random Struct. Algor. to appear. ftp://ftp. dcc.ptuchile.cl/pub/users/gnavarro/multi. ps.gz.
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
BLUMER, A., BLUMER, J., HAUSSLER, D., EHRENFEUCHT, A., CHEN, M., AND SEIFERAS, J. 1985. The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31-55.
|
 |
27
|
|
| |
28
|
|
| |
29
|
CHANG,W.AND LAWLER, E. 1994. Sublinear approximate string matching and biological applications. Algorithmica 12, 4/5, 327-344. Preliminary version in FOCS '90.
|
| |
30
|
|
| |
31
|
CHVATAL,V.AND SANKOFF, D. 1975. Longest common subsequences of two random sequences. J. Appl. Probab. 12, 306-315.
|
| |
32
|
COBBS, A. 1995. Fast approximate matching using suffix trees. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching (CPM '95), 41-54.
|
| |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI,W.,AND RYTTER,W. 1994. Speeding up two string-matching algorithms. Algorithmica 12, 247-267.
|
 |
39
|
|
| |
40
|
|
| |
41
|
DEKEN, J. 1979. Some limit results for longest common subsequences. Discrete Math. 26, 17-31.
|
| |
42
|
|
| |
43
|
|
| |
44
|
|
 |
45
|
James C. French , Allison L. Powell , Eric Schulman, Applications of approximate word matching in information retrieval, Proceedings of the sixth international conference on Information and knowledge management, p.9-15, November 10-14, 1997, Las Vegas, Nevada, United States
[doi> 10.1145/266714.266721]
|
| |
46
|
|
| |
47
|
|
| |
48
|
GIEGERICH, R., KURTZ, S., HISCHKE,F.,AND OHLEBUSCH, E. 1997. A general technique to improve filter algorithms for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press. 38-52. Preliminary version as Tech. Rep. 96-01, Universit at Bielefeld, Germany, 1996.
|
| |
49
|
GONNET, G. 1992. A tutorial introduction to Computational Biochemistry using Darwin. Tech. rep., Informatik E. T. H., Zuerich, Switzerland.
|
| |
50
|
|
| |
51
|
GONZALEZ,R.AND THOMASON, M. 1978. Syntactic Pattern Recognition. Addison-Wesley, Reading, MA.
|
 |
52
|
|
| |
53
|
GROSSI,R.AND LUCCIO, F. 1989. Simple and efficient string matching with k mismatches. Inf. Process. Lett. 33, 3, 113-120.
|
| |
54
|
|
 |
55
|
|
| |
56
|
|
 |
57
|
|
| |
58
|
HOLSTI,N.AND SUTINEN, E. 1994. Approximate string matching using q-gram places. In Proceedings of 7th Finnish Symposium on Computer Science. Univ. of Joensuu. 23-32.
|
| |
59
|
|
| |
60
|
HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice Exper. 10, 501-506.
|
| |
61
|
JOKINEN,P.AND UKKONEN, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings of the 2nd Mathematical Foundations of Computer Science (MFCS '91). Springer- Verlag, Berlin, vol. 16, 240-248.
|
| |
62
|
|
| |
63
|
|
| |
64
|
KECECIOGLU,J.AND SANKOFF, D. 1995. Exact and approximation algorithms for the inversion distance between two permutations. Algorithmica 13, 180-210.
|
| |
65
|
|
| |
66
|
KNUTH, D., MORRIS, J., JR, AND PRATT, V. 1977. Fast pattern matching in strings. SIAM J. Com-put. 6, 1, 323-350.
|
 |
67
|
|
| |
68
|
KUMAR,S.AND SPAFFORD, E. 1994. A patternmatching model for intrusion detection. In Proceedings of the National Computer Security Conference, 11-21.
|
| |
69
|
KURTZ, S. 1996. Approximate string searching under weighted edit distance. In Proceedings of the 3rd South American Workshop on String Processing (WSP '96). Carleton Univ. Press. 156- 170.
|
| |
70
|
|
| |
71
|
|
| |
72
|
|
| |
73
|
|
| |
74
|
LAWRENCE,S.AND GILES, C. L. 1999. Accessibility of information on the web. Nature 400, 107-109.
|
| |
75
|
|
| |
76
|
LEVENSHTEIN, V. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transmission 1, 8-17.
|
| |
77
|
LEVENSHTEIN, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 8, 707-710. Original in Russian in Dokl. Akad. Nauk SSSR 163, 4, 845-848, 1965.
|
| |
78
|
LIPTON,R.AND LOPRESTI, D. 1985. A systolic array for rapid string comparison. In Proceedings of the Chapel Hill Conference on VLSI, 363- 376.
|
| |
79
|
LOPRESTI,D.AND TOMKINS, A. 1994. On the search-ability of electronic ink. In Proceedings of the 4th International Workshop on Frontiers in Handwriting Recognition, 156-165.
|
| |
80
|
|
 |
81
|
|
| |
82
|
LUCZAK,T.AND SZPANKOWSKI, W. 1997. A suboptimal lossy data compression based on approximate pattern matching. IEEE Trans. Inf. Theor. 43, 1439-1451.
|
| |
83
|
MANBER,U.AND WU, S. 1994. GLIMPSE: A tool to search through entire file systems. In Proceedings of USENIX Technical Conference. USENIX Association, Berkeley, CA, USA. 23-32. Preliminary version as Tech. Rep. 93-34, Dept. of Computer Science, Univ. of Arizona, Oct. 1993.
|
| |
84
|
MASEK,W.AND PATERSON, M. 1980. A faster algorithm for computing string edit distances. J. Comput. Syst. Sci. 20, 18-31.
|
| |
85
|
MASTERS, H. 1927. A study of spelling errors. Univ. of Iowa Studies in Educ. 4,4.
|
 |
86
|
|
| |
87
|
MELICHAR, B. 1996. String matching with k differences by finite automata. In Proceedings of the International Congress on Pattern Recognition (ICPR '96). IEEE CS Press, Silver Spring, MD. 256-260. Preliminary version in Computer Anal-ysis of Images and Patterns (LNCS, vol. 970, 1995).
|
 |
88
|
|
| |
89
|
|
| |
90
|
MYERS, G. 1994a. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 4/5, 345-374. Perliminary version in Tech. Rep. TR90-25, Computer Science Dept., Univ. of Arizona, Sept. 1991.
|
| |
91
|
MYERS, G. 1994b. Algorithmic Advances for Searching Biosequence Databases. Plenum Press, New York, 121-135.
|
| |
92
|
MYERS, G. 1986a. Incremental alignment algorithms and their applications. Tech. Rep. 86-22, Dept. of Computer Science, Univ. of Arizona.
|
| |
93
|
MYERS, G. 1986b. An O(ND) difference algorithm and its variations. Algorithmica 1, 251-266.
|
| |
94
|
MYERS, G. 1991. An overview of sequence comparison algorithms in molecular biology. Tech. Rep. TR-91-29, Dept. of Computer Science, Univ. of Arizona.
|
 |
95
|
|
| |
96
|
NAVARRO, G. 1997a. Multiple approximate string matching by counting. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press, 125-139.
|
| |
97
|
NAVARRO, G. 1997b. A partial deterministic automaton for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton Univ. Press, 112-124.
|
| |
98
|
NAVARRO, G. 1998. Approximate Text Searching. Ph.D. thesis, Dept. of Computer Science, Univ. of Chile. Tech. Rep. TR/DCC-98-14. ftp://ftp. dcc.uchile.cl/pub/users/gnavarro/thesis98. ps.gz.
|
| |
99
|
|
| |
100
|
NAVARRO, G. 2000b. Nrgrep: A fast and flexible pattern matching tool, Tech. Rep. TR/DCC-2000-3. Dept. of Computer Science, Univ. of Chile, Aug. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/ nrgrep.ps.gz.
|
| |
101
|
NAVARRO,G.AND BAEZA-YATES, R. 1998a. Improving an algorithm for approximate pattern matching. Tech. Rep. TR/DCC-98- 5, Dept. of Computer Science, Univ. of Chile. Algorithmica, to appear. ftp:// ftp.dcc.uchile.cl/pub/users/gnavarro/dexp. ps.gz.
|
| |
102
|
NAVARRO,G.AND BAEZA-YATES, R. 1998b. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1,2.http://www.clei.cl.
|
| |
103
|
|
| |
104
|
|
| |
105
|
|
 |
106
|
|
| |
107
|
|
| |
108
|
NEEDLEMAN,S.AND WUNSCH, C. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 444-453.
|
| |
109
|
|
| |
110
|
|
| |
111
|
|
| |
112
|
RIVEST, R. 1976. Partial-match retrieval algorithms. SIAM J. Comput. 5,1.
|
| |
113
|
SAHINALP,S.AND VISHKIN, U. 1997. Approximate pattern matching using locally consistent parsing. Manuscript, Univ. of Maryland Institute for Advanced Computer Studies (UMIACS).
|
| |
114
|
SANKOFF, D. 1972. Matching sequences under deletion/insertion constraints. In Proceedings of the National Academy of Sciences of the USA, vol. 69, 4-6.
|
| |
115
|
SANKOFF,D.AND KRUSKAL, J., Eds. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.
|
| |
116
|
SANKOFF,D.AND MAINVILLE, S. 1983. Common Subsequences and Monotone Subsequences. Addison-Wesley, Reading, MA, 363-365.
|
| |
117
|
|
| |
118
|
SELLERS, P. 1974. On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 26, 787-793.
|
| |
119
|
SELLERS, P. 1980. The theory and computation of evolutionary distances: pattern recognition. J. Algor. 1, 359-373.
|
| |
120
|
SHI, F. 1996. Fast approximate string matching with q-blocks sequences. In Proceedings of the 3rd South American Workshop on String Processing (WSP'96). Carleton Univ. Press. 257- 271.
|
 |
121
|
|
| |
122
|
SUTINEN, E. 1998. Approximate Pattern Matching with the q-Gram Family. Ph.D. thesis, Dept. of Computer Science, Univ. of Helsinki, Finland. Tech. Rep. A-1998-3.
|
| |
123
|
|
| |
124
|
|
| |
125
|
|
| |
126
|
|
| |
127
|
|
 |
128
|
|
| |
129
|
|
| |
130
|
UKKONEN, E. 1985b. Finding approximate patterns in strings. J. Algor. 6, 132-137.
|
| |
131
|
|
| |
132
|
|
| |
133
|
UKKONEN, E. 1995. Constructing suffix trees online in linear time. Algorithmica 14, 3, 249- 260.
|
| |
134
|
UKKONEN,E.AND WOOD, D. 1993. Approximate string matching with suffix automata. Algorithmica 10, 353-364. Preliminary version in Rep. A-1990-4, Dept. of Computer Science, Univ. of Helsinki, Apr. 1990.
|
| |
135
|
ULLMAN, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Comput. J. 10, 141-147.
|
| |
136
|
VINTSYUK, T. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 52-58.
|
 |
137
|
|
| |
138
|
WATERMAN, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.
|
| |
139
|
WEINER, P. 1973. Linear pattern matching algorithms. In Proceedings of IEEE Symposium on Switching and Automata Theory, 1-11.
|
| |
140
|
|
| |
141
|
WU,S.AND MANBER, U. 1992a. Agrepfia fast approximate pattern-matching tool. In Proceedings of USENIX Technical Conference. USENIX Association, Berkeley, CA, USA. 153-162.
|
 |
142
|
|
| |
143
|
|
| |
144
|
WU, S., MANBER,U.,AND MYERS, E. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15,1,50- 67. Preliminary version as Tech. Rep. TR29-36, Computer Science Dept., Univ. of Arizona, 1992.
|
| |
145
|
YAO, A. 1979. The complexity of pattern matching for a random string. SIAM J. Comput. 8, 368- 387.
|
| |
146
|
|
 |
147
|
|
CITED BY 124
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jun Sun , Zhulong Wang , Hao Yu , Fumihito Nishino , Yukata Katsuyama , Satoshi Naoi, Effective text extraction and recognition for WWW images, Proceedings of the 2003 ACM symposium on Document engineering, November 20-22, 2003, Grenoble, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Arne Halaas , Børge Svingen , Magnar Nedland , Pål Sætrom , Ola Snøve, Jr. , Olaf René Birkeland, A recursive MISD architecture for pattern matching, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v.12 n.7, p.727-734, July 2004
|
|
|
Bijit Hore , Hakan Hacigumus , Bala Iyer , Sharad Mehrotra, Indexing text data under space constraints, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
|
|
|
Namunu C. Maddage , Changsheng Xu , Mohan S. Kankanhalli , Xi Shao, Content-based music structure analysis with applications to music semantics understanding, Proceedings of the 12th annual ACM international conference on Multimedia, October 10-16, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moninder Singh , Jayant R. Kalagnanam , Sudhir Verma , Amit J. Shah , Swaroop K. Chalasani, Automated cleansing for spend analytics, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christian Bird , Alex Gourley , Prem Devanbu , Michael Gertz , Anand Swaminathan, Mining email social networks, Proceedings of the 2006 international workshop on Mining software repositories, May 22-23, 2006, Shanghai, China
|
|
|
|
|
|
|
|
|
Sudipto Guha , H. V. Jagadish , Nick Koudas , Divesh Srivastava , Ting Yu, Approximate XML joins, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Changsheng Xu , Namunu C. Maddage , Xi Shao , Qi Tian, Content-adaptive digital music watermarking based on music structure analysis, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.3 n.1, p.1-es, February 2007
|
|
|
|
|
|
|
|
|
Ying-Dar Lin , Kuo-Kun Tseng , Tsern-Huei Lee , Yi-Neng Lin , Chen-Chou Hung , Yuan-Cheng Lai, A platform-based SoC design and implementation of scalable automaton matching for deep packet inspection, Journal of Systems Architecture: the EUROMICRO Journal, v.53 n.12, p.937-950, December, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Deise de Brum Saccol , Nina Edelweiss , Renata de Matos Galante , Carlo Zaniolo, XML version detection, Proceedings of the 2007 ACM symposium on Document engineering, August 28-31, 2007, Winnipeg, Manitoba, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hsin-Min Lu , Daniel Zeng , Lea Trujillo , Ken Komatsu , Hsinchun Chen, Ontology-enhanced automatic chief complaint classification for syndromic surveillance, Journal of Biomedical Informatics, v.41 n.2, p.340-356, April, 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Erik Linstead , Sushil Bajracharya , Trung Ngo , Paul Rigor , Cristina Lopes , Pierre Baldi, Sourcerer: mining and searching internet-scale software repositories, Data Mining and Knowledge Discovery, v.18 n.2, p.300-336, April 2009
|
|
|
|
|
|
|
|
|
Wei Wang , Chuan Xiao , Xuemin Lin , Chengqi Zhang, Efficient approximate entity extraction with edit distance constraints, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
|
|
|
Lior Aronovich , Ron Asher , Eitan Bachmat , Haim Bitner , Michael Hirsch , Shmuel T. Klein, The design of a similarity based deduplication system, Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, May 04-April 06, 2009, Haifa, Israel
|
|
|
|
|
|
Nilesh Dalvi , Ravi Kumar , Bo Pang , Raghu Ramakrishnan , Andrew Tomkins , Philip Bohannon , Sathiya Keerthi , Srujana Merugu, A web of concepts, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
|
REVIEW
"Paul Cull : Reviewer"
String matching is used in a variety of areas, including
word processing, information retrieval, and computational
biology. This survey considers the problem of locating within
a text string all substrings which are nearly the same as an
inp
more...
|