|
ABSTRACT
The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. A well known dynamic programming algorithm takes time O(nm) to solve this problem, and it is an important open problem in Combinatorial Pattern Matching to significantly improve this bound.We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we approximate the string edit distance upto a factor of O(log n log*n). Our result is a near linear time deterministic algorithm for this version of the problem. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit Sensitive Parsing (ESP). This embedding is approximately distance preserving, and we show many applications of this embedding to string proximity problems including nearest neighbors, outliers, and streaming computations with strings.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Stephen Alstrup , Gerth Stølting Brodal , Theis Rauhe, Pattern matching in dynamic texts, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.819-828, January 09-11, 2000, San Francisco, California, United States
|
| |
3
|
Amihood Amir , Moshe Lewenstein , Ely Porat, Faster algorithms for string matching with k mismatches, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.794-803, January 09-11, 2000, San Francisco, California, United States
|
| |
4
|
|
| |
5
|
Graham Cormode , Mike Paterson , Süleyman Cenk Sahinalp , Uzi Vishkin, Communication complexity of document exchange, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.197-206, January 09-11, 2000, San Francisco, California, United States
|
 |
6
|
|
| |
7
|
{Gal85} Z. Galil. Open problems in stringology. In Combinatorial Algorithms on Words, pages 1-8. Springer, 1985.
|
| |
8
|
Ashish Goel , Piotr Indyk , Kasturi Varadarajan, Reductions among high dimensional proximity problems, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.769-778, January 07-09, 2001, Washington, D.C., United States
|
 |
9
|
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
 |
13
|
Richard M. Karp , Raymond E. Miller , Arnold L. Rosenberg, Rapid identification of repeated patterns in strings, trees and arrays, Proceedings of the fourth annual ACM symposium on Theory of computing, p.125-136, May 01-03, 1972, Denver, Colorado, United States
[doi> 10.1145/800152.804905]
|
 |
14
|
Eyal Kushilevitz , Rafail Ostrovsky , Yuval Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.614-623, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276877]
|
| |
15
|
|
 |
16
|
|
| |
17
|
{MP80} W. J. Masek and M. S. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 20:18-31, 1980.
|
 |
18
|
|
| |
19
|
{MSU97} K. Mehlhorn, R. Sundar, and C. Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17(2):183-198, February 1997.
|
| |
20
|
{Mye86} E. W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1:251-256, 1986.
|
 |
21
|
|
| |
22
|
{SV95} S. C. Sahinalp and U. Vishkin. Data compression using locally consistent parsing, 1995.
|
| |
23
|
|
CITED BY 21
|
|
A. Andoni , M. Deza , A. Gupta , P. Indyk , S. Raskhodnikova, Lower bounds for embedding edit distance into normed spaces, Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, January 12-14, 2003, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
K. Selçuk Candan , Mehmet E. Dönderler , J. Ramamoorthy , Jong W. Kim, Clustering and indexing of experience sequences for popularity-driven recommendations, Proceedings of the 3rd ACM workshop on Continuous archival and retrival of personal experences, October 28-28, 2006, Santa Barbara, California, USA
|
|
|
Parikshit Gopalan , T. S. Jayram , Robert Krauthgamer , Ravi Kumar, Estimating the sortedness of a data stream, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, p.318-327, January 07-09, 2007, New Orleans, Louisiana
|
|
|
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE Design Automation Conference on
Gwo-Dong Chen
, Daniel D. Gajski
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
|