ACM Home Page
Please provide us with feedback. Feedback
Efficient algorithms for substring near neighbor problem
Full text PdfPdf (312 KB)
Source Symposium on Discrete Algorithms archive
Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm table of contents
Miami, Florida
Pages: 1203 - 1212  
Year of Publication: 2006
ISBN:0-89871-605-5
Authors
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
: SIAM Activity Group on Discrete Mathematics
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 61,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1109557.1109690
What is a DOI?

ABSTRACT

In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the distance R from P, it reports a (possibly different) substring of T within distance cR from P. The length of the pattern P, denoted by m, is not known in advance. For the case where the distances are measured using the Hamming distance, we present a data structure which uses Õ(n1+1/c) space1 and with Õ(n1/c + mno(1)) query time. This essentially matches the earlier bounds of [Ind98], which assumed that the pattern length m is fixed in advance. In addition, our data structure can be constructed in time Õ(n1+1/c + n1+o(1)M1/3), where M is an upper bound for m. This essentially matches the preprocessing bound of [Ind98] as long as the term Õ(n1+1/c) dominates the running time, which is the case when, e.g., c < 3.We also extend our results to the case where the distances are measured according to the l1 distance. The query time and the space bound are essentially the same, while the preprocessing time becomes Õ(n1+1/c + n1+o(1)M2/3).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
{And05}A. Andoni. Approximate Nearest Neighbor Problem in High Dimensions. Master of Engineering Thesis, Massachusetts Institute of Technology, Cambridge, MA, June 2005.
 
2
 
3
 
4
5
 
6
{Buh01} J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419--428, 2001.
7
8
9
 
10
 
11
 
12
 
13
14
 
15
 
16
 
17
{JP04} N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algorithms. The MIT Press Cambridge, MA, 2004.
18
19
 
20
 
21
 
22


Collaborative Colleagues:
Alexandr Andoni: colleagues
Piotr Indyk: colleagues