ACM Home Page
Please provide us with feedback. Feedback
Similarity estimation techniques from rounding algorithms
Full text PdfPdf (211 KB)
Source Annual ACM Symposium on Theory of Computing archive
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing table of contents
Montreal, Quebec, Canada
SESSION: Session 7A table of contents
Pages: 380 - 388  
Year of Publication: 2002
ISBN:1-58113-495-9
Author
Moses S. Charikar  Princeton University, Princeton, NJ
Sponsor
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 38,   Downloads (12 Months): 285,   Citation Count: 54
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/509907.509965
What is a DOI?

Warning: The download time has expired please click on the item to try again.


ABSTRACT

(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, Prh&egr;F[h(x) = h(y)] = sim(x,y), where sim(x,y) &egr; [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A &Pgr; B|}{|A &Pgr B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:<ol>

  • A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/&pgr;, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.
  • A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) &xie; Eh&egr;\F [d(h(P),h(Q))] &xie; O(log n log log n). EMD(P, Q).
  • </ol>.


    REFERENCES

    Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

    1
     
    2
     
    3
    4
     
    5
     
    6
    A. Z. Broder. Filtering near-duplicate documents. Proc. FUN 98, 1998.
    7
     
    8
     
    9
     
    10
     
    11
     
    12
    13
     
    14
     
    15
     
    16
     
    17
     
    18
     
    19
     
    20
    A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. QuickSAND: Quick Summary and Analysis of Network Data. DIMACS Technical Report 2001-43, November 2001.
    21
    22
     
    23
    24
     
    25
    26
     
    27
    T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Proc. 3rd WebDB, pp. 129--134, 2000.
     
    28
     
    29
     
    30
    31
    32
     
    33
     
    34
    N. Linial and O. Sasson. Non-Expansive Hashing. Combinatorica 18(1): 121--132, 1998.
     
    35
    N. Nisan. Pseudorandom sequences for space bounded computations. Combinatorica, 12:449--461, 1992.
     
    36
     
    37
    Y. Rubner, L. J. Guibas, and C. Tomasi. The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval. Proc. of the ARPA Image Understanding Workshop, pp. 661--668, 1997.
     
    38
    Y. Rubner, C. Tomasi. Texture Metrics. Proc. IEEE International Conference on Systems, Man, and Cybernetics, 1998, pp. 4601--4607.
     
    39
     
    40
     
    41
    M. Ruzon and C. Tomasi. Color edge detection with the compass operator. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2:160--166, 1999.
     
    42

    CITED BY  55