|
Warning: The download time has expired please click on the item to try again.
ABSTRACT
(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, Prh&egr;F[h(x) = h(y)] = sim(x,y), where sim(x,y) &egr; [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A &Pgr; B|}{|A &Pgr B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:<ol> A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/&pgr;, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) &xie; Eh&egr;\F [d(h(P),h(Q))] &xie; O(log n log log n). EMD(P, Q).</ol>.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303978]
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
A. Z. Broder. Filtering near-duplicate documents. Proc. FUN 98, 1998.
|
 |
7
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
| |
8
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
9
|
Andrei Z. Broder , Robert Krauthgamer , Michael Mitzenmacher, Improved classification via connectivity information, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.576-585, January 09-11, 2000, San Francisco, California, United States
|
| |
10
|
Gruia Calinescu , Howard Karloff , Yuval Rabani, Approximation algorithms for the 0-extension problem, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.8-16, January 07-09, 2001, Washington, D.C., United States
|
| |
11
|
Chandra Chekuri , Sanjeev Khanna , Joseph Naor , Leonid Zosin, Approximation algorithms for the metric labeling problem via a new linear programming formulation, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.109-118, January 07-09, 2001, Washington, D.C., United States
|
| |
12
|
Zhiyuan Chen , H. V. Jagadish , Flip Korn , Nick Koudas , S. Muthukrishnan , Raymond T. Ng , Divesh Srivastava, Counting Twig Matches in a Tree, Proceedings of the 17th International Conference on Data Engineering, p.595-604, April 02-06, 2001
|
 |
13
|
Zhiyuan Chen , Nick Koudas , Flip Korn , S. Muthukrishnan, Selectively estimation for Boolean queries, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.216-225, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335225]
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. QuickSAND: Quick Summary and Analysis of Network Data. DIMACS Technical Report 2001-43, November 2001.
|
 |
21
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509966]
|
 |
22
|
Aristides Gionis , Dimitrios Gunopulos , Nick Koudas, Efficient and tumble similar set retrieval, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.247-258, May 21-24, 2001, Santa Barbara, California, United States
|
| |
23
|
|
 |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Proc. 3rd WebDB, pp. 129--134, 2000.
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
 |
31
|
|
 |
32
|
Piotr Indyk , Rajeev Motwani , Prabhakar Raghavan , Santosh Vempala, Locality-preserving hashing in multidimensional spaces, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.618-625, May 04-06, 1997, El Paso, Texas, United States
[doi> 10.1145/258533.258656]
|
| |
33
|
|
| |
34
|
N. Linial and O. Sasson. Non-Expansive Hashing. Combinatorica 18(1): 121--132, 1998.
|
| |
35
|
N. Nisan. Pseudorandom sequences for space bounded computations. Combinatorica, 12:449--461, 1992.
|
| |
36
|
|
| |
37
|
Y. Rubner, L. J. Guibas, and C. Tomasi. The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval. Proc. of the ARPA Image Understanding Workshop, pp. 661--668, 1997.
|
| |
38
|
Y. Rubner, C. Tomasi. Texture Metrics. Proc. IEEE International Conference on Systems, Man, and Cybernetics, 1998, pp. 4601--4607.
|
| |
39
|
|
| |
40
|
|
| |
41
|
M. Ruzon and C. Tomasi. Color edge detection with the compass operator. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2:160--166, 1999.
|
| |
42
|
|
CITED BY 55
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ludmila Cherkasova , Kave Eshghi , Charles B. Morrey , Joseph Tucek , Alistair Veitch, Applying syntactic similarity algorithms for enterprise information management, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
Guangtao Xue , Yi Jiang , Yinyuan You , Minglu Li, A topology-aware hierarchical structured overlay network based on locality sensitive hashing scheme, Proceedings of the second workshop on Use of P2P, GRID and agents for the development of content networks, June 25-25, 2007, Monterey, California, USA
|
|
|
|
|
|
Peng Xia , Dan Feng , Hong Jiang , Lei Tian , Fang Wang, FARMER: a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance, Proceedings of the 17th international symposium on High performance distributed computing, June 23-27, 2008, Boston, MA, USA
|
|
|
Abhinandan S. Das , Mayur Datar , Ashutosh Garg , Shyam Rajaram, Google news personalization: scalable online collaborative filtering, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marco Furini , Filippo Geraci , Manuela Montangero , Marco Pellegrini, VISTO: visual storyboard for web video browsing, Proceedings of the 6th ACM international conference on Image and video retrieval, p.635-642, July 09-11, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
|
|
|
Scott Huffman , April Lehman , Alexei Stolboushkin , Howard Wong-Toi , Fan Yang , Hein Roehrig, Multiple-signal duplicate detection for search evaluation, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wei Dong , Zhe Wang , Moses Charikar , Kai Li, Efficiently matching sets of features with random histograms, Proceeding of the 16th ACM international conference on Multimedia, October 26-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|