| LSH forest: self-tuning indexes for similarity search |
| Full text |
Pdf
(248 KB)
|
| Source
|
International World Wide Web Conference
archive
Proceedings of the 14th international conference on World Wide Web
table of contents
Chiba, Japan
SESSION: Link-based similarity
table of contents
Pages: 651 - 660
Year of Publication: 2005
ISBN:1-59593-046-9
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 16, Downloads (12 Months): 95, Citation Count: 14
|
|
|
ABSTRACT
We consider the problem of indexing high-dimensional data for answering (approximate) similarity-search queries. Similarity indexes prove to be important in a wide variety of settings: Web search engines desire fast, parallel, main-memory-based indexes for similarity search on text data; database systems desire disk-based similarity indexes for high-dimensional data, including text and images; peer-to-peer systems desire distributed similarity indexes with low communication cost. We propose an indexing scheme called LSH Forest which is applicable in all the above contexts. Our index uses the well-known technique of locality-sensitive hashing (LSH), but improves upon previous designs by (a) eliminating the different data-dependent parameters for which LSH must be constantly hand-tuned, and (b) improving on LSH's performance guarantees for skewed data distributions while retaining the same storage and query overhead. We show how to construct this index in main memory, on disk, in parallel systems, and in peer-to-peer systems. We evaluate the design with experiments on multiple text corpora and demonstrate both the self-tuning nature and the superior performance of LSH Forest.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
K. Aberer. Scalable data access in p2p systems using unbalanced search trees. In Proc. WDAS, 2002.
|
 |
2
|
|
 |
3
|
Stefan Berchtold , Christian Böhm , Hans-Peter Kriegal, The pyramid-technique: towards breaking the curse of dimensionality, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.142-153, June 01-04, 1998, Seattle, Washington, United States
|
| |
4
|
|
| |
5
|
|
 |
6
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
7
|
|
| |
8
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
9
|
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon Kleinberg, Automatic resource compilation by analyzing hyperlink structure and associated text, Proceedings of the seventh international conference on World Wide Web 7, p.65-74, April 1998, Brisbane, Australia
|
 |
10
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
 |
15
|
Christos Faloutsos , King-Ip Lin, FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.163-174, May 22-25, 1995, San Jose, California, United States
|
| |
16
|
|
 |
17
|
|
| |
18
|
P. Ganesan, M. Bawa, and H. Garcia-Molina. Online balancing of range-partitioned data with applications to peer-to-peer systems. In Proc. VLDB, 2004.
|
 |
19
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
20
|
|
| |
21
|
A. Gutman. R-trees: A dynamic index structure for spatial searching. In Proc. of SIGMOD, 1997.
|
 |
22
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511502]
|
 |
23
|
|
| |
24
|
N. Katayama and S. Satoh. The r*-tree: An efficient and robust access method for points and rectangles. In Proc. of SIGMOD, 1997.
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
 |
28
|
|
 |
29
|
|
 |
30
|
|
 |
31
|
|
| |
32
|
|
| |
33
|
|
CITED BY 14
|
|
|
|
|
|
|
|
|
|
|
Qin Lv , William Josephson , Zhe Wang , Moses Charikar , Kai Li, Multi-probe LSH: efficient indexing for high-dimensional similarity search, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wei Dong , Zhe Wang , William Josephson , Moses Charikar , Kai Li, Modeling LSH for performance tuning, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Yufei Tao , Ke Yi , Cheng Sheng , Panos Kalnis, Quality and efficiency in high dimensional nearest neighbor search, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|