|
ABSTRACT
We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3% of the total number of partitions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Atul Adya , William J. Bolosky , Miguel Castro , Gerald Cermak , Ronnie Chaiken , John R. Douceur , Jon Howell , Jacob R. Lorch , Marvin Theimer , Roger P. Wattenhofer, Farsite: federated, available, and reliable storage for an incompletely trusted environment, Proceedings of the 5th symposium on Operating systems design and implementation Due to copyright restrictions we are not able to make the PDFs for this conference available for downloading, December 09-11, 2002, Boston, Massachusetts
[doi> 10.1145/1060289.1060291]
|
 |
2
|
|
 |
3
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
 |
10
|
Peter B. Danzig , Jongsuk Ahn , John Noll , Katia Obraczka, Distributed indexing: a scalable mechanism for distributed information retrieval, Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, p.220-229, October 13-16, 1991, Chicago, Illinois, United States
[doi> 10.1145/122860.122883]
|
| |
11
|
F. Douglis and A. Iyengar. Application-specific delta-encoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 113--126. USENIX, June 2003.
|
| |
12
|
|
| |
13
|
K. Eshghi and H. K. Tang. A framework for analyzing and improving content-based chunking algorithms. Technical Report HPL-2005-30(R.1), Hewlett Packard Laboraties, Palo Alto, 2005.
|
| |
14
|
|
 |
15
|
|
 |
16
|
|
 |
17
|
John Kubiatowicz , David Bindel , Yan Chen , Steven Czerwinski , Patrick Eaton , Dennis Geels , Ramakrishna Gummadi , Sean Rhea , Hakim Weatherspoon , Chris Wells , Ben Zhao, OceanStore: an architecture for global-scale persistent storage, Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, p.190-201, November 2000, Cambridge, Massachusetts, United States
|
| |
18
|
Purushottam Kulkarni , Fred Douglis , Jason LaVoie , John M. Tracey, Redundancy elimination within large collections of files, Proceedings of the annual conference on USENIX Annual Technical Conference, p.5-5, June 27-July 02, 2004, Boston, MA
|
| |
19
|
Mark Lillibridge , Sameh Elnikety , Andrew Birrell , Mike Burrows , Michael Isard, A cooperative internet backup scheme, Proceedings of the annual conference on USENIX Annual Technical Conference, p.3-3, June 09-14, 2003, San Antonio, Texas
|
 |
20
|
|
 |
21
|
|
| |
22
|
|
 |
23
|
|
 |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
|
 |
28
|
Sylvia Ratnasamy , Paul Francis , Mark Handley , Richard Karp , Scott Schenker, A scalable content-addressable network, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.161-172, August 2001, San Diego, California, United States
|
| |
29
|
|
| |
30
|
|
| |
31
|
|
 |
32
|
|
 |
33
|
Ion Stoica , Robert Morris , David Karger , M. Frans Kaashoek , Hari Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.149-160, August 2001, San Diego, California, United States
|
| |
34
|
N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 127--140, June 2003.
|
| |
35
|
|
| |
36
|
B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Kubiatowicz. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications, 22(1):41--53, January 2004.
|
|