ACM Home Page
Please provide us with feedback. Feedback
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus
Full text MovMov (16:21),  PdfPdf (793 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 105 - 112  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Deepavali Bhagwat  University of California
Kave Eshghi  Hewlett Packard Labs
Pankaj Mehra  Hewlett Packard Labs
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 23,   Downloads (12 Months): 122,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281207
What is a DOI?

ABSTRACT

We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document.

Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3% of the total number of partitions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
 
5
 
6
 
7
 
8
9
10
 
11
F. Douglis and A. Iyengar. Application-specific delta-encoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 113--126. USENIX, June 2003.
 
12
 
13
K. Eshghi and H. K. Tang. A framework for analyzing and improving content-based chunking algorithms. Technical Report HPL-2005-30(R.1), Hewlett Packard Laboraties, Palo Alto, 2005.
 
14
15
16
17
 
18
 
19
20
21
 
22
23
24
 
25
 
26
 
27
M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
28
 
29
 
30
 
31
32
33
 
34
N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 127--140, June 2003.
 
35
 
36
B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Kubiatowicz. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications, 22(1):41--53, January 2004.


Collaborative Colleagues:
Deepavali Bhagwat: colleagues
Kave Eshghi: colleagues
Pankaj Mehra: colleagues