ACM Home Page
Please provide us with feedback. Feedback
Efficient and tumble similar set retrieval
Full text PdfPdf (344 KB)
Source International Conference on Management of Data archive
Proceedings of the 2001 ACM SIGMOD international conference on Management of data table of contents
Santa Barbara, California, United States
Pages: 247 - 258  
Year of Publication: 2001
ISBN:1-58113-332-4
Also published in ...
Authors
Aristides Gionis  Stanford University
Dimitrios Gunopulos  University of California, Riverside
Nick Koudas  AT&T Laboratories
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 40,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/375663.375689
What is a DOI?

ABSTRACT

Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure.

We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
AFS93
 
AGM99
 
ALSS95
BCFM98
 
BGMZ97
 
CDF+00
 
CJK+01
CKKM00
 
Coh97
Fal85
FRM94
GG98
 
HNP95
IM98
 
Ind00
 
MS93
F. J MacWilliams and A. Sloane. The Theory 0f Error Correcting Codes. North Holland, 1993.
 
SM96
 
Udi94
Manber Udi. Finding Similar Files in a Large File System. Winter USENIX Technical Conference, October 1994.
 
vR79
YIO93

CITED BY  10

Collaborative Colleagues:
Aristides Gionis: colleagues
Dimitrios Gunopulos: colleagues
Nick Koudas: colleagues