ACM Home Page
Please provide us with feedback. Feedback
Efficient similarity search and classification via rank aggregation
Full text PdfPdf (199 KB)
Source International Conference on Management of Data archive
Proceedings of the 2003 ACM SIGMOD international conference on Management of data table of contents
San Diego, California
SESSION: Similarity queries I table of contents
Pages: 301 - 312  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Ronald Fagin  IBM Almaden Research Center, San Jose, CA
Ravi Kumar  IBM Almaden Research Center, San Jose, CA
D. Sivakumar  IBM Almaden Research Center, San Jose, CA
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 97,   Citation Count: 43
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/872757.872795
What is a DOI?

ABSTRACT

We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of independent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors.One instantiation of our methodology is as follows. Each voter projects all the vectors (database elements and the query) on a random line (different for each voter), and ranks the database elements based on the proximity of the projections to the projection of the query. The aggregation rule picks the database element that has the best median rank. This combination has several appealing features. On the theoretical side, we prove that with high probability, it produces a result that is a (1 + ε) factor approximation to the Euclidean nearest neighbor. On the practical side, it turns out to be extremely efficient, often exploring no more than 5% of the data to obtain very high-quality results. This method is also database-friendly, in that it accesses data primarily in a pre-defined order without random accesses, and, unlike other methods for approximate nearest neighbors, requires almost no extra storage. Also, we extend our approach to deal with the k nearest neighbors.We conduct two sets of experiments to evaluate the efficacy of our methods. Our experiments include two scenarios where nearest neighbors are typically employed---similarity search and classification problems. In both cases, we study the performance of our methods with respect to several evaluation criteria, and conclude that they are uniformly excellent, both in terms of quality of results and in terms of efficiency.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. In Proceedings of the First Biennial Conference on Innovative Data Systems Research, 2003.
 
3
J. J. Bartholdi, C. A. Tovey, and M. A. Trick. The computational difficulty of manipulating an election. Social Choice and Welfare, 6(3):227--241, 1989.
 
4
 
5
6
 
7
P. Diaconis and R. Graham. Spearman's footrule as a measure of disarray. Journal of the Royal Statistical Society, Series B, 39(2):262--268, 1977.
8
 
9
10
11
 
12
 
13
 
14
 
15
I. Ilyas, W. Aref, and A. Elmagarmid. Joining ranked inputs in practice. In Proceedings of the 28th International Conference on Very Large Databases, pages 950--961, 2002.
16
 
17
J. G. Kemeny. Mathematics without numbers. Daedalus, 88:571--591, 1959.
18
 
19
 
20
 
21
H. P. Young. Condorcet's theory of voting. American Political Science Review, 82:1231--1244, 1988.
 
22
H. P. Young and A. Levenglick. A consistent extension of Condorcet's election principle. SIAM Journal on Applied Mathematics, 35(2):285--300, 1978.

CITED BY  42

Collaborative Colleagues:
Ronald Fagin: colleagues
Ravi Kumar: colleagues
D. Sivakumar: colleagues