|
ABSTRACT
We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of independent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors.One instantiation of our methodology is as follows. Each voter projects all the vectors (database elements and the query) on a random line (different for each voter), and ranks the database elements based on the proximity of the projections to the projection of the query. The aggregation rule picks the database element that has the best median rank. This combination has several appealing features. On the theoretical side, we prove that with high probability, it produces a result that is a (1 + ε) factor approximation to the Euclidean nearest neighbor. On the practical side, it turns out to be extremely efficient, often exploring no more than 5% of the data to obtain very high-quality results. This method is also database-friendly, in that it accesses data primarily in a pre-defined order without random accesses, and, unlike other methods for approximate nearest neighbors, requires almost no extra storage. Also, we extend our approach to deal with the k nearest neighbors.We conduct two sets of experiments to evaluate the efficacy of our methods. Our experiments include two scenarios where nearest neighbors are typically employed---similarity search and classification problems. In both cases, we study the performance of our methods with respect to several evaluation criteria, and conclude that they are uniformly excellent, both in terms of quality of results and in terms of efficiency.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. In Proceedings of the First Biennial Conference on Innovative Data Systems Research, 2003.
|
| |
3
|
J. J. Bartholdi, C. A. Tovey, and M. A. Trick. The computational difficulty of manipulating an election. Social Choice and Welfare, 6(3):227--241, 1989.
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
P. Diaconis and R. Graham. Spearman's footrule as a measure of disarray. Journal of the Royal Statistical Society, Series B, 39(2):262--268, 1977.
|
 |
8
|
Cynthia Dwork , Ravi Kumar , Moni Naor , D. Sivakumar, Rank aggregation methods for the Web, Proceedings of the 10th international conference on World Wide Web, p.613-622, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372165]
|
| |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
I. Ilyas, W. Aref, and A. Elmagarmid. Joining ranked inputs in practice. In Proceedings of the 28th International Conference on Very Large Databases, pages 950--961, 2002.
|
 |
16
|
|
| |
17
|
J. G. Kemeny. Mathematics without numbers. Daedalus, 88:571--591, 1959.
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
H. P. Young. Condorcet's theory of voting. American Political Science Review, 82:1231--1244, 1988.
|
| |
22
|
H. P. Young and A. Levenglick. A consistent extension of Condorcet's election principle. SIAM Journal on Applied Mathematics, 35(2):285--300, 1978.
|
CITED BY 38
|
Alexandr Andoni , Ronald Fagin , Ravi Kumar , Mihai Patrascu , D. Sivakumar, Corrigendum to "efficient similarity search and classification via rank aggregation" by Ronald Fagin, Ravi Kumar and D. Sivakumar (proc. SIGMOD'03), Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
Herwig Lejsek , Fridrik H. Ásmundsson , Björn Thór Jónsson , Laurent Amsaleg, Blazingly fast image copyright enforcement, Proceedings of the 14th annual ACM international conference on Multimedia, October 23-27, 2006, Santa Barbara, CA, USA
|
|
|
|
Herwig Lejsek , Fridrik H. Ásmundsson , Björn Thór Jónsson , Laurent Amsaleg, Scalability of local image descriptors: a comparative study, Proceedings of the 14th annual ACM international conference on Multimedia, October 23-27, 2006, Santa Barbara, CA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yu-Ting Liu , Tie-Yan Liu , Tao Qin , Zhi-Ming Ma , Hang Li, Supervised rank aggregation, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
Ronald Fagin , Ravi Kumar , Mohammad Mahdian , D. Sivakumar , Erik Vee, Comparing and aggregating rankings with ties, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France
|
|
|
|
|
|
|
|
|
|
|
Flavio Chierichetti , Alessandro Panconesi , Prabhakar Raghavan , Mauro Sozio , Alessandro Tiberi , Eli Upfal, Finding near neighbors through cluster pruning, Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 11-13, 2007, Beijing, China
|
|
Tao Qin , Xu-Dong Zhang , De-Sheng Wang , Tie-Yan Liu , Wei Lai , Hang Li, Ranking with multiple hyperplanes, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
Aristides Gionis , Heikki Mannila , Kai Puolamäki , Antti Ukkonen, Algorithms for discovering bucket orders from data, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
Sudipto Guha , Nick Koudas , Amit Marathe , Divesh Srivastava, Merging the results of approximate match operations, Proceedings of the Thirtieth international conference on Very large data bases, p.636-647, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Open signaling for ATM, internet and mobile networks (OPENSIG'98)
ACM SIGCOMM Computer Communication Review
29, 1
Andrew T. Campbell
, Irene Katzela
, Kazuho Miki
, John Vicente
-
Constructing reality
Proceedings of the 11th annual international conference on Systems documentation
Douglas A. Powell
, Norman R. Ball
, Mansel W. Griffiths
-
Active bridging
ACM SIGCOMM Computer Communication Review
27, 4
D. Scott Alexander
, Marianne Shaw
, Scott M. Nettles
, Jonathan M. Smith
-
Active electronic mail
Proceedings of the 2002 ACM symposium on Applied computing
S. Karnouskos
, A. Vasilakos
-
Object-oriented database management system for process control systems—development and evaluation
Proceedings of the 1999 ACM symposium on Applied computing
Ryuji Wakizono
, Toshikazu Kawamura
, Takehiko Tsuchiya
, Takahiro Hatanaka
, Tatsuji Tanaka
|