|
ABSTRACT
The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of "intrinsic dimensionality." We also present a unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Sunil Arya , David M. Mount , Nathan S. Netanyahu , Ruth Silverman , Angela Wu, An optimal algorithm for approximate nearest neighbor searching, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.573-582, January 23-25, 1994, Arlington, Virginia, United States
|
 |
3
|
|
| |
4
|
BAEZA-YATES, R. 1997. Searching: An algorithmic tour. In Encyclopedia of Computer Science and Technology, vol. 37, A. Kent and J. Williams, Eds., Marcel Dekker, New York, 331-359.
|
| |
5
|
BAEZA-YATES,R.AND NAVARRO, G. 1998. Fast approximate string matching in a dictionary. In Proceedings of the Fifth South American Symposium on String Processing and Information Retrieval (SPIRE'98), IEEE Computer Science Press, Los Alamitos, Calif., 14-22.
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
BENTLEY, J. 1979. Multidimensional binary search trees in database applications. IEEE Trans. Softw. Eng. 5, 4, 333-340.
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
 |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
BRITO, M., CH~VEZ, E., QUIROZ, A., AND YUKICH,J. 1996. Connectivity of the mutual k-nearest neighbor graph in clustering and outlier detection. Stat. Probab. Lett. 35, 33-42.
|
| |
18
|
BUGNION, E., FHEI, S., ROOS, T., WIDMAYER,P., AND WIDMER, F. 1993. A spatial index for approximate multiple string matching. In Proceedings of the First South American Workshop on String Processing (WSP'93), R. Baeza-Yates and N. Ziviani, Eds., 43-53.
|
 |
19
|
|
| |
20
|
|
| |
21
|
CH~VEZ, E. 1999. Optimal discretization for pivot based algorithms. Manuscript. ftp:// garota.fismat.umich.mx/pub/users/elchavez/ minimax.ps.gz.
|
| |
22
|
CH~VEZ,E.AND MARROQUYN, J. 1997. Proximity queries in metric spaces. In Proceedings of the Fourth South American Workshop on String Processing ( WSP'97), R. Baeza-Yates, Ed. Carleton University Press, Ottawa, 21-36.
|
| |
23
|
|
| |
24
|
CH~VEZ,E.AND NAVARRO, G. 2001a. Towards measuring the searching complexity of metric spaces. In Proceedings of the Mexican Computing Meeting, vol. II, 969-972.
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
 |
32
|
Paolo Ciaccia , Marco Patella , Pavel Zezula, A cost model for similarity queries in metric spaces, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.59-68, June 01-04, 1998, Seattle, Washington, United States
[doi> 10.1145/275487.275495]
|
| |
33
|
|
| |
34
|
CLARKSON, K. 1999. Nearest neighbor queries in metric spaces. Discrete Comput. Geom. 22,1, 63-93.
|
| |
35
|
COX,T.AND COX, M. 1994. Multidimensional Scaling. Chapman and Hall, London.
|
| |
36
|
|
| |
37
|
|
| |
38
|
DUDA,R.AND HART, P. 1973. Pattern Classification and Scene Analysis. Wiley, New York.
|
 |
39
|
|
 |
40
|
|
| |
41
|
C. Faloutsos , R. Barber , M. Flickner , J. Hafner , W. Niblack , D. Petkovic , W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994
[doi> 10.1007/BF00962238]
|
| |
42
|
|
| |
43
|
|
 |
44
|
|
 |
45
|
|
| |
46
|
Joseph F. Hair, Jr. , Rolph E. Anderson , Ronald L. Tatham , William C. Black, Multivariate data analysis (4th ed.): with readings, Prentice-Hall, Inc., Upper Saddle River, NJ, 1995
|
| |
47
|
|
| |
48
|
|
| |
49
|
KALANTARI,I.AND MCDONALD, G. 1983. A data structure and an algorithm for the nearest point problem. IEEE Trans. Softw. Eng. 9,5.
|
| |
50
|
|
| |
51
|
|
| |
52
|
|
| |
53
|
|
| |
54
|
NAVARRO,G.AND REYES, N. 2001. Dynamic spatial approximation trees. In Proceedings of the Eleventh Conference of the Chilean Computer Science Society (SCCC'01), IEEE CS Press, 213-222.
|
| |
55
|
|
 |
56
|
|
| |
57
|
NOLTEMEIER, H. 1989. Voronoi trees and applications. In Proceedings of the International Workshop on Discrete Algorithms and Complex-ity (Fukuoka, Japan), 69-74.
|
| |
58
|
|
 |
59
|
|
 |
60
|
Nick Roussopoulos , Stephen Kelley , Frédéric Vincent, Nearest neighbor queries, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.71-79, May 22-25, 1995, San Jose, California, United States
|
| |
61
|
|
 |
62
|
|
| |
63
|
SANKOFF,D.AND KRUSKAL, J., EDS. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, Mass.
|
 |
64
|
|
 |
65
|
|
| |
66
|
|
| |
67
|
UHLMANN, J. 1991a. Implementing metric trees to satisfy general proximity/similarity queries. Manuscript.
|
| |
68
|
UHLMANN, J. 1991b. Satisfying general proximity/ similarity queries with metric trees. Inf. Proc. Lett. 40, 175-179.
|
| |
69
|
VERBARG, K. 1995. The C-Ttree: A dynamically balanced spatial index. Comput. Suppl. 10, 323-340.
|
| |
70
|
|
| |
71
|
WATERMAN, M. 1995. Introduction to Computational Biology. Chapman and Hall, London.
|
| |
72
|
|
| |
73
|
WHITE,D.AND JAIN, R. 1996. Algorithms and strategies for similarity retrieval. Tech. Rep. VCL-96-101 (July), Visual Computing Laboratory, University of California, La Jolla.
|
| |
74
|
YAO, A. 1990. Computational Geometry,J.Van Leeuwen, Ed. Elsevier Science, New York, 345-380.
|
| |
75
|
|
| |
76
|
YIANILOS, P. 1999. Excluded middle vantage point forests for nearest neighbor search. In DIMACS Implementation Challenge, ALENEX'99 (Baltimore, Md).
|
| |
77
|
|
| |
78
|
|
CITED BY 97
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Caetano Traina, Jr. , Roberto F. Filho , Agma J. Traina , Marcos R. Vieira , Christos Faloutsos, The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient, The VLDB Journal — The International Journal on Very Large Data Bases, v.16 n.4, p.483-505, October 2007
|
|
|
B. Vassiliadis , A. Stefani , L. Drossos , K. Ioannou, Knowledge discovery in multimedia repositories: the role of metadata, Proceedings of the 7th WSEAS International Conference on Mathematical Methods and Computational Techniques In Electrical Engineering, p.330-335, October 27-29, 2005, Sofia, Bulgaria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Qin Lv , William Josephson , Zhe Wang , Moses Charikar , Kai Li, Efficient filtering with sketches in the ferret toolkit, Proceedings of the 8th ACM international workshop on Multimedia information retrieval, October 26-27, 2006, Santa Barbara, California, USA
|
|
|
|
|
|
Maria Camila N. Barioni , Humberto Razente , Agma Traina , Caetano Traina, Jr., SIREN: a similarity retrieval engine for complex data, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Guoren Wang , Xiangmin Zhou , Bin Wang , Baiyou Qiao , Donghong Han, A hyperplane based indexing technique for high-dimensional data, Information Sciences: an International Journal, v.177 n.11, p.2255-2268, June, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Qin Lv , William Josephson , Zhe Wang , Moses Charikar , Kai Li, Multi-probe LSH: efficient indexing for high-dimensional similarity search, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yi Zhuang , Yueting Zhuang , Qing Li , Lei Chen , Yi Yu, Indexing high-dimensional data in dual distance spaces: a symmetrical encoding approach, Proceedings of the 11th international conference on Extending database technology: Advances in database technology, March 25-29, 2008, Nantes, France
|
|
|
|
|
|
Pedro H. Bugatti , Agma J. M. Traina , Caetano Traina, Jr., Assessing the best integration between distance-function and image-feature to answer similarity queries, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
Claudio Gennaro , Matteo Mordacchini , Salvatore Orlando , Fausto Rabitti, Processing complex similarity queries in peer-to-peer networks, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fabrizio Falchi , Claudio Lucchese , Salvatore Orlando , Raffaele Perego , Fausto Rabitti, A metric cache for similarity search, Proceeding of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
Fabrizio Falchi , Claudio Lucchese , Salvatore Orlando , Raffaele Perego , Fausto Rabitti, Caching content-based queries for robust and efficient image retrieval, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, March 24-26, 2009, Saint Petersburg, Russia
|
|
|
Fábio C. M. Ricotta , Thatyana de F. P. Seraphim , Enzo Seraphim , Edmilson M. Moreira , Caetano Traina, Jr., Paginação de resultados em consultas por abrangência, Proceedings of the 23rd Brazilian symposium on Databases, October 13-17, 2008, Campinas, Sao Paulo, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|