|
ABSTRACT
Similarity join algorithms find pairs of objects that lie within a certain distance ε of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multidimensional index. For these algorithms, when the data lies in a metric space, the usual solution is to embed the data in vector space and then make use of a multidimensional index. Such an approach has a number of drawbacks when the data is high dimensional as we must eventually find the most discriminating dimensions, which is not trivial. In addition, although the maximum distance between objects increases with dimension, the ability to discriminate between objects in each dimension does not. These drawbacks are overcome via the introduction of a new method called Quickjoin that does not require a multidimensional index and instead adapts techniques used in distance-based indexing for use in a method that is conceptually similar to the Quicksort algorithm. A formal analysis is provided of the Quickjoin method. Experiments show that the Quickjoin method significantly outperforms two existing techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
Christian Böhm , Bernhard Braunmüller , Markus Breunig , Hans-Peter Kriegel, High performance clustering based on the similarity join, Proceedings of the ninth international conference on Information and knowledge management, p.298-305, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354832]
|
 |
7
|
Christian Böhm , Bernhard Braunmüller , Florian Krebs , Hans-Peter Kriegel, Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.379-388, May 21-24, 2001, Santa Barbara, California, United States
|
| |
8
|
|
| |
9
|
|
 |
10
|
Thomas Brinkhoff , Hans-Peter Kriegel , Bernhard Seeger, Efficient processing of spatial joins using R-trees, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.237-246, May 25-28, 1993, Washington, D.C., United States
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
Dohnal, V., Gennaro, C., Savino, P., and Zezula, P. 2003a. Similarity join in metric spaces. In Proceedings of the 25th European Conference on IR Research (Advances in Information Retrieval) (ECIR'03). Pisa, Italy, 452--467.
|
| |
17
|
Dohnal, V., Gennaro, C., and Zezula, P. 2003b. Similarity join in metric spaces using eD-Index. In Proceedings of the 14th International Conference on Database and Expert Systems Applications. (DEXA'03). Prague, Czech Republic, 484--493.
|
| |
18
|
|
 |
19
|
|
| |
20
|
C. Faloutsos , R. Barber , M. Flickner , J. Hafner , W. Niblack , D. Petkovic , W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994
[doi> 10.1007/BF00962238]
|
 |
21
|
Christos Faloutsos , King-Ip Lin, FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.163-174, May 22-25, 1995, San Jose, California, United States
|
 |
22
|
|
| |
23
|
|
| |
24
|
Hettich, S. and Bay, S. D. 1999. The UCI KDD archive. Department of Information and Computer Science. University of California, Irvine, CA. [http://kdd.ics.uci.edu].
|
 |
25
|
|
| |
26
|
Hoare, C. A. R. 1962. Quicksort. Comput. J. 5, 1, 10--15.
|
| |
27
|
Hristescu, G. and Farach-Colton, M. 1999. Cluster-preserving embedding of proteins. Tech. rep., Department of Computer Science, Rutgers University, Piscataway, NJ.
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
 |
31
|
|
 |
32
|
|
| |
33
|
|
| |
34
|
Kahveci, T., Lang, C., and Singh, A. K. 2003. Joining massive high-dimensional datasets. In Proceedings of the 19th IEEE International Conference on Data Engineering. Bangalore, India, 264--276.
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
|
| |
39
|
|
 |
40
|
|
| |
41
|
|
| |
42
|
|
| |
43
|
Levenshtein, V. A. 1966. Binary codes capable of correcting deletions, insertion, and reversals. Cybern. Control Theory 10, 8, 707--710.
|
| |
44
|
Linial, N., London, E., and Rabinovich, Y. 1995. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15, 215--245.
|
| |
45
|
|
| |
46
|
|
| |
47
|
|
 |
48
|
|
 |
49
|
|
| |
50
|
|
| |
51
|
|
 |
52
|
|
| |
53
|
|
 |
54
|
|
| |
55
|
|
| |
56
|
|
| |
57
|
|
| |
58
|
|
| |
59
|
|
| |
60
|
|
| |
61
|
|
| |
62
|
|
| |
63
|
Uhlmann, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inform. Process. Lett. 40, 4, 175--179.
|
| |
64
|
Voronoi, G. 1909. Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Deuxiême mémoire: Recherches sur les parallèlloèdres primitifs. Seconde partie. Journal für die Reine und Angewandte Mathematik 136, 2, 67--181.
|
 |
65
|
Jason Tsong-Li Wang , Xiong Wang , King-Ip Lin , Dennis Shasha , Bruce A. Shapiro , Kaizhong Zhang, Evaluating a class of distance-mapping algorithms for data mining and clustering, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.307-311, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312264]
|
| |
66
|
Chenyi Xia , Hongjun Lu , Beng Chin Ooi , Jing Hu, Gorder: an efficient method for KNN join processing, Proceedings of the Thirtieth international conference on Very large data bases, p.756-767, August 31-September 03, 2004, Toronto, Canada
|
| |
67
|
|
| |
68
|
|
| |
69
|
|
|