|
ABSTRACT
The similarity join is an important operation for mining high-dimensional feature spaces. Given two data sets, the similarity join computes all tuples (x, y) that are within a distance &egr;.One of the most efficient algorithms for processing similarity-joins is the Multidimensional-Spatial Join (MSJ) by Koudas and Sevcik. In our previous work --- pursued for the two-dimensional case --- we found however that MSJ has several performance shortcomings in terms of CPU and I/O cost as well as memory-requirements. Therefore, MSJ is not generally applicable to high-dimensional data.In this paper, we propose a new algorithm named Generic External Space Sweep (GESS). GESS introduces a modest rate of data replication to reduce the number of expensive distance computations. We present a new cost-model for replication, an I/O model, and an inexpensive method for duplicate removal. The principal component of our algorithm is a highly flexible replication engine.Our analytical model predicts a tremendous reduction of the number of expensive distance computations by several orders of magnitude in comparison to MSJ (factor 107). In addition, the memory requirements of GESS are shown to be lower by several orders of magnitude. Furthermore, the I/O cost of our algorithm is by factor 2 better (independent from the fact whether replication occurs or not). Our analytical results are confirmed by a large series of simulations and experiments with synthetic and real high-dimensional data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Jonathan Ashley , Myron Flickner , James Hafner , Denis Lee , Wayne Niblack , Dragutin Petkovic, The query by image content (QBIC) system, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.475, May 22-25, 1995, San Jose, California, United States
|
 |
2
|
|
 |
3
|
Stefan Berchtold , Christian Böhm , Daniel A. Keim , Hans-Peter Kriegel, A cost model for nearest neighbor search in high-dimensional data space, Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.78-86, May 11-15, 1997, Tucson, Arizona, United States
[doi> 10.1145/263661.263671]
|
| |
4
|
Jochen Van den Bercken , Björn Blohsfeld , Jens-Peter Dittrich , Jürgen Krämer , Tobias Schäfer , Martin Schneider , Bernhard Seeger, XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries, Proceedings of the 27th International Conference on Very Large Data Bases, p.39-48, September 11-14, 2001
|
 |
5
|
Jochen van den Bercken , Jens-Peter Dittrich , Bernhard Seeger, javax.XXL: a prototype for a library of query processing algorithms, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.588, May 15-18, 2000, Dallas, Texas, United States
|
| |
6
|
|
 |
7
|
Christian Böhm , Bernhard Braunmüller , Markus Breunig , Hans-Peter Kriegel, High performance clustering based on the similarity join, Proceedings of the ninth international conference on Information and knowledge management, p.298-305, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354832]
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
Christos Faloutsos , Bernhard Seeger , Agma Traina , Caetano Traina, Jr., Spatial join selectivity using power laws, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.177-188, May 15-18, 2000, Dallas, Texas, United States
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
D. Scott. Multivariate Density Estimation. Wiley-Interscience, 1992.
|
| |
23
|
|
| |
24
|
|
CITED BY 7
|
|
Jens-Peter Dittrich , Bernhard Seeger , David Scot Taylor , Peter Widmayer, On producing join results early, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.134-142, June 09-11, 2003, San Diego, California
|
|
|
|
|
|
Chenyi Xia , Hongjun Lu , Beng Chin Ooi , Jing Hu, Gorder: an efficient method for KNN join processing, Proceedings of the Thirtieth international conference on Very large data bases, p.756-767, August 31-September 03, 2004, Toronto, Canada
|
|
|
Jens-Peter Dittrich , Bernhard Seeger , David Scot Taylor , Peter Widmayer, Progressive merge join: a generic and non-blocking sort-based join algorithm, Proceedings of the 28th international conference on Very Large Data Bases, p.299-310, August 20-23, 2002, Hong Kong, China
|
|
|
Jochen Van den Bercken , Björn Blohsfeld , Jens-Peter Dittrich , Jürgen Krämer , Tobias Schäfer , Martin Schneider , Bernhard Seeger, XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries, Proceedings of the 27th International Conference on Very Large Data Bases, p.39-48, September 11-14, 2001
|
|
|
|
|
|
|
|