ACM Home Page
Please provide us with feedback. Feedback
Similarity search: a matching based approach
Full text PdfPdf (496 KB)
Source Very Large Data Bases archive
Proceedings of the 32nd international conference on Very large data bases table of contents
Seoul, Korea
SESSION: Research sessions: Research 18: Algorithms & data mining table of contents
Pages: 631 - 642  
Year of Publication: 2006
Authors
Anthony K. H. Tung  National Univ. of Singapore
Rui Zhang  Univ. of Melbourne
Nick Koudas  Univ. of Toronto
Beng Chin Ooi  National Univ. of Singapore
Sponsors
: Naver
Microsoft : Microsoft
: Kaist
: CCF-DBS
: Systems Applications Products
: SK telecom
: Advanced Information Technology Research Center
: AJU Information Technology Co., Ltd
Korea Info Sci Society : Korea Information Science Society
: The Database Society of Japan
: International Business Management
: Google
SIGMOD: ACM Special Interest Group on Management of Data
: Air Force Office of Scientific Research/Asian Office of Aerospace R&D
: US Army ITC-PAC Asian Research Office
: K.I.S.S. SIG on Databases
: LG Electronics
ORACLE : ORACLE
: Samsung SOS
: Kosef
Publisher
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 80,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

Warning: The download time has expired please click on the item to try again.


ABSTRACT

Similarity search is a crucial task in multimedia retrieval and data mining. Most existing work has modelled this problem as the nearest neighbor (NN) problem, which considers the distance between the query object and the data objects over a fixed set of features. Such an approach has two drawbacks: 1) it leaves many partial similarities uncovered; 2) the distance is often affected by a few dimensions with high dissimilarity. To overcome these drawbacks, we propose the k-n-match problem in this paper.The k-n-match problem models similarity search as matching between the query object and the data objects in n dimensions, where n is a given integer smaller than dimensionality d and these n dimensions are determined dynamically to make the query object and the data objects returned in the answer set match best. The k-n-match query is expected to be superior to the kNN query in discovering partial similarities, however, it may not be as good in identifying full similarity since a single value of n may only correspond to a particular aspect of an object instead of the entirety. To address this problem, we further introduce the frequent k-n-match problem, which finds a set of objects that appears in the k-n-match answers most frequently for a range of n values. Moreover, we propose search algorithms for both problems. We prove that our proposed algorithm is optimal in terms of the number of individual attributes retrieved, which is especially useful for information retrieval from multiple systems. We can also apply the proposed algorithmic strategy to achieve a disk based algorithm for the (frequent) k-n-match query. By a thorough experimental study using both real and synthetic data sets, we show that: 1) the k-n-match query yields better result than the kNN query in identifying similar objects by partial similarities; 2) our proposed method (for processing the frequent k-n-match query) outperforms existing techniques for similarity search in terms of both effectiveness and efficiency.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
[1] http://www1.cs.columbia.edu/CAVE/research/ softlib/coil-100.html.
 
2
[2] ftp://ftp.ics.uci.edu/pub/machine-learning-databases/.
 
3
[3] http://kdd.ics.uci.edu/databases/CorelFeatures/ CorelFeatures.data.html.
 
4
[4] C. C. Aggarwal. Towards meaningful high-dimensional nearest neighbor search by human-computer interaction. In ICDE, 2002.
 
5
6
 
7
 
8
 
9
[9] S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, 2001.
 
10
11
12
13
 
14
 
15
[15] Richard W. Hamming. Error detecting and error correcting codes. Bell Systems Technical Journal, 29:147-160, 1950.
 
16
17
18
19
20
 
21
 
22


Collaborative Colleagues:
Anthony K. H. Tung: colleagues
Rui Zhang: colleagues
Nick Koudas: colleagues
Beng Chin Ooi: colleagues