ACM Home Page
Please provide us with feedback. Feedback
Estimating the selectivity of tf-idf based cosine similarity predicates
Full text PdfPdf (431 KB)
Source
ACM SIGMOD Record archive
Volume 36 ,  Issue 2  (June 2007) table of contents
Pages 7-12  
Year of Publication: 2007
ISSN:0163-5808
Authors
Sandeep Tata  University of Michigan, Ann Arbor, Michigan
Jignesh M. Patel  University of Michigan, Ann Arbor, Michigan
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 33,   Downloads (12 Months): 137,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1328854.1328855
What is a DOI?

ABSTRACT

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Digital Bibliography and Library Project (DBLP), http://dblp.uni-trier.de/.
 
3
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.
 
4
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. In ICDE, pages 729--731, 2003.
5
6
 
7
 
8
 
9
A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill.
 
10
The LDC Corpus Catalog, http://wave.ldc.upenn.edu/Catalog/.


Collaborative Colleagues:
Sandeep Tata: colleagues
Jignesh M. Patel: colleagues