| Estimating the selectivity of tf-idf based cosine similarity predicates |
| Full text |
Pdf
(431 KB)
|
Source
|
ACM SIGMOD Record
archive
Volume 36 , Issue 2 (June 2007)
table of contents
Pages 7-12
Year of Publication: 2007
ISSN:0163-5808
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 16, Downloads (12 Months): 89, Citation Count: 2
|
|
|
ABSTRACT
An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Digital Bibliography and Library Project (DBLP), http://dblp.uni-trier.de/.
|
| |
3
|
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, 2001.
|
| |
4
|
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. In ICDE, pages 729--731, 2003.
|
 |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill.
|
| |
10
|
The LDC Corpus Catalog, http://wave.ldc.upenn.edu/Catalog/.
|
CITED BY 2
|
|
Sam Small , Joshua Mason , Fabian Monrose , Niels Provos , Adam Stubblefield, To catch a predator: a natural language approach for eliciting malicious payloads, Proceedings of the 17th conference on Security symposium, p.171-183, July 28-August 01, 2008, San Jose, CA
|
|
|
|
|