ACM Home Page
Please provide us with feedback. Feedback
Towards estimating the number of distinct value combinations for a set of attributes
Full text PdfPdf (270 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 14th ACM international conference on Information and knowledge management table of contents
Bremen, Germany
SESSION: Paper session DB-8 (databases): query optimisation table of contents
Pages: 656 - 663  
Year of Publication: 2005
ISBN:1-59593-140-6
Authors
Xiaohui Yu  University of Toronto, Toronto, ON, Canada
Calisto Zuzarte  IBM Toronto Lab, Markham, ON, Canada
Kenneth C. Sevcik  University of Toronto, Toronto, ON, Canada
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 39,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1099554.1099719
What is a DOI?

ABSTRACT

Accurately and efficiently estimating the number of distinct values for some attribute(s) or sets of attributes in a data set is of critical importance to many database operations, such as query optimization and approximation query answering. Previous work has focused on the estimation of the number of distinct values for a single attribute and most existing work adopts a data sampling approach. This paper addresses the equally important issue of estimating the number of distinct value combinations for multiple attributes which we call COLSCARD (for COLumn Set CARDinality). It also takes a different approach that uses existing statistical information (e.g., histograms) available on the individual attributes to assist estimation. We start with cases where exact frequency information on individual attributes is available, and present a pair of lower and upper bounds on COLSCARD that are consistent with the available information, as well as an estimator of COLSCARD based on probability. We then proceed to study the case where only partial information (in the form of histograms) is available on individual attributes, and show how the proposed estimator can be adapted to this case. We consider two types of widely used histograms and show how they can be constructed in order to obtain optimal approximation. An experimental evaluation of the proposed estimation method on synthetic as well as two real data sets is provided.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. Barbará, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. E. Ioannidis, H. V. Jagadish, T. Johnson, R. T. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. IEEE Data Eng. Bull., 20(4):3--45, 1997.
 
2
J. Bunge and M. Fitzpatrick. Estimating the number of species: A review. Journal of the American Statistical Association, 88:364--373, 1993.
3
 
4
 
5
 
6
P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. Journal of the American Statistical Association, 93:1475--1487, 1998.
7
8
9
 
10
 
11
 
12
 
13
F. Olken and D. Rotem. Random sampling from databases - a survey, March 1995.
 
14
G. Özsoyoglu, K. Du, A. Tjahjana, W.-C. Hou, and D. Y. Rowland. On estimating COUNT, SUM, and AVERAGE. In DEXA, pages 406--412, 1991.
15
 
16
C. B. S. Hettich and C. Merz. UCI repository of machine learning databases, 1998.

Collaborative Colleagues:
Xiaohui Yu: colleagues
Calisto Zuzarte: colleagues
Kenneth C. Sevcik: colleagues