ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Statistical profile estimation in database systems
Full text PdfPdf (2.94 MB)
Source ACM Computing Surveys (CSUR) archive
Volume 20 ,  Issue 3  (September 1988) table of contents
Pages: 191 - 221  
Year of Publication: 1988
ISSN:0360-0300
Authors
Michael V. Mannino  Univ. of Texas at Austin, Austin
Paicheng Chu  Ohio State Univ., Columbus
Thomas Sager  Univ. of Texas at Austin, Austin
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 124,   Citation Count: 69
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/62061.62063
What is a DOI?

ABSTRACT

A statistical profile summarizes the instances of a database. It describes aspects such as the number of tuples, the number of values, the distribution of values, the correlation between value sets, and the distribution of tuples among secondary storage units. Estimation of database profiles is critical in the problems of query optimization, physical database design, and database performance prediction. This paper describes a model of a database of profile, relates this model to estimating the cost of database operations, and surveys methods of estimating profiles. The operators and objects in the model include build profile, estimate profile, and update profile. The estimate operator is classified by the relational algebra operator (select, project, join), the property to be estimated (cardinality, distribution of values, and other parameters), and the underlying method (parametric, nonparametric, and ad-hoc). The accuracy, overhead, and assumptions of methods are discussed in detail. Relevant research in both the database and the statistics disciplines is incorporated in the detailed discussion.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
APERS, P. M. G., HEV~ER, A. R., ^NO Y^O, S. B. 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. SE-9, 1, 57-68.
 
2
ASTRAHAN, M., SCHKOLNICK, M., AND WHANG, K. 1985. Counting unique values of an attribute without sorting. Tech. Rep. RJ 4960, IBM Research Division.
3
4
 
5
BREIMAN, L., MEISEL, W., AND PURCELL, E. 1977. Variable kernel estimates of multivariate densities. Technometrics 19, 135-144.
 
6
CACOULLOS, V., 1966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 178-189.
7
 
8
9
10
 
11
 
12
CHRISTODOULAKIS, S. 1983a. Estimating record selectivities. Inf. Syst. 8, 2 105-115.
13
 
14
CHRISTODOULAKIS, $. 1984a. Estimating block selectivities. Inf. Syst. 9, 1, 69-79.
15
 
16
CLARK, C., AND SCHKADE, L. 1983. StatisticalAnatysis for Administrative Decisions. South-Western, Cincinnati.
 
17
 
18
DEVROYE, L. 1985. A note on the L~ consistency of variable kernel estimates. Ann. Star. 13, 1041-1049.
19
20
21
22
 
23
FRASER, D. 1957. Nonparametric Methods in Statistics. Wiley, New York.
 
24
 
25
HEVNER, A., AND YAO, S. B. 1979. Query processing in distributed database systems. IEEE Trans. Softw. Eng. SE-3, 3.
 
26
27
 
28
JOHNSON, N., AND KOTZ, S. 1970. Distributions in Statistics: Continuous Univariate Distributions, vols. I and 2. Houghton Mifflin, Boston.
29
30
 
31
 
32
KOTZ, S., AND JOHNSON, N. 1977. Urn Models and Their Application. Wiley, New York.
33
 
34
LOHMAN, G., MOHAN, C., HAAS, L., LINDSAY, B., SELINGER, P., WILMS, P., AND DANIELS, D. 1985. Query processing in R*. In Query Processing in Database Systems, W. Kim, D. Batory, and D. Reiner, Eds. Springer-Verlag, New York, pp. 31-47.
35
 
36
MACKERT, L., AND LOHMAN, G. 1985. Index scans using a finite LRU buffer: A validated I/O model. IBM Research Rep. RJ4836, Almaden Research Center, San Jose, Calif.
37
 
38
 
39
MAHALANOBIS, P. 1936. On the generalized distance in statistics. In Proceedings of the National Institute of Sciences of India 12, pp. 35-49.
 
40
MANNINO, M. 1986. Selectivity estimation in unify SQL. Tech. Rep. Dept. of Management Science and Information Systems, Univ. of Texas, Austin.
 
41
 
42
MERRETT, T. H., AND OTOO, E. 1979. Distribution models of relations. In Proceedings of the 5th International Conference on Very Large Data Bases (Rio de Janeiro, Brazil, Oct.). ACM, New York, pp. 418-425.
 
43
MONTGOMERY, A., D'SOuzA, D., AND LEE, S. 1983. The cost of relational algebraic operations on skewed data: Estimates and experiments. In Information Processing Letters 83. Elsevier North-Holland, New York, pp. 235-241.
 
44
MOORE, D., AND YACKEL, J. 1977. Consistency properties of nearest neighbor density function estimates. Ann. Stat. 5, 143-154.
45
 
46
MUTHUSWAMY, B., AND KERSCHBERG, G. 1985. A DDSM for relational query optimization. Tech. Rep., Univ. of South Carolina, Columbia. Also in Proceedings of the ACM Annual Conference (Denver, Colo., Oct.). ACM, New York.
47
 
48
PIATETSKY-SHAPIRO, G. 1985. Estimating the number of distinct attribute values by using sampling. Submitted for publication.
 
49
 
50
SAMSON, W., AND BENDELL, A. 1983. Rank order distributions and secondary key indexing (extended abstract). In Proceedings of the 2nd International Conference on Databases, (Cambridge, England).
51
 
52
SCHKOLNICK, M., AND TIBERIO, P. 1979. Considerations in developing a design tool for a relational DBMS. In Proceedings of the IEEE COMPSAC Conference (Chicago, Ill., Nov.). IEEE Press, New York, pp. 228-235.
 
53
STONEBRAKER, M., RUBENSTEIN, B., AND GUTMANN, A. 1983. Application of abstract data types and abstract indices to CAD databases. In Proceedings of Database Week: Engineering Design Applications (San Jose, Calif., May). pp. 107-113.
 
54
STUROES, H. 1926. The choice of class interval. J. Am. Stat. Assoc. 65-66.
 
55
TAPIA, R., AND THOMPSON, J. 1978. Nonparametric Probability Density Estimation. John Hopkins University Press, Baltimore, Md.
 
56
 
57
UNIFY CORPORATION. 1985. Unix Relational Database Management System--Reference Manual. Release 3.2. Portland, Oreg.
 
58
 
59
WEOMAN, E. 1983. Density estimation. In Encyclopedia of Statistical Sciences, Vol. 2, S. Kotz and N. Johnson, Eds. Wiley, New York.
 
60
WERTZ, W. 1978. Statistical Density Estimation: A Survey. Vandenhoeck and Ruprecht, Gottingen.
61
62
 
63
ZAHORJAN, J., BELL. B., AND SEVCIK, K. 1983. Estimating block transfers when record access probabilities are non-uniform. Inf. Process. Lett. 16, 5 (June), 249-252.
 
64
ZIPF, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Mass.

CITED BY  70

Collaborative Colleagues:
Michael V. Mannino: colleagues
Paicheng Chu: colleagues
Thomas Sager: colleagues