|
ABSTRACT
A statistical profile summarizes the instances of a database. It describes aspects such as the number of tuples, the number of values, the distribution of values, the correlation between value sets, and the distribution of tuples among secondary storage units. Estimation of database profiles is critical in the problems of query optimization, physical database design, and database performance prediction. This paper describes a model of a database of profile, relates this model to estimating the cost of database operations, and surveys methods of estimating profiles. The operators and objects in the model include build profile, estimate profile, and update profile. The estimate operator is classified by the relational algebra operator (select, project, join), the property to be estimated (cardinality, distribution of values, and other parameters), and the underlying method (parametric, nonparametric, and ad-hoc). The accuracy, overhead, and assumptions of methods are discussed in detail. Relevant research in both the database and the statistics disciplines is incorporated in the detailed discussion.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
APERS, P. M. G., HEV~ER, A. R., ^NO Y^O, S. B. 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. SE-9, 1, 57-68.
|
| |
2
|
ASTRAHAN, M., SCHKOLNICK, M., AND WHANG, K. 1985. Counting unique values of an attribute without sorting. Tech. Rep. RJ 4960, IBM Research Division.
|
 |
3
|
|
 |
4
|
Philip A. Bernstein , Nathan Goodman , Eugene Wong , Christopher L. Reeve , James B. Rothnie, Jr., Query processing in a system for distributed databases (SDD-1), ACM Transactions on Database Systems (TODS), v.6 n.4, p.602-625, Dec. 1981
[doi> 10.1145/319628.319650]
|
| |
5
|
BREIMAN, L., MEISEL, W., AND PURCELL, E. 1977. Variable kernel estimates of multivariate densities. Technometrics 19, 135-144.
|
| |
6
|
CACOULLOS, V., 1966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 178-189.
|
 |
7
|
|
| |
8
|
|
 |
9
|
D. D. Chamberlin , M. M. Astrahan , W. F. King , R. A. Lorie , J. W. Mehl , T. G. Price , M. Schkolnick , P. Griffiths Selinger , D. R. Slutz , B. W. Wade , R. A. Yost, Support for repetitive transactions and ad hoc queries in System R, ACM Transactions on Database Systems (TODS), v.6 n.1, p.70-94, March 1981
[doi> 10.1145/319540.319550]
|
 |
10
|
|
| |
11
|
|
| |
12
|
CHRISTODOULAKIS, S. 1983a. Estimating record selectivities. Inf. Syst. 8, 2 105-115.
|
 |
13
|
|
| |
14
|
CHRISTODOULAKIS, $. 1984a. Estimating block selectivities. Inf. Syst. 9, 1, 69-79.
|
 |
15
|
|
| |
16
|
CLARK, C., AND SCHKADE, L. 1983. StatisticalAnatysis for Administrative Decisions. South-Western, Cincinnati.
|
| |
17
|
|
| |
18
|
DEVROYE, L. 1985. A note on the L~ consistency of variable kernel estimates. Ann. Star. 13, 1041-1049.
|
 |
19
|
David J DeWitt , Randy H Katz , Frank Olken , Leonard D Shapiro , Michael R Stonebraker , David Wood, Implementation techniques for main memory database systems, Proceedings of the 1984 ACM SIGMOD international conference on Management of data, June 18-21, 1984, Boston, Massachusetts
|
 |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
FRASER, D. 1957. Nonparametric Methods in Statistics. Wiley, New York.
|
| |
24
|
|
| |
25
|
HEVNER, A., AND YAO, S. B. 1979. Query processing in distributed database systems. IEEE Trans. Softw. Eng. SE-3, 3.
|
| |
26
|
|
 |
27
|
|
| |
28
|
JOHNSON, N., AND KOTZ, S. 1970. Distributions in Statistics: Continuous Univariate Distributions, vols. I and 2. Houghton Mifflin, Boston.
|
 |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
KOTZ, S., AND JOHNSON, N. 1977. Urn Models and Their Application. Wiley, New York.
|
 |
33
|
|
| |
34
|
LOHMAN, G., MOHAN, C., HAAS, L., LINDSAY, B., SELINGER, P., WILMS, P., AND DANIELS, D. 1985. Query processing in R*. In Query Processing in Database Systems, W. Kim, D. Batory, and D. Reiner, Eds. Springer-Verlag, New York, pp. 31-47.
|
 |
35
|
|
| |
36
|
MACKERT, L., AND LOHMAN, G. 1985. Index scans using a finite LRU buffer: A validated I/O model. IBM Research Rep. RJ4836, Almaden Research Center, San Jose, Calif.
|
 |
37
|
|
| |
38
|
|
| |
39
|
MAHALANOBIS, P. 1936. On the generalized distance in statistics. In Proceedings of the National Institute of Sciences of India 12, pp. 35-49.
|
| |
40
|
MANNINO, M. 1986. Selectivity estimation in unify SQL. Tech. Rep. Dept. of Management Science and Information Systems, Univ. of Texas, Austin.
|
| |
41
|
|
| |
42
|
MERRETT, T. H., AND OTOO, E. 1979. Distribution models of relations. In Proceedings of the 5th International Conference on Very Large Data Bases (Rio de Janeiro, Brazil, Oct.). ACM, New York, pp. 418-425.
|
| |
43
|
MONTGOMERY, A., D'SOuzA, D., AND LEE, S. 1983. The cost of relational algebraic operations on skewed data: Estimates and experiments. In Information Processing Letters 83. Elsevier North-Holland, New York, pp. 235-241.
|
| |
44
|
MOORE, D., AND YACKEL, J. 1977. Consistency properties of nearest neighbor density function estimates. Ann. Stat. 5, 143-154.
|
 |
45
|
|
| |
46
|
MUTHUSWAMY, B., AND KERSCHBERG, G. 1985. A DDSM for relational query optimization. Tech. Rep., Univ. of South Carolina, Columbia. Also in Proceedings of the ACM Annual Conference (Denver, Colo., Oct.). ACM, New York.
|
 |
47
|
|
| |
48
|
PIATETSKY-SHAPIRO, G. 1985. Estimating the number of distinct attribute values by using sampling. Submitted for publication.
|
| |
49
|
|
| |
50
|
SAMSON, W., AND BENDELL, A. 1983. Rank order distributions and secondary key indexing (extended abstract). In Proceedings of the 2nd International Conference on Databases, (Cambridge, England).
|
 |
51
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
| |
52
|
SCHKOLNICK, M., AND TIBERIO, P. 1979. Considerations in developing a design tool for a relational DBMS. In Proceedings of the IEEE COMPSAC Conference (Chicago, Ill., Nov.). IEEE Press, New York, pp. 228-235.
|
| |
53
|
STONEBRAKER, M., RUBENSTEIN, B., AND GUTMANN, A. 1983. Application of abstract data types and abstract indices to CAD databases. In Proceedings of Database Week: Engineering Design Applications (San Jose, Calif., May). pp. 107-113.
|
| |
54
|
STUROES, H. 1926. The choice of class interval. J. Am. Stat. Assoc. 65-66.
|
| |
55
|
TAPIA, R., AND THOMPSON, J. 1978. Nonparametric Probability Density Estimation. John Hopkins University Press, Baltimore, Md.
|
| |
56
|
|
| |
57
|
UNIFY CORPORATION. 1985. Unix Relational Database Management System--Reference Manual. Release 3.2. Portland, Oreg.
|
| |
58
|
|
| |
59
|
WEOMAN, E. 1983. Density estimation. In Encyclopedia of Statistical Sciences, Vol. 2, S. Kotz and N. Johnson, Eds. Wiley, New York.
|
| |
60
|
WERTZ, W. 1978. Statistical Density Estimation: A Survey. Vandenhoeck and Ruprecht, Gottingen.
|
 |
61
|
|
 |
62
|
|
| |
63
|
ZAHORJAN, J., BELL. B., AND SEVCIK, K. 1983. Estimating block transfers when record access probabilities are non-uniform. Inf. Process. Lett. 16, 5 (June), 249-252.
|
| |
64
|
ZIPF, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Mass.
|
CITED BY 68
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bharat Bhasker , Csaba J. Egyhazy , Konstantinos P. Triantis, The architecture of a heterogeneous distributed database management system: the distributed access view integrated database (DAVID), Proceedings of the 1992 ACM annual conference on Communications, p.173-179, March 03-05, 1992, Kansas City, Missouri, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Kamil Saraç , Ömer Eğecioǧlu , Amr El Abbadi, Iterated DFT based techniques for join size estimation, Proceedings of the seventh international conference on Information and knowledge management, p.348-355, November 02-07, 1998, Bethesda, Maryland, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Marin J. Strauss, Optimal and approximate computation of summary statistics for range aggregates, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.227-236, May 2001, Santa Barbara, California, United States
|
|
|
|
|
|
Brian Dunkel , Qiang Zhu , Wing Lau , Suyun Chen, Multiple-granularity interleaving for piggyback query processing, Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research, p.2, November 08-11, 1999, Mississauga, Ontario, Canada
|
|
|
Qiang Zhu , Brian Dunkel , Nandit Soparkar , Suyun Chen , Berni Schiefer , Tony Lai, A piggyback method to collect statistics for query optimization in database management systems, Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research, p.25, November 30-December 03, 1998, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Wan-Sup Cho , Wook-Shin Han , Ki-Hyung Hong , Kyu-Young Whang, Estimating nested selectivity in object-oriented databases, Proceedings of the ninth international conference on Information and knowledge management, p.94-101, November 06-11, 2000, McLean, Virginia, United States
|
|
|
|
|
|
|
|
|
|
|
|
Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Optimal histograms for hierarchical range queries (extended abstract), Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.196-204, May 15-18, 2000, Dallas, Texas, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carlo Dell'aquila , Ezio Lefons , Filippo Tangorra, Analytic-based estimation of query result sizes, Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases, p.1-7, February 13-15, 2005, Salzburg, Austria
|
|
|
|
|
|
|
|
|
|
|
|
H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Viswanath Poosala , Kenneth C. Sevcik , Torsten Suel, Optimal Histograms with Quality Guarantees, Proceedings of the 24rd International Conference on Very Large Data Bases, p.275-286, August 24-27, 1998
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carlo Dell'Aquila , Ezio Lefons , Filippo Tangorra, Analytic use of bitmap indices, Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, p.159-164, February 16-19, 2007, Corfu Island, Greece
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carlo Dell'aquila , Ezio Lefons , Filippo Tangorra, Capturing semantics from bitmap indices for data analysis, Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, p.438-443, September 22-24, 2006, Lisbon, Portugal
|
|