|
ABSTRACT
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. Before DBMSs providing highly-accurate approximate answers can become a reality, many new techniques for summarizing data and for estimating answers from summarized data must be developed. This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution. We quantify their advantages over standard sample views in terms of the number of additional sample points for the same view size, and hence in providing more accurate query answers. Finally, we consider their application to providing fast approximate answers to hot list queries. Our algorithms maintain their accuracy in the presence of ongoing insertions to the data warehouse.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
AMS96
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
Ant92
|
|
| |
AS94
|
|
| |
AZ96
|
|
| |
BDF+97
|
D. Barbar~i, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3- 45, 1997.
|
 |
BM96
|
Roberto J. Bayardo, Jr. , Daniel P. Miranker, Processing queries for first-few answers, Proceedings of the fifth international conference on Information and knowledge management, p.45-52, November 12-16, 1996, Rockville, Maryland, United States
[doi> 10.1145/238355.238372]
|
 |
BMUT97
|
Sergey Brin , Rajeev Motwani , Jeffrey D. Ullman , Shalom Tsur, Dynamic itemset counting and implication rules for market basket data, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.255-264, May 11-15, 1997, Tucson, Arizona, United States
|
| |
FJS97
|
|
| |
Fla85
|
|
| |
FM83
|
P. Flajolet and G. N. Martin. Probabilistic counting. In Proc. 24th IEEE Syrup. on Foundations of Computer Science, pages 76-82, October 1983.
|
| |
FM85
|
|
| |
GM95
|
P.B. Gibbons and Y. Matias, August 1995. Presentation and feedback during a Bell Labs-Teradata presentation to Walmart scientists and executives on proposed improvements to the Teradata DBS.
|
| |
GM97
|
E B. Gibbons and Y. Matias. Synopsis data structures, concise samples, and mode statistics. Manuscript, July 1997.
|
| |
GMP97a
|
P.B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997.
|
| |
GMP97b
|
|
| |
GPA+98
|
E B. Gibbons, V. Poosala, S. Acharya, Y. Bartal, Y. Matias, S. Muthukrishnan, S. Ramaswamy, and T. Suel. AQUA: System and techniques for approximate query answering. Technical report, Bell Laboratories, Murray Hill, New Jersey, February 1998.
|
 |
HHW97
|
Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States
|
| |
HK95
|
M. Hofri and N. Kechris. Probabilistic counting of a large number of events. Manuscript, 1995.
|
| |
HNSS95
|
|
 |
IC93
|
|
| |
Ioa93
|
|
 |
IP95
|
|
| |
Mat92
|
Y. Matias. Highly Parallel Randomized Algorithmics. PhD thesis, Tel Aviv University, Israel, 1992.
|
 |
Mor78
|
|
| |
MSY96
|
Y. Matias, S. C. Sahinalp, and N. E. Young. Performance evaluation of approximate priority queues. Presented at DIMACS Fifth Implementation Challenge: Priority Queues, Dictionaries, and Point Sets, organized by D. S. Johnson and C. McGeoch, October 1996.
|
| |
MVN93
|
Yossi Matias , Jeffrey Scott Vitter , Wen-Chun Ni, Dynamic generation of discrete random variates, Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, p.361-370, January 25-27, 1993, Austin, Texas, United States
|
| |
MVY94
|
Yossi Matias , Jeffrey Scott Vitter , Neal E. Young, Approximate data structures with applications, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.187-194, January 23-25, 1994, Arlington, Virginia, United States
|
| |
OR89
|
|
| |
OR92
|
|
 |
PIHS96
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
Pre97
|
D. Pregibon. Mega-monitoring: Developing and using telecommunications signatures, October 1997. Invited talk at the DIMACS Workshop on Massive Data Sets in Telecommunications.
|
 |
Vit85
|
|
| |
VL93
|
|
 |
WVZT90
|
|
CITED BY 112
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jeffrey Scott Vitter , Min Wang , Bala Iyer, Data cube approximation and histograms via wavelets, Proceedings of the seventh international conference on Information and knowledge management, p.96-104, November 02-07, 1998, Bethesda, Maryland, United States
|
|
|
|
|
|
|
|
|
M. Burrows , U. Erlingson , S.-T. A. Leung , M. T. Vandevoorde , C. A. Waldspurger , K. Walker , W. E. Weihl, Efficient and flexible value sampling, ACM SIGPLAN Notices, v.35 n.11, p.160-167, Nov. 2000
|
|
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lukasz Golab , David DeHaan , Erik D. Demaine , Alejandro Lopez-Ortiz , J. Ian Munro, Identifying frequent items in sliding windows over on-line packet streams, Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, October 27-29, 2003, Miami Beach, FL, USA
|
|
|
|
|
|
|
|
|
Cheqing Jin , Weining Qian , Chaofeng Sha , Jeffrey X. Yu , Aoying Zhou, Dynamically maintaining frequent items over a data stream, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Joseph M. Hellerstein , Ron Avnur , Andy Chou , Christian Hidber , Chris Olston , Vijayshankar Raman , Tali Roth , Peter J. Haas, Interactive Data Analysis: The Control Project, Computer, v.32 n.8, p.51-59, August 1999
|
|
|
M. Burrows , U. Erlingson , S-T. A. Leung , M. T. Vandevoorde , C. A. Waldspurger , K. Walker , W. E. Weihl, Efficient and flexible value sampling, ACM SIGOPS Operating Systems Review, v.34 n.5, p.160-167, Dec. 2000
|
|
|
|
|
|
|
|
|
Zina Ben Miled , Jin Liu , Omran Bukhres , Huian Li , Jesse Martin , Chavali Balagopalakrishna , Robert Oppelt, Use and Maintenance of Histograms for Large Scientific Database Access Planning: A Case Study of a Pharmaceutical Data Repository, Journal of Intelligent Information Systems, v.23 n.2, p.145-178, September 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edith Cohen , Nick Duffield , Haim Kaplan , Carsten Lund , Mikkel Thorup, Sketching unaggregated data streams for subpopulation-size queries, Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 11-13, 2007, Beijing, China
|
|
|
Sumeet Singh , Cristian Estan , George Varghese , Stefan Savage, Automated worm fingerprinting, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.4-4, December 06-08, 2004, San Francisco, CA
|
|
|
|
|
|
|
|
|
Amol Deshpande , Carlos Guestrin , Samuel R. Madden , Joseph M. Hellerstein , Wei Hong, Model-driven data acquisition in sensor networks, Proceedings of the Thirtieth international conference on Very large data bases, p.588-599, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Edith Cohen , Nick Duffield , Haim Kaplan , Carsten Lund , Mikkel Thorup, Algorithms and estimators for accurate summarization of internet traffic, Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, October 24-26, 2007, San Diego, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|