|
ABSTRACT
Histograms are a concise and flexible way to construct summary structures for large data sets. They have attracted a lot of attention in database research due to their utility in many areas, including query optimization, and approximate query answering. They are also a basic tool for data visualization and analysis.In this paper, we present a formal study of dynamic multidimensional histogram structures over continuous data streams. At the heart of our proposal is the use of a dynamic summary data structure (vastly different from a histogram) maintaining a succinct approximation of the data distribution of the underlying continuous stream. On demand, an accurate histogram is derived from this dynamic data structure. We propose algorithms for extracting such an accurate histogram and we analyze their behavior and tradeoffs. The proposed algorithms are able to provide approximate guarantees about the quality of the estimation of the histograms they extract.We complement our analytical results with a thorough experimental evaluation using real data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala , Sridhar Ramaswamy, The Aqua approximate query answering system, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.574-576, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
|
 |
4
|
|
 |
5
|
Nicolas Bruno , Surajit Chaudhuri , Luis Gravano, STHoles: a multidimensional workload-aware histogram, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.211-222, May 21-24, 2001, Santa Barbara, California, United States
|
| |
6
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
7
|
|
 |
8
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509966]
|
| |
9
|
A. Gilbert, Y. Kotadis, S. Muthukrishnan, and M. Strauss. Quicksand: quick summary and analysis of network data. DIMACS tech report.
|
| |
10
|
|
| |
11
|
Jim Gray , Adam Bosworth , Andrew Layman , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total, Proceedings of the Twelfth International Conference on Data Engineering, p.152-159, February 26-March 01, 1996
|
 |
12
|
|
| |
13
|
S. Guha and N. Koudas. Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation. ICDE, Feb. 2002.
|
 |
14
|
|
| |
15
|
|
 |
16
|
Dimitrios Gunopulos , George Kollios , Vassilis J. Tsotras , Carlotta Domeniconi, Approximating multi-dimensional aggregate range queries over real attributes, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.463-474, May 15-18, 2000, Dallas, Texas, United States
|
| |
17
|
|
 |
18
|
Peter J. Haas , Jeffrey F. Naughton , S. Seshadri , Arun N. Swami, Fixed-precision estimation of join selectivity, Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.190-201, May 25-28, 1993, Washington, D.C., United States
[doi> 10.1145/153850.153875]
|
 |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Viswanath Poosala , Kenneth C. Sevcik , Torsten Suel, Optimal Histograms with Quality Guarantees, Proceedings of the 24rd International Conference on Very Large Data Bases, p.275-286, August 24-27, 1998
|
| |
23
|
|
| |
24
|
|
 |
25
|
Ju-Hong Lee , Deok-Hwan Kim , Chin-Wan Chung, Multi-dimensional selectivity estimation using compressed histogram information, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.205-214, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
26
|
S. Madden and M. Franklin. Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data. Proceedings of ICDE, Feb. 2002.
|
 |
27
|
Yossi Matias , Jeffrey Scott Vitter , Min Wang, Wavelet-based histograms for selectivity estimation, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.448-459, June 01-04, 1998, Seattle, Washington, United States
|
| |
28
|
|
| |
29
|
S. Muthukrishnan, V. Poosala, and T. Suel. Partitioning two dimensional arrays: algorithms, complexity and applications. Proc. Intl Conf. Database Theory, 1998.
|
| |
30
|
|
 |
31
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
 |
32
|
Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay, Random sampling techniques for space efficient online computation of order statistics of large datasets, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.251-262, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
33
|
|
 |
34
|
Jeffrey Scott Vitter , Min Wang , Bala Iyer, Data cube approximation and histograms via wavelets, Proceedings of the seventh international conference on Information and knowledge management, p.96-104, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288645]
|
 |
35
|
Yi-Leh Wu , Divyakant Agrawal , Amr El Abbadi, Applying the golden rule of sampling for query estimation, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.449-460, May 21-24, 2001, Santa Barbara, California, United States
|
CITED BY 58
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Graham Cormode , Theodore Johnson , Flip Korn , S. Muthukrishnan , Oliver Spatscheck , Divesh Srivastava, Holistic UDAFs at streaming speeds, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Neal Patwari , Alfred O. Hero, III , Adam Pacholski, Manifold learning visualization of network traffic data, Proceeding of the 2005 ACM SIGCOMM workshop on Mining network data, August 26-26, 2005, Philadelphia, Pennsylvania, USA
|
|
|
|
|
|
|
|
|
|
|
|
Anna C. Gilbert , Sudipto Guha , Piotr Indyk , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, Fast, small-space algorithms for approximate histogram maintenance, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
|
|
|
Robert L. Grossman , Yunhong Gu , David Hanley , Michal Sabala , Joe Mambretti , Alex Szalay , Ani Thakar , Kazumi Kumazoe , Oie Yuji , Minsun Lee , Yoonjoo Kwon , Woojin Seok, Data mining middleware for wide-area high-performance networks, Future Generation Computer Systems, v.22 n.8, p.940-948, October 2006
|
|
|
|
|
|
Jizhou Luo , Xiaofang Zhou , Yu Zhang , Heng Tao Shen , Jianzhong Li, Selectivity estimation by batch-query based histogram and parametric method, Proceedings of the eighteenth conference on Australasian database, p.93-102, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|
|
|
|
|
|
|
|
|
|
|
S. Subramaniam , T. Palpanas , D. Papadopoulos , V. Kalogeraki , D. Gunopulos, Online outlier detection in sensor data using non-parametric models, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wei Wang , Haifeng Jiang , Hongjun Lu , Jeffrey Xu Yu, Bloom histogram: path selectivity estimation for XML data with updates, Proceedings of the Thirtieth international conference on Very large data bases, p.240-251, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
Graham Cormode , Mayur Datar , Piotr Indyk , S. Muthukrishnan, Comparing data streams using Hamming norms (how to zero in), Proceedings of the 28th international conference on Very Large Data Bases, p.335-345, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|