|
ABSTRACT
Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery,in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions.This paper describes the Dwarf structure and the Dwarf cube construction algorithm. Further optimizations are then introduced for improving clustering and query performance. Experiments with the current implementation include comparisons on detailed measurements with real and synthetic datasets against previously published techniques. The comparisons show that Dwarfs by far out-perform these techniques on all counts: storage space, creation time, query response time, and updates of cubes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Sameet Agarwal , Rakesh Agrawal , Prasad Deshpande , Ashish Gupta , Jeffrey F. Naughton , Raghu Ramakrishnan , Sunita Sarawagi, On the Computation of Multidimensional Aggregates, Proceedings of the 22th International Conference on Very Large Data Bases, p.506-521, September 03-06, 1996
|
 |
2
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala, Congressional samples for approximate answering of group-by queries, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.487-498, May 15-18, 2000, Dallas, Texas, United States
|
| |
3
|
{Bla} Jock A. Blackard. The Forest CoverType Dataset. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/covtype.
|
| |
4
|
|
 |
5
|
|
| |
6
|
{BS98} D. Barbara and M. Sullivan. A Space-Efficient way to support Approximate Multidimensional Databases. Technical report, ISSE-TR-98-03, George Mason University, 1998.
|
| |
7
|
{Cou98} Olap Council. APB-1 Benchmark. http://www.olapcouncil.org/research/bmarkco.htm, 1998.
|
| |
8
|
{DANR96} P. M. Deshpande, S. Agarwal, J. F. Naughton, and R. Ramakrishnan. Computation of multidimensional aggregates. Technical Report 1314, University of Wisconsin - Madison, 1996.
|
 |
9
|
|
| |
10
|
|
| |
11
|
Jim Gray , Adam Bosworth , Andrew Layman , Hamid Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total, Proceedings of the Twelfth International Conference on Data Engineering, p.152-159, February 26-March 01, 1996
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
 |
15
|
Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States
|
 |
16
|
Venky Harinarayan , Anand Rajaraman , Jeffrey D. Ullman, Implementing data cubes efficiently, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.205-216, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
17
|
{HWL} C. Hahn, S. Warren, and J. London. Edited synoptic cloud reports from ships and land stations over the globe. http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html.
|
| |
18
|
{JS97} T. Johnson and D. Shasha. Some Approaches to Index Design for Cube Forests. Data Engineering Bulletin, 20(1), March 1997.
|
 |
19
|
|
 |
20
|
Nick Roussopoulos , Yannis Kotidis , Mema Roussopoulos, Cubetree: organization of and bulk incremental updates on the data cube, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.89-99, May 11-15, 1997, Tucson, Arizona, United States
|
| |
21
|
|
| |
22
|
{RSDK01} N. Roussopoulos, J. Sismanis, A. Deligiannakis, and Y. Kotidis. The Dwarf Structure for Creating, Storing, and Querying Highly Compressed Data Cubes. Application to U.S. patent office submitted, June 2001.
|
| |
23
|
{SAG96} S. Sarawagi, R. Agrawal, and A. Gupta. On computing the data cube. Technical Report RJ10026, IBM Almaden Research Center, San Jose, CA, 1996.
|
| |
24
|
|
| |
25
|
{SDRK02} Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: Shrinking the PetaCube. Technical Report CS-TR 4342, University of Maryland, College Park, February 2002.
|
 |
26
|
Jayavel Shanmugasundaram , Usama Fayyad , P. S. Bradley, Compressed data cubes for OLAP aggregate query approximation on continuous dimensions, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.223-232, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312231]
|
 |
27
|
Jeffrey Scott Vitter , Min Wang , Bala Iyer, Data cube approximation and histograms via wavelets, Proceedings of the seventh international conference on Information and knowledge management, p.96-104, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288645]
|
| |
28
|
{WLFY02} W. Wang, H. Lu, J. Feng, and J. Xu Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. In ICDE, 2002.
|
 |
29
|
Yihong Zhao , Prasad M. Deshpande , Jeffrey F. Naughton, An array-based algorithm for simultaneous multidimensional aggregates, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.159-170, May 11-15, 1997, Tucson, Arizona, United States
|
CITED BY 37
|
|
|
|
|
|
|
|
|
|
|
Andrew Witkowski , Srikanth Bellamkonda , Tolga Bozkaya , Gregory Dorman , Nathan Folkert , Abhinav Gupta , Lei Shen , Sankar Subramanian, Spreadsheets in RDBMS for OLAP, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California
|
|
|
Yannis Sismanis , Antonios Deligiannakis , Yannis Kotidis , Nick Roussopoulos, Hierarchical dwarfs for the rollup cube, Proceedings of the 6th ACM international workshop on Data warehousing and OLAP, November 07-07, 2003, New Orleans, Louisiana, USA
|
|
|
|
|
|
Cuiping Li , Gao Cong , Anthony K. H. Tung , Shan Wang, Incremental maintenance of quotient cube for median, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
Andrew Witkowski , Srikanth Bellamkonda , Tolga Bozkaya , Nathan Folkert , Abhinav Gupta , John Haydu , Lei Sheng , Sankar Subramanian, Advanced SQL modeling in RDBMS, ACM Transactions on Database Systems (TODS), v.30 n.1, p.83-121, March 2005
|
|
|
|
|
|
|
|
|
|
|
|
Ying Chen , Frank Dehne , Todd Eavis , Andrew Rau-Chaplin, PnP: sequential, external memory, and parallel iceberg cube computation, Distributed and Parallel Databases, v.23 n.2, p.99-126, April 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dong Xin , Jiawei Han , Xiaolei Li , Benjamin W. Wah, Star-cubing: computing iceberg cubes by top-down and bottom-up integration, Proceedings of the 29th international conference on Very large data bases, p.476-487, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
Cuiping Li , Beng Chin Ooi , Anthony K. H. Tung , Shan Wang, DADA: a data cube for dominant relationship analysis, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|