ACM Home Page
Please provide us with feedback. Feedback
Snakes and sandwiches: optimal clustering strategies for a data warehouse
Full text PdfPdf (1.47 MB)
Source International Conference on Management of Data archive
Proceedings of the 1999 ACM SIGMOD international conference on Management of data table of contents
Philadelphia, Pennsylvania, United States
Pages: 37 - 48  
Year of Publication: 1999
ISBN:1-58113-084-8
Also published in ...
Authors
H. V. Jagadish  U of Illinois, Urbana-Champaign
Laks V. S. Lakshmanan  IIT, Bombay
Divesh Srivastava  AT&T Labs-Research
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGMOD: ACM Special Interest Group on Management of Data
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 43,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/304182.304186
What is a DOI?

ABSTRACT

Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40% of the queries concern calls made from some specific telephone number in some month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes. Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering in this context is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called lattice paths, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size. We additionally propose a technique called snaking, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
C. Faloutsos and Y. Rong. Spatial access methods using fractals: Algorithms and performance evaluation. Technical Report Tech. Report UMIACS- TR-89-31, University of Maryland, 1990.
6
 
7
 
8
 
9
10
11
12
 
13
 
14
H. V. Jagadish, L. V. S. Lakshmanan, and D. Srivastava. Snakes and sandwiches: Optimal clustering strategies for a data warehouse. Technical report, ATL:T Labs Research and Concordia University, 1999. (in preparation).
 
15
16
17
18
 
19
 
20
 
21
 
22
Y. Tanaka. Adaptive segmentation schemes for large relational database machines. In Proc. 3rd International Conference on Database Machines, Munich, FRG, 1983.

CITED BY  7

Collaborative Colleagues:
H. V. Jagadish: colleagues
Laks V. S. Lakshmanan: colleagues
Divesh Srivastava: colleagues