ACM Home Page
Please provide us with feedback. Feedback
Information-theoretic tools for mining database structure from large data sets
Full text PdfPdf (217 KB)
Source International Conference on Management of Data archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data table of contents
Paris, France
SESSION: Research sessions: schema discovery table of contents
Pages: 731 - 742  
Year of Publication: 2004
ISBN:1-58113-859-8
Authors
Periklis Andritsos  University of Toronto
Renée J. Miller  University of Toronto
Panayiotis Tsaparas  Univ. of Rome, "La Sapienza"
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 87,   Citation Count: 10
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007568.1007650
What is a DOI?

ABSTRACT

Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In VLDB, pages 586--597, Hong Kong, China, 2002.
 
5
P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. In EDBT, pages 123--146, Heraklion, Greece, 2004.
6
 
7
8
 
9
10
 
11
 
12
13
 
14
J. A. Hoffer and D. G. Severance. The Use of Cluster Analysis in Physical Data Base Design. In VLDB, pages 69--86, Framingham, MA, USA, 1975.
 
15
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
16
17
18
 
19
L. Popa, Y. Velegrakis, M. Hernandez, R. J. Miller, and R. Fagin. Translating web data. In VLDB, pages 598--609, Hong Kong, China, Aug. 2002.
 
20
 
21
22
 
23
S. Sarawagi-(Editor). Special Issue on Data Cleaning. Bulletin of the Technical Committee on Data Engineering, Volume 23(4), December 2000.
 
24
I. Savnik and P. A. Flach. Bottom-up induction of functional dependencies from relations. In AAAI-93 Workshop: Knowledge Discovery in Databases, pages 174--185, Washington, DC, USA, 1993.
 
25
I. Savnik and P. A. Flach. Disocvery of Mutlivalued Dependencies from Relations. Intelligent Data Analysis Journal, 4(3):195--211, 2000.
 
26
N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In NIPS-12, pages 617--623, Breckenridge, CO, 1999.
 
27
N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. In 37th Annual Allerton Conference on Communication, Control and Computing, Urban-Champaign, IL, 1999.
 
28
29

CITED BY  10
Collaborative Colleagues:
Periklis Andritsos: colleagues
Renée J. Miller: colleagues
Panayiotis Tsaparas: colleagues