|
ABSTRACT
Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
 |
3
|
Sanjay Agrawal , Surajit Chaudhuri , Vivek Narasayya, Materialized view and index selection tool for Microsoft SQL server 2000, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.608, May 21-24, 2001, Santa Barbara, California, United States
|
| |
4
|
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In VLDB, pages 586--597, Hong Kong, China, 2002.
|
| |
5
|
P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. In EDBT, pages 123--146, Heraklion, Greece, 2004.
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
Tamraparni Dasu , Theodore Johnson , S. Muthukrishnan , Vladislav Shkapenyuk, Mining database structure; or, how to build a data quality browser, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
[doi> 10.1145/564691.564719]
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
J. A. Hoffer and D. G. Severance. The Use of Cluster Analysis in Physical Data Base Design. In VLDB, pages 69--86, Framingham, MA, USA, 1975.
|
| |
15
|
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
|
 |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
L. Popa, Y. Velegrakis, M. Hernandez, R. J. Miller, and R. Fagin. Translating web data. In VLDB, pages 598--609, Hong Kong, China, Aug. 2002.
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
S. Sarawagi-(Editor). Special Issue on Data Cleaning. Bulletin of the Technical Committee on Data Engineering, Volume 23(4), December 2000.
|
| |
24
|
I. Savnik and P. A. Flach. Bottom-up induction of functional dependencies from relations. In AAAI-93 Workshop: Knowledge Discovery in Databases, pages 174--185, Washington, DC, USA, 1993.
|
| |
25
|
I. Savnik and P. A. Flach. Disocvery of Mutlivalued Dependencies from Relations. Intelligent Data Analysis Journal, 4(3):195--211, 2000.
|
| |
26
|
N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In NIPS-12, pages 617--623, Breckenridge, CO, 1999.
|
| |
27
|
N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. In 37th Annual Allerton Conference on Communication, Control and Computing, Urban-Champaign, IL, 1999.
|
| |
28
|
|
 |
29
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka, Disambiguating Personal Names on the Web using Automatically Extracted Key Phrases, Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy, p.553-557, May 22, 2006
|
|