ACM Home Page
Please provide us with feedback. Feedback
Mining database structure; or, how to build a data quality browser
Full text PdfPdf (1.21 MB)
Source International Conference on Management of Data archive
Proceedings of the 2002 ACM SIGMOD international conference on Management of data table of contents
Madison, Wisconsin
SESSION: Research sessions: potpourri table of contents
Pages: 240 - 251  
Year of Publication: 2002
ISBN:1-58113-497-5
Authors
Tamraparni Dasu  AT&T Labs-Research
Theodore Johnson  AT&T Labs-Research
S. Muthukrishnan  AT&T Labs-Research
Vladislav Shkapenyuk  AT&T Labs-Research
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 168,   Citation Count: 20
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/564691.564719
What is a DOI?

ABSTRACT

Data mining research typically assumes that the data to be analyzed has been identified, gathered, cleaned, and processed into a convenient form. While data mining tools greatly enhance the ability of the analyst to make data-driven discoveries, most of the time spent in performing an analysis is spent in data identification, gathering, cleaning and processing the data. Similarly, schema mapping tools have been developed to help automate the task of using legacy or federated data sources for a new purpose, but assume that the structure of the data sources is well understood. However the data sets to be federated may come from dozens of databases containing thousands of tables and tens of thousands of fields, with little reliable documentation about primary keys or foreign keys.We are developing a system, Bellman, which performs data mining on the structure of the database. In this paper, we present techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database. The results of the database structure mining allow the analyst to make sense of the database content. This information can be used to e.g., prepare data for data mining, find foreign key joins for schema mapping, or identify steps to be taken to prevent the database from collapsing under the weight of its complexity.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
 
5
V. Borkar, K. Deshmukh, and S. Sarawagi. Automatically extracting structure from free text addresses. Data Engineering Bulletin, 23(4):27-32, 2000.
 
6
7
 
8
C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In IEEE Workshop on Data Semantics, 1997.
 
9
10
 
11
A. Doan, P. Domingos, and A. Levy. Learning source description for data integration. In Proc. Intl. Workshop on The Web and Databases, 2000.
 
12
Evoke Software. http://www.evokesoftware.com/.
 
13
Evoke Software. Data profiling and mapping, the essential first step in data migration and integration projects. http://www.evokesoftware.com/pdf/wtpprDPM.pdf, 2000.
14
15
 
16
 
17
 
18
 
19
 
20
21
 
22
 
23
 
24
 
25
Y. Ioannidis and V. Poosala. Histogram-based solutions to diverse database estimation problems. Data Engineering Bulletin, 18(3):10-18, 1995.
 
26
W. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. In Conference in Modern Analysis and Probability, pages 189-206, 1984.
 
27
 
28
Knowledge Driver. http://www.knowledgedriver.com/.
 
29
 
30
S. Kramer and B. Pfahringer. Efficient search of string parital determinations. In Proc. Intl. Conf. on Knowledge Discover and Data Mining, pages 371-378, 1996.
 
31
S. Kuchin. Oracle, odbc and db2-cli template library programmer's guide. http://www.geocities.com/skuchin/otl/home.htm.
 
32
 
33
 
34
Metagenix Inc. http://www.metagenix.com/home.asp.
 
35
 
36
 
37
A. Monge. The field matching problem: Algorithms and applications. IEEE Data Engineering Bulletin, 23(4):14-20, 2000.
 
38
A. Monge and P. Elkan. The field matching problem: Algorithms and applications. In Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1996.
 
39
A. Monge and P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proc. SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 1997.
 
40
S. Muthukrishnan. Estimating number of rare values on data streams, 2001. Submitted for publication.
41
 
42
E. Rahm and H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):1-11, 2000.
 
43
 
44
I. Savnik and P. Flach. Bottom-up induction of functional dependencies from relations. In Proc. AAAI Knowledge Discovery in Databases, pages 174-185, 1993.
 
45
J. Schlimmer. Efficiently inducing determinations: A complete and systematic search algorithm that uses pruning. In Proc. AAAI Knowledge Discovery in Databases, pages 284-290, 1993.
 
46
P. Vassiliadias, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis. ARKTOS: A tool for data cleaning and transformation in data warehouse environments. Data Engineering Bulletin, 23(4):43-48, 2000.

CITED BY  20

Collaborative Colleagues:
Tamraparni Dasu: colleagues
Theodore Johnson: colleagues
S. Muthukrishnan: colleagues
Vladislav Shkapenyuk: colleagues