|
ABSTRACT
Data mining research typically assumes that the data to be analyzed has been identified, gathered, cleaned, and processed into a convenient form. While data mining tools greatly enhance the ability of the analyst to make data-driven discoveries, most of the time spent in performing an analysis is spent in data identification, gathering, cleaning and processing the data. Similarly, schema mapping tools have been developed to help automate the task of using legacy or federated data sources for a new purpose, but assume that the structure of the data sources is well understood. However the data sets to be federated may come from dozens of databases containing thousands of tables and tens of thousands of fields, with little reliable documentation about primary keys or foreign keys.We are developing a system, Bellman, which performs data mining on the structure of the database. In this paper, we present techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database. The results of the database structure mining allow the analyst to make sense of the database content. This information can be used to e.g., prepare data for data mining, find foreign key joins for schema mapping, or identify steps to be taken to prevent the database from collapsing under the weight of its complexity.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303978]
|
 |
3
|
|
| |
4
|
|
| |
5
|
V. Borkar, K. Deshmukh, and S. Sarawagi. Automatically extracting structure from free text addresses. Data Engineering Bulletin, 23(4):27-32, 2000.
|
| |
6
|
|
 |
7
|
Zhiyuan Chen , Nick Koudas , Flip Korn , S. Muthukrishnan, Selectively estimation for Boolean queries, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.216-225, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335225]
|
| |
8
|
C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In IEEE Workshop on Data Semantics, 1997.
|
| |
9
|
|
 |
10
|
AnHai Doan , Pedro Domingos , Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.509-520, May 21-24, 2001, Santa Barbara, California, United States
|
| |
11
|
A. Doan, P. Domingos, and A. Levy. Learning source description for data integration. In Proc. Intl. Workshop on The Web and Databases, 2000.
|
| |
12
|
Evoke Software. http://www.evokesoftware.com/.
|
| |
13
|
Evoke Software. Data profiling and mapping, the essential first step in data migration and integration projects. http://www.evokesoftware.com/pdf/wtpprDPM.pdf, 2000.
|
 |
14
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon, AJAX: an extensible data cleaning tool, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.590, May 15-18, 2000, Dallas, Texas, United States
|
 |
15
|
Sumit Ganguly , Phillip B. Gibbons , Yossi Matias , Avi Silberschatz, Bifocal sampling for skew-resistant join size estimation, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.271-281, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
 |
21
|
Mauricio A. Hernández , Renée J. Miller , Laura M. Haas, Clio: a semi-automatic tool for schema mapping, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.607, May 21-24, 2001, Santa Barbara, California, United States
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
Y. Ioannidis and V. Poosala. Histogram-based solutions to diverse database estimation problems. Data Engineering Bulletin, 18(3):10-18, 1995.
|
| |
26
|
W. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. In Conference in Modern Analysis and Probability, pages 189-206, 1984.
|
| |
27
|
|
| |
28
|
Knowledge Driver. http://www.knowledgedriver.com/.
|
| |
29
|
|
| |
30
|
S. Kramer and B. Pfahringer. Efficient search of string parital determinations. In Proc. Intl. Conf. on Knowledge Discover and Data Mining, pages 371-378, 1996.
|
| |
31
|
S. Kuchin. Oracle, odbc and db2-cli template library programmer's guide. http://www.geocities.com/skuchin/otl/home.htm.
|
| |
32
|
|
| |
33
|
|
| |
34
|
Metagenix Inc. http://www.metagenix.com/home.asp.
|
| |
35
|
|
| |
36
|
|
| |
37
|
A. Monge. The field matching problem: Algorithms and applications. IEEE Data Engineering Bulletin, 23(4):14-20, 2000.
|
| |
38
|
A. Monge and P. Elkan. The field matching problem: Algorithms and applications. In Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1996.
|
| |
39
|
A. Monge and P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proc. SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 1997.
|
| |
40
|
S. Muthukrishnan. Estimating number of rare values on data streams, 2001. Submitted for publication.
|
 |
41
|
|
| |
42
|
E. Rahm and H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):1-11, 2000.
|
| |
43
|
|
| |
44
|
I. Savnik and P. Flach. Bottom-up induction of functional dependencies from relations. In Proc. AAAI Knowledge Discovery in Databases, pages 174-185, 1993.
|
| |
45
|
J. Schlimmer. Efficiently inducing determinations: A complete and systematic search algorithm that uses pruning. In Proc. AAAI Knowledge Discovery in Databases, pages 284-290, 1993.
|
| |
46
|
P. Vassiliadias, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis. ARKTOS: A tool for data cleaning and transformation in data warehouse environments. Data Engineering Bulletin, 23(4):43-48, 2000.
|
CITED BY 20
|
|
Robin Dhamankar , Yoonkyong Lee , AnHai Doan , Alon Halevy , Pedro Domingos, iMAP: discovering complex semantic matches between database schemas, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Kevin Beyer , Peter J. Haas , Berthold Reinwald , Yannis Sismanis , Rainer Gemulla, On synopses for distinct-value estimation under multiset operations, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
|
|
|
|
|
|
Graham Cormode , Mayur Datar , Piotr Indyk , S. Muthukrishnan, Comparing data streams using Hamming norms (how to zero in), Proceedings of the 28th international conference on Very Large Data Bases, p.335-345, August 20-23, 2002, Hong Kong, China
|
|
|
Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Martin J. Strauss, How to summarize the universe: dynamic maintenance of quantiles, Proceedings of the 28th international conference on Very Large Data Bases, p.454-465, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Babak Ahmadi , Marios Hadjieleftheriou , Thomas Seidl , Divesh Srivastava , Suresh Venkatasubramanian, Type-based categorization of relational attributes, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, March 24-26, 2009, Saint Petersburg, Russia
|
|
|
|
|