ACM Home Page
Please provide us with feedback. Feedback
On schema matching with opaque column names and data values
Full text PdfPdf (275 KB)
Source International Conference on Management of Data archive
Proceedings of the 2003 ACM SIGMOD international conference on Management of data table of contents
San Diego, California
SESSION: Meta data management table of contents
Pages: 205 - 216  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Jaewoo Kang  University of Wisconsin-Madison, Madison, WI
Jeffrey F. Naughton  University of Wisconsin-Madison, Madison, WI
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 81,   Citation Count: 40
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/872757.872783
What is a DOI?

ABSTRACT

Most previous solutions to the schema matching problem rely in some fashion upon identifying "similar" column names in the schemas to be matched, or by recognizing common domains in the data stored in the schemas. While each of these approaches is valuable in many cases, they are not infallible, and there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are "opaque" or very difficult to interpret. In this paper we propose a two-step technique that works even in the presence of opaque column names and data values. In the first step, we measure the pair-wise attribute correlations in the tables to be matched and construct a dependency graph using mutual information as a measure of the dependency between attributes. In the second stage, we find matching node pairs in the dependency graphs by running a graph matching algorithm. We validate our approach with an experimental study, the results of which suggest that such an approach can be a useful addition to a set of (semi) automatic schema matching techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
 
5
6
 
7
AnHai Doan, Pedro Domingos, Alon Y. Levy: Learning Source Description for Data Integration. WebDB (Informal Proceedings) 2000: 81--86
 
8
Nir Friedman, Iftach Nachman, Dana Peer: Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm. UAI 1999: 206--215
 
9
10
 
11
D. Heckerman. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research, March, 1995 (revised November, 1996)
12
 
13
 
14
 
15
 
16
Sergey Melnik, Hector Garcia-Molina, Erhard Rahm: Similarity Flooding: A Versatile Graph Matching Algorithm. ICDE 2002
 
17
 
18
 
19
PKDD 2001 Discovery Challenge on Thrombosis Data. http://lisp.vse.cz/challenge/pkdd2001/
 
20
 
21
Triada, Ltd. http://www.triada.com/
 
22
U.S. Census Bureau. Census data file ftp site. ftp://ftp2.census.gov/census_2000/datasets/
23

CITED BY  40

Collaborative Colleagues:
Jaewoo Kang: colleagues
Jeffrey F. Naughton: colleagues