|
ABSTRACT
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., [author] corresponds to [first name, last name] in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., [first name, last name]) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. Unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. This leads to the introduction of a new correlation measure, $H$-measure, distinct from those proposed in previous work. We evaluate our approach extensively and the results show good accuracy for discovering complex matchings.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
 |
2
|
|
| |
3
|
M. K. Bergman. The deep web: Surfacing hidden value. Technical report, BrightPlanet LLC, Dec. 2000.
|
 |
4
|
Sergey Brin , Rajeev Motwani , Craig Silverstein, Beyond market baskets: generalizing association rules to correlations, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.265-276, May 11-15, 1997, Tucson, Arizona, United States
|
| |
5
|
H. D. Brunk. An Introduction to Mathematical Statistics. New York, Blaisdell Pub. Co., 1965.
|
| |
6
|
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, Department of Computer Science, UIUC, Feb. 2003.
|
| |
7
|
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.
|
 |
8
|
|
 |
9
|
AnHai Doan , Pedro Domingos , Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.509-520, May 21-24, 2001, Santa Barbara, California, United States
|
 |
10
|
|
| |
11
|
B. He, K. C.-C. Chang, and J. Han. Automatic complex schema matching across web query interfaces: A correlation mining approach. Technical Report UIUCDCS-R-2003-2388, Dept. of Computer Science, UIUC, July 2003.
|
| |
12
|
B. He, T. Tao, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. In EDBT'04 ClustWeb Workshop, 2004.
|
 |
13
|
Panagiotis G. Ipeirotis , Luis Gravano , Mehran Sahami, Probe, count, and classify: categorizing hidden web databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.67-78, May 21-24, 2001, Santa Barbara, California, United States
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
M. Porter. The porter stemming algorithm. Accessible at http://www.tartarus.org/~martin/PorterStemmer.
|
| |
18
|
|
| |
19
|
L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? Bulletin of the Tech. Committee on Data Engr., 25(3), 2002.
|
 |
20
|
|
 |
21
|
|
|