ACM Home Page
Please provide us with feedback. Feedback
Automatic complex schema matching across Web query interfaces: A correlation mining approach
Full text PdfPdf (1.49 MB)
Source ACM Transactions on Database Systems (TODS) archive
Volume 31 ,  Issue 1  (March 2006) table of contents
Pages: 346 - 395  
Year of Publication: 2006
ISSN:0362-5915
Authors
Bin He  University of Illinois at Urbana-Champaign, Urbana, IL
Kevin Chen-Chuan Chang  University of Illinois at Urbana-Champaign, Urbana, IL
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 163,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1132863.1132872
What is a DOI?

ABSTRACT

To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this article takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this “deep Web ” query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction. We evaluate the DCM framework on manually extracted interfaces and the results show good accuracy for discovering complex matchings. Further, to automate the entire matching process, we incorporate automatic techniques for interface extraction. Executing the DCM framework on automatically extracted interfaces, we find that the inevitable errors in automatic interface extraction may significantly affect the matching result. To make the DCM framework robust against such “noisy” schemas, we integrate it with a novel “ensemble” approach, which creates an ensemble of DCM matchers, by randomizing the schema data into many trials and aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the robustness of the ensemble approach. Empirically, our experiments show that the “ensemblization” indeed significantly boosts the matching accuracy, over automatically extracted and thus noisy schema data. By employing the DCM framework with the ensemble approach, we thus complete an automatic process of matchings Web query interfaces.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Anderson, D. R., Sweeney, D. J., and Williams, T. A. 1984. Statistics for Business and Economics (Second Edition). West Publishing Company.
3
 
4
Bergman, M. K. 2000. The deep web: Surfacing hidden value. Tech. rep., BrightPlanet LLC. Dec.
 
5
Borda, J. C. 1781. Mémoire sur les élections au scrutin. Histoire de l'Académie Royale des Sciences.
 
6
7
 
8
Brunk, H. D. 1965. An Introduction to Mathematical Statistics. Blaisdell Publishing Company, New York.
9
 
10
Chang, K. C.-C., He, B., Li, C., and Zhang, Z. 2003. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository.
 
11
Chang, K. C.-C., He, B., and Zhang, Z. 2005. Toward large scale integration: Building a metaquerier over databases on the web. In Proceedings of the CIDR 2005 Conference.
12
13
 
14
Diaconis, P. and Graham, R. 1977. Spearman's footrule as a measure of disarray. J. Roy. Statis. Soc. Ser. B 39, 2, 262--268.
15
16
17
 
18
Goodman, L. and Kruskal, W. 1979. Measures of Association for Cross Classification. Springer-Verlag, New York.
19
20
21
 
22
He, H., Meng, W., Yu, C., and Wu, Z. 2003. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of the VLDB 2003 Conference.
23
 
24
Kemeny, J. G. 1959. Mathematics without numbers. Daedalus 88, 571--591.
 
25
 
26
 
27
 
28
 
29
 
30
Porter, M. The porter stemming algorithm. Accessible at http://www.tartarus.org/~martin/Porter Stemmer.
 
31
 
32
Seligman, L., Rosenthal, A., Lehner, P., and Smith, A. 2002. Data integration: Where does the time go? Bull. Tech. Comm. Data Engr. 25, 3.
33
 
34
Wang, J., Wen, J.-R., Lochovsky, F., and Ma, W.-Y. 2004. Instance-based schema matching for web databases by domain-specific query probing. In Proceedings of the VLDB 2004 Conference.
35
 
36
Young, H. P. 1974. An axiomatization of borda's rule. J. Economic Theory 9, 43--52.
 
37
Young, H. P. 1988. Condorcet's theory of voting. American Political Science Review 82, 1231--1244.
38
 
39

CITED BY  8

Collaborative Colleagues:
Bin He: colleagues
Kevin Chen-Chuan Chang: colleagues