ACM Home Page
Please provide us with feedback. Feedback
Statistical schema matching across web query interfaces
Full text PdfPdf (294 KB)
Source International Conference on Management of Data archive
Proceedings of the 2003 ACM SIGMOD international conference on Management of data table of contents
San Diego, California
SESSION: Meta data management table of contents
Pages: 217 - 228  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Bin He  University of Illinois at Urbana-Champaign
Kevin Chen-Chuan Chang  University of Illinois at Urbana-Champaign
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 96,   Citation Count: 52
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/872757.872784
What is a DOI?

ABSTRACT

Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
M. K. Bergman. The deep web: Surfacing hidden value. Technical report, BrightPlanet LLC, Dec. 2000.
 
3
P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, 2001.
 
4
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC, Feb. 2003.
5
 
6
 
7
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1--38, 1977.
8
 
9
A. Halevy, O. Etzioni, A. Doan, Z. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. Conf. on Innovative Database Research, 2003.
 
10
B. He, T. Tao, C. Li, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC, Feb. 2003.
 
11
 
12
C. J. Lloyd. Statistical Analysis of Categorical Data. Wiley, 1999.
 
13
 
14
15
 
16
 
17
L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? Bulletin of the Tech. Committee on Data Engr., 25(3), 2002.

CITED BY  52

Collaborative Colleagues:
Bin He: colleagues
Kevin Chen-Chuan Chang: colleagues