|
ABSTRACT
Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
M. K. Bergman. The deep web: Surfacing hidden value. Technical report, BrightPlanet LLC, Dec. 2000.
|
| |
3
|
P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, 2001.
|
| |
4
|
K. C.-C. Chang, B. He, C. Li, and Z. Zhang. Structured databases on the web: Observations and implications. Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC, Feb. 2003.
|
 |
5
|
|
| |
6
|
|
| |
7
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39:1--38, 1977.
|
 |
8
|
AnHai Doan , Pedro Domingos , Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.509-520, May 21-24, 2001, Santa Barbara, California, United States
|
| |
9
|
A. Halevy, O. Etzioni, A. Doan, Z. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. Conf. on Innovative Database Research, 2003.
|
| |
10
|
B. He, T. Tao, C. Li, and K. C.-C. Chang. Clustering structured web sources: A schema-based, model-differentiation approach. Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC, Feb. 2003.
|
| |
11
|
|
| |
12
|
C. J. Lloyd. Statistical Analysis of Categorical Data. Wiley, 1999.
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? Bulletin of the Tech. Committee on Data Engr., 25(3), 2002.
|
CITED BY 52
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Robin Dhamankar , Yoonkyong Lee , AnHai Doan , Alon Halevy , Pedro Domingos, iMAP: discovering complex semantic matches between database schemas, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Robert McCann , Bedoor AlShebli , Quoc Le , Hoa Nguyen , Long Vu , AnHai Doan, Mapping maintenance for data integration systems, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alon Halevy , Michael Franklin , David Maier, Principles of dataspace systems, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.1-9, June 26-28, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jiying Wang , Ji-Rong Wen , Fred Lochovsky , Wei-Ying Ma, Instance-based schema matching for web databases by domain-specific query probing, Proceedings of the Thirtieth international conference on Very large data bases, p.408-419, August 31-September 03, 2004, Toronto, Canada
|
|
|
Hai He , Weiyi Meng , Clement Yu , Zonghuan Wu, Wise-integrator: an automatic integrator of web search interfaces for E-commerce, Proceedings of the 29th international conference on Very large data bases, p.357-368, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lorenzo Blanco , Valter Crescenzi , Paolo Merialdo , Paolo Papotti, Supporting the automatic construction of entity aware search engines, Proceeding of the 10th ACM workshop on Web information and data management, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
Guija Choe , Young-Kwang Nam , Joseph Goguen , Guilian Wang, Query generation for retrieving data from distributed semistructured documents using a metadata interface, Computer Languages, Systems and Structures, v.35 n.4, p.422-434, December, 2009
|
|