ACM Home Page
Please provide us with feedback. Feedback
Integration of heterogeneous databases without common domains using queries based on textual similarity
Full text PdfPdf (1.83 MB)
Source International Conference on Management of Data archive
Proceedings of the 1998 ACM SIGMOD international conference on Management of data table of contents
Seattle, Washington, United States
Pages: 201 - 212  
Year of Publication: 1998
ISBN:0-89791-995-5
Also published in ...
Author
William W. Cohen  AT&T Labs-Research, 180 Park Avenue, Florham Park NJ
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 105,   Citation Count: 80
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/276304.276323
What is a DOI?

ABSTRACT

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Yigal Arens, Craig A. Knoblock, and Chun-Nan Hsu. Query processing in the SIMS information mediator. In Austin Tate, editor, Advanced Planning Technology. AAAI Press, Menlo Park, CA, 1996.
 
3
Paolo Atzeni, Giansalvatore Mecca, and Paolo Merialdo. Semistructured and structured data on the Web: going back and forth. In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.at t.com / suciu / workshop-papers.html.
 
4
 
5
6
 
7
Justin Boyan, Dane Freitag, and Thorsten Joachims. A machine learning architecture for optimizing web search engines. Technical Report WS-96-05, American Association of Artificial Intelligence, 1994.
8
 
9
William W. Cohen. Knowledge integration for structured information sources containing text (extended abstract). In The SIGIR-97 Workshop on Networked Information Retrieval, 1997.
10
 
11
12
13
14
 
15
Douglas Fang, Joachim Hammer, and Dennis McLeod. The identification and resolution of semantic heterogeneity in multidatabase systems. In Multidatabase Systems: An Advanced Solution for Global Information Sharing, pages 52- 60. IEEE Computer Society Press, Los Alamitos, California, 1994.
 
16
I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.
 
17
Thorsten Fiebig, Jurgen Weiss, and Guido Moerkotte. RAW: a relational algebra for the Web. In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/suciu/workshop-papers.html.
18
 
19
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, 3. Ullman, and J. Widom. The TSIM- MIS approach to mediation: Data models and languages (extended abstract). In Next Generation Information Technologies and Systems (NGITS-95), Naharia, Israel, November 1995.
20
 
21
Scott Huffman and David Steier. Heuristic joins to integrate structured heterogeneous data. In Working notes of the AAAI spring symposium on information gathering in heterogeneous distributed environments, Palo Alto, CA, March 1995. AAAI Press.
 
22
B. Kilss and W. Alvey (ed). Record linkage techniques-- 1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http: / / www.bt s.gov/fcsm / methodology/, 1985.
 
23
 
24
 
25
 
26
Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Query answering algorithms for information agents. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), Portland, Oregon, august 1996.
 
27
 
28
29
 
30
A. Monte and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
 
31
A. Monte and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.
 
32
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954- 959, 1959.
 
33
 
34
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.
 
35
 
36
37
 
38
 
39
Dan Suciu, editor. Proceedings of the Workshop on Management of Semistructured Data. Available on-line from http://ww w. research, at t. corn / suciu /workshop-papers.html, Tucson, Arizona, May 1997.
40
 
41

CITED BY  80