|
ABSTRACT
Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Yigal Arens, Craig A. Knoblock, and Chun-Nan Hsu. Query processing in the SIMS information mediator. In Austin Tate, editor, Advanced Planning Technology. AAAI Press, Menlo Park, CA, 1996.
|
| |
3
|
Paolo Atzeni, Giansalvatore Mecca, and Paolo Merialdo. Semistructured and structured data on the Web: going back and forth. In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.at t.com / suciu / workshop-papers.html.
|
| |
4
|
|
| |
5
|
|
 |
6
|
R. J. Bayardo, Jr. , W. Bohrer , R. Brice , A. Cichocki , J. Fowler , A. Helal , V. Kashyap , T. Ksiezyk , G. Martin , M. Nodine , M. Rashid , M. Rusinkiewicz , R. Shea , C. Unnikrishnan , A. Unruh , D. Woelk, InfoSleuth: agent-based semantic integration of information in open and dynamic environments, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.195-206, May 11-15, 1997, Tucson, Arizona, United States
|
| |
7
|
Justin Boyan, Dane Freitag, and Thorsten Joachims. A machine learning architecture for optimizing web search engines. Technical Report WS-96-05, American Association of Artificial Intelligence, 1994.
|
 |
8
|
Surajit Chaudhuri , Umeshwar Dayal , Tak W. Yan, Join queries with external text sources: execution and optimization techniques, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.410-422, May 22-25, 1995, San Jose, California, United States
|
| |
9
|
William W. Cohen. Knowledge integration for structured information sources containing text (extended abstract). In The SIGIR-97 Workshop on Networked Information Retrieval, 1997.
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
 |
13
|
|
 |
14
|
|
| |
15
|
Douglas Fang, Joachim Hammer, and Dennis McLeod. The identification and resolution of semantic heterogeneity in multidatabase systems. In Multidatabase Systems: An Advanced Solution for Global Information Sharing, pages 52- 60. IEEE Computer Society Press, Los Alamitos, California, 1994.
|
| |
16
|
I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.
|
| |
17
|
Thorsten Fiebig, Jurgen Weiss, and Guido Moerkotte. RAW: a relational algebra for the Web. In Dan Suciu, editor, Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. Available on-line from http://www.research.att.com/suciu/workshop-papers.html.
|
 |
18
|
|
| |
19
|
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, 3. Ullman, and J. Widom. The TSIM- MIS approach to mediation: Data models and languages (extended abstract). In Next Generation Information Technologies and Systems (NGITS-95), Naharia, Israel, November 1995.
|
 |
20
|
|
| |
21
|
Scott Huffman and David Steier. Heuristic joins to integrate structured heterogeneous data. In Working notes of the AAAI spring symposium on information gathering in heterogeneous distributed environments, Palo Alto, CA, March 1995. AAAI Press.
|
| |
22
|
B. Kilss and W. Alvey (ed). Record linkage techniques-- 1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http: / / www.bt s.gov/fcsm / methodology/, 1985.
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Query answering algorithms for information agents. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), Portland, Oregon, august 1996.
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
A. Monte and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
|
| |
31
|
A. Monte and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.
|
| |
32
|
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954- 959, 1959.
|
| |
33
|
|
| |
34
|
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.
|
| |
35
|
|
| |
36
|
|
 |
37
|
|
| |
38
|
|
| |
39
|
Dan Suciu, editor. Proceedings of the Workshop on Management of Semistructured Data. Available on-line from http://ww w. research, at t. corn / suciu /workshop-papers.html, Tucson, Arizona, May 1997.
|
 |
40
|
Anthony Tomasic , Rémy Amouroux , Philippe Bonnet , Olga Kapitskaia , Hubert Naacke , Louiqa Raschid, The distributed information search component (Disco) and the World Wide Web, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.546-548, May 11-15, 1997, Tucson, Arizona, United States
|
| |
41
|
|
CITED BY 80
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Todd Millstein , Alon Levy , Marc Friedman, Query containment for data integration systems, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.67-75, May 15-18, 2000, Dallas, Texas, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Greg Barish , Dan DiPasquo , Craig A. Knoblock , Steven Minton, An efficient plan execution system for information management agents, Proceedings of the 2nd international workshop on Web information and data management, p.1-5, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
William W. Cohen , Henry Kautz , David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
Greg Barish , Daniel DiPasquo , Craig A. Knoblock , Steven Minton, Dataflow plan execution for software agents, Proceedings of the fourth international conference on Autonomous agents, p.138-139, June 03-07, 2000, Barcelona, Spain
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gültekin Özsoyoǧlu , Ismail Sengör Altingövde , Abdullah Al-Hamdani , Selma Ayşe Özel , Özgür Ulusoy , Zehra Meral özsoyoǧlu, Querying web metadata: Native score management and text support in databases, ACM Transactions on Database Systems (TODS), v.29 n.4, p.581-634, December 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Weifeng Su , Jiying Wang , Qiong Huang , Fred Lochovsky, Query result ranking over e-commerce web databases, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Surajit Chaudhuri , Gautam Das , Vagelis Hristidis , Gerhard Weikum, Probabilistic ranking of database query results, Proceedings of the Thirtieth international conference on Very large data bases, p.888-899, August 31-September 03, 2004, Toronto, Canada
|
|
|
Sudipto Guha , Nick Koudas , Amit Marathe , Divesh Srivastava, Merging the results of approximate match operations, Proceedings of the Thirtieth international conference on Very large data bases, p.636-647, August 31-September 03, 2004, Toronto, Canada
|
|
|
Hai He , Weiyi Meng , Clement Yu , Zonghuan Wu, Wise-integrator: an automatic integrator of web search interfaces for E-commerce, Proceedings of the 29th international conference on Very large data bases, p.357-368, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
G. Özsoyoǧlu , A. Al-Hamdani , I. S. Altingövde , S. A. Özel , Ö. Ulusoy , Z. M. Özsoyoǧlu, Sideway value algebra for object-relational databases, Proceedings of the 28th international conference on Very Large Data Bases, p.59-70, August 20-23, 2002, Hong Kong, China
|
|
|
Alberto Pan , Juan Raposo , Manuel Álvarez , Paula Montoto , Vicente Orjales , Justo Hidalgo , Lucía Ardao , Anastasio Molano , Ángel Viña, The denodo data integration platform, Proceedings of the 28th international conference on Very Large Data Bases, p.986-989, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
Helena Galhardas , Daniela Florescu , Dennis Shasha , Eric Simon , Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
|
|
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Partha Pratim Talukdar , Marie Jacob , Muhammad Salman Mehmood , Koby Crammer , Zachary G. Ives , Fernando Pereira , Sudipto Guha, Learning to create data-integrating queries, Proceedings of the VLDB Endowment, v.1 n.1, August 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|