|
ABSTRACT
The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language “names” for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text. We describe WHIRL, a “soft” database management system which supports “similarity joins,” based on certain robust, general-purpose similarity metrics for text. This enables fragments of text (e.g., informal names of objects) to be used as keys. WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate, and answers every query with a list of answer substitutions that are ranked according to an overall score. Experiments show that WHIRL is much faster than naive inference methods, even for short queries, and efficient on typical queries to real-world databases with tens of thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outerperforming exact matching with a plausible global domain on a second.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
ARENS, Y., KNOBLOCK,C.A.,AND HSU, C.-N. 1996. Query processing in the SIMS informa-tion mediator. In A. Tate Ed., Advanced Planning Technology. Menlo Park, CA: AAAI Press.
|
| |
3
|
ATZENI, P., MECCA, G., AND MERIALDO, P. 1997. Semistructured and structured data on the Web: going back and forth. In D. Suciu Ed., Proceedings of the Workshop on Management of Semistructured Data (Tucson, Arizona, May 1997). Available on-line from http://www.re-search. att.com/ ; suciu/workshop-papers.html.
|
| |
4
|
|
| |
5
|
|
 |
6
|
R. J. Bayardo, Jr. , W. Bohrer , R. Brice , A. Cichocki , J. Fowler , A. Helal , V. Kashyap , T. Ksiezyk , G. Martin , M. Nodine , M. Rashid , M. Rusinkiewicz , R. Shea , C. Unnikrishnan , A. Unruh , D. Woelk, InfoSleuth: agent-based semantic integration of information in open and dynamic environments, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.195-206, May 11-15, 1997, Tucson, Arizona, United States
|
| |
7
|
BORGMAN,C.L.AND SIEGFRIED, S. L. 1992. Getty's Synoname and its cousins: a survey of applications of personal name-matching algorithms. Journal of the American Society for Information Science 43, 7, 459-476.
|
| |
8
|
BOSC,P.AND PRADE, H. 1997. An introduction to the fuzzy set and possibility theory-based treatment of queries and uncertain or imprecise databases. In Uncertainty management in information systems. Kluwer Academic Publishers.
|
| |
9
|
BOYAN, J., FREITAG, D., AND JOACHIMS, T. 1994. A machine learning architecture for optimiz-ing web search engines. Technical Report WS-96-05, American Association of Artificial Intelligence.
|
 |
10
|
Surajit Chaudhuri , Umeshwar Dayal , Tak W. Yan, Join queries with external text sources: execution and optimization techniques, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.410-422, May 22-25, 1995, San Jose, California, United States
|
| |
11
|
COHEN, W. W. 1997. Knowledge integration for structured information sources containing text (extended abstract). In The SIGIR-97 Workshop on Networked Information Retrieval (1997).
|
 |
12
|
|
 |
13
|
|
| |
14
|
COHEN,W.W.AND HIRSH, H. 1998. Joins that generalize: Text categorization using WHIRL. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (New York, NY, 1998), pp. 169-173.
|
 |
15
|
|
| |
16
|
|
| |
17
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
 |
18
|
|
 |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
FANG, D., HAMMER, J., AND MCLEOD, D. 1994. The identification and resolution of semantic heterogeneity in multidatabase systems. In Multidatabase Systems: An Advanced Solution for Global Information Sharing, pp. 52-60. IEEE Computer Society Press, Los Alamitos, California.
|
| |
23
|
FELLIGI,I.P.AND SUNTER, A. B. 1969. A theory for record linkage. Journal of the American Statistical Society 64, 1183-1210.
|
| |
24
|
FIEBIG, T., WEISS, J., AND MOERKOTTE, G. 1997. RAW: a relational algebra for the Web. In D. Suciu Ed., Proceedings of the Workshop on Management of Semistructured Data (Tucson, Arizona, May 1997). Available on-line from http://www.research.att.com/ ; suciu/workshop-papers. html.
|
 |
25
|
|
| |
26
|
GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J., AND WIDOM, J. 1995. The TSIMMIS approach to mediation: Data models and languages (extended abstract). In Next Generation Information Technologies and Systems (NGITS-95) (Naharia, Israel, November 1995).
|
 |
27
|
|
| |
28
|
HUFFMAN,S.AND STEIER, D. 1995. Heuristic joins to integrate structured heterogeneous data. In Working notes of the AAAI spring symposium on information gathering in heteroge-neous distributed environments (Palo Alto, CA, March 1995). AAAI Press.
|
| |
29
|
KILSS,B.AND ALVEY, W. 1985. Record linkage techniques:1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.bts. gov/fcsm/methodology/.
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
LEVY,A.Y.,RAJARAMAN, A., AND ORDILLE, J. J. 1996a. Query answering algorithms for information agents. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96) (Portland, Oregon, August 1996).
|
| |
34
|
|
| |
35
|
|
 |
36
|
|
| |
37
|
MONGE,A.AND ELKAN, C. 1996. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (August 1996).
|
| |
38
|
MONGE,A.AND ELKAN, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 work-shop on data mining and knowledge discovery (May 1997).
|
| |
39
|
NEWCOMBE,H.B.,KENNEDY,J.M.,AXFORD,S.J.,AND JAMES, A. P. 1959. Automatic linkage of vital records. Science 130, 954-959.
|
| |
40
|
|
| |
41
|
|
| |
42
|
PORTER, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130-137.
|
| |
43
|
|
| |
44
|
|
 |
45
|
|
| |
46
|
|
| |
47
|
SUCIU,D.ED. 1997. Proceedings of the Workshop on Management of Semistructured Data. Available on-line from http://www.research.att.com/suciu/workshop-papers.html, Tucson, Arizona.
|
 |
48
|
Anthony Tomasic , Rémy Amouroux , Philippe Bonnet , Olga Kapitskaia , Hubert Naacke , Louiqa Raschid, The distributed information search component (Disco) and the World Wide Web, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.546-548, May 11-15, 1997, Tucson, Arizona, United States
|
| |
49
|
|
| |
50
|
ZADEH, L. A. 1965. Fuzzy sets. Information and Control 8, 338-353.
|
CITED BY 32
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carina F. Dorneles , Carlos A. Heuser , Andrei E. N. Lima , Altigran Soares da Silva , Edleno Silva de Moura, Measuring similarity between collection of values, Proceedings of the 6th annual ACM international workshop on Web information and data management, November 12-13, 2004, Washington DC, USA
|
|
|
|
|
|
Natalie Glance , Matthew Hurst , Kamal Nigam , Matthew Siegler , Robert Stockton , Takashi Tomokiyo, Deriving marketing intelligence from online discussion, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
Moisés G. de Carvalho , Marcos André Gonçalves , Alberto H. F. Laender , Altigran S. da Silva, Learning to deduplicate, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Moisés G. Carvalho , Albero H. F. Laender , Marcos André Gonçalves , Altigran S. da Silva, Replica identification using genetic programming, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
Carlos A. Heuser , Francisco N. A. Krieser , Viviane Moreira Orengo, SimEval: a tool for evaluating the quality of similarity functions, Tutorials, posters, panels and industrial contributions at the 26th international conference on Conceptual modeling, November 01-01, 2007, Auckland, New Zealand
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Omar Benjelloun , Hector Garcia-Molina , David Menestrina , Qi Su , Steven Euijong Whang , Jennifer Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.1, p.255-276, January 2009
|
|
|
|
|
|
|
|
|
|
REVIEW
"Richard S. Marcus : Reviewer"
A core problem in data integration is that databases from
heterogeneous sources often name objects differently. Using the SMART
vector-space textual retrieval system techniques of Gerard Salton, Cohen
has devised an elaborate methodology for o
more...
|