ACM Home Page
Please provide us with feedback. Feedback
Text joins in an RDBMS for web data integration
Full text PdfPdf (717 KB)
Source International World Wide Web Conference archive
Proceedings of the 12th international conference on World Wide Web table of contents
Budapest, Hungary
SESSION: Information retrieval 2 table of contents
Pages: 90 - 101  
Year of Publication: 2003
ISBN:1-58113-680-3
Authors
Luis Gravano  Columbia University
Panagiotis G. Ipeirotis  Columbia University
Nick Koudas  AT&T Labs--Research
Divesh Srivastava  AT&T Labs--Research
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 92,   Citation Count: 23
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775152.775166
What is a DOI?

ABSTRACT

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity. In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons. Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), 2002.
 
2
3
4
 
5
W. W. Cohen. Personal communication, 2002.
 
6
 
7
 
8
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183--1210, Dec. 1969.
 
9
 
10
L. Gravano, P. G. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28--34, Dec. 2001.
 
11
 
12
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an RDBMS (poster paper). In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE 2003), 2003.
 
13
14
 
15
K. Kulkarni, A. Mozes, A. Witwoski, M. Zaharioudakis, and F. Zemke. SQL extensions for sampling. Technical Report IEC JTC1/SC32, ISO International Organization for Standardization, Data Management and Interchange WG3 Database Languages Working Group, Oct. 2001.
 
16
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR, 163(4):845--848, 1965. Original in Russian - translation in Soviet Physics Doklady 10(8):707--710, 1966.
 
17
 
18
19
 
20
21
22
23
24
 
25
W. E. Winkler. Matching and record linkage. In Business Survey Methods, pages 355--384. Wiley, 1995.
 
26

CITED BY  24

Collaborative Colleagues:
Luis Gravano: colleagues
Panagiotis G. Ipeirotis: colleagues
Nick Koudas: colleagues
Divesh Srivastava: colleagues