ACM Home Page
Please provide us with feedback. Feedback
Web data integration using approximate string join
Full text PdfPdf (59 KB)
Source International World Wide Web Conference archive
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters table of contents
New York, NY, USA
POSTER SESSION: Posters table of contents
Pages: 364 - 365  
Year of Publication: 2004
ISBN:1-58113-912-8
Authors
Yingping Huang  University of Notre Dame, Notre Dame, IN
Gregory Madey  University of Notre Dame, Notre Dame, IN
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 42,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1013367.1013477
What is a DOI?

ABSTRACT

Web data integration is an important preprocessing step for web mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many existing data integration algorithms make use of approximate string join, which seeks to (approximately) find all pairs of strings whose distances are less than a certain threshold. In this paper, we propose a new mapping method to detect pairs of strings with similarity above a certain threshold. In our method, each string is first mapped to a point in a high dimensional grid space, then pairs of points whose distances are 1 are identified. We implement it using Oracle SQL and PL/SQL. Finally, we evaluate this method using real data sets. Experimental results suggest that our method is both accurate and efficient.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
L. Gravano and P. Ipeirotis. Using q-grams in a dbms for approximate string processing. In IEEE Data Engineering Bulletin 24(4), pages 28--34, 2001.
 
3
L. Gravano and P. Ipeirotis. Text joins for data cleansing and integration in an rdbms. In Proc. Int. Conf. on Data Engineering, 2003.
 
4
5
 
6
7


Collaborative Colleagues:
Yingping Huang: colleagues
Gregory Madey: colleagues