ACM Home Page
Please provide us with feedback. Feedback
Do not crawl in the dust: different urls with similar text
Full text PdfPdf (305 KB)
Source
International World Wide Web Conference archive
Proceedings of the 16th international conference on World Wide Web table of contents
Banff, Alberta, Canada
SESSION: Mining textual data table of contents
Pages: 111 - 120  
Year of Publication: 2007
ISBN:978-1-59593-654-7
Authors
Ziv Bar-Yossef  Technion and Google, Haifa, Israel
Idit Keidar  Technion, Haifa, Israel
Uri Schonfeld  UCLA, Log Angeles, CA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 92,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1242572.1242588
What is a DOI?

ABSTRACT

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. Technical Report CCIT Report #601, Dept. Electrical Engineering, Technion, 2006.
 
3
 
4
K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. IEEE Data Engin. Bull., 23(4):21--26, 2000.
 
5
M. Bognar. A survey on abstract rewriting. Available online at: www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps, 1995.
6
 
7
8
 
9
E. Di Iorio, M. Diligenti, M. Gori, M. Maggini, and A. Pucci. Detecting Near-replicas on the Web by Content and Hyperlink Analysis. In Proc. 11th WWW, 2003.
 
10
 
11
 
12
 
13
Google Inc. Google sitemaps. http://sitemaps.google.com.
 
14
 
15
 
16
N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proc. 7th WebDB, pages 25--30, 2005.
17
 
18
S. J. Kim, H. S. Jeong, and S. H. Lee. Reliable evaluations of URL normalization. In Proc. 4th ICCSA, pages 609--617, 2006.
 
19
H. Liang. A URL-String-Based Algorithm for Finding WWW Mirror Host. Master's thesis, Auburn University, 2001.
20
21
 
22

CITED BY  7

Collaborative Colleagues:
Ziv Bar-Yossef: colleagues
Idit Keidar: colleagues
Uri Schonfeld: colleagues