ACM Home Page
Please provide us with feedback. Feedback
Do not crawl in the DUST: Different URLs with similar text
Full text PdfPdf (613 KB)
Source
ACM Transactions on the Web (TWEB) archive
Volume 3 ,  Issue 1  (January 2009) table of contents
Article No. 3  
Year of Publication: 2009
ISSN:1559-1131
Authors
Ziv Bar-Yossef  Technion Israel Institute of Technology, Haifa, Israel
Idit Keidar  Technion Israel Institute of Technology, Haifa, Israel
Uri Schonfeld  University of California Los Angeles, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 297,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1462148.1462151
What is a DOI?

ABSTRACT

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Apache 2008. Apache. http server version 2.2 configuration files. http://httpd.apache.org/docs/2.2/configuring.html.
 
2
 
3
Analog. 2008. Analog homepage. http://www.analog.cx/.
 
4
 
5
 
6
 
7
Bognar, M. 1995. A survey on abstract rewriting. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.
8
 
9
10
 
11
Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., and Pucci, A. 2003. Detecting near-replicas on the Web by content and hyperlink analysis. In Proceedings of the 11th International World Wide Web Conference (WWW).
 
12
 
13
 
14
 
15
 
16
Google, Inc. 2008. Google sitemaps. http://sitemaps.google.com.
 
17
 
18
 
19
Jaccard, P. 1908. Nouvelles recherches sur la distribution florale. 44, 223--270.
 
20
Jain, N., Dahlin, M., and Tewari, R. 2005. Using bloom filters to refine Web search results. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), 25--30.
21
 
22
Kim, S. J., Jeong, H. S., and Lee, S. H. 2006. Reliable evaluations of URL normalization. In Proceedings of the 4th International Conference on Computational Science and Its Applications (ICCSA), 609--617.
 
23
Liang, H. 2001. A URL-string-based algorithm for finding WWW mirror host. M.S. thesis, Auburn University.
24
 
25
 
26
 
27
StatCounter. 1998. Counter homepage. http://www.statcounter.com/.
 
28
2008} WEBLOGEXPERT WebLog Expert. 2008. WebLog expert homepage. http://www.weblogexpert.com/.
29

Collaborative Colleagues:
Ziv Bar-Yossef: colleagues
Idit Keidar: colleagues
Uri Schonfeld: colleagues