|
ABSTRACT
We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Apache 2008. Apache. http server version 2.2 configuration files. http://httpd.apache.org/docs/2.2/configuring.html.
|
| |
2
|
|
| |
3
|
Analog. 2008. Analog homepage. http://www.analog.cx/.
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
Bognar, M. 1995. A survey on abstract rewriting. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.
|
 |
8
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
9
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
10
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
11
|
Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., and Pucci, A. 2003. Detecting near-replicas on the Web by content and hyperlink analysis. In Proceedings of the 11th International World Wide Web Conference (WWW).
|
| |
12
|
Fred Douglis , Anja Feldmann , Balachander Krishnamurthy , Jeffrey Mogul, Rate of change and other metrics: a live study of the world wide web, Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems, p.14-14, December 08-11, 1997, Monterey, California
|
| |
13
|
Raphael A. Finkel , Arkady Zaslavsky , Krisztián Monostori , Heinz Schmidt, Signature extraction for overlap detection in documents, Proceedings of the twenty-fifth Australasian conference on Computer science, p.59-64, January 01, 2002, Melbourne, Victoria, Australia
|
| |
14
|
Héctor García-Molina , Luis Gravano , Narayanan Shivakumar, dSCAM: finding document copies across multiple databases, Proceedings of the fourth international conference on on Parallel and distributed information systems, p.68-79, December 18-20, 1996, Miami Beach, Florida, United States
|
| |
15
|
|
| |
16
|
Google, Inc. 2008. Google sitemaps. http://sitemaps.google.com.
|
| |
17
|
|
| |
18
|
|
| |
19
|
Jaccard, P. 1908. Nouvelles recherches sur la distribution florale. 44, 223--270.
|
| |
20
|
Jain, N., Dahlin, M., and Tewari, R. 2005. Using bloom filters to refine Web search results. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), 25--30.
|
 |
21
|
|
| |
22
|
Kim, S. J., Jeong, H. S., and Lee, S. H. 2006. Reliable evaluations of URL normalization. In Proceedings of the 4th International Conference on Computational Science and Its Applications (ICCSA), 609--617.
|
| |
23
|
Liang, H. 2001. A URL-string-based algorithm for finding WWW mirror host. M.S. thesis, Auburn University.
|
 |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
StatCounter. 1998. Counter homepage. http://www.statcounter.com/.
|
| |
28
|
2008} WEBLOGEXPERT WebLog Expert. 2008. WebLog expert homepage. http://www.weblogexpert.com/.
|
 |
29
|
|
|