| Do not crawl in the dust: different urls with similar text |
| Full text |
Pdf
(305 KB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 16th international conference on World Wide Web
table of contents
Banff, Alberta, Canada
SESSION: Mining textual data
table of contents
Pages: 111 - 120
Year of Publication: 2007
ISBN:978-1-59593-654-7
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 92, Citation Count: 7
|
|
|
ABSTRACT
We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. Technical Report CCIT Report #601, Dept. Electrical Engineering, Technion, 2006.
|
| |
3
|
|
| |
4
|
K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. IEEE Data Engin. Bull., 23(4):21--26, 2000.
|
| |
5
|
M. Bognar. A survey on abstract rewriting. Available online at: www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps, 1995.
|
 |
6
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
7
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
8
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
9
|
E. Di Iorio, M. Diligenti, M. Gori, M. Maggini, and A. Pucci. Detecting Near-replicas on the Web by Content and Hyperlink Analysis. In Proc. 11th WWW, 2003.
|
| |
10
|
Fred Douglis , Anja Feldmann , Balachander Krishnamurthy , Jeffrey Mogul, Rate of change and other metrics: a live study of the world wide web, Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems, p.14-14, December 08-11, 1997, Monterey, California
|
| |
11
|
Héctor García-Molina , Luis Gravano , Narayanan Shivakumar, dSCAM: finding document copies across multiple databases, Proceedings of the fourth international conference on on Parallel and distributed information systems, p.68-79, December 18-20, 1996, Miami Beach, Florida, United States
|
| |
12
|
|
| |
13
|
Google Inc. Google sitemaps. http://sitemaps.google.com.
|
| |
14
|
|
| |
15
|
|
| |
16
|
N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proc. 7th WebDB, pages 25--30, 2005.
|
 |
17
|
|
| |
18
|
S. J. Kim, H. S. Jeong, and S. H. Lee. Reliable evaluations of URL normalization. In Proc. 4th ICCSA, pages 609--617, 2006.
|
| |
19
|
H. Liang. A URL-String-Based Algorithm for Finding WWW Mirror Host. Master's thesis, Auburn University, 2001.
|
 |
20
|
|
 |
21
|
|
| |
22
|
|
CITED BY 7
|
|
|
|
|
Eyal Oren , Renaud Delbru , Michele Catasta , Richard Cyganiak , Holger Stenzhorn , Giovanni Tummarello, Sindice.com: a document-oriented lookup index for open linked data, International Journal of Metadata, Semantics and Ontologies, v.3 n.1, p.37-52, November 2008
|
|
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
Atsuyuki Morishima , Akiyoshi Nakamizo , Toshinari Iida , Shigeo Sugimoto , Hiroyuki Kitagawa, Bringing your dead links back to life: a comprehensive approach and lessons learned, Proceedings of the 20th ACM conference on Hypertext and hypermedia, June 29-July 01, 2009, Torino, Italy
|
|