ACM Home Page
Please provide us with feedback. Feedback
Evaluation of crawling policies for a web-repository crawler
Full text PdfPdf (482 KB)
Source Conference on Hypertext and Hypermedia archive
Proceedings of the seventeenth conference on Hypertext and hypermedia table of contents
Odense, Denmark
SESSION: Web engineering table of contents
Pages: 157 - 168  
Year of Publication: 2006
ISBN:1-59593-417-0
Authors
Frank McCown  Old Dominion University, Norfolk, Virginia
Michael L. Nelson  Old Dominion University, Norfolk, Virginia
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 145,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1149941.1149972
What is a DOI?

ABSTRACT

We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
R. Baeza-Yates and C. Castillo. Characterization of national web domains. Technical report, Universitat Pompeu Fabra, 2005.
3
 
4
S. Baldwin. Museum of e-failure, 2006. http://disobey.com/ghostsites/mef.shtml.
5
 
6
M. K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, August 2001. http://www.press.umich.edu/jep/07-01/bergman.html.
 
7
T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifier (URI): Generic syntax. RFC 3986, Jan. 2005.
 
8
 
9
 
10
11
12
13
 
14
 
15
M. Cutts. SEO advice: URL canonicalization. Jan 2006. http://www.mattcutts.com/blog/seo-advice-url-canonicalization/.
16
 
17
18
19
20
 
21
Fire destroys top research centre. Oct 31, 2005. http://news.bbc.co.uk/2/hi/uk_news/england/hampshire/4390048.stm.
22
 
23
Google Sitemap Protocol, 2005. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html.
24
25
 
26
Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.
 
27
C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the Greek Web. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.
 
28
S. H. Lee, S. J. Kim, and S. H. Hong. On URL normalization. In Proceedings of the International Conference on Computational Science and Its Applications (ICCSA '05), pages 1076--1085, June 2005.
 
29
S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In Workshop on Conceptual Modeling Approaches for e-Business, pages 402--413, Oct 2002.
 
30
31
32
 
33
F. McCown. Google is sorry. Jan 2006. http://frankmccown.blogspot.com/2006/01/google-is-sorry.html.
 
34
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.
35
 
36
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.
37
38
 
39
M. L. Nelson, H. Van de Sompel, X. Liu, T. L. Harrison, and N. McFarland. mod\_oai: An Apache module for metadata harvesting. In Proceedings of ECDL '05, 2005.
40
 
41
E. T. O'Neill, B. F. Lavorie, and R. Bennett. Trends in the evolution of the public web. D-Lib Magazine, 3(4), April 2003.
 
42
G. Pant, P. Srinivasan, and F. Menczer. ``Crawling the Web''. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis, pages 153--178. Springer-Verlag, 2004.
 
43
44
 
45
 
46
M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: facts versus sampling biases. Technical report, 2006. http://www.arxiv.org/abs/cs.NI/0511035.
 
47
 
48
K. Sigurosson. Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving Workshop (IWAW '05), Sept 2005.
 
49
J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.
 
50
 
51
What are Google's design and technical guidelines? http://www.google.com/support/webmasters/bin/answer.py?answer=35770.
52

CITED BY  9

Collaborative Colleagues:
Frank McCown: colleagues
Michael L. Nelson: colleagues