|
ABSTRACT
We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
R. Baeza-Yates and C. Castillo. Characterization of national web domains. Technical report, Universitat Pompeu Fabra, 2005.
|
 |
3
|
|
| |
4
|
S. Baldwin. Museum of e-failure, 2006. http://disobey.com/ghostsites/mef.shtml.
|
 |
5
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
| |
6
|
M. K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, August 2001. http://www.press.umich.edu/jep/07-01/bergman.html.
|
| |
7
|
T. Berners-Lee, R. Fielding, and L. Masinter. Uniform Resource Identifier (URI): Generic syntax. RFC 3986, Jan. 2005.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
 |
13
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
14
|
|
| |
15
|
M. Cutts. SEO advice: URL canonicalization. Jan 2006. http://www.mattcutts.com/blog/seo-advice-url-canonicalization/.
|
 |
16
|
Zubin Dalal , Suvendu Dash , Pratik Dave , Luis Francisco-Revilla , Richard Furuta , Unmil Karadkar , Frank Shipman, Managing distributed collections: evaluating web page changes, movement, and replacement, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
[doi> 10.1145/996350.996387]
|
| |
17
|
|
 |
18
|
|
 |
19
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
20
|
|
| |
21
|
Fire destroys top research centre. Oct 31, 2005. http://news.bbc.co.uk/2/hi/uk_news/england/hampshire/4390048.stm.
|
 |
22
|
|
| |
23
|
Google Sitemap Protocol, 2005. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html.
|
 |
24
|
|
 |
25
|
|
| |
26
|
Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.
|
| |
27
|
C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the Greek Web. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.
|
| |
28
|
S. H. Lee, S. J. Kim, and S. H. Hong. On URL normalization. In Proceedings of the International Conference on Computational Science and Its Applications (ICCSA '05), pages 1076--1085, June 2005.
|
| |
29
|
S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau. Extracting data behind web forms. In Workshop on Conceptual Modeling Approaches for e-Business, pages 402--413, Oct 2002.
|
| |
30
|
Stephen W. Liddle , Sai Ho Yau , David W. Embley, On the Automatic Extraction of Data from the Hidden Web, Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops, p.212-226, November 27-30, 2001
|
 |
31
|
|
 |
32
|
|
| |
33
|
F. McCown. Google is sorry. Jan 2006. http://frankmccown.blogspot.com/2006/01/google-is-sorry.html.
|
| |
34
|
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.
|
 |
35
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
36
|
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. Introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept 2004.
|
 |
37
|
|
 |
38
|
|
| |
39
|
M. L. Nelson, H. Van de Sompel, X. Liu, T. L. Harrison, and N. McFarland. mod\_oai: An Apache module for metadata harvesting. In Proceedings of ECDL '05, 2005.
|
 |
40
|
|
| |
41
|
E. T. O'Neill, B. F. Lavorie, and R. Bennett. Trends in the evolution of the public web. D-Lib Magazine, 3(4), April 2003.
|
| |
42
|
G. Pant, P. Srinivasan, and F. Menczer. ``Crawling the Web''. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis, pages 153--178. Springer-Verlag, 2004.
|
| |
43
|
|
 |
44
|
|
| |
45
|
|
| |
46
|
M. A. Serrano, A. Maguitman, M. Boguna, S. Fortunato, and A. Vespignani. Decoding the structure of the WWW: facts versus sampling biases. Technical report, 2006. http://www.arxiv.org/abs/cs.NI/0511035.
|
| |
47
|
|
| |
48
|
K. Sigurosson. Incremental crawling with Heritrix. In Proceedings of the 5th International Web Archiving Workshop (IWAW '05), Sept 2005.
|
| |
49
|
J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.
|
| |
50
|
|
| |
51
|
What are Google's design and technical guidelines? http://www.google.com/support/webmasters/bin/answer.py?answer=35770.
|
 |
52
|
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511465]
|
|