| Bringing your dead links back to life: a comprehensive approach and lessons learned |
| Full text |
Pdf
(486 KB)
|
Source
|
Conference on Hypertext and Hypermedia
archive
Proceedings of the 20th ACM conference on Hypertext and hypermedia
table of contents
Torino, Italy
SESSION: Hypertext structure and usage
table of contents
Pages 15-24
Year of Publication: 2009
ISBN:978-1-60558-486-7
|
|
Authors
|
|
Atsuyuki Morishima
|
University of Tsukuba, Tsukuba, Japan
|
|
Akiyoshi Nakamizo
|
Shibaura Institute of Technology, Tokyo, Japan
|
|
Toshinari Iida
|
University of Tsukuba, Tsukuba, Japan
|
|
Shigeo Sugimoto
|
University of Tsukuba, Tsukuba, Japan
|
|
Hiroyuki Kitagawa
|
University of Tsukuba, Tsukuba, Japan
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 39, Citation Count: 0
|
|
|
ABSTRACT
This paper presents an experimental study of the automatic correction of broken (dead) Web links focusing, in particular, on links broken by the relocation ofWeb pages. Our first contribution is that we developed an algorithm that incorporates a comprehensive set of heuristics, some of which are novel, in a single unified framework. The second contribution is that we conducted a relatively large-scale experiment, and analysis of our results revealed the characteristics of the problem of finding movedWeb pages. We demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems. First, it is impossible to identify the final destination until the page is moved, so the index-server approach is not necessarily effective. Secondly, there is a large bias about where the new address is likely to be and crawler-based solutions can be effectively implemented, avoiding the need to search the entire Web. We analyzed the experimental results in detail to show how important each heuristic is in real Web settings, and conducted statistical analyses to show that our algorithm succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
| |
4
|
|
| |
5
|
M. Beynon, A. Flegg: Hypertext Request Integrity and User Experience. US Patent Application Publication, US 2004/0267726 A1, Dec, 2004.
|
| |
6
|
M. Beynon, A. Flegg: Guaranteeing Hypertext Link Integrity. US Patent Application Publication, US 2005/0021997 A1, Jan. 2005.
|
 |
7
|
|
 |
8
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
 |
9
|
|
 |
10
|
Hugh C. Davis, Referential integrity of links in open hypermedia systems, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.207-216, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276650]
|
 |
11
|
|
| |
12
|
R. P. Dellavalle, E. J. hester, L. F. Heilig, A. L. Drake, J. W. Kuntzman, M. Graber, L. M. Schilling: Going, Going, Gone: Lost Internet References, Science 302(31), 2003: 787--788
|
 |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
Toshinari Iida, Natsumi Sawa, Atsuyuki Morishima, Shigeo Sugimoto, Hiroyuki Kitagawa. Efficient Search for Moved Web Pages. Proc. DEWS2007, 7 pages, 2007 (in Japanese).
|
| |
19
|
|
| |
20
|
Google Technology. http://www.google.com/technology/.
|
| |
21
|
GVU Center, College of Computing Georgia Institute of Technology. GVU's 10th WWW User Survey. http://www.gvu.gatech.edu/user_surveys/survey-1998-10/.
|
| |
22
|
A. Mood, F. Graybill, D. Boes. Introduction to the theory of statistics. McGraw-Hill, 1974.
|
| |
23
|
A. Morishima, et al. Automatic Correction of Broken Web Links (full version of this paper) Technical Report, University of Tsukuba.
|
| |
24
|
Thomas A. Phelps, Robert Wilensky: Robust Hyperlinks: Cheap, Everywhere, Now. DDEP/PODDP 2000: 28--43
|
| |
25
|
Persistent URL Home Page. http://purl.oclc.org/.
|
| |
26
|
RFC2396. Uniform Resource Identifiers (URI): Generic Syntax. http://www.ietf.org/rfc/rfc2396.txt.
|
| |
27
|
|
| |
28
|
L. Huxley, E. Place, D. Boyd and P. Cross. Planet SOSIG - A spring-clean for SOSIG: a systematic approach to collection management. http://www.ariadne.ac.uk/issue33/planet-sosig/.
|
| |
29
|
|
| |
30
|
Xenu's Link Sleuth. http://www.cs.washington.edu/lab/sw/LinkSleuth.html.
|
|