|
ABSTRACT
A <i>lexical signature</i> (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called <i>Test & Select</i> (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alberto Aimar , James Casey , Nikos Drakos , Ian Hannell , Arash Khodabandeh , Paolo Palazzi , Bertrand Rousseau , Mario Ruggier, WebLinker, a tool for managing WWW cross-references, Computer Networks and ISDN Systems, v.28 n.1-2, p.99-107, Dec. 1995
[doi> 10.1016/0169-7552(95)00089-4]
|
| |
2
|
Andrews, K., Kappe, F., and Maurer, H. 1995. The Hyper-G network information system. J. Univers. Comput. Sci. 1, 4 (April), 206--220.
|
| |
3
|
Arms, W., Blanchi, C., and Overly, E. 1997. An architecture for information in digital libraries. D-Lib Mag. Available at http://www.dlib.org/dlib/february97/cnri/02arms1.html.
|
| |
4
|
Berners-Lee, T., Fielding, R., and Frystyk, H. 1996. Hypertext transfer protocol---http/1.0. Available at http://www.w3.org/Protocols/HTTP/1.0/draft-ietf-http-spec.html.
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
Douglis, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. C. 1997. Rate of change and other metrics: A live study of the world wide web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. 147--158.
|
| |
10
|
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and Berners-Lee, T. 1999. Rfc 2616---hypertext transfer protocol---http/1.1. Available at http://www.w3.org/Protocols/ rfc2616/rfc2616.html.
|
| |
11
|
|
| |
12
|
Fisher, R. A. 1966. Design of Experiments, 8th ed. Hafner (Macmillan), New York, NY.
|
 |
13
|
C. Lee Giles , Kurt D. Bollacker , Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276685]
|
| |
14
|
|
| |
15
|
Gulesian, M. 1996. Netscape livewire pro 1.0. Available at http://www.dbmsmag.com/9612d08. html.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
Ingham, D., Little, M., Caughey, S., and Shrivastava, S. 1995. W3Objects: Bringing object-oriented technology to the Web. In Proceedings of the 4th International World Wide Web Conference. 89--105.
|
 |
20
|
|
| |
21
|
|
| |
22
|
Lawrence, S. and Giles, C. L. 1998b. Searching the World Wide Web. Science 280, 5360 (April), 98--100.
|
| |
23
|
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.
|
| |
24
|
|
| |
25
|
Steve Lawrence , David M. Pennock , Gary William Flake , Robert Krovetz , Frans M. Coetzee , Eric Glover , Finn Årup Nielsen , Andries Kruger , C. Lee Giles, Persistence of Web References in Scientific Research, Computer, v.34 n.2, p.26-31, February 2001
|
 |
26
|
|
| |
27
|
Noreen, E. W. 1989. Computer-Intensive Methods for Testing Hypotheses: An Introduction. Wiley, New York, NY.
|
| |
28
|
Oberholzer, G. and Wilde, E. 2002. Extended link visualization with DHTML: The Web as an open hypermedia system. Tech. rep. TIK-125. Computer Engineering and Networks Laboratory (TIK), ETH Zurich, Zurich, Switzerland.
|
 |
29
|
|
| |
30
|
Phelps, T. A. and Wilensky, R. 2000a. Robust hyperlinks: Cheap, everywhere, now. In Proceedings of Digital Documents and Electronic Publishing 2000 (DDEP00).
|
| |
31
|
|
| |
32
|
Pitkow, J. 1998a. Web characterization activity: Answers to the W3C HTTP-NGs protocol design group's questions. World Wide Web Consortium (WWW.W3.org).
|
| |
33
|
|
| |
34
|
Reich, V. and Rosenthal, D. S. H. 2001. Lockss: A permanent Web publishing and access system. D-Lib Mag. 7, 6 (June), 55--68.
|
| |
35
|
Rusmevichientong, P., Pennock, D. M., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of the AAAI Fall Symposium on Using Uncertainty Within Computation. 121--128.
|
| |
36
|
Shafer, K., Weibel, S., Jul, E., and Fausey, J. 1996. Introduction to persistent uniform resource locators. In INET 96. Internet Society, Reston, Va.
|
| |
37
|
Sollins, K. and Masinter, L. 1994. Functional Requirements for Uniform Resource Names, Network Working Group, RFC 1737. Available at http://www.faqs.org/rfcs/rfc1737.html.
|
 |
38
|
|
| |
39
|
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes, 2nd ed. Harcourt Science and Technology Company, San Diego, CA.
|
CITED BY 5
|
|
|
|
|
|
|
|
|
|
|
Atsuyuki Morishima , Akiyoshi Nakamizo , Toshinari Iida , Shigeo Sugimoto , Hiroyuki Kitagawa, Bringing your dead links back to life: a comprehensive approach and lessons learned, Proceedings of the 20th ACM conference on Hypertext and hypermedia, June 29-July 01, 2009, Torino, Italy
|
|
|
Atsuyuki Morishima , Akiyoshi Nakamizo , Toshinari Iida , Shigeo Sugimoto , Hiroyuki Kitagawa, Why are moved web pages difficult to find?: the WISH approach, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
General Terms:
Algorithms,
Experimentation,
Measurement,
Performance,
Reliability,
Verification
Keywords:
Broken URLs,
TREC,
World Wide Web,
dead links,
digital libraries,
indexing,
information retrieval,
inverse document frequency,
lexical signatures,
robust hyperlinks,
search engines,
term frequency
|