|
ABSTRACT
A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alberto Aimar , James Casey , Nikos Drakos , Ian Hannell , Arash Khodabandeh , Paolo Palazzi , Bertrand Rousseau , Mario Ruggier, WebLinker, a tool for managing WWW cross-references, Computer Networks and ISDN Systems, v.28 n.1-2, p.99-107, Dec. 1995
[doi> 10.1016/0169-7552(95)00089-4]
|
| |
2
|
K. Andrews, F. Kappe, and H. Maurer. The Hyper-G Network Information Systems. Journal of Universal Computer Science, 1(4):206--220, April 1995.
|
| |
3
|
W. Arms, C. Blanchi, and E. Overly. An Architecture for Information in Digital Libraries. D-Lib Magazine, February 1997.
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
D. Ingham, M. Little, S. Caughey, and S. Shrivastava. W3Objects: Bringing Object-Oriented Technology to the Web. In The Web Journal, pages 89--105. 4th International World Wide Web Conference, December 1995.
|
 |
8
|
|
| |
9
|
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, April 1998.
|
| |
10
|
S. Lawrence and C. L. Giles. Accessibility of information on the Web. Nature, 400:107--109, July 1999.
|
| |
11
|
|
| |
12
|
Steve Lawrence , David M. Pennock , Gary William Flake , Robert Krovetz , Frans M. Coetzee , Eric Glover , Finn Årup Nielsen , Andries Kruger , C. Lee Giles, Persistence of Web References in Scientific Research, Computer, v.34 n.2, p.26-31, February 2001
|
| |
13
|
G. Oberholzer and E. Wilde. Extended Link Visualization with DHTML: The Web as an open hypermedia system. Technical Report TIK-Report No. 125, Computer Engineering and Networks Laboratory (TIK), ETH Zrich, January 2002.
|
| |
14
|
T. A. Phelps and R. Wilensky. Robust Hyperlinks: Cheap, Everywhere, Now. In Proceedings of Digital Documents and Electronic Publishing 2000 (DDEP00), September 2000.
|
| |
15
|
J. Pitkow. Web Characterization Activity Answers to the W3C HTTP-NGs Protocol Design Group's Questions. World Wide Web Consortium, 1998. http://www.w3.org/WCA/Reports/1998-01-PDG-answers.htm.
|
| |
16
|
|
| |
17
|
K. Shafer, S. Weibel, E. Jul, and J. Fausey. Introduction to Persistent Uniform Resource Locators. In INET 96. Internet Society, Reston, Va., 1996.
|
| |
18
|
K. Sollins and L. Masinter. Functional Requirements for Uniform Resource Names. Internet Request for Comments, Dec 1994. http://ietf.org/rfc/rfc1737.txt.
|
 |
19
|
|
| |
20
|
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes 2nd Edition. A Harcourt Science and Technology Company, 525 B Street, Suite 1990, San Diego, CA 92101 - 4495, USA, 1999.
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Additional Classification:
E.
Data
E.2
DATA STORAGE REPRESENTATIONS
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.2
Information Storage
H.3.7
Digital Libraries
General Terms:
Algorithms,
Experimentation,
Measurement,
Performance,
Reliability,
Verification
Keywords:
TREC,
broken URLs,
dead links,
digital,
indexing,
information retrieval,
inverse document frequency,
lexical signatures,
libraries,
robust hyperlinks,
search engines,
term frequency,
world wide web
|