ACM Home Page
Please provide us with feedback. Feedback
Analysis of lexical signatures for finding lost or related documents
Full text PdfPdf (389 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Tampere, Finland
SESSION: Web Information Retrieval table of contents
Pages: 11 - 18  
Year of Publication: 2002
ISBN:1-58113-561-0
Authors
Seung-Taek Park  The Pennsylvania State University, University Park, PA
David M. Pennock  The Pennsylvania State University, University Park, PA
C. Lee Giles  NEC Research Institute
Robert Krovetz  NEC Research Institute
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 38,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/564376.564381
What is a DOI?

ABSTRACT

A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
K. Andrews, F. Kappe, and H. Maurer. The Hyper-G Network Information Systems. Journal of Universal Computer Science, 1(4):206--220, April 1995.
 
3
W. Arms, C. Blanchi, and E. Overly. An Architecture for Information in Digital Libraries. D-Lib Magazine, February 1997.
 
4
 
5
 
6
 
7
D. Ingham, M. Little, S. Caughey, and S. Shrivastava. W3Objects: Bringing Object-Oriented Technology to the Web. In The Web Journal, pages 89--105. 4th International World Wide Web Conference, December 1995.
8
 
9
S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98--100, April 1998.
 
10
S. Lawrence and C. L. Giles. Accessibility of information on the Web. Nature, 400:107--109, July 1999.
 
11
 
12
 
13
G. Oberholzer and E. Wilde. Extended Link Visualization with DHTML: The Web as an open hypermedia system. Technical Report TIK-Report No. 125, Computer Engineering and Networks Laboratory (TIK), ETH Zrich, January 2002.
 
14
T. A. Phelps and R. Wilensky. Robust Hyperlinks: Cheap, Everywhere, Now. In Proceedings of Digital Documents and Electronic Publishing 2000 (DDEP00), September 2000.
 
15
J. Pitkow. Web Characterization Activity Answers to the W3C HTTP-NGs Protocol Design Group's Questions. World Wide Web Consortium, 1998. http://www.w3.org/WCA/Reports/1998-01-PDG-answers.htm.
 
16
 
17
K. Shafer, S. Weibel, E. Jul, and J. Fausey. Introduction to Persistent Uniform Resource Locators. In INET 96. Internet Society, Reston, Va., 1996.
 
18
K. Sollins and L. Masinter. Functional Requirements for Uniform Resource Names. Internet Request for Comments, Dec 1994. http://ietf.org/rfc/rfc1737.txt.
19
 
20
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes 2nd Edition. A Harcourt Science and Technology Company, 525 B Street, Suite 1990, San Diego, CA 92101 - 4495, USA, 1999.


Collaborative Colleagues:
Seung-Taek Park: colleagues
David M. Pennock: colleagues
C. Lee Giles: colleagues
Robert Krovetz: colleagues