ACM Home Page
Please provide us with feedback. Feedback
Analysis of lexical signatures for improving information persistence on the World Wide Web
Full text PdfPdf (808 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 22 ,  Issue 4  (October 2004) table of contents
Pages: 540 - 572  
Year of Publication: 2004
ISSN:1046-8188
Authors
Seung-Taek Park  Yahoo! Research Labs, Pasadena, CA
David M. Pennock  Yahoo! Research Labs, Pasadena, CA
C. Lee Giles  The Pennsylvania State University, University Park, PA
Robert Krovetz  Ask Jeeves, Piscataway, NJ
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 69,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1028099.1028101
What is a DOI?

ABSTRACT

A <i>lexical signature</i> (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called <i>Test & Select</i> (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Andrews, K., Kappe, F., and Maurer, H. 1995. The Hyper-G network information system. J. Univers. Comput. Sci. 1, 4 (April), 206--220.
 
3
Arms, W., Blanchi, C., and Overly, E. 1997. An architecture for information in digital libraries. D-Lib Mag. Available at http://www.dlib.org/dlib/february97/cnri/02arms1.html.
 
4
Berners-Lee, T., Fielding, R., and Frystyk, H. 1996. Hypertext transfer protocol---http/1.0. Available at http://www.w3.org/Protocols/HTTP/1.0/draft-ietf-http-spec.html.
 
5
 
6
 
7
 
8
 
9
Douglis, F., Feldmann, A., Krishnamurthy, B., and Mogul, J. C. 1997. Rate of change and other metrics: A live study of the world wide web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. 147--158.
 
10
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and Berners-Lee, T. 1999. Rfc 2616---hypertext transfer protocol---http/1.1. Available at http://www.w3.org/Protocols/ rfc2616/rfc2616.html.
 
11
 
12
Fisher, R. A. 1966. Design of Experiments, 8th ed. Hafner (Macmillan), New York, NY.
13
 
14
 
15
Gulesian, M. 1996. Netscape livewire pro 1.0. Available at http://www.dbmsmag.com/9612d08. html.
 
16
 
17
 
18
 
19
Ingham, D., Little, M., Caughey, S., and Shrivastava, S. 1995. W3Objects: Bringing object-oriented technology to the Web. In Proceedings of the 4th International World Wide Web Conference. 89--105.
20
 
21
 
22
Lawrence, S. and Giles, C. L. 1998b. Searching the World Wide Web. Science 280, 5360 (April), 98--100.
 
23
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.
 
24
 
25
26
 
27
Noreen, E. W. 1989. Computer-Intensive Methods for Testing Hypotheses: An Introduction. Wiley, New York, NY.
 
28
Oberholzer, G. and Wilde, E. 2002. Extended link visualization with DHTML: The Web as an open hypermedia system. Tech. rep. TIK-125. Computer Engineering and Networks Laboratory (TIK), ETH Zurich, Zurich, Switzerland.
29
 
30
Phelps, T. A. and Wilensky, R. 2000a. Robust hyperlinks: Cheap, everywhere, now. In Proceedings of Digital Documents and Electronic Publishing 2000 (DDEP00).
 
31
 
32
Pitkow, J. 1998a. Web characterization activity: Answers to the W3C HTTP-NGs protocol design group's questions. World Wide Web Consortium (WWW.W3.org).
 
33
 
34
Reich, V. and Rosenthal, D. S. H. 2001. Lockss: A permanent Web publishing and access system. D-Lib Mag. 7, 6 (June), 55--68.
 
35
Rusmevichientong, P., Pennock, D. M., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of the AAAI Fall Symposium on Using Uncertainty Within Computation. 121--128.
 
36
Shafer, K., Weibel, S., Jul, E., and Fausey, J. 1996. Introduction to persistent uniform resource locators. In INET 96. Internet Society, Reston, Va.
 
37
Sollins, K. and Masinter, L. 1994. Functional Requirements for Uniform Resource Names, Network Working Group, RFC 1737. Available at http://www.faqs.org/rfcs/rfc1737.html.
38
 
39
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes, 2nd ed. Harcourt Science and Technology Company, San Diego, CA.


Collaborative Colleagues:
Seung-Taek Park: colleagues
David M. Pennock: colleagues
C. Lee Giles: colleagues
Robert Krovetz: colleagues