ACM Home Page
Please provide us with feedback. Feedback
A comparison of techniques for estimating IDF values to generate lexical signatures for the web
Full text PdfPdf (328 KB)
Source
Workshop On Web Information And Data Management archive
Proceeding of the 10th ACM workshop on Web information and data management table of contents
Napa Valley, California, USA
SESSION: System issues table of contents
Pages 39-46  
Year of Publication: 2008
ISBN:978-1-60558-260-3
Authors
Martin Klein  Old Dominion University, Norfolk, VA, USA
Michael L. Nelson  Old Dominion University, Norfolk, VA, USA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 53,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458502.1458510
What is a DOI?

ABSTRACT

For bounded datasets such as the TREC Web Track the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, since IDF cannot be directly calculated for the entire web, it must be estimated. We see a need to estimate accurate IDF values to generate TF-IDF based lexical signatures (LSs) of web pages. Future applications for generating such LSs require a real time IDF computation. Therefore we conducted a comparison study of different methods to estimate IDF values of web pages. Our objective is to investigate how accurate these estimation methods are compared to the a baseline. We use the Google N-grams as our baseline and compare it against two IDF estimation techniques which are based on: 1) a "local universe" consisting of textual content and the according document frequencies from copies of URLs from the Internet Archive and 2) "screen scraping", a technique to query the Google web interface for document frequencies. We found a term overlap of 70 to 80% between the results of the two methods and the baseline. We further discovered a great agreement in rank correlation of TF-IDF ranked terms between our methods. Kendall τ is approximately 0.8 and the M-Score (penalizing discordances in higher ranks) is even higher, it peaks at well above 0.9. These preliminary results lead us to the conclusion that both methods are appropriate for creating accurate IDF values for web pages.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
L. A. Adamic and B. A. Huberman. Zipf's Law and the Internet. Glottometrics, 3:143--150, 2002.
 
2
3
 
4
 
5
A. Franz and T. Brants. All Our N-Gram are Belong to You. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html.
6
 
7
D. Hawking. Overview of the TREC-9 Web Track. In NIST Special Publication 500-249: TREC-9, pages 87--102, 2001.
 
8
 
9
 
10
M. Klein and M. L. Nelson. Approximating Document Frequency with Term Count Values. Technical Report arXiv:0807.3755, Old Dominion University, 2008.
 
11
M. Klein and M. L. Nelson. Revisiting Lexical Signatures to (Re-)Discover Web Pages. In Proceedings of ECDL '08, 2008.
12
 
13
G. Leech, L. P. Grayson, and A. Wilson. Word Frequencies in Written and Spoken English: based on the British National Corpus. Longman, London, 2001.
 
14
Y. Ling, X. Meng, and W. Meng. Automated extraction of hit numbers from search result pages. In Proceedings of WAIM '06, pages 73--84, 2006.
15
16
 
17
P. Nakov and M. Hearst. A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies. In Proceedings of RANLP '05, 2005.
18
 
19
T. A. Phelps and R. Wilensky. Robust Hyperlinks Cost Just Five Words Each. Technical report, University of California at Berkeley, 2000.
 
20
21
 
22
M. Theall. Methodologies for Crawler Based Web Surveys. Internet Research: Electronic Networking and Applications, 12:124--138, 2002.
 
23
X. Wan and J. Yang. Wordrank-based Lexical Signatures for Finding Lost or Related Web Pages. In APWeb, pages 843--849, 2006.
 
24
X. Zhu and R. Rosenfeld. Improving Trigram Language Modeling with the World Wide Web. In Proceedings of ICASSP '01, pages 533--536, 2001.

Collaborative Colleagues:
Martin Klein: colleagues
Michael L. Nelson: colleagues