| Why inverse document frequency? |
| Full text |
Pdf
(480 KB)
|
| Source
|
North American Chapter Of The Association For Computational Linguistics
archive
Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001
table of contents
Pittsburgh, Pennsylvania
Pages: 1 - 8
Year of Publication: 2001
|
|
Author
|
|
| Publisher |
Association for Computational Linguistics
Morristown, NJ, USA
|
| Bibliometrics |
Downloads (6 Weeks): 11, Downloads (12 Months): 77, Citation Count: 6
|
|
|
ABSTRACT
Inverse Document Frequency (IDF) is a popular measure of a word's importance. The IDF invariably appears in a host of heuristic measures used in information retrieval. However, so far the IDF has itself been a heuristic. In this paper, we show IDF to be optimal in a principled sense. We show that IDF is the optimal weight of a word with respect to minimization of a Kullback-Leibler distance suitably generalized to nonnegative functions which need not be probability distributions. This optimization problem is closely related to maximum entropy problem. We show that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where we treat each document as the query that retrieves itself. That is, IDF is optimal for document self-retrieval.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
I. Csiszar. 1991. Why least squares and maximum entropy? an axiomatic approach to inference for linear inverse problems. Annals of Statistics, 19:2032--2066.
|
| |
3
|
|
| |
4
|
John Lafferty, Stephen Della Pietra, and Vincent Della Pietra. 1997. Statistical learning algorithms based on Bregman distances. Canadian Workshop on Information Theory, pages 77--80.
|
| |
5
|
|
| |
6
|
Kishore Papineni. 2000. A generalized Kullback Leibler distance and its minimization. IBM Research Report RC21815, August. Also available at www.research.ibm.com/resources/paper_search.html.
|
| |
7
|
S. E. Robertson and K. Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science, pages 129--146, May-June.
|
| |
8
|
|
| |
9
|
K. Sparck Jones. 1973. Index term weighting. Information Storage and Retrieval, 9:619--633.
|
| |
10
|
S. K. M. Wong and Y. Y. Yao. 1992. An information-theoretic measure of term specificity. Journal of the American Society for Information Science, 43:54--61.
|
|