| Generalized inverse document frequency |
| Full text |
Pdf
(464 KB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceeding of the 17th ACM conference on Information and knowledge management
table of contents
Napa Valley, California, USA
SESSION: IR: theory
table of contents
Pages 399-408
Year of Publication: 2008
ISBN:978-1-59593-991-3
|
|
Author
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 18, Downloads (12 Months): 177, Citation Count: 0
|
|
|
ABSTRACT
Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. There have been various attempts to provide theoretical justifications for IDF. One of the most appealing derivations follows from the Robertson-Sparck Jones relevance weight. However, this derivation, and others related to it, typically make a number of strong assumptions that are often glossed over. In this paper, we re-examine these assumptions from a Bayesian perspective, discuss possible alternatives, and derive a new, more generalized form of IDF that we call generalized inverse document frequency. In addition to providing theoretical insights into IDF, we also undertake a rigorous empirical evaluation that shows generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
| |
4
|
W. B. Croft and D. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979.
|
 |
5
|
|
 |
6
|
|
| |
7
|
S. P. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 26:197--206 and 280--289, 1975.
|
 |
8
|
|
| |
9
|
K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11--21, 1972.
|
| |
10
|
J. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Lafferty, editors, Language Modeling and Information Retrieval. 2003.
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
S. Robertson. The probability ranking principle in IR. Journal of Documentation, 33(4):294--304, 1977.
|
| |
15
|
S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. 3rd Text REtrieval Conference, pages 109--126, 1994.
|
| |
16
|
S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
 |
21
|
|
| |
22
|
T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-based serach engine for complex queries. In Proceedings of the International Conference on Intelligence Analysis, 2004.
|
 |
23
|
|
 |
24
|
|
|