| TF-IDF uncovered: a study of theories and probabilities |
| Full text |
Pdf
(148 KB)
|
Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Singapore, Singapore
SESSION: Probabilistic models
table of contents
Pages 435-442
Year of Publication: 2008
ISBN:978-1-60558-164-4
|
|
Authors
|
|
Thomas Roelleke
|
Queen Mary, University of London, London, United Kngdm
|
|
Jun Wang
|
Queen Mary, University of London, London, United Kngdm
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 58, Downloads (12 Months): 500, Citation Count: 2
|
|
|
ABSTRACT
Interpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities P(q|d) and P(d|q). Two approaches are explored: a space of independent, and a space of disjoint terms. For independent terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of P(d|q) and the probabilistic odds O(r|d, q) mirrors relevance feedback. For disjoint terms, a relationship between probability theory and TF-IDF is established through the integral + 1/x dx = log x. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
K. Church and W Gale. Inverse document frequency (idf): A measure of deviation from poisson. In Third Workshop on Very Large Corpora, pages 121--130, 1995.
|
| |
4
|
W.B. Croft and D.J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979.
|
 |
5
|
|
| |
6
|
|
| |
7
|
Djoerd Hiemstra. A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries, 3(2):131--139, 2000.
|
| |
8
|
John Lafferty and ChengXiang Zhai. Probabilistic Relevance Models Based on Document and Query Generation, chapter 1. Kluwer, 2003.
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
S.E. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:503--520, 2004.
|
| |
13
|
S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146, 1976.
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
 |
17
|
|
|