|
ABSTRACT
This paper reports on theoretical investigations about the assumptions underlying the inverse document frequency (idf). We show that an intuitive idf-based probability function for the probability of a term being informative assumes disjoint document events. By assuming documents to be independent rather than disjoint, we arrive at a Poisson-based probability of being informative. The framework is useful for understanding and deciding the parameter estimation and combination in probabilistic retrieval models.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
R. K. Belew. Finding out about. Cambridge University Press, 2000.
|
| |
4
|
A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25:312--318, 1974.
|
| |
5
|
I. N. Bronstein. Taschenbuch der Mathematik. Harri Deutsch, Thun, Frankfurt am Main, 1987.
|
| |
6
|
K. Church and W. Gale. Poisson mixtures. Natural Language Engineering, 1(2):163--190, 1995.
|
| |
7
|
K. W. Church and W. A. Gale. Inverse document frequency: A measure of deviations from poisson. In Third Workshop on Very Large Corpora, ACL Anthology, 1995.
|
| |
8
|
T. Lafouge and C. Michel. Links between information construction and information gain: Entropy and bibliometric distribution. Journal of Information Science, 27(1):39--49, 2001.
|
 |
9
|
|
| |
10
|
|
| |
11
|
S. Wong and Y. Yao. An information-theoric measure of term specificity. Journal of the American Society for Information Science, 43(1):54--61, 1992.
|
 |
12
|
|
|