ACM Home Page
Please provide us with feedback. Feedback
Probabilistic models of information retrieval based on measuring the divergence from randomness
Full text PdfPdf (264 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 20 ,  Issue 4  (October 2002) table of contents
Pages: 357 - 389  
Year of Publication: 2002
ISSN:1046-8188
Authors
Gianni Amati  University of Glasgow, Fondazione Ugo Bordoni, Roma, Italy
Cornelis Joost Van Rijsbergen  University of Glasgow, Glasgow, Scotland
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 36,   Downloads (12 Months): 258,   Citation Count: 40
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/582415.582416
What is a DOI?

ABSTRACT

We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Allan, J., Callan, J. P., Croft, W. B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1996. INQUERY at TREC-5. In Proceedings of the Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238, Gaithersburg, Md., 119--132.
 
2
Amati, G., Carpineto, C., and Romano, G. 2001. FUB at TREC 10 web track: A probabilistic framework for topic relevance term weighting. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). NIST Special Publication 500-250, Gaithersburg, Md.
 
3
Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci. 25, 312--318.
 
4
Carpineto, C. and Romano, G. 2000. Trec-8 automatic ad-hoc experiments at fub. In Proceedings of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246, Gaithersburg, Md., 377--380.
5
 
6
Cox, R. T. 1961. The Algebra of Probable Inference. Johns Hopkins Press, Baltimore, Md.
 
7
Croft, W. and Harper, D. 1979. Using probabilistic models of document retrieval without relevance information. J. Doc. 35, 285--295.
 
8
Damerau, F. 1965. An experiment in automatic indexing. Am. Doc. 16, 283--289.
 
9
Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. I, third ed. Wiley, New York.
 
10
 
11
Good, I. J. 1968. The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Vol. 30. MIT Press, Cambridge, Mass.
 
12
 
13
Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. PhD Thesis, Graduate Library, The University of Chicago, Thesis No. T25146.
 
14
Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. ASIS 26, 197--216.
 
15
Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. ASIS 26, 280--289.
 
16
Hiemstra, D. and de Vries, A. 2000. Relating the new language models of information retrieval to the traditional retrieval models. Res. Rep. TR--CTIT--00--09, Centre for Telematics and Information Technology.
 
17
Hintikka, J. 1970. On semantic information. In Information and Inference, J. Hintikka, and P. Suppes, Eds., Synthese Library. D. Reidel, Dordrecht, Holland, 3--27.
18
19
20
21
 
22
Popper, K. 1995. The Logic of Scientific Discovery (The bulk of the work was first published in Vienna in 1935; this reprint was first published by Hutchinson in 1959, new notes and footnotes in the present reprint). Routledge, London.
 
23
Renyi, A. 1969. Foundations of Probability. Holden-Day, San Francisco.
 
24
Robertson, S. 1986. On relevance weight estimation and query expansion. J. Doc. 42, 3, 288--297.
 
25
 
26
Robertson, S., Walker, S., Beaulieu, M., Gatford, M., and Payne, A. 1996. Okapi at Trec-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), D. Harman, Ed., Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Md., 182--191.
 
27
Robertson, S. E. and Sparck-Jones, K. 1976. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129--146.
 
28
 
29
 
30
 
31
Solomonoff, R. 1964a. A formal theory of inductive inference. Part I. Inf. Control 7, 1 (March), 1--22.
 
32
Solomonoff, R. 1964b. A formal theory of inductive inference. Part II. Inf. Control 7, 2 (June), 224--254.
 
33
Titterington, D. M., Smith, A. F. M., and Makov, U. E. 1985. Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.
 
34
 
35
van Rijsbergen, C. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Doc. 33, 106--119.
36
 
37
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes, second ed. Morgan Kaufmann, San Francisco.
38

CITED BY  40

Collaborative Colleagues:
Gianni Amati: colleagues
Cornelis Joost Van Rijsbergen: colleagues