|
ABSTRACT
We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Allan, J., Callan, J. P., Croft, W. B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1996. INQUERY at TREC-5. In Proceedings of the Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500-238, Gaithersburg, Md., 119--132.
|
| |
2
|
Amati, G., Carpineto, C., and Romano, G. 2001. FUB at TREC 10 web track: A probabilistic framework for topic relevance term weighting. In Proceedings of the Tenth Text Retrieval Conference (TREC-10). NIST Special Publication 500-250, Gaithersburg, Md.
|
| |
3
|
Bookstein, A. and Swanson, D. 1974. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci. 25, 312--318.
|
| |
4
|
Carpineto, C. and Romano, G. 2000. Trec-8 automatic ad-hoc experiments at fub. In Proceedings of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246, Gaithersburg, Md., 377--380.
|
 |
5
|
|
| |
6
|
Cox, R. T. 1961. The Algebra of Probable Inference. Johns Hopkins Press, Baltimore, Md.
|
| |
7
|
Croft, W. and Harper, D. 1979. Using probabilistic models of document retrieval without relevance information. J. Doc. 35, 285--295.
|
| |
8
|
Damerau, F. 1965. An experiment in automatic indexing. Am. Doc. 16, 283--289.
|
| |
9
|
Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. I, third ed. Wiley, New York.
|
| |
10
|
|
| |
11
|
Good, I. J. 1968. The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Vol. 30. MIT Press, Cambridge, Mass.
|
| |
12
|
|
| |
13
|
Harter, S. P. 1974. A probabilistic approach to automatic keyword indexing. PhD Thesis, Graduate Library, The University of Chicago, Thesis No. T25146.
|
| |
14
|
Harter, S. P. 1975a. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. ASIS 26, 197--216.
|
| |
15
|
Harter, S. P. 1975b. A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. J. ASIS 26, 280--289.
|
| |
16
|
Hiemstra, D. and de Vries, A. 2000. Relating the new language models of information retrieval to the traditional retrieval models. Res. Rep. TR--CTIT--00--09, Centre for Telematics and Information Technology.
|
| |
17
|
Hintikka, J. 1970. On semantic information. In Information and Inference, J. Hintikka, and P. Suppes, Eds., Synthese Library. D. Reidel, Dordrecht, Holland, 3--27.
|
 |
18
|
|
 |
19
|
John Lafferty , Chengxiang Zhai, Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.111-119, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383970]
|
 |
20
|
|
 |
21
|
|
| |
22
|
Popper, K. 1995. The Logic of Scientific Discovery (The bulk of the work was first published in Vienna in 1935; this reprint was first published by Hutchinson in 1959, new notes and footnotes in the present reprint). Routledge, London.
|
| |
23
|
Renyi, A. 1969. Foundations of Probability. Holden-Day, San Francisco.
|
| |
24
|
Robertson, S. 1986. On relevance weight estimation and query expansion. J. Doc. 42, 3, 288--297.
|
| |
25
|
|
| |
26
|
Robertson, S., Walker, S., Beaulieu, M., Gatford, M., and Payne, A. 1996. Okapi at Trec-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), D. Harman, Ed., Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Md., 182--191.
|
| |
27
|
Robertson, S. E. and Sparck-Jones, K. 1976. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27, 129--146.
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
| |
31
|
Solomonoff, R. 1964a. A formal theory of inductive inference. Part I. Inf. Control 7, 1 (March), 1--22.
|
| |
32
|
Solomonoff, R. 1964b. A formal theory of inductive inference. Part II. Inf. Control 7, 2 (June), 224--254.
|
| |
33
|
Titterington, D. M., Smith, A. F. M., and Makov, U. E. 1985. Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.
|
| |
34
|
|
| |
35
|
van Rijsbergen, C. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. J. Doc. 33, 106--119.
|
 |
36
|
|
| |
37
|
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes, second ed. Morgan Kaufmann, San Francisco.
|
 |
38
|
|
CITED BY 40
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Thomas R. Lynam , Chris Buckley , Charles L. A. Clarke , Gordon V. Cormack, A multi-system analysis of document and term selection for blind feedback, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shuming Shi , Ji-Rong Wen , Qing Yu , Ruihua Song , Wei-Ying Ma, Gravitation-based model for information retrieval, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Prasenjit Majumder , Mandar Mitra , Dipasree Pal , Ayan Bandyopadhyay , Samaresh Maiti , Sukanya Mitra , Aparajita Sen , Sukomal Pal, Text collections for FIRE, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jianhan Zhu , Jun Wang , Ingemar J. Cox , Michael J. Taylor, Risky business: modeling and exploiting uncertainty in information retrieval, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Subjects:
Retrieval models
Additional Classification:
G.
Mathematics of Computing
G.3
PROBABILITY AND STATISTICS
General Terms:
Algorithms,
Experimentation,
Measurement,
Theory
Keywords:
Aftereffect model,
BM25,
Bose--Einstein statistics,
Laplace,
Poisson,
binomial law,
document length normalization,
eliteness,
idf,
information retrieval,
probabilistic models,
randomness,
succession law,
term frequency normalization,
term weighting
|