|
ABSTRACT
In this paper the score distributions of a number of text search engines are modeled. It is shown empirically that the score distributions on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. Experiments show that this model fits TREC-3 and TREC-4 data for not only probabilistic search engines like INQUERY but also vector space search engines like SMART for English. We have also used this model to fit the output of other search engines like LSI search engines and search engines indexing other languages like Chinese.It is then shown that given a query for which relevance information is not available, a mixture model consisting of an exponential and a normal distribution can be fitted to the score distribution. These distributions can be used to map the scores of a search engine to probabilities. We also discuss how the shape of the score distributions arise given certain assumptions about word distributions in documents. We hypothesize that all 'good' text search engines operating on any language have similar characteristics.This model has many possible applications. For example, the outputs of different search engines can be combined by averaging the probabilities (optimal if the search engines are independent) or by using the probabilities to select the best engine for each query. Results show that the technique performs as well as the best current combination techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Arampatzis, J. Beney, C. H. A. Koster, and T. P. van der Weide. Incrementality, half-life and threshold optimization for adaptive document filtering. In Proc. of the 9th Text Retrieval Conference (TREC-9). NIST, Nov 2000, To be published in late 2001.
|
 |
2
|
|
| |
3
|
|
| |
4
|
A. Bookstein. When the most Pertinent document should not be retrieved - an analysis of the Swets model. Information Processing and Management, 13:377-383, 1977.
|
 |
5
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
| |
6
|
K. W. Church and W. A. Gale. Poisson mixtures. Natural Language Engineering, 1(2):163-190, 1995.
|
| |
7
|
W. B. Croft. Combining approaches to information retrieval. In W. B. Croft, editor, Advances in Information Retrieval, pages 1-36. Kluwer Academic Publishers, 2000.
|
 |
8
|
|
| |
9
|
Myron Flickner , Harpreet Sawhney , Wayne Niblack , Jonathan Ashley , Qian Huang , Byron Dom , Monika Gorkani , Jim Hafner , Denis Lee , Dragutin Petkovic , David Steele , Peter Yanker, Query by Image and Video Content: The QBIC System, Computer, v.28 n.9, p.23-32, September 1995
[doi> 10.1109/2.410146]
|
| |
10
|
E. Fox and J. Shaw. Combination of multiple searches. In the Proc. of the 2nd Text Retrieval Conference (TREC-2), pages 243-252. National Institute of Standards and Technology Special Publications 500-215, 1994.
|
| |
11
|
W. Greiff. The use of exploratory data analysis in information retrieval research. In W. B. Croft, editor, Advances in Information Retrieval, pages 37-72. Kluwer Academic Publishers, 2000.
|
| |
12
|
S. P. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 20:197-206, 1975.
|
 |
13
|
|
 |
14
|
|
| |
15
|
G. McLachlan and D. Peel. Finite Mixture Models. John Wiley, 2000.
|
| |
16
|
F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison Weseley, 1964.
|
| |
17
|
|
| |
18
|
J. A. Swets. Information retrieval systems. Science, 141:245-250, 1963.
|
| |
19
|
K. Tumer and J. Ghosh. Linear and order statistics combiners for pattern clasification. In A. Sharkey, editor, Combining Artificial Neural Networks, pages 127-162. Springer-Verlag, 1999.
|
| |
20
|
|
 |
21
|
|
 |
22
|
Ellen M. Voorhees , Narendra K. Gupta , Ben Johnson-Laird, Learning collection fusion strategies, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.172-179, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215357]
|
CITED BY 47
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Javed A. Aslam , Virgiliu Pavlu , Robert Savell, A unified model for metasearch, pooling, and system evaluation, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Steven M. Beitzel , Eric C. Jensen , Abdur Chowdhury , David Grossman , Ophir Frieder , Nazli Goharian, Fusion of effective retrieval strategies in the same information retrieval system, Journal of the American Society for Information Science and Technology, v.55 n.10, p.859-868, August 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Lillis , Fergus Toolan , Rem Collier , John Dunnion, ProbFuse: a probabilistic approach to data fusion, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
Yu-Ting Liu , Tie-Yan Liu , Tao Qin , Zhi-Ming Ma , Hang Li, Supervised rank aggregation, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Aruna Balasubramanian , Yun Zhou , W. Bruce Croft , Brian Neil Levine , Aruna Venkataramani, Web search from a bus, Proceedings of the second workshop on Challenged networks CHANTS, September 14-14, 2007, Montreal, Quebec, Canada
|
|
|
|
|
|
|
|
|
Andrei Broder , Massimiliano Ciaramita , Marcus Fontoura , Evgeniy Gabrilovich , Vanja Josifovski , Donald Metzler , Vanessa Murdock , Vassilis Plachouras, To swing or not to swing: learning when (not) to advertise, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Douglas R. Turnbull , Luke Barrington , Gert Lanckriet , Mehrdad Yazdani, Combining audio content and social context for semantic music discovery, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|