|
ABSTRACT
Various measures, such as binary preference (bpref), inferred average precision (infAP), and binary normalised discounted cumulative gain (nDCG) have been proposed as alternatives to mean average precision (MAP) for being less sensitive to the relevance judgements completeness. As the primary aim of any system building is to train the system to respond to user queries in a more robust and stable manner, in this paper, we investigate the importance of the choice of the evaluation measure for training, under different levels of evaluation incompleteness. We simulate evaluation incompleteness by sampling from the relevance assessments. Through large-scale experiments on two standard TREC test collections, we examine retrieval sensitivity when training - i.e. if a training process, based on any of the four discussed measures has an impact on the final retrieval performance. Experimental results show that training by bpref, infAP and nDCG provides significantly better retrieval performance than training by MAP when relevance judgements completeness is extremely low. When relevance judgements completeness increases, the measures behave more similarly.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Allan, B. Carterette, J. Aslam, V. Pavlu, B. Dachev, and E. Kanoulas. Million Query TREC 2007 Overview. In Proceedings of TREC 2007.
|
| |
2
|
G. Amati. Probabilistic Models for Information Retrieval based on Divergence from Randomness. PhD thesis, Univ. of Glasgow, 2003.
|
| |
3
|
S. Buttcher, C. Clarke and I. Soboroff. The TREC 2006 Terabyte Track. In Proceedings of TREC 2006.
|
 |
4
|
|
 |
5
|
|
 |
6
|
Chris Burges , Tal Shaked , Erin Renshaw , Ari Lazier , Matt Deeds , Nicole Hamilton , Greg Hullender, Learning to rank using gradient descent, Proceedings of the 22nd international conference on Machine learning, p.89-96, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102363]
|
 |
7
|
|
 |
8
|
|
| |
9
|
L. Gronqvist. Evaluating Latent Semantic Vector Models with Synonym Tests and Document Retrieval. In Proceedings of SIGIR 2005 ELECTRA Workshop.
|
| |
10
|
Donna Harman , Martin Braschler , Michael Hess , Michael Kluck , Carol Peters , Peter Schäuble , Paraic Sheridan, CLIR Evaluation at TREC, Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation, p.7-23, September 21-22, 2000
|
| |
11
|
B. He and I. Ounis. Setting Per-field Normalisation Hyper-parameters for the Named-page Finding Search Task. In Proceedings of ECIR 2007.
|
| |
12
|
B. He. Term Frequency Normalisation for Information Retrieval. PhD thesis, University of Glasgow, 2007.
|
 |
13
|
|
 |
14
|
|
| |
15
|
S. Kirkpatrick, C. Gelatt, and M. Vecchi. Optimization by simulated annealing. Science, 220(4598):671--680, 1983.
|
| |
16
|
K. Kuriyama, N, Kando, T. Nozue and K. Oyama. Pooling for a large scale test collection : Analysis of the search results for the pre-test of the NTCIR-1 Workshop. In Proceedings of NTCIR-1, 1999.
|
 |
17
|
Irina Matveeva , Chris Burges , Timo Burkard , Andy Laucius , Leon Wong, High accuracy retrieval with multiple nested ranker, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
[doi> 10.1145/1148170.1148246]
|
| |
18
|
D. Metzler. Direct maximization of rank-based metrics. Technical report, Univ. of Massachusetts, 2005.
|
 |
19
|
|
 |
20
|
|
| |
21
|
I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A high performance and scalable judgements retrieval platform. In Proceedings of the OSIR Workshop 2006.
|
| |
22
|
S. E. Robertson, S. Walker, M. M. Beaulieu, M. Gatford, and A. Payne. Okapi at TREC 4. In Proceedings of TREC 4, 1995.
|
| |
23
|
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Proceedings of TREC-1, 1992.
|
 |
24
|
|
 |
25
|
|
| |
26
|
K. Sparck Jones and C. van Rijsbergen. Report on the need for and provision of an "ideal" judgements retrieval test collection. British Library Research and Development Report 5266, Computer Laboratory, University of Cambridge, 1975.
|
 |
27
|
Michael Taylor , Hugo Zaragoza , Nick Craswell , Stephen Robertson , Chris Burges, Optimisation methods for ranking functions with multiple parameters, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183698]
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
 |
31
|
|
|