| Relevance assessment: are judges exchangeable and does it matter |
| Full text |
Pdf
(138 KB)
|
Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Singapore, Singapore
SESSION: Evaluation--2
table of contents
Pages 667-674
Year of Publication: 2008
ISBN:978-1-60558-164-4
|
|
Authors
|
|
Peter Bailey
|
Microsoft, Redmond, WA, USA
|
|
Nick Craswell
|
Microsoft, Cambridge, United Kngdm
|
|
Ian Soboroff
|
NIST, Gaithersburg, MD, USA
|
|
Paul Thomas
|
CSIRO, Canberra, Australia
|
|
Arjen P. de Vries
|
CWI, Amsterdam, Netherlands
|
|
Emine Yilmaz
|
Microsoft Research, Cambridge, United Kngdm
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 16, Downloads (12 Months): 212, Citation Count: 3
|
|
|
ABSTRACT
We investigate to what extent people making relevance judgements for a reusable IR test collection are exchangeable. We consider three classes of judge: "gold standard" judges, who are topic originators and are experts in a particular information seeking task; "silver standard" judges, who are task experts but did not create topics; and "bronze standard" judges, who are those who did not define topics and are not experts in the task. Analysis shows low levels of agreement in relevance judgements between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for measuring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise. Bronze standard judges may not be able to substitute for topic and task experts, due to changes in the relative performance of assessed systems, and gold standard judges are preferred.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, to appear.
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
C. W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical Report ASLIB part 2, Cranfield Institute of Technology, 1970.
|
| |
8
|
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37--46, 1960.
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
K. S. Jones and K. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32:59--75, 1976.
|
| |
14
|
M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4:343--359, 1969.
|
| |
15
|
S. Mizzaro. Measuring the agreement among relevance judges. In Proc. MIRA 99: Evaluating Interactive Information Retrieval, April 1999.
|
| |
16
|
R. Rietveld and R. van Hout. Statistical Techniques for the Study of Language and Language Behaviour. Mouton de Gruyter, 1993.
|
| |
17
|
S. Sigel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.
|
 |
18
|
|
 |
19
|
|
| |
20
|
A. Trotman and D. Jenkinson. IR Evaluation Using Multiple Assessors per Topic. In Proc. ADCS, 2007.
|
| |
21
|
A. Trotman, N. Pharo, and D. Jenkinson. Can we at least agree on something? In Proc. SIGIR Workshop on Focused Retrieval, 2007.
|
| |
22
|
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. SIGIR, 1998.
|
| |
23
|
E. M. Voorhees and D. Harman. Overview of the Fifth Text REtrieval Conference (TREC-5). NIST, 1996.
|
 |
24
|
|
 |
25
|
|
|