ACM Home Page
Please provide us with feedback. Feedback
Relevance assessment: are judges exchangeable and does it matter
Full text PdfPdf (138 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Singapore, Singapore
SESSION: Evaluation--2 table of contents
Pages 667-674  
Year of Publication: 2008
ISBN:978-1-60558-164-4
Authors
Peter Bailey  Microsoft, Redmond, WA, USA
Nick Craswell  Microsoft, Cambridge, United Kngdm
Ian Soboroff  NIST, Gaithersburg, MD, USA
Paul Thomas  CSIRO, Canberra, Australia
Arjen P. de Vries  CWI, Amsterdam, Netherlands
Emine Yilmaz  Microsoft Research, Cambridge, United Kngdm
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 212,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1390334.1390447
What is a DOI?

ABSTRACT

We investigate to what extent people making relevance judgements for a reusable IR test collection are exchangeable. We consider three classes of judge: "gold standard" judges, who are topic originators and are experts in a particular information seeking task; "silver standard" judges, who are task experts but did not create topics; and "bronze standard" judges, who are those who did not define topics and are not experts in the task.

Analysis shows low levels of agreement in relevance judgements between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for measuring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise. Bronze standard judges may not be able to substitute for topic and task experts, due to changes in the relative performance of assessed systems, and gold standard judges are preferred.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, to appear.
2
3
4
 
5
 
6
 
7
C. W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical Report ASLIB part 2, Cranfield Institute of Technology, 1970.
 
8
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37--46, 1960.
9
 
10
 
11
12
 
13
K. S. Jones and K. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32:59--75, 1976.
 
14
M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4:343--359, 1969.
 
15
S. Mizzaro. Measuring the agreement among relevance judges. In Proc. MIRA 99: Evaluating Interactive Information Retrieval, April 1999.
 
16
R. Rietveld and R. van Hout. Statistical Techniques for the Study of Language and Language Behaviour. Mouton de Gruyter, 1993.
 
17
S. Sigel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.
18
19
 
20
A. Trotman and D. Jenkinson. IR Evaluation Using Multiple Assessors per Topic. In Proc. ADCS, 2007.
 
21
A. Trotman, N. Pharo, and D. Jenkinson. Can we at least agree on something? In Proc. SIGIR Workshop on Focused Retrieval, 2007.
 
22
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. SIGIR, 1998.
 
23
E. M. Voorhees and D. Harman. Overview of the Fifth Text REtrieval Conference (TREC-5). NIST, 1996.
24
25


Collaborative Colleagues:
Peter Bailey: colleagues
Nick Craswell: colleagues
Ian Soboroff: colleagues
Paul Thomas: colleagues
Arjen P. de Vries: colleagues
Emine Yilmaz: colleagues