ACM Home Page
Please provide us with feedback. Feedback
Including summaries in system evaluation
Full text PdfPdf (445 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval table of contents
Boston, MA, USA
SESSION: Evaluation and measurement II table of contents
Pages 508-515  
Year of Publication: 2009
ISBN:978-1-60558-483-6
Authors
Andrew Turpin  RMIT University, Melbourne, Australia
Falk Scholer  RMIT University, Melbourne, Australia
Kalvero Jarvelin  University of Tampere, Tampere, Finland
Mingfang Wu  RMIT University, Melbourne, Australia
J. Shane Culpepper  RMIT University, Melbourne, Australia
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 35,   Downloads (12 Months): 127,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1571941.1572029
What is a DOI?

ABSTRACT

In batch evaluation of retrieval systems, performance is calculated based on predetermined relevance judgements applied to a list of documents returned by the system for a query. This evaluation paradigm, however, ignores the current standard operation of search systems which require the user to view summaries of documents prior to reading the documents themselves.

In this paper we modify the popular IR metrics MAP and P@10 to incorporate the summary reading step of the search process, and study the effects on system rankings using TREC data. Based on a user study, we establish likely disagreements between relevance judgements of summaries and of documents, and use these values to seed simulations of summary relevance in the TREC data. Re-evaluating the runs submitted to the TREC Web Track, we find the average correlation between system rankings and the original TREC rankings is 0.8 (Kendall τ), which is lower than commonly accepted for system orderings to be considered equivalent. The system that has the highest MAP in TREC generally remains amongst the highest MAP systems when summaries are taken into account, but other systems become equivalent to the top ranked system depending on the simulated summary relevance.

Given that system orderings alter when summaries are taken into account, the small amount of effort required to judge summaries in addition to documents (19 seconds vs 88 seconds on average in our data) should be undertaken when constructing test collections.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
C. Buckley and E.M. Voorhees. Retrieval system evaluation. In Ellen M. Voorhees and Donna K. Harman, editors, TREC: experiment and evaluation in information retrieval. MIT Press, 2005.
5
 
6
 
7
 
8
D. Hawking. Overview of the TREC-9 Web track. In TREC-9, pages 87--102, Gaithersburg, MD, 2000.
 
9
D. Hawking and N. Craswell. Overview of TREC 2001 Web track. In TREC 2001, pages 61--67, Gaithersburg, MD, 2001.
10
 
11
12
13
 
14
D. Kelly, X. Fu, and C. Shah. Eects of rank and precision of search results on users' evaluations of system performance. Technical Report TR-2007-02, University of North Carolina, 2007.
 
15
R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2008. ISBN 3-900051-07-0.
16
17
18
 
19
20
21
22
23
 
24
25
26
 
27
28
29

Collaborative Colleagues:
Andrew Turpin: colleagues
Falk Scholer: colleagues
Kalvero Jarvelin: colleagues
Mingfang Wu: colleagues
J. Shane Culpepper: colleagues