|
ABSTRACT
A methodology based on "information nuggets" has recently emerged as the de facto standard by which answers to complex questions are evaluated. After several implementations in the TREC question answering tracks, the community has gained a better understanding of its many characteristics. This paper focuses on one particular aspect of the evaluation: the human assignment of nuggets to answer strings, which serves as the basis of the F-score computation. As a byproduct of the TREC 2006 ciQA task, identical answer strings were independently evaluated twice, which allowed us to assess the consistency of human judgments. Based on these results, we explored simulations of assessor behavior that provide a method to quantify scoring variations. Understanding these variations in turn lets researchers be more confident in their comparisons of systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Allan. HARD track overview in TREC 2005: High accuracy retrieval from documents. In Proceedings of TREC 2005.
|
 |
2
|
|
 |
3
|
|
| |
4
|
C. Cleverdon, J. Mills, and E. Keen. Factors determining the performance of indexing systems. Two volumes, ASLIB Cranfield Research Project, Cranfield, England, 1968.
|
| |
5
|
W. Hildebrandt, B. Katz, and J. Lin. Answering definition questions with multiple knowledge sources. In Proceedings of HLT/NAACL 2004.
|
 |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
E. Voorhees. Overview of the TREC 2003 question answering track. In Proceedings of TREC 2003.
|
| |
12
|
E. Voorhees. Overview of the TREC 2004 question answering track. In Proceedings of TREC 2004.
|
| |
13
|
|
| |
14
|
E. Voorhees and H. Dang. Overview of the TREC 2005 question answering track. In Proceedings of TREC 2005.
|
 |
15
|
|
 |
16
|
|
|