|
|||||||||||||||||||||
|
|||||||||||||||||||||
ABSTRACT
A common practice in comparative evaluation of information retrieval (IR) systems is to create a test collection comprising a set of topics (queries), a document corpus, and relevance judgments, and to monitor the performance of retrieval systems over such a collection. A typical evaluation of a system involves computing a performance metric, e.g., Average Precision (AP), for each topic and then using the average performance metric, e.g., Mean Average Precision (MAP) to express the overall system performance. However, averages do not capture all the important aspects of system performance, and used alone may not thoroughly express system effectiveness, i.e., average of performance can mask large variance in individual topic effectiveness. The author hypothesis is that, in addition to the average of overall performance, attention needs to be paid to how a system performance varies across topics. This variability can be measured by calculating the standard deviation (SD) of individual performance scores. We refer to this performance variation as Volatility. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references. INDEX TERMS
Primary Classification:
Additional Classification:
General Terms:
|
|||||||||||||||||||||