|
||||||||||||||||||||||
|
||||||||||||||||||||||
ABSTRACT
The power of a statistical test specifies the sample size required to reliably detect a given true effect. In IR evaluation, the power corresponds to the number of topics that are likely to be sufficient to detect a certain degree of superiority of one system over another. To predict the power of a test, one must estimate the variability of the population being sampled from; here, of between-system score deltas. This paper demonstrates that basing such an estimation either on previous experience or on trial experiments leaves wide margins of error. Iteratively adding more topics to the test set until power is achieved is more efficient; however, we show that it leads to a bias in favour of finding both power and significance. A hybrid methodology is proposed, and the reporting requirements of the experimenter using this methodology are laid out. We also demonstrate that greater statistical power is achieved for the same relevance assessment effort by evaluating a large number of topics shallowly than a small number deeply. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references. INDEX TERMS
Primary Classification:
General Terms:
Keywords:
Collaborative Colleagues:
|
||||||||||||||||||||||