posted on 1977-01-01, 00:00authored byWei-Hao Lin, Alexander Hauptmann
Evaluating retrieval systems in a controlled environment
with a large set of topics has been the core paradigm in
the information retrieval community. Voorhees and Buckley
proposed to estimate the reliability of retrieval experiments
by calculating the probability of making wrong effectiveness
judgments between two retrieval systems over two retrieval
experiments[2], which is called Retrieval Experiment Error
Rate (REER) in this paper. They have successfully shown
how the topic set sizes affect the retrieval experiment reliability.
However, the REER model in the previous work was
empirically justified without providing a derivation based on
statistical principles. We fill this gap and show that REER
can indeed be derived from statistical principles. Based on
the derived model we can explain why a successful experiment
design depends on factors including a sufficient number
of topics, large enough measurement score difference between
systems, and a homogeneous distribution of retrieval
scores for topics and systems, which reduces the variance of
the score differences.