An In-Depth Comparison of Keyword Specific Thresholding and Sum-to-One Score Normalization
The quality of a spoken term detection (STD) system critically depends on the choice of a “thresholding” function, which is used to determine whether to output a candidate detection or not based on its score. In the context of the IARPA Babel program and the NIST OpenKWS evaluation series, the penalty for missing an occurrence depends on the frequency of the keyword, so it is desirable either to apply different thresholds to different keywords, or to normalize the scores before applying a global threshold. This paper compares two widely used thresholding algorithms: keyword specific thresholding (KST) and sum-to-one score normalization (STO), analyzes the difference in their performance in detail, and recommends the use of the “estimated KST” algorithm