posted on 1987-01-01, 00:00authored byYu-Gang Jiang, Jun Yang, Chong-Wah Ngo, Alexander Hauptmann
Based on the local keypoints extracted as salient
image patches, an image can be described as a “bag-of-visual words
(BoW)” and this representation has appeared promising
for object and scene classification. The performance of BoW
features in semantic concept detection for large-scale multimedia
databases is subject to various representation choices. In this
paper, we conduct a comprehensive study on the representation
choices of BoW, including vocabulary size, weighting scheme,
stop word removal, feature selection, spatial information, and
visual bi-gram. We offer practical insights in how to optimize
the performance of BoW by choosing appropriate representation
choices. For the weighting scheme, we elaborate a soft-weighting
method to assess the significance of a visual word to an image.
We experimentally show that the soft-weighting outperforms
other popular weighting schemes such as TF-IDF with a large
margin. Our extensive experiments on TRECVID data sets also
indicate that BoW feature alone, with appropriate representation
choices, already produces highly competitive concept detection
performance. Based on our empirical findings, we further apply
our method to detect a large set of 374 semantic concepts. The
detectors, as well as the features and detection scores on several
recent benchmark data sets, are released to the multimedia
community.