Carnegie Mellon University
Browse

Improved audio features for large-scale multimedia event detection

Download (106.72 kB)
journal contribution
posted on 2014-07-01, 00:00 authored by Florian MetzeFlorian Metze, Shourabh Rawat, Yipei Wang

In this paper, we present recent experiments on using Artificial Neural Networks (ANNs), a new “delayed” approach to speech vs. non-speech segmentation, and extraction of large-scale pooling feature (LSPF) for detecting “events” within consumer videos, using the audio channel only. A “event” is defined to be a sequence of observations in a video, that can be directly observed or inferred. Ground truth is given by a semantic description of the event, and by a number of example videos. We describe and compare several algorithmic approaches, and report results on the 2013 TRECVID Multimedia Event Detection (MED) task, using arguably the largest such research set currently available. The presented system achieved the best results in most audio-only conditions. While the overall finding is that MFCC features perform best, we find that ANN as well as LSP features provide complementary information at various levels of temporal resolution. This paper provides analysis of both low-level and high-level features, investigating their relative contributions to overall system performance.

History

Publisher Statement

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Date

2014-07-01

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC