Carnegie Mellon University
Browse

Towards Usable Multimedia Event Detection

Download (5.45 MB)
thesis
posted on 2022-12-02, 21:32 authored by Zhenzhong Lan

We often come across events on our daily commute such as a traffic jam, a person running a red light, or an ambulance approaching. These are complex events that human can effortlessly recognize and to which we react appropriately. For computers to be capable of recognizing complex events in a reliable way, like humans can, will facilitate many important applications such as self-driving cars, smart security systems, and elderly care systems. However, existing computer vision and multimedia research focuses mainly on detecting elementary visual concepts (for example, actions, objects, and scenes). Such detection alone are generally insufficient for decision making. Hence we have a pressing need for complex event detection systems. Much research emphasis should be placed on developing such systems. 

Compared to elementary visual concept detection, complex event detection is much more difficult in terms of representing both the task and the data that describe the task. Unlike elementary visual concepts, complex events are higher level abstractions of longer temporal spans, and they have richer content with more dramatic variations. The web videos which describe those events are generally much larger in size, noisier in content, and sparser in labels than the images used for concept detection research. Thus, complex event detection introduces several novel research challenges that have not been sufficiently studied in the literature. In this dissertation, we propose a set of algorithms to address such challenges. These algorithms enable us to build a multimedia event detection (MED) system which is practically useful for complex event detection. 

The proposed algorithms significantly improve the accuracy and speed of our MED system by addressing the aforementioned challenges. For example, our new data augmentation step and a new way of integrating multi-modal information significantly reduce the impact of the large event variation problem; our two-stage Convolutional Neural Network (CNN) training method allows us to get in-domain CNN features using noisy labels; our new feature smoothing technique is a thorough solution to the problem that noisy and uninformative background contents dominate the video representations; and so forth. 

We have implemented most of the proposed methods into the CMU-Elamp system. They are one of the major reasons for its leading performances in the TRECVID MED competition 2011 ∼ 2015, the most representative task for MED. Our governing aim, however, has been to uncover enduring insights that can be widely used. Given the complexity of our task and the significance of those improvements, we believe that our algorithms and lessons derived could be generalized to other tasks. Indeed, our methods have already been used by other researchers on tasks such as medical video analysis and image segmentation. 

History

Date

2017-05-11

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Dr. Alexander G. Hauptmann

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC