Antunes_cmu_0041E_10561.pdf (2.11 MB)
Download file

Leveraging Context for Multi-Label Action Recognition and Detection in Video

Download (2.11 MB)
posted on 05.11.2020, 20:33 by Joao Antunes MartinsJoao Antunes Martins
This thesis addresses video-based multi-person, multi-label, spatiotemporal action detection and recognition. This is a challenging problem because each person can be performing several actions at the same time (e.g. talking and walking), and simultaneously other actors can be performing different actions. We claim that these are problems where the use of contextual information (e.g. semantic descriptions of the scene) may lead to significant performance improvements. In this work, we develop several approaches to tackle this problem and validate them in challenging datasets. We propose a framework to integrate
and test multiple sources of contextual information in video-based multi-person, multi-label, spatiotemporal action detection and recognition. We highlight six contributions,
and that are collected in three publications (at different stages of publication at the time of this writing). The first contribution is a proposed Multisource Video Classification
(MVC) framework that allows the combination of several sources of context information, for which we consider four types: actor centric input filtering (a way to focus attention
on the actor under analysis but still gather appearance information from the neighborhood), semantic neighbor context (a way to inform the model with the actions performed by nearby agents), object detection (how objects interacting with the actor can inform about its action) and pose data (how high level features extracted from the actor can help the classification process). The second contribution is a foveated approach to actor centric filtering for input selection that weights the appearance information in a decreasing way, from the center to the periphery of the actor bounding box. The third contribution is a novel encoding
for the semantic neighbor context and its custom classifier with spatial and temporal dependence. The fourth is a custom Hybrid Sigmoid-Softmax loss function for the multiclass/ multi-label case, that combines the cross-entropy loss typical of multi-class problems with the sum-of-sigmoids loss used in the multi label case. The fifth is the application of the developed methods to a challenging dataset with a large number of videos with mulivtiple agents performing multiple actions, with 80 heterogeneous and highly unbalanced classes. To allow research with reasonable computer power, we have created the mini-AVA, a partition of AVA that maintains temporal continuity and class distribution with only one tenth of the dataset size. The sixth contribution is a collection of ablation studies on alternative actor centric filters and semantic neighbor context classifiers. From this research we achieve a relative mAP improvement of 18:8% using our foveated actor centric
filtering, relative mAP improvement of 5% using our semantic neighbor context embedding and models, and a relative mAP improvement of 12:6% using our custom Hybrid Sigmoid-Softmax loss.




Degree Type



Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)


Daniel P. Siewiorek Asim Smailagic Alexandre Bernardino

Usage metrics