Towards Object Detection in the Real World
Object detection is one of the most fundamental tasks in the computer vision field, which aims at localizing and classifying instances of semantic objects of certain classes in digital images. Object detection serves as a crucial step for many downstream vision tasks such as action recognition, face analysis, instance segmentation, object re-identification, retail scene understanding, etc. Therefore, it has been carefully studied by the computer vision community for decades. Thanks to the advance of deep neural networks and wellannotated challenging datasets, object detection algorithms have been greatly improved. However, object detectors are still far from robust when deployed in real-world AI applications. The performance can drop dramatically due to the challenging conditions introduced by the varying nature of the real-world data. We summarize the majority of this varying nature as three aspects, i.e. appearance variation, scale variation, and availability variation. In some extreme cases where multiple variations co-exist, the failure of object detectors may even lead to the crash of the entire AI system. The focus of this thesis is to construct the solutions addressing the mentioned three types of data variations. For the appearance variation, we study the effect of the context information on the detection of the human face, one of the most common objects. We propose an explicit contextual reasoning module for the detection network to capture the local information surrounding the face. For the scale variation challenge, we start with the anchor-based formulation of object detection where the anchor-object matching mechanism is theoretically investigated. This inspires us to propose several better designs of robust anchors. Then we discover the inherent limitations of anchor-based detection, leading to the reformulation of detection from an anchor-free perspective. Advanced techniques for dynamic feature selection are proposed to achieve the goal that less is more. For the availability variation, we address the inherent long-tail distribution of the real-world data by studying object detection in the few-shot setting in which there are some rare classes with only a few annotated objects available while other common classes dominate the dataset with abundant labeled samples. Given limited visual information of the rare classes, we propose semantic relation reasoning with prior knowledge from natural language to take advantage of the constant relationship between common classes and rare classes regardless of the data availability. We thoroughly analyze the effect of proposed techniques by conducting several experiments on challenging real-world datasets, such as WiderFace, VOC, COCO, etc. Comparisons with the previous state of the arts demonstrate the superiority of our methods.
History
Date
2021-08-20Degree Type
- Dissertation
Department
- Electrical and Computer Engineering
Degree Name
- Doctor of Philosophy (PhD)