Contextual Recurrent Level Set Networks and Recurrent Residual Networks for Semantic Labeling
Semantic labeling is becoming more and more popular among researchers in computer vision and machine learning. Many applications, such as autonomous driving, tracking, indoor navigation, augmented reality systems, semantic searching, medical imaging are on the rise, requiring more accurate and efficient segmentation mechanisms. In recent years, deep learning approaches based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have dramatically emerged as the dominant paradigm for solving many problems in computer vision and machine learning. The main focus of this thesis is to investigate robust approaches that can tackle the challenging semantic labeling tasks including semantic instance segmentation and scene understanding. In the first approach, we convert the classic variational Level Set method to a learnable deep framework by proposing a novel definition of contour evolution named Recurrent Level Set (RLS). The proposed RLS employs Gated Recurrent Units to solve the energy minimization of a variational Level Set functional. The curve deformation processes in RLS is formulated as a hidden state evolution procedure and is updated by minimizing an energy functional composed of fitting forces and contour length. We show that by sharing the convolutional features in a fully end-to-end trainable framework, RLS is able to be extended to Contextual Recurrent Level Set (CRLS) Networks to address semantic segmentation in the wild problem. The experimental results have shown that our proposed RLS improves both computational time and segmentation accuracy against the classic variational Level Set-based methods whereas the fully end-to-end system CRLS achieves competitive performance compared to the state-of-the-art semantic segmentation approaches on PAS CAL VOC 2012 and MS COCO 2014 databases. The second proposed approach, Contextual Recurrent Residual Networks (CRRN), inherits all the merits of sequence learning information and residual learning in order to simultaneously model long-range contextual infor- mation and learn powerful visual representation within a single deep network. Our proposed CRRN deep network consists of three parts corresponding to sequential input data, sequential output data and hidden state as in a recurrent network. Each unit in hidden state is designed as a combination of two components: a context-based component via sequence learning and a visualbased component via residual learning. That means, each hidden unit in our proposed CRRN simultaneously (1) learns long-range contextual dependencies via a context-based component. The relationship between the current unit and the previous units is performed as sequential information under an undirected cyclic graph (UCG) and (2) provides powerful encoded visual representation via residual component which contains blocks of convolution and/or batch normalization layers equipped with an identity skip connection. Furthermore, unlike previous scene labeling approaches [1, 2, 3], our method is not only able to exploit the long-range context and visual representation but also formed under a fully-end-to-end trainable system that effectively leads to the optimal model. In contrast to other existing deep learning networks which are based on pretrained models, our fully-end-to-end CRRN is completely trained from scratch. The experiments are conducted on four challenging scene labeling datasets, i.e. SiftFlow, CamVid, Stanford background, and SUN datasets, and compared against various state-of-the-art scene labeling methods.