Robust Pose Invariant Face Recognition Using 3D Thin Plate Spline Spatial Transformer Networks
In recent years, face recognition has advanced with incredible speed thanks to the advent of deep learning, large scale datasets, and the improvement in GPU computing. While many of these methods claim to be able to match faces from images captured in-the-wild, they still seem to perform poorly when trying to match non-frontal faces to frontal ones which is the practical scenario faced by law enforcement everyday in the processing of criminal cases. Trying to learn these large pose variations implicitly is a very hard problem, both from a deep neural network modeling perspective and from the lack of structured datasets used in training and evaluating these models. As they are often made up of celebrity images found online, they contain a large bias in the types of images present in both the datasets used for training and evaluating new methods. Perhaps the largest bias is in the distribution of the pose of the faces. Most celebrity images are captured from a frontal or near-frontal view which have traditionally been the easiest poses for face recognition. Most importantly, as both training and evaluation datasets share this bias, this has led to artificially high results being reported. The goal of this thesis is to design a system to be able to take advantage of the large amount of data already available and still be able to perform robust face recognition across large pose variations. We propose that the most efficient way to do this is to transform and reduce the entire pose distribution to just the frontal faces by re-rendering the o-angle faces from a frontal viewpoint. By doing this, the mismatch between the training, evaluation, and real-world multi-modal distributions on pose will be eliminated. To solve this problem we must explicitly understand and model the 3D face structure of faces since faces are not planar objects. This 3D model of the face must be able to be generated from a single, 2D image since that is all that is usually available in a recognition scenario. This is also the hardest scenario and is often overlooked by the use of temporal fusion to perform some kind of data reconstruction. By improving performance of the models in this worst case scenario, we can always further improve by utilizing temporal information later but maintain a high accuracy on single images. To achieve this, we first design a new method of 3D facial alignment and modeling from a single 2D image using our 3D Thin Plate Spline Spatial Transformer Networks (3DTPS-STN). We evaluate this method against several previous methods on the Annotated Facial Landmarks in the Wild (AFLW) dataset and the synthetic AFLW2000-3D dataset and show that our method achieves very high performance on these at a much faster speed. We also confirm the intuition that most recognition datasets in use have a heavy bias towards frontal faces using the implicit knowledge of the pose extracted from the 3D modeling. We then show how we can use the 3D models created by the 3DTPS-STN method to frontalize the face from any angle and, by a careful selection of the face region, generate a more stable face image across all poses. We then train a 28 layer ResNet, a common face recognition framework, on these faces and show that this model can outperform all comparable models on the CMU Multi-PIE dataset and also show a detailed analysis on other datasets.