Learning to Learn for Small Sample Visual Recognition
Understanding how humans and machines recognize novel visual concepts from few examples remains a fundamental challenge. Humans are remarkably able to grasp a new concept and make meaningful generalization from just few examples. By contrast, state-ofthe- art machine learning techniques and visual recognition systems typically require thousands of training examples and often break down if the training sample set is too small. This dissertation aims to endow visual recognition systems with low-shot learning ability, so that they learn consistently well on data of different sample sizes. Our key insight is that the visual world is well structured and highly predictable not only in data and feature spaces but also in task and model spaces. Such structures and regularities enable the systems to learn how to learn new recognition tasks rapidly by reusing previous experiences. This philosophy of learning to learn, or meta-learning, is one of the underlying tenets towards versatile agents that can continually learn a wide variety of tasks throughout their lifetimes. In this spirit, we address key technical challenges and explore complementary perspectives. We begin by learning from extremely limited data (e.g., one-shot learning). We cast the problem as supervised knowledge distillation and explore structures within model pairs. We introduce a meta-network that operates on the space of model parameters and encodes a generic transformation from “student” models learned from few samples to “teacher” models learned from large enough sample sets. By learning a series of transformations as more training data is gradually added, we further capture a notion of model dynamics to facilitate long-tail recognition with categories of different sample sizes. Moreover, by viewing the meta-network as an effective model adaptation strategy, we combine it with learning a generic model initialization and extend the use in few-shot human motion prediction tasks. To further decouple a recognition model from ties to a specific set of categories, we introduce self-supervision using meta-data. We expose the model to a large amount of unlabeled real-world images through an unsupervised meta-training phase. By learning diverse sets of low-density separators across auxiliary pseudo-classes, we capture a more generic, richer description of the visual world. Since they are informative across different categories, we alternatively use the low-density separators to constitute an “off-the-shelf” library as external memory, enabling generation of new models on-the-fly for a variety of tasks, including object detection, hypothesis transfer learning, domain adaptation, and image retrieval. By doing so, we have essentially leveraged structures within a large collection of models. We them move on to learning from a medium sized number of examples and explore structures within an evolving model when learning from continuously changing data streams and tasks. We rethink the dominant knowledge transfer paradigm that fine-tunes a fixedsize pre-trained model on new labeled target data. Inspired by developmental learning, we progressively grow a convolutional neural network with increased model capacity, which significantly outperforms classic fine-tuning approaches. Furthermore, we address unsupervised fine-tuning by transferring knowledge from a discriminative to a generative model on unlabeled target data. We thus make progress towards a lifelong learning process. From a different perspective, humans can imagine what novel objects look like from different views. Incorporating this ability to hallucinate novel instances of new concepts and leveraging joint structures in both data and task spaces might help recognition systems perform better low-shot learning. We then combine a meta-learner with a “hallucinator” that produces additional training examples, and optimize both models jointly, leading to significant performance gains. Finally, combining these approaches, we suggest a broader picture of learning to learn predictive structures through exploration and exploitation.