Carnegie Mellon University
Browse

Visual Knowledge Learning

Download (12.16 MB)
thesis
posted on 2022-12-02, 20:45 authored by Xinlei ChenXinlei Chen

Understanding images requires rich background knowledge that is not often written down and hard for current computers to acquire. Traditional approach to overcoming this lack of knowledge in computer vision has been to manually summarize them in the form of labels or annotations. While such efforts are impressive, they suffer two critical issues when applied to recognition tasks: Scalability and Usefulness.

This Ph.D. thesis has made progress toward solving both issues. Instead of manually labeling everything, we develop systems and approaches that can teach computers visual knowledge in a more automatic and scalable way. Specifically, we let them learn by looking at images returned by web search engines. We show that even with traditional, imperfect computer vision and natural language technologies, it is nevertheless possible to acquire various types of explicit visual knowledge at a large scale, and potentially become better as the system builds up from previous iterations.

Moreover, by adapting end-to-end methods that train deep convolutional networks directly on Internet images, we verify that the intermediate vectorized layers can be convenient and generalizable implicit knowledge representations for visual recognition, even with noisy supervision signals. Such representation, while simple, can not only be transformed to discrete relationships as explicit knowledge, but also be exploited to accomplish complex-structured tasks like caption generation.

Finally, we develop reasoning frameworks to use visual knowledge. To this end, we combine both implicit and explicit knowledge into a single pipeline – since the former is effective especially when abundant data is available; and the latter offers supervision, model explainability and alternative ways to help when few examples exist. As one building block, we present a local, spatial memory to store instances while preserving the intrinsic layout of the image. To leverage explicit knowledge, we additionally introduce a graph-based module for global reasoning. Both are tested to be helpful for enhancing the reasoning ability of current vision systems.

History

Date

2018-02-02

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Abhinav Gupta

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC