Neural Networks for Linguistic Structured Prediction and Their Interpretability
Linguistic structured prediction, such as sequence labeling, syntactic and semantic parsing, and coreference resolution, is one of the first stages in deep language understanding and its importance has been well recognized in the natural language processing community, and has been applied to a wide range of down-stream tasks.
Most traditional high performance linguistic structured prediction models are linear statistical models, including Hidden Markov Models (HMM) and Conditional Random Fields (CRF), which rely heavily on hand-crafted features and task-specific resources. However, such task-specific knowledge is costly to develop, making structured prediction models difficult to adapt to new tasks or new domains. In the past few years, non-linear neural networks with as input distributed word representations have been broadly applied to NLP problems with great success. By utilizing distributed representations as inputs, these systems are capable of learning hidden representations directly from data instead of manually designing hand-crafted features.
Despite the impressive empirical successes of applying neural networks to linguistic structured prediction tasks, there are at least two major problems: 1) there is no a consistent architecture for, at least of components of, different structured prediction tasks that is able to be trained in a truely end-to-end setting. 2) The end-to-end training paradigm, however, comes at the expense of model interpretability: understanding the role of different parts of the deep neural network is difficult.
In this thesis, we will discuss the two of the major problems in current neural models, and attempt to provide solutions to address them. In the first part of this thesis, we introduce a consistent neural architecture for the encoding component, named BLSTM-CNNs, across different structured prediction tasks. It is a truly end-to-end model requiring no task-specific resources, feature engineering, or data pre-processing beyond pre-trained word embeddings on unlabeled corpora. Thus, our model can be easily applied to a wide range of structured prediction tasks on different languages and domains. We apply this encoding architecture to different tasks including sequence labeling and graph and transition-based dependency parsing, combined with different structured output layers, achieving state-of-the-art performance.
In the second part of this thesis, we use probing methods to investigate learning properties of deep neural networks with dependency parsing as a test bed. We first apply probes to neural dependency parsing models and demonstrate that using probes with different expressiveness leads to inconsistent observations. Based on our findings, we propose to interpret performance of probing tasks with two separate metrics, capacity and accessibility, which are associated with probe expressiveness. Specifically, capacity measures how much information has been encoded, while accessibility measures how easily the information can be detected. Then, we conduct systematic experiments to illustrate two learning properties of deep neural networks: (i) laziness – storing information in a way that requires minimal efforts; (ii) targetedness – filtering out from internal representations information that is unnecessary for the target task.
- Language Technologies Institute
- Doctor of Philosophy (PhD)