Adapting to the Long Tail in Language Understanding
Advances in deep learning, especially self-supervised representation learning, have produced models that reach human parity on many benchmark datasets, which cover a variety of natural language understanding tasks. However, benchmark datasets are constructed from naturally occurring text, and are no exception to Zipf's law, containing a small proportion of highly frequent cases and a long tail of less frequent cases. Benchmark-driven evaluation and model development favors NLU models that perform well on the head, sidelining domains and phenomena that are underrepresented.
In this thesis, we adopt a two level conceptualization of the long tail: (i) macro-level according to broad dimensions of linguistic variation such as language, genre, topic, etc., and (ii) micro-level according to the presence or absence of specific linguistic phenomena such as numeracy, deixis, etc. With this conceptualization in mind, we focus on addressing three research questions about the applicability of domain adaptation to the long tail: (i) how can we best adapt between macro-level dimensions?, (ii) how can we best handle micro-level phenomena?, (iii) how do we evaluate performance on the long tail?
For adaptation at the macro level (low-resource domains), we propose: (i) likelihood-based instance weighting, an unsupervised adaptation technique that uses language model likelihoods to estimate source-target similarity, and (ii) domain-aware query sampling, an embedding similarity-based criterion to improve data efficiency during active learning. For micro-level adaptation (low-resource phenomena), we present an integrated architecture that incorporates knowledge/rules represented as ILP constraints into neural model training using a structured SVM framework. Finally, for long tail evaluation, we develop an evaluation paradigm called “stress tests'', which allows us to identify micro long tail phenomena that models fail on by supplementing benchmark evaluation with evaluation on non-identically distributed phenomenon-focused test-only datasets.
Through a series of systematically designed case studies, we analyze and contrast the performance of these proposed techniques with existing transfer learning methods on information extraction and text classification tasks. Our goal is to identify promising categories of methods for the long tail, while mapping out their limits. This thesis takes preliminary steps towards aggregating a series of best practices that can facilitate informed selection from an arsenal of strong transfer methods, given a new long tail setting
- Language Technologies Institute
- Doctor of Philosophy (PhD)