Adapting to the Long Tail in Language Understanding

Naik, Aakanksha

doi:10.1184/R1/24944535.v1

naik, aakanksha - Thesis_1.pdf (5.59 MB)

Adapting to the Long Tail in Language Understanding

thesis

posted on 2024-01-17, 21:14 authored by Aakanksha NaikAakanksha Naik

Advances in deep learning, especially self-supervised representation learning, have produced models that reach human parity on many benchmark datasets, which cover a variety of natural language understanding tasks. However, benchmark datasets are constructed from naturally occurring text, and are no exception to Zipf's law, containing a small proportion of highly frequent cases and a long tail of less frequent cases. Benchmark-driven evaluation and model development favors NLU models that perform well on the head, sidelining domains and phenomena that are underrepresented.

In this thesis, we adopt a two level conceptualization of the long tail: (i) macro-level according to broad dimensions of linguistic variation such as language, genre, topic, etc., and (ii) micro-level according to the presence or absence of specific linguistic phenomena such as numeracy, deixis, etc. With this conceptualization in mind, we focus on addressing three research questions about the applicability of domain adaptation to the long tail: (i) how can we best adapt between macro-level dimensions?, (ii) how can we best handle micro-level phenomena?, (iii) how do we evaluate performance on the long tail?

For adaptation at the macro level (low-resource domains), we propose: (i) likelihood-based instance weighting, an unsupervised adaptation technique that uses language model likelihoods to estimate source-target similarity, and (ii) domain-aware query sampling, an embedding similarity-based criterion to improve data efficiency during active learning. For micro-level adaptation (low-resource phenomena), we present an integrated architecture that incorporates knowledge/rules represented as ILP constraints into neural model training using a structured SVM framework. Finally, for long tail evaluation, we develop an evaluation paradigm called “stress tests'', which allows us to identify micro long tail phenomena that models fail on by supplementing benchmark evaluation with evaluation on non-identically distributed phenomenon-focused test-only datasets.

Through a series of systematically designed case studies, we analyze and contrast the performance of these proposed techniques with existing transfer learning methods on information extraction and text classification tasks. Our goal is to identify promising categories of methods for the long tail, while mapping out their limits. This thesis takes preliminary steps towards aggregating a series of best practices that can facilitate informed selection from an arsenal of strong transfer methods, given a new long tail setting

History

Date

2022-04-20

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Carolyn Rosé

Usage metrics

Keywords

deep learning self-supervised representation learning natural language

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Adapting to the Long Tail in Language Understanding

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports