Carnegie Mellon University
Browse

Learning Cross-language and Cross-style Mappings with Limited Supervision

thesis
posted on 2025-04-14, 21:08 authored by Ruochen XuRuochen Xu

Recent natural language processing (NLP) research has been increasingly focusing on deep learning methods and producing superior results on various NLP tasks. Deep NLP models are usually based on the dense vector representation of input and are able to automatically extract multi-scale features given human-annotated data. However, human annotations are expensive and often not evenly distributed across different languages, domains, genres, and styles. This thesis focuses on multiple aspects of cross-language and cross-style mapping in text, addressing the limitations of existing methods and improving the state-of-the-art results when sufficient amounts of labeled data are not available. By developing both task-oriented transfer learning models (e.g., for class-language classification) and generic methods for mapping among embedded words or sentences, the key contribution of this thesis is a set of novel approaches to leveraging unlabeled text data for effective and effcient mapping across languages or styles.

Chapter 1 outlines the overall theme, challenges being addressed and unique contributions in this thesis.

Chapter 2 presents two novel methods for the transfer of trained text classification models from rich-resource languages to low-resource languages. The first model focuses on the scenario where only a bilingual dictionary of limited size is available as the linkage between languages. It uses unsupervised word embeddings trained on monolingual data to construct a regularization graph in each language and a spectral graph propagation algorithm to extend bilingual dictionaries. The second model is a distillation approach over parallel data in the scenario where the teacher network and student network are classifiers in different languages, respectively. Both models achieved state-of-art performance at the time on several benchmark datasets [60, 62, 84, 129]

Chapter 3 presents an unsupervised approach to the mapping of monolingual word embeddings across languages. It is the first gradient-based method for optimizing the Sinkhorn distance between two spaces of word embeddings and has been proven to be more accurate and robust than other methods [58, 126]. More importantly, this model achieves the state-of-art performance without using any bilingual dictionary or parallel data.

Chapter 4 present new text generation models that transfer the styles or attributes of sentences. We introduce a semi-supervised model which is trained on both paired and unpaired sentences with style labels and achieved the state-of-art results in a formality transfer task. For the case where no paired sentences are available, we proposed a novel unsupervised method that combines the strength of neural Seq2Seq model and search engine, and outperforms other competing methods on various datasets.


History

Date

2019-08-01

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Yiming Yang