Combine and conquer: methods for multitask learning in biology and language
Generalizing beyond an individual task and borrowing knowledge from related tasks are the hallmarks of true intelligence. Knowing one language makes it easier to learn other languages, similar sports require learning similar skills to master them, etc. While building supervised machine learning models, such opportunities arise in machine translation for similar languages, modeling molecular processes of related organisms, predicting links across different types of social networks, extracting information from related sources of data etc. There are several benefits of borrowing from related tasks, beyond the ability to generalize. In many supervised learning applications the main bottleneck is insufficient labeled data (i.e annotations) to learn a good model. Obtaining additional labels is often expensive, requires resources and can be very time consuming. However there are often at hand, other related applications which have plentiful labeled information that can be utilized. Multitask learning [Caruana, 1997] is a family of machine learning methods that addresses this issue of building models using data from multiple problem domains (i.e ‘tasks’) by exploiting the similarity between them. The goal is to achieve performance benefits on the low-resource task called the target task or on all the tasks involved.
This thesis focuses on developing and extending multitask learning models for various types of data. Two diverse applications motivate the methods in this work. The first one is, modeling infectious diseases via host-pathogen interactions where we study molecular level interactions between pathogens such as bacteria and viruses and their hosts (such as humans). The question we address is: Can we model host-pathogen interactions better by leveraging data across multiple diseases?, towards which we develop new methods to jointly learn models across several hosts and pathogens. The other application that we consider, semantic parsing, is the process of mapping a natural-language sentence into a formal representation of its meaning. Since there are several ways to represent meaning, there are several linguistic resources (one per representation) and each annotates a different text corpus. Here we focus on: how to leverage information from resources with different representations and distributions? Overall, we explore various mechanisms of sharing information across tasks: by enforcing priors, structured similarity, feature augmentation and instance-level transfer. We show how our models can be interpreted to obtain additional insights into the problems.
In terms of impact, we build the first models for host-pathogen interactions for several bacteria and viruses and the first to involve a plant host. The methods we develop perform better than other computational methods. The predictions we obtain for the bacteria, Salmonella were validated by laboratory experiments, and we find that our model has a significantly higher recall compared to other computational models. Since there is very little known about how plant immune systems work, we exploit the data from other hosts. With the predictions from our model, we compare two hosts: human and the plant host Arabidopsis thaliana. The model we develop for viral pathogens leads us to some interesting insights on pathogen-specific protein sequence structures. Finally, leveraging several linguistic resources leads us to achieve impressive gains for the task of frame semantic role labeling
History
Date
2015-08-12Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)