Feature Learning and Graphical Models for Protein Sequences

Moitra, Subhodeep

doi:10.1184/R1/21671705.v1

2015_subhodee_dissertation.pdf (4.77 MB)

Feature Learning and Graphical Models for Protein Sequences

thesis

posted on 2022-12-13, 21:41 authored by Subhodeep Moitra

Evolutionarily related proteins often share similar sequences and structures and are grouped together into entities called protein families. The sequences in a protein family can have complex amino acid distributions encoding evolutionary relationships, physical constraints and functional attributes. Additionally, protein families can contain large numbers of sequences (deep) as well as large number of positions (wide). Existing models of protein sequence families make strong assumptions, require prior knowledge or severely limit the representational power of the models. In this thesis, we study computational methods for the task of learning rich predictive and generative models of protein families.

First, we consider the problem of large scale feature selection for predictive models. We address this in the context of a target application of designing drug cocktails against HIV-1 infection. We work with a large dataset consisting of around 70,000 HIV-1 protease and reverse transcriptase sequences. The core challenge in this setting is scaling up and selecting discriminatory features. We successfully accomplish this and provide strategies for designing cocktails of drugs robust to mutations by examining the fitness landscape learned by our predictive models.

Next, we present a framework for modelling protein families as a series of increasingly complex models using Markov Random Fields (MRFs). We hypothesise that by adding edges and latent variables in the MRF, we can progressively relax model assumptions and increase representational power. We note that latent variable models with cycles fail to learn effective models due to poor approximate inference thus defeating their purpose. This motivates the need for special architectures which allow efficient inference even in the company of latent variables.

Next, we extend the utility of the learned models beyond generative metrics. We introspect and interpret the learned features for biological significance by studying allostery in G Protein Coupled Receptors (GPCRs). We identify networks of co-evolving residues, a minimal binding pocket and long range interactions all by learning the structure of a MRF trained on the GPCR protein family.

Finally, we develop the first Restricted Boltzmann Machines (RBMs) and Deep Boltzmann Machines (DBMs) for protein sequence families. We demonstrate that these models significantly outperform their MRF counterparts in terms of imputation error. Additionally, we also consider Boltzmann Machines with sparse topologies and provide a strategy for learning their sparse structures. We note that the sparse Boltzmann Machines perform similar to MRFs thus reinforcing our hypothesis that non-sparse Boltzmann Machines are required for modelling the complex relationships inherent in protein families.

History

Date

2015-05-06

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Dr. Christopher James Langmead

Usage metrics

Keywords

Large Scale Feature Selection Protein Families Markov Random Fields Boltzmann Machines G Protein Coupled Receptors

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Feature Learning and Graphical Models for Protein Sequences

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports