Carnegie Mellon University
2015_subhodee_dissertation.pdf (4.77 MB)
Download file

Feature Learning and Graphical Models for Protein Sequences

Download (4.77 MB)
posted on 2022-12-13, 21:41 authored by Subhodeep Moitra

Evolutionarily related proteins often share similar sequences and structures and are grouped together into entities called protein families. The sequences in a protein family can have complex amino acid distributions encoding evolutionary relationships, physical constraints and functional attributes. Additionally, protein families can contain large numbers of sequences (deep) as well as large number of positions (wide). Existing models of protein sequence families make strong assumptions, require prior knowledge or severely limit the representational power of the models. In this thesis, we study computational methods for the task of learning rich predictive and generative models of protein families. 

First, we consider the problem of large scale feature selection for predictive models. We address this in the context of a target application of designing drug cocktails against HIV-1 infection. We work with a large dataset consisting of around 70,000 HIV-1 protease and reverse transcriptase sequences. The core challenge in this setting is scaling up and selecting discriminatory features. We successfully accomplish this and provide strategies for designing cocktails of drugs robust to mutations by examining the fitness landscape learned by our predictive models. 

Next, we present a framework for modelling protein families as a series of increasingly complex models using Markov Random Fields (MRFs). We hypothesise that by adding edges and latent variables in the MRF, we can progressively relax model assumptions and increase representational power. We note that latent variable models with cycles fail to learn effective models due to poor approximate inference thus defeating their purpose. This motivates the need for special architectures which allow efficient inference even in the company of latent variables. 

Next, we extend the utility of the learned models beyond generative metrics. We introspect and interpret the learned features for biological significance by studying allostery in G Protein Coupled Receptors (GPCRs). We identify networks of co-evolving residues, a minimal binding pocket and long range interactions all by learning the structure of a MRF trained on the GPCR protein family.

Finally, we develop the first Restricted Boltzmann Machines (RBMs) and Deep Boltzmann Machines (DBMs) for protein sequence families. We demonstrate that these models significantly outperform their MRF counterparts in terms of imputation error. Additionally, we also consider Boltzmann Machines with sparse topologies and provide a strategy for learning their sparse structures. We note that the sparse Boltzmann Machines perform similar to MRFs thus reinforcing our hypothesis that non-sparse Boltzmann Machines are required for modelling the complex relationships inherent in protein families. 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Dr. Christopher James Langmead