Carnegie Mellon University
Browse

Computational Methods for Analyzing the Architecture and Evolution of the Regulatory Genome

thesis
posted on 2025-04-18, 20:00 authored by Pradipta Ray

One goal of this thesis is to explore supervised motif detection in regulatory sequences by maximally utilizing the inherent “grammar” or structure of the cis-regulatory modules. We achieve this goal by using hierarchical and generalized Hidden Markov Models (HMMs) in a Bayesian setting.

Hierarchical HMMs help capture correlation among binding sites should they exist, as well as being able to model flanking regions specific to different kinds of binding sites, Generalized HMMs help model spacer distances between motifs and a bayesian framework ensures that whatever prior knowledge we have about the architecture (possible correlations of types of binding sites, etc) can be incorporated into the model by using priors on the parameters. The work is presented in detail in Chapter 2 [106].

Another goal of this thesis is to explore supervised motif detection by using comparative genomic data (multiple sequence alignment), with a specific focus of taking into account the phenomenon of functional turnover. Functional turnover is a phenomenon where orthologous sequences across even closely related species may have varying functionality due to gain or loss in functionality in the specific subsequence in question. Functional turnover is one of the biggest confounding factors plaguing comparative genomic analyses, and we developed a generative graphical model which models the multiple sequence alignment as the output of a mixture of phylogenies.

The mixture variables themselves are not drawn from a simple Bernoulli distribution [127], but are themselves the product of a higher level phylogenetic tree modelling the evolution of binary function indicators. The work is presented in detail in Chapter 3 [148].

A third goal of this thesis is to analyze diverse sources of evidence and conclude which genetic and epigenetic features correlate well with binding site locations, and to use such information to create a discriminative model for supervised prediction of binding sites. We use the discriminative framework of a conditional random field (CRF) for the purpose, which assigns weights to each genetic or epigenetic feature or “score”. Evolutionary features, annotation of transcribed and translated regions, features like GC content related to chromatin stability, as well as epigenetic features like nucleosome binding affinity were explored. The work is presented in detail in Chapter 4 [57]. DISCOVER aims to be the standard tool for integrative analysis based on Conditional Random Fields, with the ability to integrate differing datasets like epigenetic marks, transcription factor binding, genomic information, and evolutionary context.

Obtaining a deeper understanding of the evolution of the regulatory genome is crucial to be able to model generative processes which account for evolution of regulatory regions like CSMET [148], and EMnEM [127], as well as for analyzing what kinds of evolutionary features may prove discriminative with respect to motif-finding in discriminative models like DISCOVER [57]. A final goal of this thesis is to model the evolutionary dynamics of regulatory regions. We modelled co-evolving regions inside cis-regulatory modules by analyzing and spectral clustering evolutionary parameters in different parts of regulatory regions. Another goal was analyzing selectional forces in the regulatory genome by identifying which k-mers are preferentially present in regulatory regions across species by modelling regulatory regions as evolving mixtures of stochastic dictionaries. We explored the predictive ability of the mixture components in our stochastic dictionaries, as well as understanding how we can track the evolution of such stochastic dictionaries across species. This work is presented in detail in Chapter 5, with preliminary work having been presented in [147].

This thesis provides novel statistical frameworks for identifying regulatory regions, and analyzing them in terms of their architecture, function, evolutionary properties and correlation with other genomic and epigenomic features in a computationally optimal and statistically sound way.

History

Date

2013-05-13

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Eric P. Xing Veronica F. Hinman