Phylogenetic Inference for Multidomain Proteins
In this thesis, I present a model of multidomain evolution with associated algorithms and software for phylogenetic analysis of multidomain families, as well as applications of this novel methodology to case-studies and the human genome.
Phylogenetic analysis is of central importance to understanding the origins and evolution of life on earth. In biomedical research, molecular phylogenetics has proved an essential tool for practical applications. Current molecular phylogenetic methods are not equipped, however, to model many of the unique characteristics of multidomain families. Genes that encode this large and important class of proteins are characterized by a mosaic of sequence fragments that encode structural or functional modules, called domains. Multidomain families evolve via domain shuffling, a process that includes insertion, internal duplication, and deletion of domains. This versatile evolutionary mechanism played a transformative role in major evolutionary transitions, including the emergence of multicellular animals and the vertebrate immune system.
Multidomain families are ill-suited to current methods for phylogeny reconstruction due to their mosaic composition. Different regions of the same protein may have different evolutionary histories. Moreover, a protein may contain domains that also occur in otherwise unrelated proteins. These attributes pose substantial obstacles for phylogenetic methods that require a multiple sequence alignment as input. In addition, current methods do not incorporate a model of domain shuffling and hence, cannot infer the events that occurred in the history of the family. I address this problem by treating a multidomain family as a set of co-evolving domains, each with its own history. If the family is evolving by vertical descent from a conserved set of ancestral domains, then all constituent domains will have the same phylogenetic history. Disagreement between domain tree topologies is evidence that the family evolved through processes other than speciation and gene duplication. My algorithms exploit this information to reconstruct the history of domain shuffling in the family, as well as the timing of these events and the ancestral domain composition. I have implemented these algorithms in software that outputs the most parsimonious history of events for each domain family. The software also reconstructs a composite family history, including duplications, insertions and losses of all constituent domains and ancestral domain composition.
My approach is capable of more detailed and accurate reconstructions than the widely used domain architecture model, which ignores sequence variation between domain instances. In contrast, my approach is based on an explicit model of events and captures sequence variation between domain instances. I demonstrate the utility of this method through case studies of notch-related proteins, protein tyrosine kinases, and membrane-associated guanylate kinases. I further present a largescale analysis of domain shuffling processes through comparison of all pairs of domain families that co-occur in a protein in the human genome. These analyses suggest that (1) a remarkably greater amount of domain shuffling may have occurred than previously thought and (2) that it is not uncommon for the same domain architecture to arise more than once through independent events. This stands in contrast to earlier reports that convergent evolution of domain architecture is rare and suggests that incorporating sequence variation in evolutionary analyses of multidomain families is a crucial requirement for accurate inference.