Rare and Frequent N-grams in Whole-genome Protein Sequences

Ganapathiraju, Madhavi; Klein-Seetharaman, Judith; Rosenfeld, Roni; Carbonell, Jaime G.; Reddy, Raj

doi:10.1184/R1/6608789.v1

Rare and Frequent N-grams in Whole-genome Protein Sequences

journal contribution

posted on 2004-05-01, 00:00 authored by Madhavi Ganapathiraju, Judith Klein-Seetharaman, Roni Rosenfeld, Jaime G. Carbonell, Raj Reddy

The precise relationship between a primary protein sequence, its three-dimensional structure and its function in a complex cellular environment is one of the most fundamental unanswered questions in biology. Unprecedented amounts of genomic and proteomic data create an opportunity for attacking the sequence-structure-function mapping problem with data-driven methods. The mapping of biological sequences to form and function of proteins is conceptually similar to the mapping of words to meaning. This analogy is being studied by a growing body of research ([1] and pointers thereof). Thus, n-gram analysis (statistical analysis of co-occurrence of words in a text) has found applications to biological sequences, using various types of “vocabulary”, for example nucleotides and amino acids. Here, we investigate n-gram statistics in whole-genome sequences to address the following questions: How characteristic is the amino acid n-gram distribution for specific organisms? Do different organisms tend to use different “phrases”? What is the “meaning” of a rare sequence in a protein? The long-term goal is to provide a useful starting point to derive language models with defined vocabulary and phrase preferences and grammatical rules for protein sequences of different organisms.

History

Date

2004-05-01

Usage metrics

Keywords

probabilistic modeling biology-language analogy protein folding Information and Computing Sciences not elsewhere classified

Licence

In Copyright

Rare and Frequent N-grams in Whole-genome Protein Sequences

History

Date

Usage metrics

Categories

Keywords

Licence

Exports