posted on 2004-05-01, 00:00authored byMadhavi Ganapathiraju, Judith Klein-Seetharaman, Roni Rosenfeld, Jaime G. Carbonell, Raj Reddy
The precise relationship between a primary protein sequence, its three-dimensional structure and its
function in a complex cellular environment is one of the most fundamental unanswered questions in
biology. Unprecedented amounts of genomic and proteomic data create an opportunity for attacking the
sequence-structure-function mapping problem with data-driven methods. The mapping of biological
sequences to form and function of proteins is conceptually similar to the mapping of words to meaning.
This analogy is being studied by a growing body of research ([1] and pointers thereof). Thus, n-gram
analysis (statistical analysis of co-occurrence of words in a text) has found applications to biological
sequences, using various types of “vocabulary”, for example nucleotides and amino acids. Here, we
investigate n-gram statistics in whole-genome sequences to address the following questions: How
characteristic is the amino acid n-gram distribution for specific organisms? Do different organisms tend to
use different “phrases”? What is the “meaning” of a rare sequence in a protein? The long-term goal is to
provide a useful starting point to derive language models with defined vocabulary and phrase preferences
and grammatical rules for protein sequences of different organisms.