Document Classification of Protein Sequences
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncovers new proteins at a fast rate. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to the extreme diversity among its members; yet, they are an important subject in pharmacological research being the target of approximately 60% of current drugs (Muller, 2000). A comparison of BLAST, k-NN, HMM and SVM with alignment-based features by Karchin et al. (2002) has suggested that classifiers at the complexity of SVM are needed to attain high accuracy in GPCR subfamily classification. Here, analogous to document classification, we applied Decision Tree and Naïve Bayes classifiers with chi-square feature selection on n-gram counts to the GPCR family and subfamily classification task. Using the dataset and evaluation protocol from the previous study, we found the Naïve Bayes classifier surpassing the reported accuracy of SVM by 4.8% and 6.1% in level I and II subfamily classification with an accuracy of 93.2% and 92.4% respectively. The Decision Tree, while inferior to SVM, still outperforms HMM in both level I and II subfamily classification. Moreover, the n-grams selected by chi-square feature selection show evidence of biological importance. Thus, the document classification approach has resulted in a simpler, more accurate and interpretable classifier.