An Efficient CNN-based Classification on G-protein
Coupled Receptors Using TF-IDF and N-gram
Man Li, Cheng Ling
1, ∗
and Jingyang Gao
1, ∗
1
College of Information Science and Technology,
Beijing University of Chemical Technology,
Beijing, China.
∗
CL: s0897918@gmail.com;
∗
JG: gaojy@buct.edu.cn
Abstract—Protein sequence classification is increasingly crucial
in the current “biological information sciences” epoch, where
researchers hammer at functional genomics and proteomics tech-
nologies for predicting the function of large-scale new proteins.
This has sparked interest in the methods which do not rely
on traditional sequence alignment, but prefer machine learning
approaches. In this paper, we present a Convolutional Neural
Network (CNN) based method to perform the classification on
the different levels of G-protein Coupled Receptors (GPCRs).
The method is implemented in conjunction with an improved
feature extraction method and TF-IDF feature weighting strategy.
Experimental results indicate that the proposed method makes
significant improvements over previous methods, which attains
an accuracy of up to 98.34%, 98.13% and 96.47% in the
classification of family level, subfamily level I and II, respectively.
In comparison to the other well-known classification methods for
GPCRs, the classification error rate of the proposed method is
reduced by of at least 55.14% (family level), 72.86% (level I) and
52.63% (Level II).
Index Terms—Protein Sequence Classification, Convolutional
Neural Network, G-protein Coupled Receptors
I. INTRODUCTION
Protein sequence classification plays a critical role in bio-
logical sciences. Advances in biotechnology have drastically
increased the quantity of new proteins, developing efficient and
accurate methodologies for protein classification has become
an imperative target of proteomics. Various methods have been
developed for protein sequence classification. Basically, the
methodologies can be divided into two aspects, where most
methods are based on sequence alignment and motifs, and the
others are accomplished by machine learning algorithms. The
first appeared methodology is sequence alignment. A score
matrix is established by pair-wise sequences, the matrix value
corresponds to the similarity score of the relevant position
of sequences. Subsequently, sequence alignment problem is
turned into finding the best alignment path in the score
matrix. The operation of sequence alignment aims to find
the best global alignment in the early stage. Needleman-
Wunsch dynamic programming algorithm [1] is such a kind
of algorithms, which calculates the global similarity between
query and database sequences. Since it is possible that the
newly discovered sequences only match regionally with ex-
isting ones, searching local alignment is also reasonable and
incisive. Based on this consideration, another widely spread
dynamic programming algorithm, namely Smith-Waterman
algorithm [2], is developed. The algorithm performs sequence
alignment by searching local similar regions between two
sequences. The traditional dynamic programming algorithms
are relatively precise, the major challenge of applying such
algorithms to a database-wide search is that they are time
consuming and often results in very expensive computational
cost however. To solve the problem, heuristic based search,
such as BLAST [3] and FASTA [4] algorithms, are developed.
They search short sequence segments and only extend the one
that meets criterion to a large similarity region. In comparison
to dynamic programming algorithms, BLAST and FASTA is
more effective and prevalent. All the algorithms mentioned
above are pair-wise sequence alignments, in attention to
this, multiple sequence alignment tools, such as ClustalW
[5], BLOCKMAKER [6], T-Coffee [7], are also frequently
employed. Details of these methods are beyond the scope of
this article and will not be covered here.
Alignment is a common theme among the above outlined
methodologies. The fundamental principle is to align a query
sequence to reference sequences and assign it to the class that
the best matched reference sequence belongs to. However, a
fatal flaw [8] of this methodology is that unreliable alignments
are often provided when the similarities between aligned
sequences are less than 40% [9][10]. This phenomenon has
sparked interest to find more approving algorithms. Recently,
machine learning algorithms [11] have got extensive attentions
of scholars and have been applied in various scientific fields.
To address the issues of sequence classification, a significant
amount of machine learning algorithm based methods are
presented. Li et al. [12] predicted transmembrane proteins
using only protein sequence information via n-gram with the
random forests classifier, which obtained highest maximum
accuracy of 95.6%. Dongardive et al. [13] proposed a K-
Nearest Neighbor (KNN) algorithm based method and con-
ducted experiments on 717 sequences [14][15]. The results
revealed that the procedure with cosine measures and the
number of neighbors as 15 gave a classification accuracy of up
to 84%. Bandyopadhyay et al. [16] developed a variable length
fuzzy genetic clustering algorithm to find prototypes for each
super family and harnessed a nearest neighbor algorithm for
the classification, which obtained anaccuracy of up to 81.3%
for three super families. Iqbal et al. [17] proposed an encoding
technique with a decision tree classification algorithm, which
2017 IEEE Symposium on Computers and Communications (ISCC)
978-1-5386-1629-1/17/$31.00 ©2017 IEEE