基于TF-IDF和N-gram的G蛋白偶联受体CNN分类研究

研究论文

160 浏览量更新于2024-08-26 收藏 459KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"本文介绍了一种使用TF-IDF和N-gram特征的基于卷积神经网络（CNN）的G蛋白偶联受体分类方法，旨在提高蛋白质序列分类的效率和准确性。作者来自北京化工大学信息科学与技术学院。研究显示，该方法在GPCR的不同层次分类上表现出色，精度达到了98.34%，98.13%和9..." 在当前的生物信息学时代，随着功能基因组学和蛋白质组学技术的发展，预测大量新蛋白质的功能变得至关重要。传统的序列比对方法在处理大规模数据时可能存在局限性，因此研究人员开始转向机器学习方法，特别是深度学习技术，如卷积神经网络（CNN）。G蛋白偶联受体（GPCRs）是一类广泛存在于细胞膜上的蛋白质，它们在信号转导中起着关键作用，涉及多种生理和病理过程。本文提出的方法结合了TF-IDF（Term Frequency-Inverse Document Frequency）和N-gram技术来提取蛋白质序列的特征。TF-IDF是一种常用的文本挖掘技术，用于评估一个词在文档中的重要性，它考虑了词频和逆文档频率，可以有效地突出文档中重要的、非常见词汇。N-gram则是一种统计语言模型，通过分析连续的n个字符或氨基酸，捕获序列中的局部模式。在CNN模型中，TF-IDF和N-gram特征被用作输入层的初始表示。CNN的特点在于其能自动学习和提取特征的能力，尤其适合处理序列数据。通过卷积层、池化层和全连接层，CNN可以从蛋白质序列中学习到具有层次的抽象特征。此外，采用的改进特征提取方法可能包括预处理步骤，如归一化或标准化，以及特定的滤波器设计，以优化对GPCR特征的捕捉。实验结果证明了该方法的有效性，与现有方法相比，分类准确率显著提高，最高可达98.34%、98.13%和9... (由于原文缺失，此处数字不完整)。这表明，结合TF-IDF和N-gram的CNN模型在G蛋白偶联受体的分类任务上表现优异，对于理解GPCR的结构与功能关系以及药物研发等领域有极大的潜力。这项研究提供了一个新的视角，即利用机器学习和深度学习工具，特别是TF-IDF和N-gram特征，来解决蛋白质序列分类问题。这种方法不仅提高了分类效率，还降低了对专业知识的依赖，有助于推动生物信息学的自动化和智能化进程。

资源详情

资源推荐

An Efﬁcient CNN-based Classiﬁcation on G-protein

Coupled Receptors Using TF-IDF and N-gram

Man Li, Cheng Ling

1, ∗

and Jingyang Gao

1, ∗

College of Information Science and Technology,

Beijing University of Chemical Technology,

Beijing, China.

∗

CL: s0897918@gmail.com;

∗

JG: gaojy@buct.edu.cn

Abstract—Protein sequence classiﬁcation is increasingly crucial

in the current “biological information sciences” epoch, where

researchers hammer at functional genomics and proteomics tech-

nologies for predicting the function of large-scale new proteins.

This has sparked interest in the methods which do not rely

on traditional sequence alignment, but prefer machine learning

approaches. In this paper, we present a Convolutional Neural

Network (CNN) based method to perform the classiﬁcation on

the different levels of G-protein Coupled Receptors (GPCRs).

The method is implemented in conjunction with an improved

feature extraction method and TF-IDF feature weighting strategy.

Experimental results indicate that the proposed method makes

signiﬁcant improvements over previous methods, which attains

an accuracy of up to 98.34%, 98.13% and 96.47% in the

classiﬁcation of family level, subfamily level I and II, respectively.

In comparison to the other well-known classiﬁcation methods for

GPCRs, the classiﬁcation error rate of the proposed method is

reduced by of at least 55.14% (family level), 72.86% (level I) and

52.63% (Level II).

Index Terms—Protein Sequence Classiﬁcation, Convolutional

Neural Network, G-protein Coupled Receptors

I. INTRODUCTION

Protein sequence classiﬁcation plays a critical role in bio-

logical sciences. Advances in biotechnology have drastically

increased the quantity of new proteins, developing efﬁcient and

accurate methodologies for protein classiﬁcation has become

an imperative target of proteomics. Various methods have been

developed for protein sequence classiﬁcation. Basically, the

methodologies can be divided into two aspects, where most

methods are based on sequence alignment and motifs, and the

others are accomplished by machine learning algorithms. The

ﬁrst appeared methodology is sequence alignment. A score

matrix is established by pair-wise sequences, the matrix value

corresponds to the similarity score of the relevant position

of sequences. Subsequently, sequence alignment problem is

turned into ﬁnding the best alignment path in the score

matrix. The operation of sequence alignment aims to ﬁnd

the best global alignment in the early stage. Needleman-

Wunsch dynamic programming algorithm [1] is such a kind

of algorithms, which calculates the global similarity between

query and database sequences. Since it is possible that the

newly discovered sequences only match regionally with ex-

isting ones, searching local alignment is also reasonable and

incisive. Based on this consideration, another widely spread

dynamic programming algorithm, namely Smith-Waterman

algorithm [2], is developed. The algorithm performs sequence

alignment by searching local similar regions between two

sequences. The traditional dynamic programming algorithms

are relatively precise, the major challenge of applying such

algorithms to a database-wide search is that they are time

consuming and often results in very expensive computational

cost however. To solve the problem, heuristic based search,

such as BLAST [3] and FASTA [4] algorithms, are developed.

They search short sequence segments and only extend the one

that meets criterion to a large similarity region. In comparison

to dynamic programming algorithms, BLAST and FASTA is

more effective and prevalent. All the algorithms mentioned

above are pair-wise sequence alignments, in attention to

this, multiple sequence alignment tools, such as ClustalW

[5], BLOCKMAKER [6], T-Coffee [7], are also frequently

employed. Details of these methods are beyond the scope of

this article and will not be covered here.

Alignment is a common theme among the above outlined

methodologies. The fundamental principle is to align a query

sequence to reference sequences and assign it to the class that

the best matched reference sequence belongs to. However, a

fatal ﬂaw [8] of this methodology is that unreliable alignments

are often provided when the similarities between aligned

sequences are less than 40% [9][10]. This phenomenon has

sparked interest to ﬁnd more approving algorithms. Recently,

machine learning algorithms [11] have got extensive attentions

of scholars and have been applied in various scientiﬁc ﬁelds.

To address the issues of sequence classiﬁcation, a signiﬁcant

amount of machine learning algorithm based methods are

presented. Li et al. [12] predicted transmembrane proteins

using only protein sequence information via n-gram with the

random forests classiﬁer, which obtained highest maximum

accuracy of 95.6%. Dongardive et al. [13] proposed a K-

Nearest Neighbor (KNN) algorithm based method and con-

ducted experiments on 717 sequences [14][15]. The results

revealed that the procedure with cosine measures and the

number of neighbors as 15 gave a classiﬁcation accuracy of up

to 84%. Bandyopadhyay et al. [16] developed a variable length

fuzzy genetic clustering algorithm to ﬁnd prototypes for each

super family and harnessed a nearest neighbor algorithm for

the classiﬁcation, which obtained anaccuracy of up to 81.3%

for three super families. Iqbal et al. [17] proposed an encoding

technique with a decision tree classiﬁcation algorithm, which

2017 IEEE Symposium on Computers and Communications (ISCC)

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38640674

粉丝: 2
资源: 960

基于TF-IDF和N-gram的G蛋白偶联受体CNN分类研究

TF-IDF与余弦相似性的应用

tf-idf_tf-idf_

如何使用Tf-idf进行词频统计

垃圾短信分类TF-IDF算法

使用TF-IDF实体消歧

tf-idf 朴素贝叶斯

使用TF-IDF筛选出重要的词汇

使用TF-IDF进行处理将文本转换为数值型特征向量

基于TF-IDF+KNN进行中文文本分类

tf-idf进行文本预处理代码

tf-idf算法改进python实现

TF-IDF特征向量

TF-IDF和词袋向量的区别

使用随机森林之前计算TF-IDF

tf-idf文件提取python

如何对文档进行TF-IDF计算

TF-IDF和相似度匹配

tf-idf算法和应用场景

如何选择使用Word2vec还是TF-IDF

最新资源