使用mRMR和随机森林的RNA结合蛋白序列预测法

PDF格式 | 1.33MB | 更新于2024-08-27 | 63 浏览量 | 举报

"本文介绍了使用随机森林与最小冗余最大相关性(mRMR)特征选择方法进行基于序列的RNA结合蛋白预测的研究论文。" 在生物信息学领域，预测RNA结合蛋白是一项极具挑战性的任务，因为这类蛋白质在细胞内起着至关重要的作用，如调控基因表达、参与信号传导等。尽管已有许多研究致力于解决这个问题，但预测的准确性仍然有待提高。这篇研究论文提出了一种新的方法，通过结合随机森林算法和mRMR特征选择技术，提升了预测RNA结合蛋白的精确度。随机森林是一种集成学习方法，它构建了多个决策树并取其平均结果来提高预测性能。这种方法能处理大量特征，并且对过拟合有很好的抵抗能力。而mRMR（最小冗余最大相关性）是一种特征选择策略，它的目标是寻找一组最相关的特征，同时尽可能减少它们之间的相互冗余。这样的组合有助于提取最具有代表性和区分性的特征，从而提高模型的预测能力。在本研究中，作者首先将氨基酸序列作为输入数据，然后应用mRMR方法来筛选出对预测最有价值的氨基酸属性。这些属性可能包括氨基酸的物理化学性质、序列位置的相对信息等。经过特征选择后，随机森林模型被训练用于分类，即判断一个蛋白质是否为RNA结合蛋白。通过这种方法，可以更有效地识别出具有RNA结合能力的蛋白质，从而帮助科学家理解其功能并进行后续实验验证。该研究论文的重要性和贡献在于提供了一个准确的预测工具，这将有利于生物学研究，特别是在蛋白质功能注释、疾病相关性研究以及药物靶点发现等方面。此外，这种方法的通用性可能使其适用于其他生物大分子的预测问题，例如DNA结合蛋白或蛋白质-蛋白质相互作用的预测。这篇论文提出的结合随机森林和mRMR特征选择的预测模型，为提高RNA结合蛋白预测的准确性开辟了新途径。通过深入挖掘氨基酸序列中的关键信息，该方法有望在生物信息学研究中发挥重要作用，并为生物医学研究提供有价值的预测工具。

BioMed Research International 3

correlation of RNA-binding residues in the query protein,

respectively. Furthermore, BP(2)formula represents the rel-

evance of the two RNA-binding residues combined with

dierent distances from 1 to −1and takes into account the

fact that the correlation value between two residues is smaller

when the distance is larger which proves the rationality of

the denition.

We also dened two nonbinding propensities for non-

binding proteins. e denitions of NBP(1)and NBP(2)are

similar to the denitions of BP(1)and BP(2). Consider

NBP

(

)

∑

𝑛

𝑖=1

(



)

10

(3)

where and are the number of amino acids and the number

of nonbinding residues in this protein, respectively. RI() is

the reliability index of the prediction result of nonbinding

residue obtained from PRBR. Consider

NBP

(

)

∑

𝑁−1

𝑖=1

−𝑖+1

∑

𝑛(𝑖)

𝑘=1

(



)

(

−1

)

(4)

where  is the number of amino acids in this protein, ()

is the number of two nonbinding residues at a distance ,

and

RI() is the average value of the reliability index for

nonbinding residue and nonbinding residue +.

NBP(1) and NBP(2) describe the information of the

appearance and correlation of nonbinding residues in the

query protein, respectively, which are similar to BP(1) and

BP(2). We also used the reliability index because the pre-

diction result of nonbinding residues is applied in those

formulas.

2.2.2. Evolutionary Information Combined with Physicochem-

ical Properties (EIPP). Evolutionary information in the form

of a position-specic scoring matrix (PSSM) has been used

successfully to represent proteins in many applications, such

as prediction of DNA-binding residues [16–21] and RNA-

binding residues [15, 22, 23]. Here, PSSM proles were

generated using the PSI-BLAST program [24] to search the

nonredundant (NR) database through three iterations, with

0.001 as the -value cuto for multiple sequence alignment.

e PSSM scoring matrix has 20 ∗  elements, where  is

the length of protein. However, dierent proteins may have

dierent numbers of amino acids. erefore, the PSSM could

not be used directly as feature in the prediction work because

all the machine learning methods require the input feature

to have a xed length. erefore, we generated a PSSM-400,

which has a vector of dimension of 400 from the PSSM.

PSSM-400 is composition of occurrences of each type of

amino acid corresponding to each type of amino acids in

sequences. We pooled all rows that belonged to the same

aminoacidinthisPSSMtoformanewmatrix.Wethen

converted each new matrix to a vector and added all the

normalized values in each column for the new matrixes.

erefore, we produced a 20-dimensional vector for each new

matrix to generate PSSM-400.

ephysicochemicalpropertyfeaturehasbeenused

eectively in many elds, such as the identication of

DNA\RNA-binding proteins [7, 9, 14, 25, 26] and the iden-

tication and prediction of protein-protein interactions [27].

us, an EIPP was generated by merging 20 amino acid

columns of the PSSM-400 into a single column containing

the information for a certain physicochemical property. Six

physicochemical properties that we used successfully in pre-

vious works [15] were considered for combining with PSSM-

400 to generate the EIPP: the pKa values of the amino group,

the pKa values of the carboxyl group [28], the molecular mass

[6], the lowest free energy [29], the Balaban index [30], and

theWienerindex[31].eentry

𝑎𝑘

of th type of amino acid

in a protein sequence for a certain physicochemical property

in EIPP was calculated with



𝑎𝑘



𝑖=1





𝑎

(



)



𝑘

(



)

(5)

where is the index of a certain physicochemical property, 

is the index of the type of amino acids in the query protein

sequence, is the index of the type of na

ıve amino acids, 

𝑘

()

is the normalized value of the th type of na

ıve amino acid

for the thtypeofaminoacidintheproteinsequenceof

the PSSM-400, and 

𝑎

()is the normalized physicochemical

property values of for the th type amino acids. erefore,

thevectorsizeofEIPPfeatureis6×20.

2.2.3. Conjoint Triad (CT). Electrostatic and hydrophobic

interactions inuence protein-nucleic acid interactions and

may be reected by the dipoles and volumes of the side

chains of amino acids, respectively. Based on the dipoles

and volumes of the side chains, the 20 kinds of amino acids

could be clustered into seven classes [32]. Considering that

disulde bonds have no special eect on protein-nucleic acid

interactions, the unique amino acid cysteine in the seventh

class was put back to the third class in this study. erefore,

the 20 kinds of amino acids were clustered into six classes as

follows.Classa:Ala,Gly,andVal;Classb:Ile,Leu,Phe,and

Pro;Classc:Tyr,Met,r,Ser,andCys;Classd:His,Asn,

Gln, and Tpr; Class e: Arg and Lys; and Class f: Asp and Glu.

According to the similar feature construction method used

in [32], a protein is described by the conjoint triads feature

with 6×6×6=216dimensions, where each component

of the feature vector has the value of the frequency of the

corresponding triad.

As mentioned above, for each query protein, the vector

size of a feature is 4+120+216=340.

2.3. Algorithms to Classify and Measure a Classier’s Perfor-

mance. e random forest (RF) algorithm [12] is a classi-

cation algorithm that uses an ensemble of tree-structured

classiers, which has been used successfully in many applica-

tions for data classication and achieves high performance.

e random forest R package [33] was used to implement the

RF algorithm.

To evaluate the performance of the classier, a 5-fold

cross-validation procedure for the training dataset was used

in this research. During the procedure, we randomly divided

the data instances into ve parts. Four of these parts were

input into the RF to establish a model for classication,

剩余10页未读，继续阅读

weixin_38743481

粉丝: 698

使用mRMR和随机森林的RNA结合蛋白序列预测法

使用js-sequence-diagrams绘制SVG序列图

可视化事件顺序：sequence-comparison-table工具介绍

Sequence-viz库应用演示：可视化数据序列

CNNsite: Prediction of DNA-binding Residues in Proteins Using Convolutional Neural Network with Sequence Features

Enhancement of the variable universe of discourse control by Hammersley sequence-based TP model transformation

RBP-detector-using-RNA-sequence-master_deeplearning_深度学习_CNN_

De novo prediction of RNA-protein interactions from sequence information

spotify-sequential-track-sequence-prediction:使用变形金刚，基于注意力的Seq2Seq和深度强化学习来预测会话中用户收听的接下来的10首歌曲

SProtP: A Web Server to Recognize Those Short-Lived Proteins Based on Sequence-Derived Features in Human Cells

Chaotic-laser-based true random sequence generation for spread-spectrum communications

最新资源