成本敏感旋转森林算法：基因表达数据分类新方法

150 浏览量更新于2024-08-27 收藏 644KB PDF 举报

"一种成本敏感的旋转森林算法，用于基因表达数据分类" 在当前的生物信息学领域，基因表达数据分析是至关重要的，因为它可以帮助研究人员识别疾病相关的基因模式并进行疾病诊断。传统的分类算法往往只关注分类准确率，而忽视了分类成本。这篇研究论文探讨了一种新的方法，即“成本敏感的旋转森林算法”，该算法特别适用于基因表达数据的分类。旋转森林（Rotation Forest）是一种集成学习方法，它通过随机投影和多种决策树的组合来提高分类性能。旋转森林通过在不同的特征子空间上构建决策树，增强了模型的多样性，从而提高了分类的准确性。然而，对于基因表达数据，分类错误可能带来严重后果，比如错诊或漏诊，因此，考虑分类成本变得尤为重要。本文提出的方法将三种类型的分类成本纳入考虑：误分类成本、测试成本和拒绝成本。误分类成本指的是将样本错误分类到不同类别的代价；测试成本则涉及获取基因表达数据的费用，这在高通量测序技术中可能是相当高昂的；拒绝成本是指当系统无法确定样本类别时选择不分类的代价。通过将这些成本因素整合到旋转森林算法中，该方法能够更加智能地权衡分类决策，以降低总成本。研究中，作者首先介绍了成本敏感旋转森林算法的实现细节，包括如何在决策树构建过程中考虑成本信息，以及如何通过优化策略来最小化总体成本。接着，他们对多种基因表达数据集进行了实验，以验证该算法的有效性。实验结果表明，与传统的旋转森林和其它分类算法相比，成本敏感的旋转森林在维持或提高分类准确率的同时，显著降低了总的分类成本。此外，论文还讨论了算法的局限性和可能的改进方向，例如如何更精确地估计各类别的成本，以及如何适应不同类型的数据和应用。这项工作对于那些需要在有限资源下进行高效、经济的基因表达数据分类的研究人员具有很高的参考价值，为未来的生物信息学研究提供了新的思路。总结来说，这篇研究论文提出了一个创新的成本敏感旋转森林算法，该算法在处理基因表达数据分类时，不仅考虑了分类的准确性，还兼顾了实际应用中的经济成本，这对于优化生物医学决策和提高疾病诊断的效率具有重要意义。

Contents lists available at ScienceDirect

Neurocomputing

journa l homepa ge: www.elsevier.com/locate/neucom

A cost-sensitive rotation forest algorithm for gene expression data

classiﬁcation

Huijuan Lu

, Lei Yang

, Ke Yan

⁎

, Yu Xue

, Zhigang Gao

College of Information Engineering, China Jiliang University, 258 Xueyuan Street, Hangzhou 310018, China

Nanjing University of Information Science & Technology, Nanjing 210044, China

College of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China

ARTICLE INFO

Keywords:

Gene expression data

Rotation forest

Cost-sensitive

Misclassiﬁcation cost

Rejection cost

Test cost

ABSTRACT

Existing works show that the rotation forest algorithm has competitive performance in terms of classiﬁcation

accuracy for gene expression data. However, most existing works only focus on the classiﬁcation accuracy and

neglect the classiﬁcation costs. In this study, we propose a cost-sensitive rotation forest algorithm for gene

expression data classiﬁcation. Three classiﬁcation costs, namely misclassiﬁcation cost, test cost and rejection

cost, are embedded into the rotation forest algorithm. This extension of the rotation forest algorithm is named

as cost-sensitive rotation forest algorithm. Experimental results show that the cost-sensitive rotation forest

algorithms eﬀectively reduce the classiﬁcation cost and make the classiﬁcation result more reliable.

1. Introduction

The increasingly polluted environment makes the cancer become

the most common fatal disease world-widely for the current century

[1]. The fast development of the Internet and database management

technologies makes the automated cancer diagnosis possible [2].In

bioinformatics, various data mining and machine learning techniques

are proposed to assist the cancer diagnosis in the molecular level [3–7].

With the discovery of the DNA microarray, biologists believe that the

classiﬁcation of gene expression data provides important information

in cancer diagnosis [8–13]. The current research of the gene expression

data classiﬁcation problem focuses on the classiﬁcation accuracy, the

generalization ability, the complexity and understandability of the

algorithm, the stability of the classiﬁers and etc. However, it is usually

diﬃcult for traditional classiﬁers to achieve high and stable classiﬁca-

tion results due to the three diﬃculties of the gene expression data

classiﬁcation, namely high dimension, imbalanced noisy data and small

sample size [14]. Besides the classiﬁcation accuracy, considering

classiﬁcation cost is also desired for gene expression data classiﬁcation

problems [15]. In this work, we take the classiﬁcation cost into

consideration and introduce a series of cost-sensitive learning algo-

rithms to overcome the diﬃculties of the gene expression data

classiﬁcation.

Machine learning techniques, such as neural networks [16–18] and

support vector machine [19], are widely used in gene expression data

classiﬁcation problems, medical diagnostic analysis, industrial data

analysis and etc., because of the high classiﬁcation accuracy. Decision

tree (DT) is a conventional machine learning model, which generally

presents a tree structure, and can be rewritten by a set of ‘if-else’ rules.

Each branch of DT represents a class of sample with common

characteristics in the feature space. There are many extensions of the

DT algorithm, such as EG2, ID3, C4.5, CART and etc. [20].

Random forest (RF) [21] is an ensemble classiﬁer that consists of

multiple DTs through a random splitting of the feature space. The

rotation forest (RoF) is developed based on RF by assembling multiple

DTs [22]

. It segments the feature space into subspaces, extracts the

most

important features from each subset and repeats the process to

obtain the most distinguishable training dataset and basic classiﬁers

for diﬀerent feature subspaces. Recently, Hosseinzadeh and Eftekhari

[23] proposed a high performance RoF for imbalance data classiﬁca-

tion by adding fuzzy C-mean clustering method into the RoF classiﬁca-

tion process. Fang et al. [24] utilized two improved RoF algorithms to

classify highly imbalanced data. Experiments showed satisfactory

results on the most widely used imbalanced measure criterion AUC.

The classiﬁcation costs mainly consist of three types, namely

misclassiﬁcation cost, test cost and rejection cost [25]. Hybrid learning

methods extend the conventional classiﬁers by embedding new factors

into the algorithms [26,27]. By embedding the classiﬁcation cost into

the traditional classiﬁers, many cost-sensitive classiﬁcation models are

proposed [28]. Zidelmal et al. [29] embedded classiﬁcation cost into

the support vector machine (SVM) to classify the ECG beat and

achieved average accuracy of 97.2% with no rejection and 98.8% for

http://dx.doi.org/10.1016/j.neucom.2016.09.077

Received 16 May 2016; Received in revised form 25 July 2016; Accepted 5 September 2016

⁎

Corresponding author.

E-mail address: yanke@cjlu.edu.cn (K. Yan).

Neurocomputing 228 (2017) 270–276

Available online 02 November 2016

MARK

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38593723

粉丝: 5
资源: 919

成本敏感旋转森林算法：基因表达数据分类新方法

大数据挖掘，随机森林算法，可用于分类，特征向量选择.zip

算法导论(part2)

算法导论(part1)

基于深度学习的miRNA与疾病相关性预测算法.pdf

探索机器学习算法：从原理到应用详解

ggthemes包热图制作全攻略：从基因表达到市场分析的图表创建秘诀

医疗健康数据分析突破：机器学习算法的创新应用

【机器学习与数据挖掘】：5大算法对比与场景应用，专家教你如何选择！

提升医疗算法性能：评估与优化的科学方法

自我学习机制：大数据时代的人工智能算法创新

最新资源