一种考虑肿瘤纯度信息的基因表达差异分析配对对特征选择方法

研究论文

151 浏览量更新于2024-08-26 收藏 2.75MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

基因表达数据分析中的配对对特征选择方法在基因表达数据分析中，特征选择是非常重要的一步骤。为了更好地分析基因表达数据，研究人员提出了各种特征选择方法。然而，现有的特征选择方法往往忽视了肿瘤纯度信息的影响。为了解决这个问题，研究人员提出了一个新的配对对特征选择方法，考虑了肿瘤纯度信息。这个方法的主要思想是通过对基因表达数据进行配对对分析，来identify differential gene expression patterns between tumor and normal samples。同时，研究人员也考虑了肿瘤纯度信息的影响，以确保选定的特征能够更好地反映基因表达数据的生物学意义。该方法的优点在于能够考虑到肿瘤纯度信息的影响，从而提高了特征选择的准确性。此外，该方法也能够identify potential biomarkers for cancer diagnosis and treatment。该方法的实现步骤可以分为以下几个步骤：首先，研究人员需要收集基因表达数据，并对其进行预处理，包括数据 normalization和feature scaling。其次，研究人员需要对基因表达数据进行配对对分析，以identify differential gene expression patterns between tumor and normal samples。第三，研究人员需要考虑肿瘤纯度信息的影响，以确保选定的特征能够更好地反映基因表达数据的生物学意义。最后，研究人员需要对选定的特征进行评估，以确保它们能够满足基因表达数据分析的要求。该方法的应用前景广阔，能够应用于各种基因表达数据分析领域，例如肿瘤研究、基因治疗、个体化medicine等。该方法的优点在于： * 考虑了肿瘤纯度信息的影响，提高了特征选择的准确性 * 能够identify potential biomarkers for cancer diagnosis and treatment * 能够应用于各种基因表达数据分析领域该方法的缺点在于： * 需要大量的计算资源和存储空间 * 需要专业的生物信息学和计算机科学知识该方法是一种effective feature selection method for gene expression data analysis，能够考虑到肿瘤纯度信息的影响，提高了特征选择的准确性，并能够identify potential biomarkers for cancer diagnosis and treatment。

资源详情

资源推荐

Contents lists available at ScienceDirect

Mathematical Biosciences

journal homepage: www.elsevier.com/locate/mbs

A novel matched-pairs feature selection method considering with tumor

purity for diﬀerential gene expression analyses

Liang Sen

a,b,c

, Yang Sen

, Liang Dayang

, Ma Jiechao

, Tian Yuan

a,e

, Zhao Jing

, Zhang Xu

Xu Ying

a,b,g

, Wang Yan

a,b,

⁎

Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University,

Changchun 130012, China

Cancer Systems Biology Center, China-Japan Union Hospital, Jilin University, Changchun 130033, China

Advanced Institute, Infervision, Beijing 100000, China

School of Mechatronics Engineering, Nanchang University, Nanchang 330031, China

School of Artiﬁcial Intelligence, Jilin University, Changchun, 130012, China

Sanford Research, Sioux Falls, SD 57104, USA

Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA

ARTICLE INFO

Keywords:

Feature selection

Tumor purity

Test statistic

Gene expression analyses, Matched case-

control design

ABSTRACT

Tissue-based gene expression data analyses, while most powerful, represent a signiﬁcantly more challenging

problem compared to cell-based gene expression data analyses, even for the simplest diﬀerential gene expression

analyses. The result in determining if a gene is diﬀerentially expressed in tumor vs. non-tumorous control tissues

does not only depend on the two expression values but also on the percentage of the tissue cells being tumor

cells, i.e., the tumor purity. We developed a novel matched-pairs feature selection method, which takes into full

consideration of the tumor purity when deciding if a gene is diﬀerentially expressed in tumor vs. control ex-

periments, which is simple, eﬀective, and accurate. To evaluate the validity and performance of the method, we

have compared it with four published methods using both simulated datasets and actual cancer tissue datasets

and found that our method achieved better performance with higher sensitivity and speciﬁcity than the other

methods. Our method was the a matched-pairs feature selection method on gene expression analysis under

matched case-control design which takes into consideration the tumor purity information, which can set a

foundation for further development of other gene expression analysis needs.

1. Introduction

Compared to the traditional cell-based gene expression data col-

lected under laboratory conditions, tissue-based gene expression data

analyses enable researchers to study cancer evolution in the actual

cancer-forming environment directly. In addition, direct collection of

tissue-level gene expression data without separating tissues into distinct

cell types, followed by single-cell sequencing provides a considerably

more eﬃcient and economically more feasible approach for large-scale

tumor data analyses. However, the approach poses a signiﬁcant chal-

lenge to bioinformatics researchers since the collected data are com-

positions of gene expressions from multiple cell types. At the forefront

of the analysis of such data is the issue of tumor purity, i.e., the

percentage of tissue cells being tumor cells as the meaning of observed

gene expression data changes with the diﬀerent percentage of the tissue

cells being tumor cells.

The tumor purity issue has been well recognized as a technical issue

that needs to be solved before reliable information can be derived from

tissue-based expression data [1,2]. Aran et al. recently gave a sys-

tematic pan-cancer analysis on tumor purity [3] and found that some

immunotherapy gene signatures were not detected by traditional dif-

ferential expression analysis, but became detectable when tumor purity

was taken into consideration. Diﬀerent types of information have been

employed in the published methods for tumor purity estimation, in-

cluding gene expression data (ESTIMATE [4]), somatic copy-number

variation data (ABSOLUTE [5], THetA [6] and others [7]), somatic

https://doi.org/10.1016/j.mbs.2019.02.007

Received 17 January 2019; Received in revised form 21 February 2019; Accepted 22 February 2019

⁎

Corresponding author at: Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and

Technology, Jilin University, Changchun 130012, China.

E-mail addresses: hawkcoder@gmail.com (S. Liang), yangsen18@mails.jlu.edu.cn (S. Yang), 5910116336@email.ncu.edu.cn (D. Liang),

mjiechao@infervision.com (J. Ma), tianyuan12@mails.jlu.edu.cn (Y. Tian), zj1228@gmail.com (J. Zhao), xuhzhang@outlook.com (X. Zhang), xyn@uga.edu (Y. Xu),

wy6868@jlu.edu.cn (Y. Wang).

Mathematical Biosciences 311 (2019) 39–48

Available online 27 February 2019

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38656374

粉丝: 3
资源: 934

一种考虑肿瘤纯度信息的基因表达差异分析配对对特征选择方法

荧光定量PCR数据分析.doc

基因工程载体构建蛋白表达优品文档.ppt

基因表达谱分析一般怎么进行

随机森林特征选择方法

在信用模型中，1.决策树常用的分裂条件计算标准有哪些？分别代表什么含义？2.数据层特征工程是一种显式的特征衍生方法，主要包括哪些方法？

python文本特征选择信息增益法概念

信息增益特征选择算法及matlab实现

在决策树建模过程中，应选取划分后纯度增加什么的特征

用随机森林平均不纯度进行特征排序时系数越大越重要还是系数越小越重要

写一篇关于Rnaseq的综述

gee随机森林特征选择

特征重要性评估方法推荐

决策树的特征值的选择

数据分析中的信息熵详细解释

决策树如何进行特征选择？

ID3使用信息增益作为特征选择的度量 C4.5使用信息增益比作为特征选择的度量

信息增益等于信息熵减去条件熵吗，能让分类结果“纯度更高”的“最优特征” ，条件熵大or小？信息 增益大or 小？

高通量测序数据的ESTIMATE分析代码

纯度指标采用“信息增益”，“增益率”或者“基尼指数”时，比较他们的分类结果和分类精度，对算法有什么影响

最新资源

信息增益等于信息熵减去条件熵吗，能让分类结果“纯度更高”的“最优特征” ，条件熵大or小？信息增益大or 小？