Contents lists available at ScienceDirect
Mathematical Biosciences
journal homepage: www.elsevier.com/locate/mbs
A novel matched-pairs feature selection method considering with tumor
purity for differential gene expression analyses
Liang Sen
a,b,c
, Yang Sen
a
, Liang Dayang
d
, Ma Jiechao
c
, Tian Yuan
a,e
, Zhao Jing
f
, Zhang Xu
g
,
Xu Ying
a,b,g
, Wang Yan
a,b,
⁎
a
Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University,
Changchun 130012, China
b
Cancer Systems Biology Center, China-Japan Union Hospital, Jilin University, Changchun 130033, China
c
Advanced Institute, Infervision, Beijing 100000, China
d
School of Mechatronics Engineering, Nanchang University, Nanchang 330031, China
e
School of Artificial Intelligence, Jilin University, Changchun, 130012, China
f
Sanford Research, Sioux Falls, SD 57104, USA
g
Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
ARTICLE INFO
Keywords:
Feature selection
Tumor purity
Test statistic
Gene expression analyses, Matched case-
control design
ABSTRACT
Tissue-based gene expression data analyses, while most powerful, represent a significantly more challenging
problem compared to cell-based gene expression data analyses, even for the simplest differential gene expression
analyses. The result in determining if a gene is differentially expressed in tumor vs. non-tumorous control tissues
does not only depend on the two expression values but also on the percentage of the tissue cells being tumor
cells, i.e., the tumor purity. We developed a novel matched-pairs feature selection method, which takes into full
consideration of the tumor purity when deciding if a gene is differentially expressed in tumor vs. control ex-
periments, which is simple, effective, and accurate. To evaluate the validity and performance of the method, we
have compared it with four published methods using both simulated datasets and actual cancer tissue datasets
and found that our method achieved better performance with higher sensitivity and specificity than the other
methods. Our method was the a matched-pairs feature selection method on gene expression analysis under
matched case-control design which takes into consideration the tumor purity information, which can set a
foundation for further development of other gene expression analysis needs.
1. Introduction
Compared to the traditional cell-based gene expression data col-
lected under laboratory conditions, tissue-based gene expression data
analyses enable researchers to study cancer evolution in the actual
cancer-forming environment directly. In addition, direct collection of
tissue-level gene expression data without separating tissues into distinct
cell types, followed by single-cell sequencing provides a considerably
more efficient and economically more feasible approach for large-scale
tumor data analyses. However, the approach poses a significant chal-
lenge to bioinformatics researchers since the collected data are com-
positions of gene expressions from multiple cell types. At the forefront
of the analysis of such data is the issue of tumor purity, i.e., the
percentage of tissue cells being tumor cells as the meaning of observed
gene expression data changes with the different percentage of the tissue
cells being tumor cells.
The tumor purity issue has been well recognized as a technical issue
that needs to be solved before reliable information can be derived from
tissue-based expression data [1,2]. Aran et al. recently gave a sys-
tematic pan-cancer analysis on tumor purity [3] and found that some
immunotherapy gene signatures were not detected by traditional dif-
ferential expression analysis, but became detectable when tumor purity
was taken into consideration. Different types of information have been
employed in the published methods for tumor purity estimation, in-
cluding gene expression data (ESTIMATE [4]), somatic copy-number
variation data (ABSOLUTE [5], THetA [6] and others [7]), somatic
https://doi.org/10.1016/j.mbs.2019.02.007
Received 17 January 2019; Received in revised form 21 February 2019; Accepted 22 February 2019
Corresponding author at: Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and
Technology, Jilin University, Changchun 130012, China.
E-mail addresses: hawkcoder@gmail.com (S. Liang), yangsen18@mails.jlu.edu.cn (S. Yang), 5910116336@email.ncu.edu.cn (D. Liang),
mjiechao@infervision.com (J. Ma), tianyuan12@mails.jlu.edu.cn (Y. Tian), zj1228@gmail.com (J. Zhao), xuhzhang@outlook.com (X. Zhang), xyn@uga.edu (Y. Xu),
wy6868@jlu.edu.cn (Y. Wang).
Mathematical Biosciences 311 (2019) 39–48
Available online 27 February 2019
0025-5564/ © 2019 Elsevier Inc. All rights reserved.
T