Mini Review
A Review of Matched-pairs Feature Selection Methods for Gene
Expression Data Analysis
Sen Liang
a
, Anjun Ma
b,c
, Sen Yang
a
, Yan Wang
a,
⁎
,QinMa
b,c,
⁎⁎
a
Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
b
Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings,
SD 57007, USA
c
BioSNTR, Brookings, SD, USA
abstractarticle info
Article history:
Received 18 September 2017
Received in revised form 14 February 2018
Accepted 19 February 2018
Available online 25 February 2018
With the rapid accumulation of gene expression data from various technologies, e.g., microarray, RNA-
sequencing (RNA-seq), and single-cell RNA-seq, it is necessary to carry out dimensional reduction and feature
(signature genes) selection in support of making sense out of such high dimensional data. These computational
methods significantly facilitate further data analysis and interpretation, such as gene fu nction enrichment
analysis, cancer biomarker detection, and drug targeting identification in precision medicine. Although numer-
ous methods have been developed for feature selection in bioinformatics, it is still a challenge to choose the
appropriate methods for a specific problem and seek for the most reasonable ranking features. Meanwhile, the
paired gene expression data under matched case-control design (MCC D) is becoming increasingly popular,
which has often been used in multi-omics integration studies and may increase feature selection efficiency by
offsetting similar distributions of confounding features. The appropriate feature selection methods specifically
designed for the paired data, which is named as matched-pairs feature selection (MPFS), however, have not
been maturely developed in parallel. In this review, we compare the performance of 10 feature-selection
methods (eight MPFS methods and two traditional unpaired methods) on two real datasets by applied
three classification methods, and analyze the algorithm complexity of these methods through the running
of their pro grams. This review aims to induce and compr ehensively present the MPFS in such a way that
readers can easily understand its characteristics and get a clue in selecting the appropriate methods for their
analyses.
© 2018 Liang et al.. Published by Elsevier B.V. on behalf of the Research Network of Computational and Structural
Biotechnology. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Keywords:
Matched-pairs feature selection
Matched case-control design
Paired data
Gene expression
Contents
1. Introduction...............................................................89
2. FeatureSelectionTechniques .......................................................89
2.1. UnpairedFeatureSelectionMethods.................................................89
2.2. ADifferentPerspectiveofFeatureSelectionByDataProperties.....................................90
3. Matched-pairsFeatureSelection......................................................90
3.1. ProblemDescription........................................................90
3.2. MethodsSurvey..........................................................90
3.2.1. TestStatisticforMPFS ...................................................90
3.2.2. ConditionalLogisticRegressionforMPFS...........................................91
3.2.3. BoostingStrategyforMPFS.................................................92
4. ExperimentalValidation..........................................................92
5. Discussion................................................................94
6. Conclusion................................................................95
Computational and Structural Biotechnology Journal 16 (2018) 88–97
⁎ Corresponding author.
⁎⁎ Correspondence to: Q. Ma, Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, Department of Mathematics and Statistics,
South Dakota State University, Brookings, SD 57007, USA.
E-mail addresses: wy6868@jlu.edu.cn (Y. Wang), qin.ma@sdstate.edu (Q. Ma).
https://doi.org/10.1016/j.csbj.2018.02.005
2001-0370/© 2018 Liang et al.. Published by Elsevier B.V. on behalf of the Research Network of Computational and Structural Biotechnology. This is an open access article under the CC BY
license (http://creativecommons.org/licenses/by/4.0/).
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/csbj