Vol. 31 no. 4 2015, pages 572–580
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btu679
Systems biology Advance Access publication October 16, 2014
jNMFMA: a joint non-negative matrix factorization meta-analysis
of transcriptomics data
Hong-Qiang Wang
1,
*
,Chun-HouZheng
2
and Xing-Ming Zhao
3
1
Machine Intelligence and Computational Biology Lab, Hefei Institutes of Physical Science, Chinese Academy of Science,
Hefei 230031, China,
2
College of Electrical Engineering and Automation, Anhui University, Hefei 230031, China and
3
Department of Computer Science, School of Electronics and Information Engineering, Tongji University, Shanghai
201804, China
Associate Editor: Jonathan Wren
ABSTRACT
Motivation: Tremendous amount of omics data being accumulated
poses a pressing challenge of meta-analyzing the heterogeneous data
for mining new biological knowledge. Most existing methods deal with
each gene independently, thus often resulting in high false positive
rates in detecting differentially expressed genes (DEG). To our know-
ledge, no or little effort has been devoted to methods that consider
dependence structures underlying transcriptomics data for DEG iden-
tification in meta-analysis context.
Results: This article proposes a new meta-analysis method for iden-
tification of DEGs based on joint non-negative matrix factorization
(jNMFMA). We mathematically extend non-negative matrix factoriza-
tion (NMF) to a joint version (jNMF), which is used to simultaneously
decompose multiple transcriptomics data matrices into one common
submatrix plus multiple individual submatrices. By the jNMF, the
dependence structures underlying transcriptomics data can be inter-
rogated and utilized, while the high-dimensional transcriptomics data
are mapped into a low-dimensional space spanned by metagenes that
represent hidden biological signals. jNMFMA finally identifies DEGs as
genes that are associated with differentially expressed metagenes.
The ability of extracting dependence structures makes jNMFMA
more efficient and robust to identify DEGs in meta-analysis context.
Furthermore, jNMFMA is also flexible to identify DEGs that are
consistent among various types of omics data, e.g. gene expression
and DNA methylation. Experimental results on both simulation data
and real-world cancer data demonstrate the effectiveness of jNMFMA
and its superior performance over other popular approaches.
Availability and implementation: RcodeforjNMFMA is available for
non-commercial use via http://micblab.iim.ac.cn/Download/.
Contact: hqwang@ustc.edu
Supplementary information: Supplementary data are available at
Bioinformatics online.
Received on July 10, 2014; revised on September 26, 2014; accepted
on October 10, 2014
1INTRODUCTION
As high throughput biotechnologies have become routine tools
in biological and biomedical researches, tremendous amounts of
omics data have been generated that provide great opportunity
for deciphering molecular mechanisms of cancer or other
diseases (Jiao et al., 2014; Natrajan and Wilkerson, 2013;
TCGA, 2012; Zhang et al., 2013). Two famous public gene ex-
pression databases, GEO (www.ncbi.nlm.nih.gov/geo/) and
ArrayExpress (www.ebi.ac.uk/arrayexpress/), have deposited
transcriptomic data with more than a million assays from
more than 30 000 studies. Another valuable resource, the
TCGA project (http://cancergenome.nih.gov/), has released vari-
ous types of omics data for nearly 10 000 cancer patient samples.
Reusing the flood of transcriptomics data with meta-analysis can
reduce sample bias and increase statistical power, and thus allow
for indepth understanding of pathology of cancer or other dis-
eases at molecular level (Rung and Brazma, 2013). However, the
key issue of meta-analysis, i.e. capturing consistent but subtle
patterns of gene activity across multiple transcriptomics datasets,
still remains challenging both theoretically and practically.
Differentially expressed genes (DEG) across studies could
reflect subtle but consistent biological effects and might
be false negatives in individual analysis (Xia et al., 2013).
To efficiently identify DEGs, meta-analysis methods need to
overcome a variety of biological or non-biological variations
introduced by distinct protocols and data platforms used in
individual studies (Rung and Brazma, 2013). From the aspect
of information to be combined, existing meta-analysis methods
can be categorized into three classes: P-value-based, effect
size-based and rank-based, which each deal with non-specific
variations at different levels of data. Among them, the P-value-
based method is statistically most intuitive but allows for
standardization of topic-related associations from studies to the
common scale of significance (Li and Tseng, 2011). However,
the performance of P value-based methods heavily depends on
the underlying method used for P value calculation in individual
analysis (Tseng et al., 2012). Compared with P value-based meth-
ods, the effect size-based methods estimate and directly synthe-
size effect sizes across studies by using a t-statistic-like model.
Because the effect size quantity provides a direct measure of
differential expression, effect size methods tend to be more effi-
cient in detecting DEGs than the P value-based methods (Hong
and Breitling, 2008). There are two types of effect size models
that can be used for meta-analysis of transcriptomics data:
fixed-effect model (FEM) and random effect model (REM),
which differ in whether between-study variation is ignorable.
Generally, effect size-based methods suffer from unreliable
error estimates due to improper distribution assumption
*To whom correspondence should be addressed.
572 ß The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com