inference of ceRNA interactions mediated via phosphatase
and tensin homolog (PTEN). However, all the factors used
by this model, such as degradation and transcription rates
for association and dissociation of miRNA/ceRNAs com-
plexes [24], are too difficult to be surveyed for most of miR-
NAs and lncRNAs. Therefore, it is not feasible to
extensively use this kinetic model for the inference of
miRNA-lncRNA interactions. Increasing evidences [30, 31]
demonstrated that lncRNAs are also presumably
co-regulated in expression networks, and multiple lncRNAs
could involve in the biological regulation processes by syn-
ergistically interacting particular miRNA clusters. Accord-
ingly, the expression pattern of lncRNA-lncRNA synergistic
network has recently attracted increasing attention.
In this work, we develop a group-preference Bayesian
collaborative filtering model called GBCF to pick up a
top-k probability ranking list for an individual miRNA or
lncRNA based on the known miRNA-lncRNA interaction
network derived from lncRNASNP database. Since the
known miRNA-lncRNA interactions in the lncRNASNP
database are all positive, the negative samples are relatively
hard to be collected. This prediction task is actually a
semi-supervised one only treating the known interactions
as positive samples. The semi-supervised prediction task
can properly utilize enough side information beneficial for
the prediction performance. Particularly, we first propose
the local scoring scheme to alleviate the prediction prefer-
ence caused by the disproportion of the known
miRNA-lncRNA interaction network. In this scoring sys-
tem, we implemented both leave-one-out cross validation
(LOOCV) and k-fold cross validation to evaluate the pre-
diction performance of the proposed model. The experi-
mental result demonstrated that GBCF obtain the reliable
prediction performance and achieve the higher AUC (area
under ROC curve) of 0.9193 compared with a few repre-
sentative classical classifiers and the state-of-the-art model
EPLMI [32]. GBCF obtained the average AUCs of 0.8354
+/− 0.0079, 0.8615+/− 0.0078 and 0.8928+/− 0.0082 in the
frameworks of 2-fold, 5-fold and 10-fold cross validations,
respectively. To better describe the similarities among
miRNAs and lncRNAs, we leveraged three diverse types
of biological information, i.e., expression profile,
coding-non-coding co-expression networks and sequence
data. Using a series of 5-fold cross validations and correl-
ation analysis of RNA clusters, the experimental compari-
son demonstrated that the miRNA and lncRNA similarity
should be measured by the biological function-based and
expression profile-based correlations, respectively.
Results
The experiment result in cross validations
Using LOOCV, we compared GBCF with a few classical
classifiers including [33–36] as well as the state-of-the-art
model EPLMI [30] as baseline. Note that, all the compared
models were built on the same information source as
GBCF. EPLMI is a two-way diffusion model first proposed
for the prediction of large-scale miRNA-lncRNA interac-
tions. Unlike GBCF, EPLMI adopts a global scoring
scheme to rank the most potential novel miRNA-lncRNA
interactions among all unobserved samples. We also tried
to explore the potential of these classical classifiers from
different perspectives. For example, Katz can be catego-
rized as the network-based measurement method by cal-
culating the nodes’ similarity in a bipartite graph.
Singular-value decomposition (SVD) is used to decompose
the known interaction network into three relatively smaller
matrices for construction of probability matrix. Latent fac-
tor model (LFM) aims to explain observed associations in
terms of two latent factors (also called hidden variables),
which are iteratively optimized for matrix product as prob-
ability matrix. Since GBCF model adopts a specific
group-preference Bayesian collaborative filtering (CF)
technique, we also compared it with typical lncRNA-based
and miRNA-based CF models, respectively.
The performance comparison via LOOCV is shown in
Fig. 1. Among these models, GBCF achieves the best pre-
diction performance with the highest AUC value of
0.9193. The miRNA-based CF, lncRNA-based CF, EPLMI,
SVD-based model and basic LFM obtain the AUC values
of 0.9089, 0.8880, 0,8847, 0.8402 and 0.8680 respectively.
It is noteworthy that the CF-based models seems to per-
form better than others do. This phenomenon could be at-
tributed to their capability of automatic collecting
extrinsic preferences from other RNAs. Although EPLMI
model still maintain reasonable prediction accuracy, the
local ranking scheme limit its performance to a certain ex-
tent. GBCF is developed from the previous approach of
the recommended system, it is more efficient to deal with
the sparse dataset than EPLMI. In a word, the LOOCV re-
sults demonstrate the reliability of GBCF.
Insufficient training samples would greatly affect the
prediction accuracy (sparsity = 2.49%). To evaluate the
performance of GBCF in terms of diverse sparsity,
2-fold, 5-fold and 10-fold cross validations were con-
ducted, respectively. As shown in Table 1, GBCF model
achieves the average AUCs of 0.8354+/− 0.0079 when
the number of training samples drops to a half. In
addition, the result suggests that GBCF model shows a
strong ro bustness to different level of training data spar-
sities. We also used 5-fold cross validation to assess the
performance of GBCF with lncRNA-based group prefer-
ence instead. The average AUCs of 0.8612+/− 0.0080 ob-
tained suggest that miRNA- and lncRNA- based group
preferences contribute equally to the prediction per-
formance of GBCF. Considering the complex competi-
tion mechanisms in ceRNA network and the lack of
investigation into the competition patterns for sequester-
ing miRNAs, we provided the top-50 ranking lists of
Huang et al. BMC Medical Genomics 2018, 11(Suppl 6):113 Page 19 of 112