2608 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 11, NOVEMBER 2015
Danaher et al. [25] proposed the fused graphical
Lasso (FGL) by applying the fused Lasso penalty [30]
P(
1
,...,
K
) =
k<k
i=j
k
ij
−
k
ij
to (1), which encourages K precision matrices to have identical
element values. In addition, they also propose the group graph-
ical Lasso (GGL) by applying the group Lasso penalty [31]
P(
1
,...,
K
) =
i=j
K
k=1
k
ij
2
to (1), which encourages K precision matrices to have a
common pattern of sparsity.
B. Node-Based Joint Graphical Lasso
First, we review the RCON given in [26].
Definition 1: The RCON induced by a matrix norm ·is
defined as
(
1
,
2
,...,
N
) = min
V
1
,...,V
N
⎡
⎢
⎢
⎢
⎣
V
1
V
2
.
.
.
V
N
⎤
⎥
⎥
⎥
⎦
s.t.
n
= V
n
+ (V
n
)
T
for n = 1, 2,...,N.
Indeed, (·) is a norm for all matrix norms ·. Thus,
it is convex. In this paper, we consider only a particular class
of RCON, where · is an
1
/
r
norm, given by
V =
p
j=1
V
j
r
, with V =[V
1
, V
2
,...,V
p
] and
1 ≤ r ≤∞. In the following, (·) is denoted by
r
(·).
Based on RCON, the perturbed-node joint graphical
Lasso (PNJGL) [26] is proposed to detect the perturbed nodes
2
among multiple networks using the structure penalty
P(
1
,...,
K
) =
k<k
r
(
k
−
k
).
When r = 2orr =∞, PNJGL encourages the precision
matrices {
k
}
K
k=1
to be a union of a few rows and the
corresponding columns, which can be interpreted as a set of
perturbed nodes across K networks.
Moreover, the common hub (cohub) node joint graphical
Lasso (CNJGL) [26] is proposed to detect the cohub nodes
3
among multiple networks using the structure penalty
P(
1
,...,
K
) =
r
(
1
− diag(
1
),...,
K
− diag(
K
)).
CNJGL encourages the supports of {
k
}
K
k=1
to be the same,
and the union of a set of rows and columns among K precision
matrices, which can be interpreted as a set of common hub
nodes among K networks.
2
The perturbed nodes have a completely different connectivity pattern to
other nodes in the multiple networks [26].
3
The cohub nodes serve as hubs in each of the multiple networks [26].
III. JOINT MATRIX GRAPHICAL MODELS
In this section, we propose the joint matrix graphical
Lasso for learning multiple MGGMs sharing the same matrix
variable under distinct conditions. According to different appli-
cations, we propose the edge-based and the node-based joint
matrix graphical Lasso, respectively.
A. Problem Formulation
Suppose that a random matrix Y ∈ R
p×q
follows the matrix
normal distribution MN
p,q
(M,,)whose density function
is defined in terms of
P(Y |M,,) =
1
(2π)
pq/2
||
q/2
||
p/2
·exp
−
1
2
tr((Y − M)
T
−1
(Y − M)
−1
)
where M ∈ R
p×q
is mean matrix; ∈ R
p×p
and ∈ R
q×q
are row and column covariance matrices, respectively. The row
precision matrix
−1
encodes the conditional independence
among rows in the matrix variable, while the column precision
matrix
−1
encodes that among columns.
Further suppose that Y
k
i
∈ R
p×q
(i = 1, 2,...,n
k
) are sam-
pled i.i.d. from MN
p,q
(O
p×q
,
k
,
k
) for k = 1, 2,...,K
with K ≥ 2. Here n
k
is the number of samples in the kth class,
and the features are shared among K classes. For convenience,
let A
k
= (
k
)
−1
and B
k
= (
k
)
−1
(k = 1, 2,...,K ),and
further let {A}={A
1
,...,A
K
} and {B}={B
1
,...,B
K
}.
Then the negative log likelihood for the data takes the
form of
L({A}, {B}) =
K
k=1
1
n
k
pq
n
k
l=1
tr
A
k
Y
k
l
B
k
Y
k
l
T
−
1
p
log|A
k
|−
1
q
log|B
k
|
.
Meanwhile, we can also consider the weighted negative
log likelihood. Clearly, minimizing L({A}, {B}) leads to the
maximum likelihood estimates (MLEs). However, the MLEs
are usually dense. The
1
-regularization has been employed to
induce sparsity, resulting in sparse precision estimation.
In this paper, we propose the joint matrix graphical Lasso
for estimating multiple MGGMs by minimizing the penalized
negative log likelihood
min
{A
k
∈S
p
++
,B
k
∈S
q
++
}
K
k=1
L({A}, {B}) + λ
1
K
k=1
A
k
1
+ρ
1
K
k=1
B
k
1
+λ
2
P
1
({A}) + ρ
2
P
2
({B})
(2)
where λ
1
, λ
2
, ρ
1
,andρ
2
are nonnegative tuning parameters.
Here, P
1
({A}) and P
2
({B}) are convex structure penalty
functions, which aim at preserving the common structures in
the row and the column precision matrices, respectively. When
A
k
= I
p
or B
k
= I
q
(k = 1, 2,...,K ), our model reduces to