condition (Lemma A.3) where we denote
f
x
(z, D) , L
x
(z, D) + h(z)
. One immediate observation
is that
λ ≥ kD
T
xk
∞
⇔ {0} ∈ arg min f
x
(z, D)
. We assume
λ < kD
T
xk
∞
and the solution to
(1)
is unique. Sufficient conditions for uniqueness in the overcomplete case (i.e.,
p > m
) are extensively
studied in the literature [
49
,
50
,
51
]. Tibshirani discussed that the solution is unique with probability
one if entries of
D
are drawn from a continuous probability distribution [
51
] (Assumption 2.3). We
argue that as long as the data
x ∈ X
are sampled from a continuous distribution, this assumption holds
for the entire learning process. The assumption has been previously considered in analyses of unfolded
sparse coding networks [
48
] and can be extended to
`
1
regularized optimization problems [
51
,
52
].
Assumption 2.3
(Lasso uniqueness)
.
The entries of the dictionary
D
are continuously distributed.
Hence, the minimizer of (1) is unique, i.e., {
ˆ
z} ∈ arg min f
x
(z, D) with probability one.
Lemma 2.1 states the fixed-point property of the encoder recursion [
53
]. Given the definitions for
Lipschitz and Lipschitz differentiable functions, (Definitions A.1 and A.2), the loss
L
and function
h
satisfy following Lipschitz properties, which will play an important role in our analysis.
Lemma 2.1
(Fixed-point property of lasso)
.
Given Assumption 2.3, we have
0 ∈ ∇
1
L(
ˆ
z, D)+∂h(
ˆ
z)
.
The minimizer is a fixed-point of the mapping, i.e.,
ˆ
z = P
αh
(
ˆ
z − α∇
1
L(
ˆ
z, D)) = Φ(
ˆ
z) [53].
Lemma 2.2
(Lipschitz differentiable least squares)
.
Given
L
x
(z, D) =
1
2
kx − Dzk
2
2
,
D
, and As-
sumption 2.2, the loss is Lipschitz differentiable. Let
L
1
and
L
2
denote the Lipschitz constants of the
first derivatives
∇
1
L
x
(z, D)
and
∇
2
L
x
(z, D)
,
L
11
and
L
21
the Lipschitz constants of the second
derivatives ∇
2
11
L
x
(z, D) and ∇
2
21
L
x
(z, D), all w.r.t z. Let ∇
1
L
x
(z, D) be L
D
-Lipschitz w.r.t D.
Lemma 2.3
(Lipschitz proximal)
.
Given
h(z) = λkzk
1
, its proximal operator has bounded sub-
derivative, i.e., k∂P
h
(z)k
2
≤ c
prox
.
3 Main Results
The gradients defined in PUDLE (Algorithm 2) can be compared against the local direction at each
update of classical alternating-minimization (Algorithm 1). Assuming there are infinite samples, i.e.,
Best local direction :
ˆ
g , lim
n→∞
1
n
P
n
i=1
∇
2
L
x
i
(
ˆ
z
i
, D) = E
x∈X
[∇
2
L
x
(
ˆ
z, D)]
(5)
where
ˆ
z = arg min
z∈R
p
L
x
(z, D) + h(z)
. Additionally, to assess the estimators for model recovery,
hence dictionary learning, we compare them against gradients that point towards
D
∗
and
˜
D
, namely
Desired gradient for D
∗
: g
∗
, lim
n→∞
1
n
P
n
i=1
∇
2
L
x
i
(z
i∗
, D) = E
x∈X
[∇
2
L
x
(z
∗
, D)]
Desired gradient for
˜
D :
˜
g , lim
n→∞
1
n
P
n
i=1
∇
2
L
x
i
(
˜
z
i
, D) = E
x∈X
[∇
2
L
x
(
˜
z, D)]
(6)
where
z
∗
= arg min
z∈R
p
L
x
(z, D
∗
) + h(z)
. To see why the above are desired directions, we
highlight that
(z
∗
, D
∗
)
is a critical point of the expected risk. Hence, given the current
D
, to reach
the critical point by gradient descent, we move towards the direction minimizing
E
x∈X
[L
x
(z
∗
, D)]
.
Similarly,
(
˜
z,
˜
D)
is a critical point of the loss
L
which also reaches zero for data following the
model
(3)
. Hence, to reach
˜
D ∈ arg min
D∈D
E
x∈X
[L
x
(
˜
z, D)]
, we move towards the direction
minimizing the loss in expectation. Given these directions, we analyze the error of the gradients
g
dec
t
,
g
ae-lasso
t
, and g
ae-ls
t
assuming infinite samples. In this regard, we first study the forward pass.
3.1 Forward pass
We show two convergence results in the forward pass, one for z and another for the Jacobian, i.e.,
Definition 3.1 (Code Jacobian). Given D, the Jacobian of z
t
is defined as J
t
,
∂z
t
∂D
.
The forward pass analysis gives upper bounds on the error between
z
t
and
ˆ
z
and the error between
J
t
and
ˆ
J
as a function of unfolded iterations
t
. We will require these errors in Section 3.2, where we
analyze the gradient estimation errors. Similar to [
28
], the error associated with
g
dec
t
depends on the
code convergence. Unlike
g
dec
t
, the convergence of backpropagation with gradient estimates
g
ae-lasso
t
and
g
ae-ls
t
relies on the convergence properties of the code and the Jacobian [
31
]. Forward-pass
theories are based on studies by Gilbert on the convergence of variables and their derivatives in an
iterative process governed by a smooth operator [
54
]. Moreover, Hale et al. studied the convergence
analysis of fixed point iterations for
`
1
regularized optimization problems [
55
]. In Proposition 3.1,
we re-state a result from [55] on support selection.
4