• For j = 1 . . . d, set
w
t
j
= w
t−1
j
+ α
t
×
∂
∂w
j
L(w
t−1
)
where α
t
> 0 is some stepsize, and
∂
∂w
j
L(w
t−1
) is the derivative of L
with respect to w
j
.
3. Return the final parameters w
T
.
Thus at each iteration we calculate the gradient at the current point w
t−1
, and move
some distance in the direction of the gradient.
In practice, more sophisticated optimization methods are used: one common
to choice is to use L-BFGS, a quasi-newton method. We won’t go into the details
of these optimization methods in the course: the good news is that good software
packages are available for methods such as L-BFGS. Implementations of L-BFGS
will generally require us to calculate the value of the objective function L(w), and
the value of the partial derivatives,
∂
∂w
j
L(w), at any point w. Fortunately, this will
be easy to do.
So what form do the partial derivatives take? A little bit of calculus gives
∂
∂w
j
L(w) =
X
i
φ
j
(x
i
, y
i
) −
X
i
X
y
p(y|x
i
; w)φ
j
(x
i
, y)
The first sum in the expression,
P
i
φ
j
(x
i
, y
i
), is the sum of the j’th feature value
φ
j
(x
i
, y
i
) across the labeled examples {(x
i
, y
i
)}
n
i=1
. The second sum again in-
volves a sum over the training examples, but for each training example we calcu-
late the expected feature value,
P
y
p(y|x
i
; w)φ
j
(x
i
, y). Note that this expectation
is taken with respect to the distribution p(y|x
i
; w) under the current parameter val-
ues w.
Regularized log-likelihood. In many applications, it has been shown to be highly
beneficial to modify the log-likelihood function to include an additional regular-
ization term. The modified criterion is then
L(w) =
n
X
i=1
log p(y
i
|x
i
; w) −
λ
2
||w||
2
where ||w||
2
=
P
j
w
2
j
, and λ > 0 is parameter dictating the strength of the regu-
larization term. We will again choose our parameter values to be
w
∗
= arg max
w∈R
d
L(w)
3