2.4. Nuisance parameters and hierarchical priors
In most scientific estimation problems, certain parameters are
of central interest while other parameters are only used as a means
to an end. In statistics these ‘‘uninteresting’’ parameters are called
nuisance parameters. Sigworth (Sigworth, 1998) treated the EM
transformation variables (e.g., the rotations and translations that
align images) as nuisance parameters, since they are only used
transiently to estimate the parameter of interest, the reference
structure. When SNR is low, the estimates of nuisance parameters
can be highly uncertain. Since ultimately we do not care about the
particular values of nuisance parameters, it would be useful to
somehow account for, and perhaps mitigate, the uncertainty in
their values.
A general statistical method for dealing with nuisance parame-
ters is to treat them as random variables with their own PDF. For
example, Sigworth recognized that the image transformation vari-
ables could be considered to be random variables themselves, and
he proposed a bivariate Gaussian distribution for the x,y coordinate
(translation) transformation variables:
pð/
i
r
x
;
r
y
;
^
x;
^
y
Þ¼
1
2
pr
x
r
y
exp
kx
i
^
xk
2
2
r
2
x
ky
i
^
yk
2
2
r
2
y
"#
ð9Þ
where f
^
x;
^
y;
r
x
;
r
y
g are the means and standard deviations of the x,y
coordinates. Model parameters for the Euler angles can also be
introduced if their distribution is non-uniform.
A PDF for parameters is called a prior. In this case Eq. (9) is specif-
ically referred to as a hierarchical prior, since we now have a statisti-
cal modelwith ahierarchy of distributions— aPDF for thedata, given
certain parameters, supplemented by a higher level PDF for some of
the parameters. The parameters of the hierarchical prior (e.g.,
r
x
and
r
y
in Eq. (9)) may be called hierarchical parameters, to distinguish
them from the parameters of the pure likelihood function.
Given a hierarchical prior for the /
i
parameters, the likelihoods
in Eqs. (2) and (4) can then be augmented to construct an extended
likelihood function pðX
i
; /
i
H
j
Þ by multiplying the normal likelihood
by the hierarchical prior:
pðX
i
; /
i
H
j
Þ¼pðX
i
H; /
i
j
Þpð/
i
r
x
;
r
y
;
^
x;
^
y
Þð10Þ
¼
1
ffiffiffiffiffiffiffi
2
p
p
r
i
M
exp
kX
i
Pð/
i
; AÞk
2
2
r
2
i
"#
pð/
i
r
x
;
r
y
;
^
x;
^
y
Þ
ð11Þ
where the H ¼fA;
r
;
^
x;
^
y;
r
x
;
r
y
g is the augmented set of all model
parameters associated with reference structure A. Note that Eqs.
(10) and (11) correspond to 3D versions of Eqs. (11) and (12) of Sig-
worth, using our notation. The full hierarchical joint likelihood of a
set of images is thus:
pðX;/ HjÞ¼
1
ffiffiffiffiffiffiffi
2
p
p
MN
Y
N
i
r
M
i
exp
kX
i
Pð/
i
;AÞk
2
2
r
2
i
"# !
pð/
i
r
x
;
r
y
;
^
x;
^
y
Þ
ð12Þ
with corresponding log-likelihood:
ln½pðX; / H
j
Þ ¼
MN
2
lnð2
p
Þ
1
2
X
N
i
kX
i
Pð/
i
; AÞk
2
r
2
i
M
X
N
i
ln
r
i
þ
X
N
i
ln½pð/
i
r
x
;
r
y
;
^
x;
^
y
Þ:
ð13Þ
Other hierarchical priors can be added in a similar fashion to de-
scribe the distribution of other parameters, for example defocus
(Chen et al., 2009) and magnification. The form of the additional
distributions is often assumed to be Gaussian. Other authors may
refer to an extended likelihood as a regularized likelihood, penalized
likelihood, or a hierarchical likelihood. An extended likelihood as in
Eq. (11) is also a joint likelihood, as it is equivalent to the joint
PDF of the data and the parameters given the hyperparameters.
Given a hierarchical statistical model and a corresponding ex-
tended likelihood, there are several different ways to proceed with
parameter estimation. When the hyperparameters of the hierarchi-
cal prior distributions are estimated from the data using variants of
ML methodology, such techniques are referred to as extended like-
lihood or empirical Bayes. There are two main ML variants: (a) to
maximize the extended likelihood directly, and (b) to maximize
the marginal likelihood, in which the nuisance parameters have
been integrated out.
2.5. Maximization of the joint extended likelihood
The extended likelihood can be maximized over all unknown
parameters simultaneously, including both the parameters and
the hyperparameters in the optimization. In practice, this is usually
done using an iterative algorithm, in which each parameter is max-
imized in turn, conditional on the current optimal values of all
other parameters. However, any multi-parameter optimization
method may be used.
Maximization of the joint extended likelihood aims to find the
joint point estimates of the ‘‘best’’ values for all parameters simul-
taneously. However, this method may not work well when the
hierarchical prior is diffuse or multimodal. Direct maximization
of the joint likelihood works best when the prior PDF for the hyper-
parameters is smooth and highly peaked.
2.6. Maximization of the marginal likelihood
Alternatively, when the nuisance parameters are highly uncer-
tain, it may be desirable to completely eliminate them from the
analysis, while taking into account the uncertainty in their values.
This is accomplished by integrating them out of the extended like-
lihood, resulting in a marginal likelihood function. For example, we
can eliminate /
i
from the extended likelihood function in Eqs. (10)
and (11) by integrating over its distribution:
pðX
i
H
j
Þ¼
Z
/
i
pðX
i
H; /
i
j
Þpð/
i
r
x
;
r
y
;
^
x;
^
y
Þd/
i
ð14Þ
which results in a marginal PDF that is independent of /
i
. This is the
approach taken by Sigworth (Sigworth, 1998), where he integrates
out the transformation parameters and maximizes the marginal
likelihood function over A and
r
.
In practice there are several choices for accomplishing the mar-
ginalization. In the simplest cases an analytical solution can be ob-
tained. Usually we are not so lucky and must resort to numerical
methods such as brute force integration, the Expectation–Maximi-
zation algorithm, or some combination of the two (Scheres, 2012a;
Sigworth, 1998).
2.7. Expectation–Maximization of the marginal likelihood
The Expectation–Maximization algorithm (normally abbrevi-
ated as EM, but we will avoid that here) finds the parameter values
that maximize the marginal distribution using a mathematical
trick that only requires the (non-integrated) joint likelihood. In
its most general form, the algorithm cycles between two steps:
(a) the ‘‘expectation step’’, in which one finds the expected loga-
rithm of the joint likelihood function, where the expectation is ta-
ken over the nuisance parameters (e.g., / in Eq. (11)), conditional
on the current values of the other parameters and the data, and
D. Lyumkis et al. / Journal of Structural Biology 183 (2013) 377–388
379