straightforward to verify that C
−
1
2
¼ Λ
−1
C
U
t
C
is a valid inverse square root
of C. An alternative is the symmetric matrix C
−
1
2
¼ U
C
Λ
−1
C
U
t
C
, which we
will use to visualize whitened data.
To reduce redundancy: Eq. (2) reveals that minimum-norm estimates
(MNE) actually implements what is known as Tikhonov regularization
(Tikhonov and Arsenin, 1977) or Ridge regressio n in the field of statistical
learning (Hoerl and Kennard, 1970). As a consequence, if the gain matrix
and the data are appropriately whitened, general conditions of statistical
regression models apply to the magnetoencephalography and electroen-
cephalography (M/EEG) inverse problem. Minimum-norm estimates,
therefore, rely on the specification of the noise covariance matrix that
needs to be estimated from the data. Or in other words, the quality of
the inverse solution depends on the quality of the covariance estimate.
This holds true for most other source localization in particular
beamformers such as LCMV and DICS (Veen et al., 1997; Gross et al.,
2001). However, this also applies to MNE variants such as dynamical sta-
tistics parametric mapping (dSPM) (Dale et al., 2000) or low resolution
brain electromagnetic tomography (sLORETA) (Pascual-Marqui, 2002),
as well as other distributed models such as minimum-current estimates
(MCE) (Uutela et al., 1999) or mixed-norm estimates (MxNE)
(Gramfort et al., 2012; Gramfort et al., 2013a). It therefore cannot be con-
sidered a local problem.
Covariance estimation
Model selection using cross-validation
The noise covariance estimator is typically applied to segments of
(M/EEG) data that were not used to estimate the noise covariance and
that typically include both, brain signals and noise. Its quality can
hence be assessed by investigating how well the model describes new
data. This idea of model quality assessment on unseen data is put into
practice by aggregating results over random partitions of the data, and
is referred to as cross-validation. Since data are assumed to follow a
multivariate Gaussian distribution, parameterized by a covariance ma-
trix C, the log-likelihood of some data Y reads:
L YjCðÞ¼−
1
2T
Trace YY
t
C
−1
−
1
2
log 2πðÞ
N
det CðÞ
: ð3Þ
The higher this quantity on unseen data, the more appropriate the
estimated noise covariance C and the higher its success at spatiall y
whitening the data. The log-likelihood, hence, allows us to select the
best noise covariance estimators out of a given set of models using
cross-validation with left out data. In the following we will discuss po-
tentially relevant candidate strategies to estimate covariance matrices
on M/EEG data.
Empirical covariance and regularization
The empirical covarian ce matrix can be computed by C ¼
1
T
YY
t
,
where Y contains the data of size N × T. With a sufficient number of ob-
servations (T large), the sample covariance, which can be derived from
maximum likelihood, is a good estimator of the true covariance. Typi-
cally, a noise covariance is computed on baseline segments preceding
stimulation or for MEG on empty room measurements during which
no subject is present. The latter is however not possible for electroen-
cephalography (EEG) recordings for which the covariance estimation
relies on data segments considered not relevant for the task, typically
during baseline. Biological artifacts often contaminate the data leading
to outlier sample s, and sometimes the data statistics change o ver
time, for example due to changes in environmental noise or changes
in head position. If in such situations only a limited number of samples
is available, the empirical covariance tends to to suffer from high vari-
ance. The estimate then is noisy and unreliable for further analysis.
One typical way to reduce the variance of the covariance estimator is
to apply diagonal loading. It consists of amplifying the diagonal with a
hand-selected constant which attenuates the off-diagonal elements
that correspond to inter-sensor covariance:
C
0
¼ C þ αI; α N 0: ð4Þ
The value α is the regularization parameter. This diagonal weighting
of the covariance stabilizes MNE-like estimates by reducing the vari-
ance. However, the introduced bias amounts to assuming a stronger un-
correlated noise level which leads to underestimated amplitudes in the
source estimates. This especially applies to dSPM and sLORETA where
the noise variance is used to rescale MNE estimates and convert
them to statisti cal quantities such as F or T statistics. When used in
beamformers, such a regularization of the data covariance matrices
tends to increase th e point spread function of the spatial filters and
smear the estimates (Woolrich et al., 2011). In addition, hand-set regu-
larization raises a new problem, which is how to choose the value of α.
Shrinkage models
An improvement of the hand-selected regularization or shrinkage
approach introduced in the section called Covariance estimation is pro-
vided by the Ledoit–Wolf (LW) shrinkage model (Ledoit and Wolf,
2004). This covariance model constitutes an optimal weighted average
of th e invariant identity matrix and the empirical covariance matrix
(Eq. (5)). The LW covariance estimates C
LW
takes the form of:
C
LW
¼ 1−αðÞC þ αμ I; ð5Þ
where I stands for the identity matrix, μ is the mean of the diagonal el-
ements of C, and α is called the shrinkage parameter. The contribution of
Ledoit and Wolf (2004) is to provide a formula to compute the optimal
value for α. The solution is derived from the values of N, the number of
dimensions, and T, the number of samples. It is provided in closed form
and minimizes the mean squared error between the estimator and the
population covariance. The underlying assumptions of the LW estimator
are that the data are i.i.d. (independent identically distributed) which,
as we will see below, is not a valid assumption for M/EEG data. Howev-
er, Ledoit and Wolf (2004) have shown that the optimal shrinkage pa-
rameter guarantees C
LW
to be well conditioned: matrix inve rsion is
numerically stable, and more stable than with the empirical covariance.
A data-driven extension to the Ledoit–Wolf estimator can be moti-
vated by Eq. (5). Instead of using the Ledoit–Wolf formula to compute
α, cross-validation and likelihood es timation on unse en data can be
compared over a range of α values to select the optimal regularization
parameter. The optimal α can then be determined as the one yielding
a covariance estimator with the maximum likelihood on unseen data.
Throughout the manuscript, models with data-driven shrinkage coeffi-
cient as in Eq. (5) will be referred to as SC.
Probabilistic principal component analysis (PPCA)
M/EEG measurements are obtained by sensor recordings at various
locations in space. They include signals from the brain but also artifacts.
Such signals and artifacts yield spatially structured patterns on the sen-
sor array. For example, a source in the brain that would be well modeled
by an equivalent current dipole ECD produces a dipolar pattern on the
sensors. If this dipole does not rotate, due to the physics of the forward
problem, the signal space spanned by this ECD is of dimension one. The
signal space is thus said to be of rank one. Both sources in the brain and
artifacts share this property of generating low rank signals on the sen-
sors. This is for example what justifies the use of signal space projection
SSP (Uusitalo and Ilmoniemi, 1997). The id ea behind SSP is that the
noise subspace includes artifact-related sources of low dimensionality
and that it is approximately orthogonal with the subspace spanned by
the brain signals of interest. Therefore, projecting the data on the or-
thogonal of the noise subspace will remove artif acts and therefore
denoise the data.
Principal component analysis ( PCA) is a statistical method that is
built on this idea of low rank si gnal space. When using cla ssical PCA
330 D.A. Engemann, A. Gramfort / NeuroImage 108 (2015) 328–342