-. s _=[Qo- u-I (2.1)
¢rH
The significance is properly a dimensionless quantity, but it is natural to call the units of _q"sigmas." Thus, one might
speak of a two sigma eff,'ct as not especially significant, but ten sigmas as extremely significant. If the distribution
of statistic values is gaussian (and numerical experiments indicate that this is often a reasonable approximation),
then the p-value associated with a signifcance S is given by p - erfc[,3/V_]; this is the probability of observing a
_ignificance S or larger if the null hypothesis is true.
If computational effort really were not a consideration, then a more robust way to define significance _vould be
directly in terms of p-values with rank statistics. In particular, if the observed time series has a statistic which is in
the lower one percentile of ali the surrogate statistics (and at least a hundred surrogates would be needed to make
this determination), then a (two-sided) p-value of p -- 0.02 could be quoted.
2.2 Hierarchy of null hypotheses
The null hypothesis defines the nature of the candidate process which may or may not adequately explain the data.
Our null hypotheses usually specify that certain properties of the original data are preserved -- such as mean and
variance m but that there is no further structure in the time series. The surrogate ¢ilata is then generated to mimic
these preserved features but to otherwise be random. There is some latitude in choosing which features ought to be
preserved: certainly mean and variance, and possibly also the Fourier power spectrum. If the raw data is discretized
to integer values, then the surrogate data should be similarly discretized.
Ultimately we envision a hierarchy (perhaps even a hierarchical tree) of null hypotheses against which time series
might be compared. Beginning with the simplest hypotheses, and increasing in generality, the following sections
outline some of the possibilities that we have considered.
2.2.1 Temporally uncorrelated noise
The null hypothesis of no temporal correlations is of particular interest in circumstances (e.g., stock market returns,
or outcomes on a roulette wheel) where any correlation at ali can potentially be exploited for profit. The simplest
null hypothesis in this case is that the observed data is fully described by independent and identically distributed
(liD) gaussian random variables. Surrogate data in this case are readily generated from a standard pseudorandom
number generator, normalized to the mean and variance of the original data.
A clever extension of this approach was used by Scheinkman and LeBaron [15] in sn analysis of stock market
returns. To test the hypothesis of IID noise with arbitrary amplitude distribution, they generated surrogate data by
shuffling the time-order the original time series. This more closely mimics the original data, but it destroys any
temporal correlations that may have been in the data.
2.2.2 Ornstein-Uhlenbeck noise
For most physical systems, it is usually obvious that there are temporal correlations, but the nature of these corre-
lations may not be so clear. The simplest case of non-IID noise is given by the Ornstein-Uhlenheck process [16]. For
a discrete time series, this can be produced by
zt = ao + alzt-i + _et (2.2)
where et is uncorrelated gaussian noise of unit variance. The coefficients ao, ai, and _ collectively determine the
mean, variance, and autocorrelation time of the time series. In fact, the autocorrelation fuaction is exponential in
this case:
(ztz,_,) - (_,:)2
A(r) = (z_) - (zt) 2 = e-al'l (2.3)
where 0 denotes an average over time t, and _ = -log at.
To make surrogate data sets, the mean p, variance v, and first autocorrelation A(1) are estimated from the
original time series; from these the coefficients are fit: ai = A(1), ao -/J(1 - ai), and _2 __ v(1 - a_). Finally, one
generates the surrogate data by iterating Eq. (2.2), using a pseudorandom number generator for the unit variance
oaussian et.