8 1 Introduction and examples
a reliable e stimate for large sample sizes , as we saw in the example it can be
statistically unreliable for small n, in which case it serves more as a summary
of the sample data than as a precise estimate of θ.
If our interest lies more in obtaining an es timate of θ than in summarizing
our sample data, we may want to consider estimators of the form
ˆ
θ =
n
n + w
¯y +
w
n + w
θ
0
,
where θ
0
represents a “best guess” at the true value of θ and w represents a
degree of confidence in the guess. If the sample size is large, then ¯y is a reliable
estimate of θ. The estimator
ˆ
θ takes advantage of this by having its weights
on ¯y and θ
0
go to one and zero, respectively, as n increases. As a result, the
statistical properties of ¯y and
ˆ
θ are essentially the same for large n. However,
for small n the variability of ¯y might be more than our uncertainty about θ
0
.
In this case, using
ˆ
θ allows us to combine the data with prior information to
stabilize our estimation of θ.
These properties of
ˆ
θ for both large and small n suggest that it is a useful
estimate of θ for a broad range of n. In Section 5.4 we will confirm this by
showing that, under some conditions,
ˆ
θ outperforms ¯y as an estimator of θ for
all values of n. As we saw in the infection rate example and will see again in
later chapters,
ˆ
θ can be interpreted as a Bayesian estimator using a certain
class of prior distributions. Even if a particular prior distribution p(θ) does not
exactly reflect our prior information, the corresponding posterior distribution
p(θ|y) can still be a useful means of providing stable inference and estimation
for situations in which the sample size is low.
1.2.2 Building a predictive model
In Chapter 9 we will discuss an example in which our task is to build a pre-
dictive model of diabetes progression as a function of 64 baseline explanatory
variables such as age, sex and body mass index. Here we give a brief synopsis of
that example. We will first estimate the parameters in a regression model us-
ing a “training” dataset consisting of measurements from 342 patients. We will
then evaluate the predictive performance of the estimated regression model
using a separate “test” dataset of 100 patients.
Sampling model and parameter space
Letting Y
i
be the diabetes progression of subject i and x
i
= (x
i,1
, . . . , x
i,64
)
be the explanatory variables, we will consider linear regression models of the
form
Y
i
= β
1
x
i,1
+ β
2
x
i,2
+ ··· + β
64
x
i,64
+ σ
i
.
The sixty-five unknown parameters in this model are the vector of regression
coefficients β = (β
1
, . . . , β
64
) as well as σ, the standard deviation of the error
term. The parameter space is 64-dimensional Euclidean s pace for β and the
positive real line for σ.
professordoctordoron@gmail.com