Machine Learning 8
Where y is a normally distributed vector of responses [target] with
mean µ and constant variance σ
2
. X is a typical model matrix, i.e. a
matrix of predictor variables and in which the first column is a vec-
tor of 1s for the intercept [bias
5
], and β is the vector of coefficients
5
Yes, you will see ’bias’ refer to an
intercept, and also mean something
entirely different in our discussion of
bias vs. variance.
[weights] corresponding to the intercept and predictors in the model.
What might be given less focus in applied courses however is how
often it won’t be the best tool for the job or even applicable in the form
it is presented. Because of this many applied researchers are still ham-
mering screws with it, even as the explosion of statistical techniques
of the past quarter century has rendered obsolete many current intro-
ductory statistical texts that are written for disciplines. Even so, the
concepts one gains in learning the standard linear model are general-
izable, and even a few modifications of it, while still maintaining the
basic design, can render it still very effective in situations where it is
appropriate.
Typically in fitting [learning] a model we tend to talk about R-
squared and statistical significance of the coefficients for a small
number of predictors. For our purposes, let the focus instead be on
the residual sum of squares
6
with an eye towards its reduction and
6
∑
(y − f (x))
2
where f (x) is a function
of the model predictors, and in this
context a linear combination of them
(Xβ).
model comparison. We will not have a situation in which we are only
considering one model fit, and so must find one that reduces the sum
of the squared errors but without unnecessary complexity and overfit-
ting, concepts we’ll return to later. Furthermore, we will be much more
concerned with the model fit on new data [generalization].
Logistic Regression
Logistic regression is often used where the response is categorical in
nature, usually with binary outcome in which some event occurs or
does not occur [label]. One could still use the standard linear model
here, but you could end up with nonsensical predictions that fall out-
side the 0-1 range regarding the probability of the event occurring, to
go along with other shortcomings. Furthermore, it is no more effort
nor is any understanding lost in using a logistic regression over the
linear probability model. It is also good to keep logistic regression in
mind as we discuss other classification approaches later on.
Logistic regression is also typically covered in an introduction to
statistics for applied disciplines because of the pervasiveness of binary
responses, or responses that have been made as such
7
. Like the stan-
7
It is generally a bad idea to discretize
continuous variables, especially the
dependent variable. However contextual
issues, e.g. disease diagnosis, might
warrant it.
dard linear model, just a few modifications can enable one to use it to
provide better performance, particularly with new data. The gist is,
it is not the case that we have to abandon familiar tools in the move
toward a machine learning perspective.