CHAPTER 2 ■ MACHINE LEARNING FUNDAMENTALS
6
unseen data, that is,
xy S
ii
,
Ï and xy U
ii
,
Î . We measure performance over this task as the error over
unseen data,
EfDU
fx y
U
xy U
ii
,,
()
=
¹
é
ë
ù
û
()
Î
,
.
We now have a precise definition of the task, which is to categorize data into one of two categories
(y = ±1) based on some seen data S by generating f. We measure performance (and improvement in
performance) using the error E (f,D,U) over unseen data U. The size of the seen data |S| is the conceptual
equivalent of experience. In this context, we want to develop algorithms that generate such functions
f (which are commonly referred to as a model). In general, the field of machine learning studies the
development of such algorithms that produce models that make predictions over unseen data for such
algorithms, and other formal tasks (we will be introducing multiple such tasks later in the chapter). Note that
the x is commonly referred to as the input/input variable and y is referred to as the output/output variable.
As with any other discipline in computer science, the computational characteristics of such algorithms
is an important facet, but in addition to that we also would like to have a model f that achieves a lower error
E (f, D, U) with as small a |S| as possible.
Let us now relate this abstract but precise definition to a real world problem so that our abstractions
are grounded. Let us say an e-commerce web site wants to customize the landing page for registered users
to show them the products the users might be interested in buying. The web site has historical data on users
and would like to implement this as a feature so as to increase sales. Let us now see how this real world
problem maps onto the abstract problem of binary classification we described earlier.
The first thing that one might notice is that given a particular user and a particular product, one wants
to predict whether the user will buy the product. Since this is the value to be predicted, it maps onto y = ±1,
where we will let the value of y = +1 denote the prediction that the user will buy the product and we will
denote y = –1 as the prediction that the user does not buy the product. Note that that there is no particular
reason for picking these values; we could have swapped this (let y = +1 denote the does not buy case and
y = –1 denote the buy case) and there would be no difference. We just use y = ±1 to denote the two classes of
interest to categorize data. Next, let us assume that we can somehow represent the attributes of the product
and the user’s buying and browsing history as
x
n
Î
. This step is referred to as feature engineering in
machine learning and we will cover it later in the chapter. For now, it suffices to say that we are able to
generate such a mapping. Thus, we have historical data of what the users browsed and bought, attributes of
a product and whether the user bought the product or not mapped onto {(x
1
,y
1
),(x
2
, y
2
), … (x
n
, y
n
)}. Now,
based on this data, we would like to generate a function or a model
fx y:
which we can use to determine
which products a particular user will buy and use this to populate the landing page for users. We can
measure how well the model is doing on unseen data by populating the landing page for users and see
whether they buy the products or not and evaluate the error E (f,D,U).
Regression
Let us introduce another task, namely regression. Here, we have data of the form Dxyxyxy
nn
=
¼
11 22
,, ,
where
x
n
Î
and y
and our task is to generate a computational procedure that implements the function
fx y:
. Note that instead of the prediction being a binary class label y = ±1 like in binary classification, we have
real valued prediction. We measure performance over this task as the root mean squared error (RMSE) over
unseen data,