CHAPTER 2 ■ MACHINE LEARNING FUNDAMENTALS

6

unseen data, that is,

xy S

ii

,

Ï and xy U

ii

,

Î . We measure performance over this task as the error over

unseen data,

EfDU

fx y

U

xy U

ii

,,

()

=

¹

é

ë

ù

û

()

Î

,

.

We now have a precise definition of the task, which is to categorize data into one of two categories

(y = ±1) based on some seen data S by generating f. We measure performance (and improvement in

performance) using the error E (f,D,U) over unseen data U. The size of the seen data |S| is the conceptual

equivalent of experience. In this context, we want to develop algorithms that generate such functions

f (which are commonly referred to as a model). In general, the field of machine learning studies the

development of such algorithms that produce models that make predictions over unseen data for such

algorithms, and other formal tasks (we will be introducing multiple such tasks later in the chapter). Note that

the x is commonly referred to as the input/input variable and y is referred to as the output/output variable.

As with any other discipline in computer science, the computational characteristics of such algorithms

is an important facet, but in addition to that we also would like to have a model f that achieves a lower error

E (f, D, U) with as small a |S| as possible.

Let us now relate this abstract but precise definition to a real world problem so that our abstractions

are grounded. Let us say an e-commerce web site wants to customize the landing page for registered users

to show them the products the users might be interested in buying. The web site has historical data on users

and would like to implement this as a feature so as to increase sales. Let us now see how this real world

problem maps onto the abstract problem of binary classification we described earlier.

The first thing that one might notice is that given a particular user and a particular product, one wants

to predict whether the user will buy the product. Since this is the value to be predicted, it maps onto y = ±1,

where we will let the value of y = +1 denote the prediction that the user will buy the product and we will

denote y = –1 as the prediction that the user does not buy the product. Note that that there is no particular

reason for picking these values; we could have swapped this (let y = +1 denote the does not buy case and

y = –1 denote the buy case) and there would be no difference. We just use y = ±1 to denote the two classes of

interest to categorize data. Next, let us assume that we can somehow represent the attributes of the product

and the user’s buying and browsing history as

x

n

Î

. This step is referred to as feature engineering in

machine learning and we will cover it later in the chapter. For now, it suffices to say that we are able to

generate such a mapping. Thus, we have historical data of what the users browsed and bought, attributes of

a product and whether the user bought the product or not mapped onto {(x

1

,y

1

),(x

2

, y

2

), … (x

n

, y

n

)}. Now,

based on this data, we would like to generate a function or a model

fx y:

which we can use to determine

which products a particular user will buy and use this to populate the landing page for users. We can

measure how well the model is doing on unseen data by populating the landing page for users and see

whether they buy the products or not and evaluate the error E (f,D,U).

Regression

Let us introduce another task, namely regression. Here, we have data of the form Dxyxyxy

nn

=

¼

11 22

,, ,

where

x

n

Î

and y

and our task is to generate a computational procedure that implements the function

fx y:

. Note that instead of the prediction being a binary class label y = ±1 like in binary classification, we have

real valued prediction. We measure performance over this task as the root mean squared error (RMSE) over

unseen data,