商业中的机器学习：数据科学导论

需积分: 10 61 浏览量更新于2024-07-06 收藏 14.79MB PDF 举报

"Machine Learning in Business_ An Introduction" 是一本由John C. Hull编著的数据科学入门书籍，主要探讨了机器学习在商业领域的应用。这本书的第二版于2020年发布，旨在向读者介绍数据科学的世界。在本书中，作者首先介绍了这本书的基本内容和配套材料，让读者对全书有一个整体认识。接着，他详细阐述了不同类型的机器学习模型，包括监督学习和无监督学习，以及验证与测试的重要性。数据清洗是机器学习过程中不可或缺的步骤，作者强调了它对模型性能的影响，并简述了贝叶斯定理在概率推理中的作用。在无监督学习这一章，Hull讲解了特征缩放，这是预处理数据时的关键步骤，以确保不同特征在同一尺度上。他还深入讨论了k-均值算法，这是一种常见的聚类方法，用于将数据集分成不同的组。选择合适的k值（聚类数量）是一个挑战，作者提到了一些解决策略。此外，他还探讨了高维数据问题（维度灾难）及其对分析的影响，并介绍了国家风险评估的例子。本章还提到了其他聚类算法和主成分分析（PCA），这些都是无监督学习中的重要工具。进入监督学习部分，Hull重点讲述了线性回归和逻辑回归，这两种在预测和分类任务中广泛使用的模型。对于线性回归，他区分了单特征和多特征情况，并讨论了如何处理分类特征。正则化是防止过拟合的一种技术，Ridge回归和Lasso回归作为正则化的变种，被介绍给读者。逻辑回归则用于处理二元分类问题，书中可能涵盖了其基本概念、公式以及应用案例。每一章末尾，Hull提供了总结、短概念问题和练习，以帮助读者巩固所学知识并应用于实际场景。这本教材适合那些希望了解机器学习在商业环境中应用的学生和专业人士，它以理论与实践相结合的方式，引导读者进入数据驱动决策的世界。

4 Chapter 1

inspire some readers to learn more and develop their abilities in this

area. Data science may well prove to be the most rewarding and excit-

ing profession in the 21

century.

To use machine learning effectively you have to understand how the

underlying algorithms work. It is tempting to learn a programming lan-

guage such as Python and apply various packages to your data without

really understanding what the packages are doing or even how the re-

sults should be interpreted. This would be a bit like a finance specialist

using the Black−Scholes−Merton model to value options without under-

standing where it comes from or its limitations.

The objective of this book is to explain the algorithms underlying

machine learning so that the results from using the algorithms can be

assessed knowledgeably. Anyone who is serious about using machine

learning will want to learn a language such as Python for which many

packages have been developed. This book takes the unusual approach

of using both Excel and Python to provide backup material. This is be-

cause it is anticipated that some readers will, at least initially, be much

more comfortable with Excel than with Python.

The backup material can be found on the author’s website:

www-2.rotman.utoronto.ca/~hull

Readers can start by focusing on the Excel worksheets and then move to

Python as they become more comfortable with it. Python will enable

them use machine learning packages, handle data sets that are too large

for Excel, and benefit from Python’s faster processing speeds.

1.2 Types of Machine Learning Models

There are four main categories of machine learning models

 Supervised learning

 Unsupervised learning

 Semi-supervised learning

 Reinforcement learning

Supervised learning is concerned with using data to make predictions.

In the next section, we will show how a simple regression model can be

used to predict salaries. This is an example of supervised learning. In

Chapter 3, we will consider how a similar model can be used to predict

house prices. We can distinguish between supervised learning models

that are used to predict a variable that can take a continuum of values

Introduction 5

(such as an individual’s salary or the price of a house) and models that

are used for classification. Classification models are very common in

machine learning. As an example, we will later look at an application of

machine learning where potential borrowers are classified as accepta-

ble or unacceptable credit risks.

Unsupervised learning is concerned with recognizing patterns in da-

ta. The main objective is not to forecast a particular variable. Rather it is

to understand the environment represented by the data better. Consid-

er a company that markets a range of products to consumers. Data on

consumer purchases could be used to determine the characteristics of

the customers who buy different products. This in turn could influence

the way the products are advertised. As we will see in Chapter 2, clus-

tering is the main tool used in unsupervised learning.

The data for supervised learning contains what are referred to as

features and labels. The labels are the values of the target that is to be

predicted. The features are the variables from which the predictions are

to be made. For example, when predicting the price of a house the fea-

tures could be the square feet of living space, the number of bedrooms,

the number of bathrooms, the size of the garage, whether the basement

is finished, and so on. The label would be the house price. The data for

unsupervised learning consists of features but no labels because the

model is being used to identify patterns, not to forecast something. We

could use an unsupervised learning model to understand the houses

that exist in a certain neighborhood without trying to predict prices. We

might find that there is a cluster of houses with 1,500 to 2,000 square

feet of living space, three bedrooms, and a one-car garage and another

cluster of houses with 5,000 to 6,000 square feet of living area, six bed-

rooms, and a two-car garage.

Semi-supervised learning is a cross between supervised and un-

supervised learning. It arises when we are trying to predict something

and we have some data with labels (i.e., values for the target) and some

(usually much more) unlabeled data. It might be thought that the unla-

beled data is useless, but this is not necessarily the case. The unlabeled

data can be used in conjunction with the labeled data to produce clus-

ters which help prediction. For example, suppose we are interested in

predicting whether a customer will purchase a particular product from

features such as age, income level, and so on. Suppose further that we

have a small amount of labeled data (i.e., data which indicates the fea-

tures of customers as well as whether they bought or did not buy the

product) and a much larger amount of unlabeled data (i.e., data which

indicates the features of potential customers, but does not indicate

whether they bought the product). We can apply unsupervised learning

6 Chapter 1

to use the features to cluster potential customers. Imagine a simple

situation where:

 There are two clusters, A and B, in the full data set.

 The purchasers from the labeled data all correspond to points

in Cluster A while the non-purchasers from the labeled data all

correspond to points in the other Cluster B.

We might reasonably classify all individuals in Cluster A as buyers and

all individuals in Cluster B as non-buyers.

Human beings use semi-supervised learning. Imagine that you do

not know the names “cat” and “dog,” but are observant. You notice two

distinct clusters of domestic pets in your neighborhood. Finally some-

one points at two particular animals and tells you one is a cat while the

other is a dog. You will then have no difficulty in using semi-supervised

learning to apply the labels to all the other animals you have seen. If

humans use semi-supervised learning in this way, it should come as no

surprise that machines can do so as well. Many machine learning algo-

rithms are based on studying the ways our brains process data.

The final type of machine learning, reinforcement learning, is con-

cerned with situations where a series of decisions is to be taken. The

environment is typically changing in an uncertain way as the decisions

are being taken. Driverless cars use reinforcement learning algorithms.

The algorithms underlie the programs mentioned earlier for playing

games such as Go and chess. They are also used for some trading and

hedging decisions. We will discuss reinforcement learning in Chapter 7.

1.3 Validation and Testing

When a data set is used for forecasting or determining a decision

strategy, there is a danger that the machine learning model will work

well for the data set, but will not generalize well to other data. An obvi-

ous point is that it is important that the data used in a machine learning

model be representative of the situations to which the model is to be

applied. For example, using data for a region where customers have a

high income to predict the national sales for a product is likely to give

biased results.

As statisticians have realized for a long time, it is also important to

test a model out-of-sample. By this we mean that the model should be

tested on data that is different from the sample data used to determine

the parameters of the model.

Introduction 7

Data scientists are typically not just interested in testing one model.

They typically try several different models, choose between them, and

then test the accuracy of the chosen model. For this, they need three

data sets:

 a training set

 a validation set

 a test set

The training set is used to determine the parameters of the models

that are under consideration. The validation set is used to determine

how well each of the models generalizes to a different data set. The test

set is held back to provide a measure of the accuracy of the chosen

model.

We will illustrate this with a simple example. Suppose that we are in-

terested in predicting the salaries of people working in a particular pro-

fession in a certain part of the United States from their age. We collect

data on a random sample of 30 individuals. (This is a very small data set

created to provide a simple example. The data sets used in machine

learning are many times larger than this.) The first ten observations

(referred to in machine learning as instances) will be used to form the

training set. The next ten observations will be used for form the valida-

tion set and the final ten observations will be used to form the test set.

The training set is shown in Table 1.1 and plotted in Figure 1.1. It is

tempting to choose a model that fits the training set really well. Some

experimentation shows that a polynomial of degree five does this. This

is the model:

2 3 4 5

1 2 3 4 5

Y a b X b X b X b X b X     

where Y is salary and X is age. The result of fitting the polynomial to the

data is shown in Figure 1.2. Details of all analyses carried out, are in

www-2.rotman.utoronto.ca/~hull

The model provides a good fit to the data. The standard deviation of

the difference between the salary given by the model and the actual sal-

ary for the ten individuals in the training data set, which is referred to

as the root-mean-squared error (rmse), is $12,902. However, common

sense would suggest that we may have over-fitted the data. (This is be-

cause the curve in Figure 1.2 seems unrealistic. It declines, increases,

declines, and then increases again as age increases.) We need to check

the model out-of-sample. To use the language of data science, we need

剩余273页未读，继续阅读

bonacenter

粉丝: 1
资源: 13

商业中的机器学习：数据科学导论

Python库watson_machine_learning_client-1.0.333的详细介绍

Java机器学习实战：Machine Learning in Java中文版

探索watson_machine_learning_client-1.0.376：Python的机器学习库

An Introduction to Statistical Learning with Application in R (1)

Machine Learning Fundamentals

Statistical Reinforcement Learning - Modern Machine Learning Approaches

Predictive Analytics with Microsoft Azure Machine Learning: Build and Deplo

R.Machine.Learning.Essentials.178398774X

Thoughtful.Machine.Learning.with.Python.epub

Understanding Accuracy and Recall: Key Metrics in Machine Learning

最新资源