Python机器学习入门：从零开始

需积分: 10 62 浏览量更新于2024-07-18 1 收藏 3.23MB PDF 举报

"Python机器学习入门教程" 这是一本面向初学者的Python机器学习指南，旨在帮助读者从零开始了解和掌握机器学习的基本概念和技术。书中的内容涵盖了机器学习的定义、分类以及Python在机器学习中的应用优势。在第一章中，作者介绍了机器学习的基本概念。机器学习是人工智能的一个分支，它允许系统通过经验学习和改进，而不是通过预先编程的方式来执行任务。机器学习与传统的编程方式不同，后者依赖于明确的指令集，而前者则侧重于数据和模式识别。机器学习主要分为两大类：监督学习和无监督学习。监督学习是指有标签的数据集被用于训练模型，如回归分析和分类；无监督学习则是在没有标签的情况下寻找数据中的模式，如聚类。第二章深入探讨了数据清洗和预处理的重要性。数据清洗涉及到处理噪声数据（不准确或错误的数据）、缺失数据和不一致数据。对于缺失数据，作者提供了案例研究，展示了如何在萨克拉门托房地产交易数据中进行缺失数据的修复。数据预处理包括数据集成（将来自不同源的数据统一起来）、数据转换（将数据转化为适合模型的形式）和数据降维（减少特征数量以降低复杂性）。此外，还讲解了交叉验证中k折技术的应用，包括k值选择和如何用Python实现折叠过程。第三章聚焦于监督学习，特别是回归分析和分类。回归分析是一种预测连续变量的方法，如线性回归，它通过拟合最佳直线来预测目标变量。书中介绍了如何使用相关性测试来评估模型的性能。在分类部分，重点介绍了决策树，这是一种基于特征的重要性和信息增益构建的树状模型。作者不仅解释了决策树的基本原理，还指导读者如何在Python中构建和可视化基础决策树。第四章介绍了无监督学习中的聚类方法，尤其是k-means算法。k-means是一种迭代算法，旨在将数据点分配到k个不同的簇中，以最小化簇内差异并最大化簇间差异。讨论了算法的偏见和方差问题，这些都是影响聚类效果的关键因素。这本书为读者提供了一个全面的Python机器学习入门路径，涵盖了从数据预处理到模型构建和评估的关键步骤，是初学者学习这一领域的理想资源。通过阅读和实践书中的例子，读者可以逐步建立起对机器学习的理解，并掌握使用Python进行实际项目的基本技能。

1.4BenefitsofApplyingPythoninMachineLearningProgramming

For machine learning algorithms, we need a programming language that is

understandableand clearfor large portionof dataresearchers andscientists. A

languagewithlibrariesthatareusefulfordifferenttypesofworkandinmatrix

mathinspecificwillbepreferable.Moreover,itisofaverygoodadvantageto

usea languagewith a largenumberof activedevelopers.These featuresmake

thearrowpointtothePythonasthebestchoice.ThemainadvantagesofPython

canbesummarizedinthefollowingpoints:

-Hasclearsyntax.

-Makestextmanipulationextremelyeasy.

-AlargenumberofpeopleandcommunitiesusePython.

- Possibility of programing in different styles: object-oriented, procedural,

functional,etc.

-Idealforprocessingnon-numericdata.

-AbilitytoextractdatafromHTML.

- Common in the scientific and also the financial communities. Therefore,

there is a seamless connection between the two fields especially in the

machinelearningasthefinancialfieldisoneofthemainsourcesofthe

datasets.

-ContainsanumberofusefulscientificlibrariessuchasSciPyandNumPy

whichenablesustoperformvectorandmatrixoperations.Theinstallation

of Python and also adding these libraries and other, are shown in

AppendixA.



2.1DataScrubbing

Datascrubbing,alsocalleddatacleansing,isaveryimportantstepbefore

applyinganymachinelearningalgorithm.Datasetarenormallydrawnfromreal

world sources which produce large amounts of messy datasets. Examples of

sources are statistics, or massive amounts of records generated from

organizationsthatworkindata-intensivefieldssuchasbanking,communication

systems,insurance,ortransportation.Thesemessydatamighthavenoisyentries,

missingdataorhaverecordsthatcontradictswitheachother.Thereareseveral

reasonsforwhythismayhappen.Noiseusuallyaffectsthedatabecauseofthe

hardware limitations and problems, as these can act as noise sources. For

example, if a blood sample is tested by a medical device that encounters a

problem,thedevicemeasurementsmaybeaffectedandvaryoneachrunforthe

samesample.Ifthedeviceisconnectedtoadatabaseviaanetwork,theunstable

readings may be transferred automatically to the database regardless of the

erroneousthathappened.Noisy datacanalsobe generatedbyhumanwhen he

makes faults. Missing data problem, on the other hand, may occur due to

technical issues, such as a server or a network hang during data transfer, or

because of manual data entry. These errors combined; noisy data and missing

data problems, may lead also to a third class of messy data, called: data

contradiction. Based on that, the data scrubbing step is very important before

startingthelearningprocess.

2.1.1NoisyData

When there is a difference between the model and the measurements,

thereexistsa“noise”thatcausesanerrororvariance.Examplesofmethodsthat

dealwith noisy datasetsare asfollows

[3],[4]

:Binningmethods:In this method,

the data is sorted and then smoothed by considering its neighborhood or

surroundingvalues.Thisiscalledalocalsmoothing.

Clustering:Thismethodisusefulindetectingandremovingtheoutliers.

Clusteringeachsetofsimilarvaluesmeansthatthevalueslocatedinonecluster

aredifferentfromthoseintheotherclusters.Theremainingoutlierscanthenbe

easilydetected.

MachinelearningAlgorithms:Regression,oneof the basicsupervised

machine learning algorithms, can be used to smooth data by fitting it into

regressionfunctions.Thisalgorithmwillbediscussedindetailsinchapterthree.

Human inspection: Human can interfere to manually detect and

eliminateoutliersorsmoothnoise.

In section, 2.3, details on how to deal and eliminate noisy data will be

presented.

2.1.2MissingData

Thefollowings,aresomecommonwaystodealwiththemissingdata

[5]

:Ignore

thetuple:This method of healinga dataset with missing values isconsidered

efficient, when a record of data contains many attributes with missing values.

However,when theamountof missingvaluesperattribute variessignificantly,

this scheme is no more effective. Efficiency of this method decreases when a

regulation of data is produced by an undefined source to the machine. The

machinewilldealwiththiskindofdataasmissingdata.Forexample,whena

patientisadmittedaccordingtounusualcondition.

Determineandfillinthemissingvaluemanually:Thisschememight

have high accuracy. However, it is infeasible mainly in terms of time

consumption.Also,itistiringandtedious.

MakingExpectations:Usually,therearewaysthatonecanusetopredict

amissingattribute.Forinstance,theaverageofthevaluesnearbyofthemissing

values. Although this way can cause a bias in the data because of mis-

predictions,butitcanbeutilizedtocheckandcompareitsresults,totheresults

obtainedbythefirstmethod,ignoringthetuples.

Inferring-basedalgorithms:Thiskindofalgorithmsareemployedwhen

amissingvalueisfilledbythemostprobablevalue.Examplesalgorithmsare:

剩余78页未读，继续阅读

amanda_moon

粉丝: 2
资源: 1

Python机器学习入门：从零开始

Python Machine Learning.pdf 无水印书签修正完美版 2015

MachineLearning-master-python.zip

Python Machine Learning

Python Machine Learning_python_machinelearning_

Python Machine Learning Machine Learning and Deep Learning

Python Machine Learning Machine Learning And Deep Learning From Scratch

Python Machine Learning (2nd) -2017-9

Python Machine Learning 2nd Edition [Sebastian Raschka]

Machine Learning in Python 高清

Python Machine Learning - Second Edition

最新资源