探索机器学习：《Machine Learning for Hackers》解析

4星 · 超过85%的资源需积分: 9 158 浏览量更新于2024-07-24 收藏 23.08MB PDF 举报

"Machine Learning for Hackers 是一本由 Drew Conway 和 John Myles White 合著的机器学习经典书籍，涵盖了PCA（主成分分析），MDS（多维尺度分析），回归分析和最近邻等重要概念和技术。这本书面向对机器学习感兴趣的程序员和数据分析师，旨在通过实际案例帮助读者理解并应用机器学习方法。" 《Machine Learning for Hackers》这本书深入浅出地介绍了机器学习的基础和实践应用，是针对具有编程背景但可能没有深厚统计学基础的读者设计的。PCA（主成分分析）是一种常用的数据降维技术，它通过线性变换将原始数据转换为一组各维度线性无关的表示，从而简化数据，降低复杂度，同时保留大部分信息。在书中，作者会讲解如何运用PCA来处理高维数据，理解变量之间的关系，并进行数据可视化。 MDS（多维尺度分析）则是一种非线性的降维方法，用于将高维数据映射到低维空间中，保持对象之间的相似性或距离。这在地理空间分析、文本挖掘等领域中非常有用。作者可能会介绍如何使用MDS来发现数据的内在结构和模式。回归分析是预测模型的一种，通过找到因变量与一个或多个自变量之间的数学关系，来预测未知的因变量值。书中可能涵盖线性回归、逻辑回归等经典方法，以及如何处理缺失值、异常值和多重共线性等问题。最近邻（K-Nearest Neighbors, K-NN）是监督学习中的一个重要算法，适用于分类和回归问题。K-NN的基本思想是根据样本的特征，找出训练集中与其最相似的K个样本，然后根据这些样本的类别或数值来预测新样本的类别或数值。书中会介绍如何实施K-NN算法，包括选择合适的K值、距离度量方法以及优化策略。此外，本书可能还会涉及数据预处理、特征工程、模型评估和调优等相关主题。作者通过实例和代码示例，引导读者动手实践，提高对机器学习的理解和应用能力。无论是对黑客文化感兴趣的数据科学家，还是希望提升自己技能的IT专业人士，都能从中受益匪浅。

CHAPTER 1

Using R

Machine learning exists at the intersection of traditional mathematics and statistics

with software engineering and computer science. In this book, we will describe several

tools from traditional statistics that allow you to make sense of the world. Statistics has

almost always been concerned with learning something interpretable from data,

whereas machine learning has been concerned with turning data into something prac-

tical and usable. This contrast makes it easier to understand the term machine learn-

ing: Machine learning is concerned with teaching computers something about the

world, so that they can use that knowledge to perform other tasks. In contrast, statistics

is more concerned with developing tools for teaching humans something about the

world, so that they can think more clearly about the world in order to make better

decisions.

In machine learning, the learning occurs by extracting as much information from the

data as possible (or reasonable) through algorithms that parse the basic structure of the

data and distinguish the signal from the noise. After they have found the signal, or

pattern, the algorithms simply decide that everything else that’s left over is noise. For

that reason, machine learning techniques are also referred to as pattern recognition

algorithms. We can “train” our machines to learn about how data is generated in a given

context, which allows us to use these algorithms to automate many useful tasks. This

is where the term training set comes from, referring to the set of data used to build a

machine learning process. The notion of observing data, learning from it, and then

automating some process of recognition is at the heart of machine learning and forms

the primary arc of this book. Two particularly important types of patterns constitute

the core problems we’ll provide you with tools to solve: the problem of classification

and the problem of regression, which will be introduced over the course of this book.

In this book, we assume a relatively high degree of knowledge in basic programming

techniques and algorithmic paradigms. That said, R remains a relatively niche language,

even among experienced programmers. In an effort to establish the same starting point

for everyone, this chapter provides some basic information on how to get started using

the R language. Later in the chapter we will provide an extended case study for working

with data in R.

This chapter does not provide a complete introduction to the R pro-

gramming language. As you might expect, no such introduction could

fit into a single book chapter. Instead, this chapter is meant to prepare

the reader for the tasks associated with doing machine learning in R,

specifically the process of loading, exploring, cleaning, and analyzing

data. There are many excellent resources on R that discuss language

fundamentals such as data types, arithmetic concepts, and coding best

practices. In so far as those topics are relevant to the case studies pre-

sented here, we will touch on all of these issues; however, there will be

no explicit discussion of these topics. For those interested in reviewing

these topics, many of these resources are listed in Table 1-3.

If you have never seen the R language and its syntax before, we highly recommend

going through this introduction to get some exposure. Unlike other high-level scripting

languages, such as Python or Ruby, R has a unique and somewhat prickly syntax and

tends to have a steeper learning curve than other languages. If you have used R before

but not in the context of machine learning, there is still value in taking the time to go

through this review before moving on to the case studies.

R for Machine Learning

R is a language and environment for statistical computing and graphics....R provides a

wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-

series analysis, classification, clustering, ...) and graphical techniques, and is highly ex-

tensible. The S language is often the vehicle of choice for research in statistical method-

ology, and R provides an Open Source route to participation in that activity.

—The R Project for Statistical Computing, http://www.r-project.org/

The best thing about R is that it was developed by statisticians. The worst thing about R

is that...it was developed by statisticians.

—Bo Cowgill, Google, Inc.

R is an extremely powerful language for manipulating and analyzing data. Its meteoric

rise in popularity within the data science and machine learning communities has made

it the de facto lingua franca for analytics. R’s success in the data analysis community

stems from two factors described in the preceding epitaphs: R provides most of the

technical power that statisticians require built into the default language, and R has been

supported by a community of statisticians who are also open source devotees.

There are many technical advantages afforded by a language designed specifically for

statistical computing. As the description from the R Project notes, the language pro-

vides an open source bridge to S, which contains many highly specialized statistical

operations as base functions. For example, to perform a basic linear regression in R,

one must simply pass the data to the lm function, which then returns an object con-

taining detailed information about the regression (coefficients, standard errors, residual

2 | Chapter 1: Using R

values, etc.). This data can then be visualized by passing the results to the plot function,

which is designed to visualize the results of this analysis.

In other languages with large scientific computing communities, such as Python, du-

plicating the functionality of lm requires the use of several third-party libraries to rep-

resent the data (NumPy), perform the analysis (SciPy), and visualize the results (mat-

plotlib). As we will see in the following chapters, such sophisticated analyses can be

performed with a single line of code in R.

In addition, as in other scientific computing environments, the fundamental data type

in R is a vector. Vectors can be aggregated and organized in various ways, but at the

core, all data is represented this way. This relatively rigid perspective on data structures

can be limiting, but is also logical given the application of the language. The most

frequently used data structure in R is the data frame, which can be thought of as a

matrix with attributes, an internally defined “spreadsheet” structure, or relational

database-like structure in the core of the language. Fundamentally, a data frame is

simply a column-wise aggregation of vectors that R affords specific functionality to,

which makes it ideal for working with any manner of data.

For all of its power, R also has its disadvantages. R does not scale well

with large data, and although there have been many efforts to address

this problem, it remains a serious issue. For the purposes of the case

studies we will review, however, this will not be an issue. The data sets

we will use are relatively small, and all of the systems we will build are

prototypes or proof-of-concept models. This distinction is important

because if your intention is to build enterprise-level machine learning

systems at the Google or Facebook scale, then R is not the right solution.

In fact, companies like Google and Facebook often use R as their “data

sandbox” to play with data and experiment with new machine learning

methods. If one of those experiments bears fruit, then the engineers will

attempt to replicate the functionality designed in R in a more appropri-

ate language, such as C.

This ethos of experimentation has also engendered a great sense of community around

the language. The social advantages of R hinge on this large and growing community

of experts using and contributing to the language. As Bo Cowgill alludes to, R was

borne out of statisticians’ desire to have a computing environment that met their spe-

cific needs. Many R users, therefore, are experts in their various fields. This includes

an extremely diverse set of disciplines, including mathematics, statistics, biology,

chemistry, physics, psychology, economics, and political science, to name a few. This

community of experts has built a massive collection of packages on top of the extensive

base functions in R. At the time of writing, CRAN, the R repository for packages,

contained over 2,800 packages. In the case studies that follow, we will use many of the

most popular packages, but this will only scratch the surface of what is possible with R.

R for Machine Learning | 3

Finally, although the latter portion of Cowgill’s statement may seem a bit menacing, it

further highlights the strength of the R community. As we will see, the R language has

a particularly odd syntax that is rife with coding “gotchas” that can drive away even

experienced developers. But all grammatical grievances with a language can eventually

be overcome, especially for persistent hackers. What is more difficult for nonstatisti-

cians is the liberal assumption of familiarity with statistical and mathematical methods

built into R functions. Using the lm function as an example, if you had never performed

a linear regression, you would not know to look for coefficients, standard errors, or

residual values in the results. Nor would you know how to interpret those results.

But because the language is open source, you are always able to look at the code of a

function to see exactly what it is doing. Part of what we will attempt to accomplish with

this book is to explore many of these functions in the context of machine learning, but

that exploration will ultimately address only a tiny subset of what you can do in R.

Fortunately, the R community is full of people willing to help you understand not only

the language, but also the methods implemented in it. Table 1-1 lists some of the best

places to start.

Table 1-1. Community resources for R help

Resource Location Description

RSeek http://rseek

.org/

When the core development team decided to create an open source version of S

and call it R, they had not considered how hard it would be to search for documents

related to a single-letter language on the Web. This specialized search tool at-

tempts to alleviate this problem by providing a focused portal to R documentation

and information.

Official R mailing lists http://www.r

-project.org/

mail.html

There are several mailing lists dedicated to the R language, including announce-

ments, packages, development, and—of course—help. Many of the language’s

core developers frequent these lists, and responses are often quick and terse.

StackOverflow http://stackover

flow.com/ques

tions/tagged/r

Hackers will know StackOverflow.com as one of the premier web resources for

coding tips in any language, and the R tag is no exception. Thanks to the efforts

of several prominent R community members, there is an active and vibrant col-

lection of experts adding and answering R questions on StackOverflow.

#rstats Twitter hash-

tag

http://search

.twitter.com/

search?q=

%23rstats

There is also a very active community of R users on Twitter, and they have des-

ignated the #rstats hash tag as their signifier. The thread is a great place to find

links to useful resources, find experts in the language, and post questions—as

long as they can fit into 140 characters!

R-Bloggers http://www.r

-bloggers.com/

There are hundreds of people blogging about how they use R in their research,

work, or just for fun. R-bloggers.com aggregates these blogs and provides a single

source for all things related to R in the blogosphere, and it is a great place to learn

by example.

Video Rchive http://www

.vcasmo.com/

user/drewcon

way

As the R community grows, so too do the number of regional meetups and

gatherings related to the language. The Rchive attempts to document the pre-

sentations and tutorials given at these meetings by posting videos and slides,

and now contains presentations from community members all over the world.

4 | Chapter 1: Using R

剩余321页未读，继续阅读

njust_yxy

粉丝: 1
资源: 7

探索机器学习：《Machine Learning for Hackers》解析

Machine Learning for Hackers 无水印pdf

Machine Learning for Hackers 高清英文.pdf版

Machine learning for hackers.

大数据时代的数据挖掘：Machine Learning for Hackers

使用R语言的机器学习实战：《Machine Learning for Hackers》

machine_learning_for_hackers.

Machine Learning for Developers by Mike de Waard

Machine_Learning_for_Hachers_一书中所有的算法都采用_R语言实现，本re_ML_f

Data Protection and Security: Virtual Machine Encryption and Protection

Fluent电弧，激光，熔滴一体模拟 UDF包括高斯旋转体热源、双椭球热源（未使用）、VOF梯度计算、反冲压力、磁场力、表面张力，以及熔滴过渡所需的熔滴速度场、熔滴温度场和熔滴VOF

最新资源