黑客程序员的机器学习实战指南

需积分: 50 84 浏览量更新于2024-07-19 收藏 23.05MB PDF 举报

"《机器学习为黑客》是一本由Drew Conway和John Myles White合著的专业书籍，专为经验丰富的程序员设计，旨在帮助他们掌握在大数据时代利用机器学习和统计工具处理数据的方法。该书摒弃了传统的数学密集型讲解方式，而是通过实战案例和黑盒解决方案，提供了一种实用的学习路径。书中深入探讨了一系列问题，涵盖了可行与无效的方法，使读者能够识别何时其面临的挑战与传统统计问题相吻合。作者引导读者在理解各种情境下的机器学习应用时，学会如何将古典统计工具巧妙地应用于实际问题。这使得《机器学习为黑客》成为私营、公共和学术部门程序员的理想读物，无论是在数据挖掘、预测分析还是自动化决策中都能找到用武之地。书中特别关注的是使用R语言进行机器学习，这是一种流行的编程语言，在数据分析领域被广泛采用。R语言的强大功能和易用性在这里得到了充分展示，读者可以通过实践项目和实例学习如何构建模型、优化算法和解读结果。此外，书中还包含了实用的技巧和策略，帮助读者在遇到复杂数据集时，快速理解和解决实际问题。《机器学习为黑客》不仅提供了理论基础，更注重于实践经验的积累，使读者能在快速变化的IT行业中保持竞争力，通过机器学习技术提升工作效率和解决问题的能力。对于那些希望跨越传统编程界限，踏入机器学习世界的黑客来说，这是一本不可或缺的指南。"

CHAPTER 1

Using R

Machine learning exists at the intersection of traditional mathematics and statistics

with software engineering and computer science. In this book, we will describe several

tools from traditional statistics that allow you to make sense of the world. Statistics has

almost always been concerned with learning something interpretable from data,

whereas machine learning has been concerned with turning data into something prac-

tical and usable. This contrast makes it easier to understand the term machine learn-

ing: Machine learning is concerned with teaching computers something about the

world, so that they can use that knowledge to perform other tasks. In contrast, statistics

is more concerned with developing tools for teaching humans something about the

world, so that they can think more clearly about the world in order to make better

decisions.

In machine learning, the learning occurs by extracting as much information from the

data as possible (or reasonable) through algorithms that parse the basic structure of the

data and distinguish the signal from the noise. After they have found the signal, or

pattern, the algorithms simply decide that everything else that’s left over is noise. For

that reason, machine learning techniques are also referred to as pattern recognition

algorithms. We can “train” our machines to learn about how data is generated in a given

context, which allows us to use these algorithms to automate many useful tasks. This

is where the term training set comes from, referring to the set of data used to build a

machine learning process. The notion of observing data, learning from it, and then

automating some process of recognition is at the heart of machine learning and forms

the primary arc of this book. Two particularly important types of patterns constitute

the core problems we’ll provide you with tools to solve: the problem of classification

and the problem of regression, which will be introduced over the course of this book.

In this book, we assume a relatively high degree of knowledge in basic programming

techniques and algorithmic paradigms. That said, R remains a relatively niche language,

even among experienced programmers. In an effort to establish the same starting point

for everyone, this chapter provides some basic information on how to get started using

the R language. Later in the chapter we will provide an extended case study for working

with data in R.

This chapter does not provide a complete introduction to the R pro-

gramming language. As you might expect, no such introduction could

fit into a single book chapter. Instead, this chapter is meant to prepare

the reader for the tasks associated with doing machine learning in R,

specifically the process of loading, exploring, cleaning, and analyzing

data. There are many excellent resources on R that discuss language

fundamentals such as data types, arithmetic concepts, and coding best

practices. In so far as those topics are relevant to the case studies pre-

sented here, we will touch on all of these issues; however, there will be

no explicit discussion of these topics. For those interested in reviewing

these topics, many of these resources are listed in Table 1-3.

If you have never seen the R language and its syntax before, we highly recommend

going through this introduction to get some exposure. Unlike other high-level scripting

languages, such as Python or Ruby, R has a unique and somewhat prickly syntax and

tends to have a steeper learning curve than other languages. If you have used R before

but not in the context of machine learning, there is still value in taking the time to go

through this review before moving on to the case studies.

R for Machine Learning

R is a language and environment for statistical computing and graphics....R provides a

wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-

series analysis, classification, clustering, ...) and graphical techniques, and is highly ex-

tensible. The S language is often the vehicle of choice for research in statistical method-

ology, and R provides an Open Source route to participation in that activity.

—The R Project for Statistical Computing, http://www.r-project.org/

The best thing about R is that it was developed by statisticians. The worst thing about R

is that...it was developed by statisticians.

—Bo Cowgill, Google, Inc.

R is an extremely powerful language for manipulating and analyzing data. Its meteoric

rise in popularity within the data science and machine learning communities has made

it the de facto lingua franca for analytics. R’s success in the data analysis community

stems from two factors described in the preceding epitaphs: R provides most of the

technical power that statisticians require built into the default language, and R has been

supported by a community of statisticians who are also open source devotees.

There are many technical advantages afforded by a language designed specifically for

statistical computing. As the description from the R Project notes, the language pro-

vides an open source bridge to S, which contains many highly specialized statistical

operations as base functions. For example, to perform a basic linear regression in R,

one must simply pass the data to the lm function, which then returns an object con-

taining detailed information about the regression (coefficients, standard errors, residual

2 | Chapter 1: Using R

values, etc.). This data can then be visualized by passing the results to the plot function,

which is designed to visualize the results of this analysis.

In other languages with large scientific computing communities, such as Python, du-

plicating the functionality of lm requires the use of several third-party libraries to rep-

resent the data (NumPy), perform the analysis (SciPy), and visualize the results (mat-

plotlib). As we will see in the following chapters, such sophisticated analyses can be

performed with a single line of code in R.

In addition, as in other scientific computing environments, the fundamental data type

in R is a vector. Vectors can be aggregated and organized in various ways, but at the

core, all data is represented this way. This relatively rigid perspective on data structures

can be limiting, but is also logical given the application of the language. The most

frequently used data structure in R is the data frame, which can be thought of as a

matrix with attributes, an internally defined “spreadsheet” structure, or relational

database-like structure in the core of the language. Fundamentally, a data frame is

simply a column-wise aggregation of vectors that R affords specific functionality to,

which makes it ideal for working with any manner of data.

For all of its power, R also has its disadvantages. R does not scale well

with large data, and although there have been many efforts to address

this problem, it remains a serious issue. For the purposes of the case

studies we will review, however, this will not be an issue. The data sets

we will use are relatively small, and all of the systems we will build are

prototypes or proof-of-concept models. This distinction is important

because if your intention is to build enterprise-level machine learning

systems at the Google or Facebook scale, then R is not the right solution.

In fact, companies like Google and Facebook often use R as their “data

sandbox” to play with data and experiment with new machine learning

methods. If one of those experiments bears fruit, then the engineers will

attempt to replicate the functionality designed in R in a more appropri-

ate language, such as C.

This ethos of experimentation has also engendered a great sense of community around

the language. The social advantages of R hinge on this large and growing community

of experts using and contributing to the language. As Bo Cowgill alludes to, R was

borne out of statisticians’ desire to have a computing environment that met their spe-

cific needs. Many R users, therefore, are experts in their various fields. This includes

an extremely diverse set of disciplines, including mathematics, statistics, biology,

chemistry, physics, psychology, economics, and political science, to name a few. This

community of experts has built a massive collection of packages on top of the extensive

base functions in R. At the time of writing, CRAN, the R repository for packages,

contained over 2,800 packages. In the case studies that follow, we will use many of the

most popular packages, but this will only scratch the surface of what is possible with R.

R for Machine Learning | 3

Finally, although the latter portion of Cowgill’s statement may seem a bit menacing, it

further highlights the strength of the R community. As we will see, the R language has

a particularly odd syntax that is rife with coding “gotchas” that can drive away even

experienced developers. But all grammatical grievances with a language can eventually

be overcome, especially for persistent hackers. What is more difficult for nonstatisti-

cians is the liberal assumption of familiarity with statistical and mathematical methods

built into R functions. Using the lm function as an example, if you had never performed

a linear regression, you would not know to look for coefficients, standard errors, or

residual values in the results. Nor would you know how to interpret those results.

But because the language is open source, you are always able to look at the code of a

function to see exactly what it is doing. Part of what we will attempt to accomplish with

this book is to explore many of these functions in the context of machine learning, but

that exploration will ultimately address only a tiny subset of what you can do in R.

Fortunately, the R community is full of people willing to help you understand not only

the language, but also the methods implemented in it. Table 1-1 lists some of the best

places to start.

Table 1-1. Community resources for R help

Resource Location Description

RSeek http://rseek

.org/

When the core development team decided to create an open source version of S

and call it R, they had not considered how hard it would be to search for documents

related to a single-letter language on the Web. This specialized search tool at-

tempts to alleviate this problem by providing a focused portal to R documentation

and information.

Official R mailing lists http://www.r

-project.org/

mail.html

There are several mailing lists dedicated to the R language, including announce-

ments, packages, development, and—of course—help. Many of the language’s

core developers frequent these lists, and responses are often quick and terse.

StackOverflow http://stackover

flow.com/ques

tions/tagged/r

Hackers will know StackOverflow.com as one of the premier web resources for

coding tips in any language, and the R tag is no exception. Thanks to the efforts

of several prominent R community members, there is an active and vibrant col-

lection of experts adding and answering R questions on StackOverflow.

#rstats Twitter hash-

tag

http://search

.twitter.com/

search?q=

%23rstats

There is also a very active community of R users on Twitter, and they have des-

ignated the #rstats hash tag as their signifier. The thread is a great place to find

links to useful resources, find experts in the language, and post questions—as

long as they can fit into 140 characters!

R-Bloggers http://www.r

-bloggers.com/

There are hundreds of people blogging about how they use R in their research,

work, or just for fun. R-bloggers.com aggregates these blogs and provides a single

source for all things related to R in the blogosphere, and it is a great place to learn

by example.

Video Rchive http://www

.vcasmo.com/

user/drewcon

way

As the R community grows, so too do the number of regional meetups and

gatherings related to the language. The Rchive attempts to document the pre-

sentations and tutorials given at these meetings by posting videos and slides,

and now contains presentations from community members all over the world.

4 | Chapter 1: Using R

剩余321页未读，继续阅读

哈北儿

粉丝: 7

黑客程序员的机器学习实战指南

Machine Learning for Hackers 无水印pdf

Machine Learning for Hackers 高清英文.pdf版

Machine learning for hackers.

大数据时代的数据挖掘：Machine Learning for Hackers

探索机器学习：《Machine Learning for Hackers》解析

使用R语言的机器学习实战：《Machine Learning for Hackers》

machine_learning_for_hackers.

Machine Learning for Developers by Mike de Waard

Machine_Learning_for_Hachers_一书中所有的算法都采用_R语言实现，本re_ML_f

Data Protection and Security: Virtual Machine Encryption and Protection

最新资源