R语言数据挖掘实战：案例解析

5星 · 超过95%的资源需积分: 9 56 浏览量更新于2024-07-23 收藏 2.22MB PDF 举报

"《Data Mining with R: Learning with Case Studies》是关于数据挖掘技术的一本书，专注于使用R语言进行实践。这本书通过一系列的案例研究，涵盖了数据挖掘的主要领域，并提供了完整的R代码，使得读者能够深入理解和应用这些技术。" 本文将深入探讨数据挖掘与R语言在实际案例中的应用，以及这一领域的一些核心概念。数据挖掘是现代数据分析的一个关键分支，它涉及到从大量复杂数据中提取有用信息的过程。R语言作为统计分析和图形绘制的强大工具，已成为数据科学家的首选语言之一。书中可能涵盖的R语言数据挖掘技术包括预处理、模式识别、分类、聚类、关联规则学习等。预处理是数据挖掘的第一步，包括数据清洗、缺失值处理、异常值检测和转换。R语言提供了如`dplyr`、`tidyr`等包来方便地进行数据操作。描述性统计和可视化（如`ggplot2`）也是预处理的重要部分，帮助理解数据的基本特征。分类算法如决策树（如`rpart`包）、随机森林（如`randomForest`包）和支持向量机（SVM，如`e1071`包）在预测模型构建中非常常见。这些方法可以帮助预测离散或连续的目标变量。聚类是无监督学习的一部分，用于发现数据的内在结构，如K-means（`cluster`包）和层次聚类（`hclust`函数）。R还支持更复杂的约束聚类算法，如谱聚类和DBSCAN。关联规则学习（如`arules`包）用于发现项集之间的频繁模式，常用于市场篮子分析。通过找出顾客购买商品之间的关联，企业可以制定有效的营销策略。书中很可能还包括时间序列分析，利用`forecast`包对时间序列数据进行建模和预测。此外，可能会涉及文本挖掘，如`tm`和`SnowballC`包，用于从文本数据中提取有价值的信息。生物信息学和医疗领域的数据挖掘也是重点，例如在`Bioconductor`项目中，R被广泛用于基因表达数据分析和生物标记物的发现。电子健康记录（EHR）的数据挖掘则涉及到隐私保护、患者分群和疾病预测等问题。最后，书中可能讨论地理空间数据挖掘，如`sp`和`rgdal`包提供的地理信息系统功能，以及如何结合GIS与数据挖掘技术来探索地理模式。《Data Mining with R: Learning with Case Studies》是一本综合性的教材，旨在通过实例教学，使读者掌握数据挖掘的核心技术，并能用R语言有效地实现它们。通过阅读本书，读者将能够在实践中提升数据驱动决策的能力。

List of Tables

3.1 A Confusion Matrix for the Prediction of Trading Signals . . 120

4.1 A Confusion Matrix for the Illustrative Example. . . ..... 191

Chapter 1

Introduction

R is a programming language and an environment for statistical computing.

It is similar to the S language developed at AT&T Bell Laboratories by Rick

Becker, John Chambers and Allan Wilks. There are versions of R for the Unix,

Windows and Mac families of operating systems. Moreover, R runs on diﬀerent

computer architectures like Intel, PowerPC, Alpha systems and Sparc systems.

R was initially developed by Ihaka and Gentleman (1996), both from the Uni-

versity of Auckland, New Zealand. The current development of R is carried

out by a core team of a dozen people from diﬀerent institutions around the

world. R development takes advantage of a growing community that cooper-

ates in its development due to its open source philosophy. In eﬀect, the source

code of every R component is freely available for inspection and/or adapta-

tion. This fact allows you to check and test the reliability of anything you use

in R. There are many critics to the open source model. Most of them mention

the lack of support as one of the main drawbacks of open source software. It

is certainly not the case with R! There are many excellent documents, books

and sites that provide free information on R. Moreover, the excellent R-help

mailing list is a source of invaluable advice and information, much better than

any amount of money could ever buy! There are also searchable mailing lists

archives that you can (and should!) use before posting a question. More infor-

mation on these mailing lists can be obtained at the R Web site in the section

“Mailing Lists”.

Data mining has to do with the discovery of useful, valid, unexpected,

and understandable knowledge from data. These general objectives are obvi-

ously shared by other disciplines like statistics, machine learning, or pattern

recognition. One of the most important distinguishing issues in data mining

is size. With the widespread use of computer technology and information sys-

tems, the amount of data available for exploration has increased exponentially.

This poses diﬃcult challenges to the standard data analysis disciplines: One

has to consider issues like computational eﬃciency, limited memory resources,

interfaces to databases, etc. All these issues turn data mining into a highly

interdisciplinary subject involving tasks not only of typical data analysts but

also of people working with databases, data visualization on high dimensions,

etc.

R has limitations with handling enormous datasets because all computation

is carried out in the main memory of the computer. This does not mean that

we will not be able to handle these problems. Taking advantage of the highly

Introduction 3

R commands are entered at R command prompt, “>”. Whenever you see

this prompt you can interpret it as R waiting for you to enter a command.

You type in the commands at the prompt and then press the enter key to

ask R to execute them. This may or may not produce some form of output

(the result of the command) and then a new prompt appears. At the prompt

you may use the arrow keys to browse and edit previously entered commands.

This is handy when you want to type commands similar to what you have

done before as you avoid typing them again.

Still, you can take advantage of the code provided at the book Web site to

cut and paste between your browser or editor and the R console, thus avoiding

having to type all commands described in the book. This will surely facilitate

your learning experience and improve your understanding of its potential.

1.2 A Short Introduction to R

The goal of this section is to provide a brief introduction to the key issues of the

R language. We do not assume any familiarity with computer programming.

Readers should be able to easily follow the examples presented in this section.

Still, if you feel some lack of motivation to continue reading this introductory

material, do not worry. You may proceed to the case studies and then return

to this introduction as you get more motivated by the concrete applications.

R is a functional language for statistical computation and graphics. It

can be seen as a dialect of the S language (developed at AT&T) for which

John Chambers was awarded the 1998 Association for Computing Machinery

(ACM) Software award that mentioned that this language “forever altered

how people analyze, visualize and manipulate data”.

R can be quite useful just by using it in an interactive fashion at its com-

mand line. Still, more advanced uses of the system will lead the user to develop

his own functions to systematize repetitive tasks, or even to add or change

some functionalities of the existing add-on packages, taking advantage of being

open source.

1.2.1 Starting with R

In order to install R in your system, the easiest way is to obtain a bi-

nary distribution from the R Web site

where you can follow the link that

takes you to the CRAN (Comprehensive R Archive Network) site to obtain,

among other things, the binary distribution for your particular operating sys-

tem/architecture. If you prefer to build R directly from the sources, you can

get instructions on how to do it from CRAN.

http://www.R-project.org.

4 Data Mining with R: Learning with Case Studies

After downloading the binary distribution for your operating system you

just need to follow the instructions that come with it. In the case of the Win-

dows version, you simply execute the downloaded ﬁle (R-2.10.1-win32.exe)

and select the options you want in the following menus. In some operating

systems you may need to contact your system administrator to fulﬁll the in-

stallation task due to lack of permissions to install software.

To run R in Windows you simply double-click the appropriate icon on your

desktop, while in Unix versions you should type R at the operating system

prompt. Both will bring up the R console with its prompt “>”.

If you want to quit R you can issue the command q() at the prompt. You

will be asked if you want to save the current workspace. You should answer yes

only if you want to resume your current analysis at the point you are leaving

it, later on.

Although the set of tools that comes with R is by itself quite powerful,

it is natural that you will end up wanting to install some of the large (and

growing) set of add-on packages available for R at CRAN. In the Windows

version this is easily done through the “Packages” menu. After connecting

your computer to the Internet you should select the “Install package from

CRAN...” option from this menu. This option will present a list of the packages

available at CRAN. You select the one(s) you want, and R will download the

package(s) and self-install it(them) on your system. In Unix versions, things

may be slightly diﬀerent depending on the graphical capabilities of your R

installation. Still, even without selection from menus, the operation is simple.

Suppose you want to download the package that provides functions to connect

to MySQL databases. This package name is RMySQL.

Youjustneedtotype

the following command at R prompt:

> install.packages('RMySQL')

The install.packages() function has many parameters, among which

there is the repos argument that allows you to indicate the nearest CRAN

mirror.

Still, the ﬁrst time you run the function in an R session, it will prompt

you for the repository you wish to use.

One thing that you surely should do is install the package associated with

this book, which will give you access to several functions used throughout the

book as well as datasets. To install it you proceed as with any other package:

> install.packages('DMwR')

The actual name of the ﬁle changes with newer versions. This is the name for version

2.10.1.

Please note that the following code also works in Windows versions, although you may

ﬁnd the use of the menu more practical.

You can get an idea of the functionalities of each of the R packages in the R FAQ

(frequently asked questions) at CRAN.

The list of available mirrors can be found at http://cran.r-project.org/mirrors.

html.

剩余305页未读，继续阅读

zbxjtuthu

粉丝: 1
资源: 3

R语言数据挖掘实战：案例解析

Data Mining with R完整版

Data mining with R learning with case studies second edition

Data Mining with R Learning with Case Studies 2nd 原版PDF by Torgo

"C:\Users\13123\Desktop\homework\data_mining\ch04_case_transRecord.xlsx"

python中这样写可以吗f = pd.read_csv('Z:\\Python_Project\\flasky\\apps\\data_mining\\application_record.csv',encoding='UTF-8')

> ts_data_interp <- na.interp(ts_data_with_na) Error in na.interp(ts_data_with_na) : The time series is not univariate.

def file_read(): data_li = [] with __________________ as f: for data in f.readlines(): __________________: data_li.append(data) return data_li print(file_read())

下面函数功能为：打开文件“file.txt”,读取其中不以“#”开头的行 def file_read(): data_li = [] with __________________ as f: for data in f.readlines(): __________________: data_li.append(data) return data_li print(file_read())

最新资源

def file_read(): data_li = [] with as f: for data in f.readlines(): : data_li.append(data) return data_li print(file_read())

下面函数功能为：打开文件“file.txt”,读取其中不以“#”开头的行 def file_read(): data_li = [] with as f: for data in f.readlines(): : data_li.append(data) return data_li print(file_read())