R语言实战：文本挖掘探索数据科学

需积分: 10 112 浏览量更新于2024-07-21 收藏 361KB PDF 举报

"Hands-On Data Science with R: Text Mining指南" 是一本深入实践的数据科学教材，专注于使用R语言进行文本挖掘（Text Mining）技术。该书由Graham Williams编写，旨在帮助读者在海量文本数据中发掘有价值的信息，如新闻文章、书籍、电子邮件等，其目标类似于人类通过阅读来学习新知识。文本挖掘利用自动化算法处理大量文本资料，超越了个人处理能力的限制。本章涉及的主要内容包括： 1. **Text Mining框架**：章节开始首先引入了R语言中的主要包`tm`，这是一个专为文本分析设计的基础库，提供了处理和分析文本数据所需的基本工具。 2. **词干提取（Stemming）**：`SnowballC`包提供了`wordStem()`函数，用于将单词转化为词根或词干，这对于减少词汇的多样性并简化分析过程至关重要。 3. **定量语篇分析**：`qdap`和`qdapDictionaries`包被用来进行更深层次的文本分析，如分析对话或访谈记录中的量化特征，如话题分布和情感倾向。 4. **数据预处理与管道操作**：`dplyr`包提供了一套灵活的数据操作语法，使得数据清洗、转换和整理变得简单易行，通过`%>%`符号实现管道连接。 5. **颜色映射与图形展示**：`RColorBrewer`和`ggplot2`组合使用，允许创建有吸引力的词频图和可视化，`scales`包则有助于在图表中正确显示包含小数的数值。 6. **相关性分析**：`Rgraphviz`包用于生成关联网络图，展示词汇之间的关系，如共现网络，帮助理解词语间的关联性。通过本章的学习，读者将能够掌握如何运用R语言进行文本挖掘的基本步骤，包括数据导入、预处理、特征提取和可视化，从而为特定主题或目标人群找出最具价值的信息。此外，该书还鼓励读者在实践中不断探索，访问网站HandsOnDataScience.com获取更多章节内容，以加深对文本挖掘的理解和应用。

DRAFT

Data Science with R Hands-On Text Mining

3 Preparing the Corpus

We generally need to perform some pre-processing of the text data to prepare for the text anal-

ysis. Example transformations include converting the text to lower case, removing numbers and

punctuation, removing stop words, stemming and identifying synonyms. The basic transforms

are all available within tm.

getTransformations()

## [1] "removeNumbers" "removePunctuation" "removeWords"

## [4] "stemDocument" "stripWhitespace"

The function tm map() is used to apply one of these transformations across all documents within

a corpus. Other transformations can be implemented using R functions and wrapped within

content transformer() to create a function that can be passed through to tm map(). We will

see an example of that in the next section.

In the following sections we will apply each of the transformations, one-by-one, to remove un-

wanted characters from the text.

2013-2014 Graham@togaware.com Module: TextMiningO Page: 7 of 40

DRAFT

Data Science with R Hands-On Text Mining

3.1 Simple Transforms

We start with some manual special transforms we may want to do. For example, we might want

to replace “/”, used sometimes to separate alternative words, with a space. This will avoid the

two words being run into one string of characters through the transformations. We might also

replace “@” and “|” with a space, for the same reason.

To create a custom transformation we make use of content transformer() crate a function to

achieve the transformation, and then apply it to the corpus using tm map().

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

docs <- tm_map(docs, toSpace, "/")

docs <- tm_map(docs, toSpace, "@")

docs <- tm_map(docs, toSpace, "\\|")

This can be done with a single call:

docs <- tm_map(docs, toSpace, "/|@|\\|")

Check the email address in the following.

inspect(docs[16])

## <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

## [[1]]

## <<PlainTextDocument (metadata: 7)>>

## Hybrid weighted random forests for

## classifying very high-dimensional data

## Baoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 and

## Yunming Ye1

## 1

## Department of Computer Science, Harbin Institute of Technology Shenzhen Gr...

## School, Shenzhen 518055, China

## 2

## Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, S...

## 518055, China

## Email: amusing002 gmail.com

## Random forests are a popular classification method based on an ensemble of a

## single type of decision trees from subspaces of data. In the literature, t...

## are many different types of decision tree algorithms, including C4.5, CART...

## CHAID. Each type of decision tree algorithm may capture different information

## and structure. This paper proposes a hybrid weighted random forest algorithm,

## simultaneously using a feature weighting method and a hybrid forest method to

## classify very high dimensional data. The hybrid weighted random forest alg...

## can effectively reduce subspace size and improve classification performance

## without increasing the error bound. We conduct a series of experiments on ...

## high dimensional datasets to compare our method with traditional random fo...

....

2013-2014 Graham@togaware.com Module: TextMiningO Page: 8 of 40

剩余40页未读，继续阅读

ty20000

粉丝: 0
资源: 1

R语言实战：文本挖掘探索数据科学

HANDSON_DATA_SCIENCE_AND_PYTHON_MACHINE_LEARNING

Hands-On Data Science and Python Machine Learning

Data Mining with Rattle and R.pdf

161204_mastering_the_freertos_real_time_kernel-a_hands- on_tutorial_guide

file_path = "D:\\gesture_data\\00\\dark\\circle1\\depth.npy

拍手游戏，狐狸老师每一秒拍一次手，尼克每两秒拍一次，格莱尔每四秒拍一次。三人同时开始拍第一次手，每人都拍十次。试编一程序，算一算观众可听到多少声掌声

加州房价数据集提取代码

提取加州房价数据集代码

hands-on machine learning with scikit-learn, keras & tensorflow

最新资源