R语言实战：挖掘社交媒体数据与机器学习应用

需积分: 9 88 浏览量更新于2024-07-18 收藏 983KB PDF 举报

"Social Media Mining with R" 是一本实践导向的手册，专为那些希望在IT行业中利用社交媒体数据获取商业价值的读者而设计。本书以R语言为核心，深入探讨如何通过社会媒体API（应用程序编程接口）抓取数据，并借助R的强大机器学习功能进行有效分析。书中内容涵盖了从入门到进阶的多个主题。首先，章节1 "Going Viral" 引导读者理解社会媒体挖掘中的情感分析技术，探讨了大数据时代下人们交流的方式。这里会介绍什么是大数据，以及人类传感器（如社交媒体数据）如何作为诚实的信号源。作者会引导读者采用定量方法来分析社交媒体上的趋势和情绪动态，使读者能够洞悉社交媒体内容对品牌或事件的影响。章节2 "Getting Started with R" 是一个快速入门指南，介绍了选择R语言的原因，包括其易用性、灵活性和丰富的统计分析功能。作者将带领读者从基础操作开始，如变量赋值、算术运算，逐步学习R中的函数、参数和帮助文档。章节还会涉及向量、序列处理以及创建和导入数据集的基本步骤。此外，作者还强调了可视化在R中的重要性，通过实例展示如何优化数据的呈现和解读。接着，章节3 "Mining Twitter with R" 集焦于Twitter数据的收集与初步分析。这里，读者将了解为何选择Twitter作为研究对象，以及如何通过R获取和处理Twitter API的数据。初步分析可能包括用户行为分析、话题流行度追踪等，以展示社交媒体数据在实时洞察市场趋势方面的潜力。章节4 "Potentials and Pitfalls of Social Media Data" 深入探讨社交媒体数据的机遇与挑战。这部分内容将涵盖意见挖掘（opinion mining），即如何从大量用户评论和帖子中提取主观观点和情感，同时提醒读者注意数据质量、隐私问题以及分析结果的偏差和局限性。通过这本书，读者不仅能掌握R在社会媒体数据分析中的应用技巧，还能学会如何避免常见陷阱，充分利用这些海量信息为企业决策提供有价值的数据支持。无论是初学者还是经验丰富的数据分析师，都能在本书中找到实用的工具和技术，从而在这个数据驱动的时代中脱颖而出。

Chapter 1. Going Viral

In this chapter, we introduce readers to the concept of social media mining. We discuss

sentiment analysis, the nature of contemporary online communication, and the facets of

Big Data that allow social media mining to be such a powerful tool. Additionally, we

discuss some of the potential pitfalls of socially generated data and argue for a

quantitative approach to social media mining.

Social media mining using sentiment

analysis

People are highly opinionated. We hold opinions about everything from international

politics to pizza delivery. Sentiment analysis, synonymously referred to as opinion mining,

is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes,

and emotions through written language. Practically speaking, this field allows us to

measure, and thus harness, opinions. Up until the last 40 years or so, opinion mining

hardly existed. This is because opinions were elicited in surveys rather than in text

documents, computers were not powerful enough to store or sort a large amount of

information, and algorithms did not exist to extract opinion information from written

language.

The explosion of sentiment-laden content on the Internet, the increase in computing

power, and advances in data mining techniques have turned social data mining into a

thriving academic field and crucial commercial domain. Professor Richard Hamming

famously pushes researchers to ask themselves, "What are the important problems in my

field?" Researchers in the broad area of natural language processing (NLP) cannot

help but list sentiment analysis as one such pressing problem. Sentiment analysis is not

only a prominent and challenging research area, but also a powerful tool currently being

employed in almost every business and social domain. This prominence is due, at least in

part, to the centrality of opinions as both measures and causes of human behavior.

This book is an introduction to social data mining. For us, social data refers to data

generated by people or by their interactions. More specifically, social data for the

purposes of this book will usually refer to data in text form produced by people for other

people's consumption. Data mining is a set of tools and techniques used to describe and

make inferences about data. We approach social data mining with a potent mix of applied

statistics and social science theory. As for tools, we utilize and provide an introduction to

the statistical programming language R.

The book covers important topics and latest developments in the field of social data

mining with many references and resources for continued learning. We hope it will be of

interest to an audience with a wide array of substantive interests from fields such as

marketing, sociology, politics, and sales. We have striven to make it accessible enough to

be useful for beginners while simultaneously directing researchers and practitioners

already active in the field towards resources for further learning. Code and additional

material will be available online at http://socialmediaminingr.com as well as on the

authors' GitHub account, https://github.com/SocialMediaMininginR.

The state of communication

The state of communication section describes the fundamentally altered modes of social

communication fostered by the Internet. The interconnected, social, rapid, and public

exchange of information detailed here underlies the power of social data mining. Now

more than ever before, information can go viral, a phrase first cited as early as 2004.

By changing the manner in which we connect with each other, the Internet changed the

way we interact—communication is now bi-directional and many-to-many. Networks are

now self-organized, and information travels along every dimension, varying

systematically depending on direction and purpose. This new economy with ideas as

currency has impacted nearly every person. More than ever, people rely on context and

information before making decisions or purchases, and by extension, more and more on

peer effects and interactions rather than centralized sources.

The traditional modes of communication are represented mainly by radio and television,

which are isotropic and one-to-many. It took 38 years for radio broadcasters and 13 years

for television to reach an audience of 50 million, but the Internet did it in just four years

(Gallup).

Not only has the nature of communication changed, but also its scale. There were 50

pages on the World Wide Web (WWW) in 1993. Today, the full impact and scope of

the WWW is difficult to measure, but we can get a rough sense of its size: the Indexed

Web contains at least 1.7 billion pages as of February 2014 (World Wide Web size). The

WWW is the largest, most widely used source of information, with nearly 2.4 billion users

(Wikipedia). 70 percent of these users use it daily to both contribute and receive

information in order to learn about the world around them and to influence that same

world—constantly organizing information around pieces that reflect their desires.

In today's connected world, many of us are members of at least one, if not more, social

networking service. The influence and reach of social media enterprises such as Facebook

is staggering. Facebook has 1.11 billion monthly active users and 751 million monthly

exclusively rely on close ties within their physical social networks. Social media has both

made our close ties closer and the number of weak ties exponentially greater. Beyond our

denser and larger social networks is a general eagerness to incorporate information from

other networks with similar interests and desires. The increased access to networks of

various types has, in fact, conditioned us to seek even more information; after all,

ignoring available information would constitute irrational behavior.

These fundamental changes to the nature and scope of communication are crucial due to

the importance of ideas in today's economic and social interactions. Today, and in the

future, ideas will be of central importance, especially those ideas that bounce and go viral.

Ideas that go viral are those that resonate and spur on social movements, which may have

political and social purposes or reshape businesses and allow companies such as Nike and

Apple to produce outsized returns on capital. This book introduces readers to the tools

necessary to measure ideas and opinions derived from social data at scale. Along the way,

we'll describe strategies for dealing with Big Data.

What is Big Data?

People create 2.5 quintillion bytes (2.5 * 1018) of data, or nearly 2.3 million Terabytes of

data every day, so much that 90 percent of the data in the world today has been created in

the last two years alone. Furthermore, rather than being a large collection of disparate

data, much of this data flow consists of data on similar things, generating huge data-sets

with billions upon billions of observations. Big Data refers not only to the deluge of data

being generated, but also to the astronomical size of data-sets themselves. Both factors

create challenges and opportunities for data scientists.

This data comes from everywhere: physical sensors used to gather information, human

sensors such as the social web, transaction records, and cell phone GPS signals to name a

few. This data is not only big but is growing at an increasing rate. The data used in this

book, namely, Twitter data, is no exception. Twitter was launched in March 21, 2006, and

it took 3 years, 2 months, and 1 day to reach 1 billion tweets. Twitter users now send 1

billion tweets every 2.5 days.

What proportion of data is Big Data? It turns out that most data-sets are (relatively) small.

This may come as a surprise in light of the contemporary excitement surrounding Big

Data. The reason for the large number of small data-sets is that data that is not socially

generated and publicly displayed is time consuming and expensive to collect. As such,

academics, businesses, and other organizations with data needs tend to collect only the

minimum amount of information necessary to gain purchase on their questions. These

data-sets are usually small and focused and are curated by the organizations that use

them; they usually do not plan on updating or adding fresh data to them. The poor

management of these data often leads to their misplacement, thereby generating dark

data—data that is suspected to exist or ought to exist but is difficult or impossible to find.

The problem of dark data is real and prevalent in the myriad of small, locally collected

data-sets. The utter lack of central management of data in the tail of the data size

distribution invariably causes these sets of data to be forgotten. In spite of the fact that

most data is not big, it is primarily the Big Data sets that exhibit exponential growth,

propelling the number of bytes created by humans moving upwards daily.

Big Data differs substantially from other data not only in its size and velocity, but also in

its scope and density. Big Data is large in scope, that is, it is created by everyone and by

itself and thus is informative about a wide audience. This characteristic makes it very

useful for studying populations, as the inferences we can make generalize to large groups

of people. Compare that with, say, opinions gleaned from a focus group or small survey.

These opinions, while highly accurate and easy to obtain, may or may not be reflective of

the views of the wider public. Thus, Big Data's scope is a real benefit, at least in terms of

generalizing evidence to wide populations.

However, Big Data's density is fairly low. By density, we mean the degree to which Big

Data, and especially social data, is directly applicable to questions we want to answer.

Again, a comparison to small data is useful. Prior to the explosion of Big Data and the

proliferation of tools used to harness it, companies or political campaigns largely used

focus groups or surveys to obtain information about public sentiments relevant to their

endeavors. The focus groups and surveys furnished organizations with data that was

directly applicable to their purpose, and often this data would already be measured with

meaningful units. For instance, respondents would describe how much they liked or

disliked a new product, or rate a political candidate's TV appearances from 1 to 5.

Compare that with social data, where opinion-laden text is buried among terabytes of

unrelated information and comes in a form that must be subjected to analysis just to

generate a measure of the opinion. Thus, low density of big social data presents unique

challenges to organizations trying to utilize opinion data.

The size and scope of Big Data helps us overcome some of the hurdles caused by its low

density. For instance, even though each unique piece of social data may have little

applicability to our particular task, these small bits of information quickly become useful

as we aggregate them across thousands or millions of people. Like the proverbial bundle

of sticks—none of which could support inferences alone—when tied together, these small

bits of information can be a powerful tool for understanding the opinions of the online

populace.

剩余113页未读，继续阅读

weixin_39516685

粉丝: 0
资源: 43

R语言实战：挖掘社交媒体数据与机器学习应用

社交媒体挖掘-简介Social Media Mining - An Introduction

Mastering Social Media Mining with Python epub 0分

Mastering Social Media Mining with Python

Book-SocialMediaMiningPython, 书"Mastering Social Media Mining with Python"的配套代码.zip

Mastering Social Media Mining with Python(pdf+epub+mobi+code_files).zip

R Mining spatial text web and social media data epub

Mastering Text Mining with R [2016]

Text Mining with R: A Tidy Approach [True PDF]

Python Social Media Analytics

Learning Data Mining with Python - Second Edition

最新资源