《数据科学概念与实践》第二版：快速掌握RapidMiner应用

需积分: 11 51 浏览量更新于2023-05-19 1 收藏 48.73MB PDF 举报

《数据科学：概念与实践（第二版）》是由Vijay Kotu和Bala Deshpande共同编著的一本深入浅出的数据科学入门书籍，它属于世界著名出版商麦格劳-希尔（Morgan Kaufmann）的印记，隶属于Elsevier集团。该书位于美国马萨诸塞州剑桥市汉普郡街50号5楼，版权归属Elsevier，2019年首次发行，所有权利受到保护。本书旨在通过一个易于理解的概念框架帮助读者掌握数据科学的基础知识，并通过RapidMiner平台进行即时实践。作者强调了在快速发展的数据科学领域，知识和最佳实践不断演变，随着新的研究发现和技术进步，研究方法、专业实践以及医疗治疗可能需要适时更新。因此，读者在学习过程中不仅要理解核心概念，还要保持对最新动态的关注。书中涵盖了数据科学的核心概念，包括但不限于数据预处理、统计学、机器学习算法、数据可视化、模型评估和部署等关键环节。读者将学习如何收集、清洗、转换和分析数据，以及如何构建和优化预测模型来解决实际问题。此外，书中还会介绍如何使用RapidMiner这款工具，它是一个强大的数据挖掘和机器学习软件，为初学者提供了直观易用的界面来实践理论知识。通过阅读《数据科学：概念与实践（第二版）》，读者不仅能建立扎实的数据科学基础，还能了解到如何在实际项目中灵活运用所学知识，以适应不断变化的数据科学环境。版权政策方面，未经许可，任何形式的复制或传播都必须得到出版社书面授权，获取更多信息可通过Elsevier官网的权限服务链接。这是一本既理论结合实践，又与时俱进的数据科学教材，是任何想要踏入数据科学领域或进一步提升技能的专业人士的理想选择。

also called “learners”, take both the known input and output (training data)

to figure out a model for the program which converts input to output. For

example, many organizations like social media platforms, review sites, or for-

ums are required to moderate posts and remove abusive content. How can

machines be taught to automate the removal of abusive content? The

machines need to be shown exampl es of both abusive and non-abusive posts

with a clear indication of which one is abusive. The learners will generalize a

pattern based on certain words or sequences of words in order to conclude

whether the overall post is abusive or not. The model can take the form of a

set of “if-then” rules. Once the data science rules or model is developed,

machines can start categorizing the disposition of any new posts.

Data science is the business application of machine learning, artificial intelli-

gence, and other quantitative fields like statistics, visualization, and mathe-

matics. It is an interdisciplinary field that extracts value from data. In the

context of how data science is used today, it relies heavily on machine learn-

ing and is sometimes called data mining. Example s of data science user cases

are: recommendation engines that can recommend movies for a particular

user, a fraud alert model that detects fraudulent credit card transactions, find

customers who will most likely churn next month, or predict revenue for the

next quarter.

1.2 WHAT IS DATA SCIENCE?

Data science starts with data, which can range from a simple array of a few

numeric observations to a complex matrix of millions of observations with

thousands of variables. Data science utilizes certain specialized computational

methods in order to discover meaningful and useful structures within a dataset.

The discipline of data science coexists and is closely associated with a number

of related areas such as database systems, data engineering, visualization, data

analysis, experimentation, and business intelligence (BI). We can further define

data science by investigating some of its key features and motivations.

1.2.1 Extracting Meaningful Patterns

Knowledge discovery in databases is the nontrivial process of identifying

valid, novel, pote ntially useful, and ultimately understandable patterns or

relationships within a dataset in order to make important decisions (

Fayyad,

Piatetsky-shapiro, & Smyth, 1996

). Data science involves inference and itera-

tion of many different hypotheses. One of the key aspects of data science is

the process of generalization of patterns from a dataset. The generalization

should be valid, not just for the dataset used to observe the pattern, but also

for new unseen data. Data science is also a process with defined steps, each

4 CHAPTER 1: Introduction

1.2.3 Combination of Statistics, Machine Learning, and

Computing

In the pursuit of extracting useful and relevant informati on from large data-

sets, data science borrows computational techniques from the disciplines of

statistics, machi ne learning, experimentation, and database theories. The

algorithms used in data science originate from these disciplines but have

since evolved to adopt more diverse techniques such as parallel computing,

evolutionary compu ting, linguistics, and behavioral studie s. One of the key

ingredients of successful data science is substantial prior knowledge ab out

the data and the business processes that generate the data, known as subject

matter expertise. Like many quantitative frameworks, data science is an itera-

tive process in which the practitioner gains more infor mation about the pat-

terns and relationships from data in each cycle. Data science also typically

operates on large datasets that need to be stored, processed, and computed.

This is where database techniques along with parallel and distributed com-

puting techniques play an important role in data science.

1.2.4 Learning Algorithms

We can also define data science as a process of discovering previously

unknown patterns in data using automatic iterative methods. The application of

sophisticated learning algorithms for extracting useful patterns from data dif-

ferentiates data science from traditional data analysis techniques. Many of

these algorithms were developed in the past few decades and are a part of

machine learning and artificial intelligence. Some algorithms are based on

the foundations of Bayesian probabilistic theorie s and regression analysis,

originating from hundreds of years ago. These iterative algorithms automate

the process of searching for an optimal solution for a given data problem.

Based on the problem, data science is classified into tasks such as classifica-

tion, association analysis, clustering, and regression. Each data science task

uses specific learning algorithms like decision trees, neural networks, k-near-

est neighbors (k-NN), and k-means clustering, among others. With increased

research on data science, such algorithms are increasing, but a few classic

algorithms remain foundational to many data science appl ications.

1.2.5 Associated Fields

While data science covers a wide set of techniques, applications, and disci-

plines, there a few associated fields that data science heavily relies on. The

techniques used in the steps of a data science process and in conjunction

with the term “data science” are:

G Descriptive statistics: Computing mean, standard deviation, correlation,

and other descriptive statistics, quantify the aggregate structure of a

6 CHAPTER 1: Introduction

dataset. This is essential information for understanding any dataset in

order to understand the structure of the data and the relationships

within the dataset. They are used in the exploration stage of the data

science process.

G Exploratory visualization: The process of expressing data in visual

coordinates enables users to find patterns and relationships in the data

and to comprehend large datasets. Similar to descriptive statistics, they

are integral in the pre- and post-processing steps in data science.

G Dimensional slicing: Online analytical processing (OLAP) applications,

which are prevalent in organi zations, mainly provide information on

the data through dimensional slicing, filtering, and pivoting. OLAP

analysis is enabled by a unique database schema design where the data

are organized as dimension s (e.g., products, regions, dates) and

quantitative facts or measu res (e.g., revenue, quantity). With a well-

defined database structure, it is easy to slice the yearly revenue by

products or combination of region and products. These techniques are

extremely useful and may unve il patterns in data (e.g., candy sales

decline after Halloween in the United States).

G Hypothesis testing: In confirmatory data analysis, experimental data are

collected to evaluate whether a hypothesis has enough evidence to be

supported or not. There are many types of statistical testing and they

have a wide variety of business applications (e.g., A/B testing in

marketing). In general, data science is a process where many hypotheses

are generated and tested based on observational data. Since the data

science algorithms are iterative, solutions can be refined in each step.

G Data engineering: Data engineering is the process of sourcing,

organizing, assembling, storing, and distributing data for effe ctive

analysis and usage. Database engineering, distributed stor age, and

computing frameworks (e.g., Apache Hadoop, Spark, Kafka), parallel

computing, extraction transformation and loading processing, and data

warehousing constitute data engineering techniques. Data engineering

helps source and prepare for data science learning algorithms.

G Business intelligence: Business intelligence helps organizations consume

data effectively. It helps query the ad hoc data without the need to

write the technical query command or use dashboards or visualizations

to communicate the facts and trends. Business intelligence spe cializes

in the secure delivery of information to right roles and the distribution

of information at scale. Historical trends are usually reported, but in

combination with data science, both the past and the predicted future

data can be combined. BI can hold and distribute the results of data

science.

1.2 What is Data Science? 7

1.3 CASE FOR DATA SCIENCE

In the past few decades, a massive accumulation of data has been seen with

the advancement of information technology, connected networks, and the

businesses it enables. This trend is also coupled with a steep decline in data

storage and data processing costs. The applications built on these advance-

ments like online businesses, social networking, and mobile technologies

unleash a large amount of comple x, heterogeneous data that are waiting to

be analyzed. Traditional analysis techniques like dimensional slicing, hypoth-

esis testing, and descriptive statistics can only go so far in information dis-

covery. A paradigm is needed to manage the massive volume of data, explore

the inter-relationships of thousands of variables, and deploy machine learn-

ing algorithms to deduce optimal insights from datasets. A set of frameworks,

tools, and techniques are needed to intelligently assist humans to process all

these data and extract valuable information (

Piatetsky-Shapiro, Brachman,

Khabaza, K loesgen, & Simoudis, 1996

). Data science is one such paradigm

that can handle large volumes with multiple attributes and deploy complex

algorithms to search for patterns from data. Each key motivation for using

data science techniques are explored here.

1.3.1 Volume

The sheer volume of data captured by organizations is exponentially increas-

ing. The rapid decline in storage costs and advan cements in capturing every

transaction and event, combined with the business need to extract as much

leverage as possible using data, creates a strong motivation to store more

data than ever. As data become more granular, the need to use large volume

data to extract information increases. A rapid increase in the volume of data

exposes the limitations of current analysis methodologies. In a few imple-

mentations, the time to create generalization models is critical and data vol-

ume plays a major part in determining the time frame of development and

deployment.

1.3.2 Dimensions

The three characteristics of the Big Data phenomenon are high volume, high

velocity, and high variety. The variety of data relates to the multiple types of

values (nume rical, categorical), formats of data (audio files, video files), and

the application of the data (location coordinates, graph data). Every single

record or data point contains multiple attributes or variables to provide co n-

text for the record. For example, every user record of an ecommerce site can

contain attributes such as products viewed, products purchased, user demo-

graphics, frequency of purchase, clickstream, etc. Deter mining the most effec-

tive offer for an ecommerce user can involve computing information across

8 CHAPTER 1: Introduction

剩余548页未读，继续阅读

bzquan

粉丝: 5
资源: 26

《数据科学概念与实践》第二版：快速掌握RapidMiner应用

Data Mining: Concepts and Techniques 3rd Edition -英文原版-韩家伟

Data Mining: Concepts and Techniques - Second Edition

Data Mining: Concepts and Techniques 解决手册

data science-concepts and practice

Data Mining: Concepts and Techniques 2nd Edition Solution Manual

Data Mining Concepts and Techniques 2nd edition

Data Mining - Concepts and Techniques

Data Fusion- Concepts and Ideas

Data Mining Concepts and Techniques 2nd edition 2006

Data Mining Concepts and Techniques 2nd edition part2

最新资源