没有合适的资源?快使用搜索试试~ 我知道了~
首页《数据科学概念与实践》第二版:快速掌握RapidMiner应用
《数据科学:概念与实践(第二版)》是由Vijay Kotu和Bala Deshpande共同编著的一本深入浅出的数据科学入门书籍,它属于世界著名出版商麦格劳-希尔(Morgan Kaufmann)的印记,隶属于Elsevier集团。该书位于美国马萨诸塞州剑桥市汉普郡街50号5楼,版权归属Elsevier,2019年首次发行,所有权利受到保护。 本书旨在通过一个易于理解的概念框架帮助读者掌握数据科学的基础知识,并通过RapidMiner平台进行即时实践。作者强调了在快速发展的数据科学领域,知识和最佳实践不断演变,随着新的研究发现和技术进步,研究方法、专业实践以及医疗治疗可能需要适时更新。因此,读者在学习过程中不仅要理解核心概念,还要保持对最新动态的关注。 书中涵盖了数据科学的核心概念,包括但不限于数据预处理、统计学、机器学习算法、数据可视化、模型评估和部署等关键环节。读者将学习如何收集、清洗、转换和分析数据,以及如何构建和优化预测模型来解决实际问题。此外,书中还会介绍如何使用RapidMiner这款工具,它是一个强大的数据挖掘和机器学习软件,为初学者提供了直观易用的界面来实践理论知识。 通过阅读《数据科学:概念与实践(第二版)》,读者不仅能建立扎实的数据科学基础,还能了解到如何在实际项目中灵活运用所学知识,以适应不断变化的数据科学环境。版权政策方面,未经许可,任何形式的复制或传播都必须得到出版社书面授权,获取更多信息可通过Elsevier官网的权限服务链接。 这是一本既理论结合实践,又与时俱进的数据科学教材,是任何想要踏入数据科学领域或进一步提升技能的专业人士的理想选择。
资源详情
资源推荐
also called “learners”, take both the known input and output (training data)
to figure out a model for the program which converts input to output. For
example, many organizations like social media platforms, review sites, or for-
ums are required to moderate posts and remove abusive content. How can
machines be taught to automate the removal of abusive content? The
machines need to be shown exampl es of both abusive and non-abusive posts
with a clear indication of which one is abusive. The learners will generalize a
pattern based on certain words or sequences of words in order to conclude
whether the overall post is abusive or not. The model can take the form of a
set of “if-then” rules. Once the data science rules or model is developed,
machines can start categorizing the disposition of any new posts.
Data science is the business application of machine learning, artificial intelli-
gence, and other quantitative fields like statistics, visualization, and mathe-
matics. It is an interdisciplinary field that extracts value from data. In the
context of how data science is used today, it relies heavily on machine learn-
ing and is sometimes called data mining. Example s of data science user cases
are: recommendation engines that can recommend movies for a particular
user, a fraud alert model that detects fraudulent credit card transactions, find
customers who will most likely churn next month, or predict revenue for the
next quarter.
1.2 WHAT IS DATA SCIENCE?
Data science starts with data, which can range from a simple array of a few
numeric observations to a complex matrix of millions of observations with
thousands of variables. Data science utilizes certain specialized computational
methods in order to discover meaningful and useful structures within a dataset.
The discipline of data science coexists and is closely associated with a number
of related areas such as database systems, data engineering, visualization, data
analysis, experimentation, and business intelligence (BI). We can further define
data science by investigating some of its key features and motivations.
1.2.1 Extracting Meaningful Patterns
Knowledge discovery in databases is the nontrivial process of identifying
valid, novel, pote ntially useful, and ultimately understandable patterns or
relationships within a dataset in order to make important decisions (
Fayyad,
Piatetsky-shapiro, & Smyth, 1996
). Data science involves inference and itera-
tion of many different hypotheses. One of the key aspects of data science is
the process of generalization of patterns from a dataset. The generalization
should be valid, not just for the dataset used to observe the pattern, but also
for new unseen data. Data science is also a process with defined steps, each
4 CHAPTER 1: Introduction
with a set of tasks. Th e term novel indicates that data science is usually
involved in finding previously unknown patterns in data. The ultimate objec-
tive of data science is to find potentially useful conclusions that can be acted
upon by the users of the analysis.
1.2.2 Building Representative Models
In statistics, a model is the representation of a relationship between var iables
in a dataset. It describes how one or more variables in the data are related to
other variables. Modeling is a process in which a representative abstraction is
built from the observed dataset. For example, based on credit score, income
level, and requested loan amount, a model can be developed to determine
the interest rate of a loan. For this task, previously known observational da ta
including credit score, income level, loan amoun t, and interest rate are
needed.
Fig. 1.3 shows the process of generating a model. Once the represen-
tative model is created, it can be used to predict the value of the interest rate,
based on all the input variables.
Data science is the process of building a representative model that fits the
observational data. This model serves two purposes: on the one hand, it pre-
dicts the outp ut (interest rate) based on the new and unseen set of input
variables (credit score, income level, and loan amount), and on the other
hand, the model can be used to understand the relationship between the
output variable and all the input variables. For example, does income level
really matter in determining the interest rate of a loan? Does income level
matter more than credit score? What ha ppens when income levels double or
if credit score drops by 10 points? A Model can be used for both predictive
and explanatory applicati ons.
FIGURE 1.3
Data science models.
1.2 What is Data Science? 5
1.2.3 Combination of Statistics, Machine Learning, and
Computing
In the pursuit of extracting useful and relevant informati on from large data-
sets, data science borrows computational techniques from the disciplines of
statistics, machi ne learning, experimentation, and database theories. The
algorithms used in data science originate from these disciplines but have
since evolved to adopt more diverse techniques such as parallel computing,
evolutionary compu ting, linguistics, and behavioral studie s. One of the key
ingredients of successful data science is substantial prior knowledge ab out
the data and the business processes that generate the data, known as subject
matter expertise. Like many quantitative frameworks, data science is an itera-
tive process in which the practitioner gains more infor mation about the pat-
terns and relationships from data in each cycle. Data science also typically
operates on large datasets that need to be stored, processed, and computed.
This is where database techniques along with parallel and distributed com-
puting techniques play an important role in data science.
1.2.4 Learning Algorithms
We can also define data science as a process of discovering previously
unknown patterns in data using automatic iterative methods. The application of
sophisticated learning algorithms for extracting useful patterns from data dif-
ferentiates data science from traditional data analysis techniques. Many of
these algorithms were developed in the past few decades and are a part of
machine learning and artificial intelligence. Some algorithms are based on
the foundations of Bayesian probabilistic theorie s and regression analysis,
originating from hundreds of years ago. These iterative algorithms automate
the process of searching for an optimal solution for a given data problem.
Based on the problem, data science is classified into tasks such as classifica-
tion, association analysis, clustering, and regression. Each data science task
uses specific learning algorithms like decision trees, neural networks, k-near-
est neighbors (k-NN), and k-means clustering, among others. With increased
research on data science, such algorithms are increasing, but a few classic
algorithms remain foundational to many data science appl ications.
1.2.5 Associated Fields
While data science covers a wide set of techniques, applications, and disci-
plines, there a few associated fields that data science heavily relies on. The
techniques used in the steps of a data science process and in conjunction
with the term “data science” are:
G Descriptive statistics: Computing mean, standard deviation, correlation,
and other descriptive statistics, quantify the aggregate structure of a
6 CHAPTER 1: Introduction
dataset. This is essential information for understanding any dataset in
order to understand the structure of the data and the relationships
within the dataset. They are used in the exploration stage of the data
science process.
G Exploratory visualization: The process of expressing data in visual
coordinates enables users to find patterns and relationships in the data
and to comprehend large datasets. Similar to descriptive statistics, they
are integral in the pre- and post-processing steps in data science.
G Dimensional slicing: Online analytical processing (OLAP) applications,
which are prevalent in organi zations, mainly provide information on
the data through dimensional slicing, filtering, and pivoting. OLAP
analysis is enabled by a unique database schema design where the data
are organized as dimension s (e.g., products, regions, dates) and
quantitative facts or measu res (e.g., revenue, quantity). With a well-
defined database structure, it is easy to slice the yearly revenue by
products or combination of region and products. These techniques are
extremely useful and may unve il patterns in data (e.g., candy sales
decline after Halloween in the United States).
G Hypothesis testing: In confirmatory data analysis, experimental data are
collected to evaluate whether a hypothesis has enough evidence to be
supported or not. There are many types of statistical testing and they
have a wide variety of business applications (e.g., A/B testing in
marketing). In general, data science is a process where many hypotheses
are generated and tested based on observational data. Since the data
science algorithms are iterative, solutions can be refined in each step.
G Data engineering: Data engineering is the process of sourcing,
organizing, assembling, storing, and distributing data for effe ctive
analysis and usage. Database engineering, distributed stor age, and
computing frameworks (e.g., Apache Hadoop, Spark, Kafka), parallel
computing, extraction transformation and loading processing, and data
warehousing constitute data engineering techniques. Data engineering
helps source and prepare for data science learning algorithms.
G Business intelligence: Business intelligence helps organizations consume
data effectively. It helps query the ad hoc data without the need to
write the technical query command or use dashboards or visualizations
to communicate the facts and trends. Business intelligence spe cializes
in the secure delivery of information to right roles and the distribution
of information at scale. Historical trends are usually reported, but in
combination with data science, both the past and the predicted future
data can be combined. BI can hold and distribute the results of data
science.
1.2 What is Data Science? 7
1.3 CASE FOR DATA SCIENCE
In the past few decades, a massive accumulation of data has been seen with
the advancement of information technology, connected networks, and the
businesses it enables. This trend is also coupled with a steep decline in data
storage and data processing costs. The applications built on these advance-
ments like online businesses, social networking, and mobile technologies
unleash a large amount of comple x, heterogeneous data that are waiting to
be analyzed. Traditional analysis techniques like dimensional slicing, hypoth-
esis testing, and descriptive statistics can only go so far in information dis-
covery. A paradigm is needed to manage the massive volume of data, explore
the inter-relationships of thousands of variables, and deploy machine learn-
ing algorithms to deduce optimal insights from datasets. A set of frameworks,
tools, and techniques are needed to intelligently assist humans to process all
these data and extract valuable information (
Piatetsky-Shapiro, Brachman,
Khabaza, K loesgen, & Simoudis, 1996
). Data science is one such paradigm
that can handle large volumes with multiple attributes and deploy complex
algorithms to search for patterns from data. Each key motivation for using
data science techniques are explored here.
1.3.1 Volume
The sheer volume of data captured by organizations is exponentially increas-
ing. The rapid decline in storage costs and advan cements in capturing every
transaction and event, combined with the business need to extract as much
leverage as possible using data, creates a strong motivation to store more
data than ever. As data become more granular, the need to use large volume
data to extract information increases. A rapid increase in the volume of data
exposes the limitations of current analysis methodologies. In a few imple-
mentations, the time to create generalization models is critical and data vol-
ume plays a major part in determining the time frame of development and
deployment.
1.3.2 Dimensions
The three characteristics of the Big Data phenomenon are high volume, high
velocity, and high variety. The variety of data relates to the multiple types of
values (nume rical, categorical), formats of data (audio files, video files), and
the application of the data (location coordinates, graph data). Every single
record or data point contains multiple attributes or variables to provide co n-
text for the record. For example, every user record of an ecommerce site can
contain attributes such as products viewed, products purchased, user demo-
graphics, frequency of purchase, clickstream, etc. Deter mining the most effec-
tive offer for an ecommerce user can involve computing information across
8 CHAPTER 1: Introduction
剩余548页未读,继续阅读
bzquan
- 粉丝: 5
- 资源: 26
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Unity UGUI性能优化实战:UGUI_BatchDemo示例
- Java实现小游戏飞翔的小鸟教程分享
- Ant Design 4.16.8:企业级React组件库的最新更新
- Windows下MongoDB的安装教程与步骤
- 婚庆公司响应式网站模板源码下载
- 高端旅行推荐:官网模板及移动响应式网页设计
- Java基础教程:类与接口的实现与应用
- 高级版照片排版软件功能介绍与操作指南
- 精品黑色插画设计师作品展示网页模板
- 蓝色互联网科技企业Bootstrap网站模板下载
- MQTTFX 1.7.1版:Windows平台最强Mqtt客户端体验
- 黑色摄影主题响应式网站模板设计案例
- 扁平化风格商业旅游网站模板设计
- 绿色留学H5模板:科研教育机构官网解决方案
- Linux环境下EMQX安装全流程指导
- 可爱卡通儿童APP官网模板_复古绿色动画设计
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功