开源工具引领数据分析：《数据之魅》深度解析

4星 · 超过85%的资源需积分: 10 10 浏览量更新于2024-07-24 2 收藏 16.44MB PDF 举报

"《数据之魅：基于开源工具的数据分析》是一本深度探讨大数据分析的经典著作，作者是Philipp K. Janert。该书主要关注于利用开源工具进行数据分析，向读者展示了如何在当今的信息时代有效地挖掘和解读海量数据。书中涵盖了广泛的主题，包括但不限于Hadoop、Spark、R语言、Python等开源平台和技术的应用，以及数据清洗、预处理、建模、可视化等核心数据分析流程。作者Philipp K. Janert以其丰富的经验和深入理解，将理论知识与实践经验相结合，为读者提供了实用且易于理解的方法。全书结构清晰，案例丰富，适合初学者入门，同时也为经验丰富的分析师提供了技术深化和实践指导。通过本书，读者可以学习到如何利用开源工具进行高效的数据探索、数据科学项目开发，以及如何在大数据环境下解决实际问题。《数据之魅：基于开源工具的数据分析》的出版日期为2011年，由O'Reilly Media公司发行，版权所有。该书不仅提供纸质版，还同步推出了在线版本，方便不同需求的学习者获取。此外，书中还包含了编辑、生产编辑、复制编辑等专业人士的贡献，确保了内容的专业性和质量。封面设计和内部设计都注重细节，插图由作者亲自绘制，使阅读体验更为生动。该书的印刷历史记录显示，自2010年首次出版以来，一直受到业界的广泛关注和好评。无论是对数据科学感兴趣的学生、工程师还是业务分析师，这本开源工具指南都是一个不可或缺的学习资源，帮助他们在数据驱动的世界中取得竞争优势。"

O’Reilly-5980006 master October 28, 2010 22:0

Data analysis, as I understand it, is not a ﬁxed set of techniques. It is a way of life, and it

has a name: curiosity. There is always something else to ﬁnd out and something more to

learn. This book is not the last word on the matter; it is merely a snapshot in time: things I

knew about and found useful today.

“Works are of value only if they give rise to better ones.”

(Alexander von Humboldt, writing to Charles Darwin, 18 September 1839)

Before We Begin

More data analysis efforts seem to go bad because of an excess of sophistication rather

than a lack of it.

This may come as a surprise, but it has been my experience again and again. As a

consultant, I am often called in when the initial project team has already gotten stuck.

Rarely (if ever) does the problem turn out to be that the team did not have the required

skills. On the contrary, I usually ﬁnd that they tried to do something unnecessarily

complicated and are now struggling with the consequences of their own invention!

Based on what I have seen, two particular risk areas stand out:

•

The use of “statistical” concepts that are only partially understood (and given the

relative obscurity of most of statistics, this includes virtually all statistical concepts)

•

Complicated (and expensive) black-box solutions when a simple and transparent

approach would have worked at least as well or better

I strongly recommend that you make it a habit to avoid all statistical language. Keep it

simple and stick to what you know for sure. There is absolutely nothing wrong with

speaking of the “range over which points spread,” because this phrase means exactly what

it says: the range over which points spread, and only that! Once we start talking about

“standard deviations,” this clarity is gone. Are we still talking about the observed width of

the distribution? Or are we talking about one speciﬁc measure for this width? (The

standard deviation is only one of several that are available.) Are we already making an

implicit assumption about the nature of the distribution? (The standard deviation is only

suitable under certain conditions, which are often not fulﬁlled in practice.) Or are we even

confusing the predictions we could make if these assumptions were true with the actual

data? (The moment someone talks about “95 percent anything” we know it’s the latter!)

I’d also like to remind you not to discard simple methods until they have been proven

insufﬁcient. Simple solutions are frequently rather effective: the marginal beneﬁt that

more complicated methods can deliver is often quite small (and may be in no reasonable

relation to the increased cost). More importantly, simple methods have fewer

opportunities to go wrong or to obscure the obvious.

xiv PREFACE

www.codecloud.net

O’Reilly-5980006 master October 28, 2010 22:0

True story: a company was tracking the occurrence of defects over time. Of course, the

actual number of defects varied quite a bit from one day to the next, and they were

looking for a way to obtain an estimate for the typical number of expected defects. The

solution proposed by their IT department involved a compute cluster running a neural

network! (I am not making this up.) In fact, a one-line calculation (involving a moving

average or single exponential smoothing) is all that was needed.

I think the primary reason for this tendency to make data analysis projects more

complicated than they are is discomfort: discomfort with an unfamiliar problem space and

uncertainty about how to proceed. This discomfort and uncertainty creates a desire to

bring in the “big guns”: fancy terminology, heavy machinery, large projects. In reality, of

course, the opposite is true: the complexities of the “solution” overwhelm the original

problem, and nothing gets accomplished.

Data analysis does not have to be all that hard. Although there are situations when

elementary methods will no longer be sufﬁcient, they are much less prevalent than you

might expect. In the vast majority of cases, curiosity and a healthy dose of common sense

will serve you well.

The attitude that I am trying to convey can be summarized in a few points:

Simple is better than complex.

Cheap is better than expensive.

Explicit is better than opaque.

Purpose is more important than process.

Insight is more important than precision.

Understanding is more important than technique.

Think more, work less.

Although I do acknowledge that the items on the right are necessary at times, I will give

preference to those on the left whenever possible.

It is in this spirit that I am offering the concepts and techniques that make up the rest of

this book.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, and email addresses

Constant width

Used to refer to language and script elements

PREFACE xv

www.codecloud.net

O’Reilly-5980006 master October 28, 2010 22:0

Using Code Examples

This book is here to help you get your job done. In general, you may use the code in this

book in your programs and documentation. You do not need to contact us for permission

unless youre reproducing a signiﬁcant portion of the code. For example, writing a

program that uses several chunks of code from this book does not require permission.

Selling or distributing a CD-ROM of examples from OReilly books does require

permission. Answering a question by citing this book and quoting example code does not

require permission. Incorporating a signiﬁcant amount of example code from this book

into your products documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title,

author, publisher, and ISBN. For example: “Data Analysis with Open Source Tools, by Philipp

If you feel your use of code examples falls outside fair use or the permission given above,

feel free to contact us at permissions@oreilly.com.

Safari® Books Online

Safari

Books online

Safari Books Online is an on-demand digital library that lets you easily search

over 7,500 technology and creative reference books and videos to ﬁnd the

answers you need quickly.

With a subscription, you can read any page and watch any video from our library online.

Read books on your cell phone and mobile devices. Access new titles before they are

available for print, and get exclusive access to manuscripts in development and post

feedback for the authors. Copy and paste code samples, organize your favorites, download

chapters, bookmark key sections, create notes, print out pages, and beneﬁt from tons of

other time-saving features.

O’Reilly Media has uploaded this book to the Safari Books Online service. To have full

digital access to this book and others on similar topics from OReilly and other publishers,

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

xvi PREFACE

www.codecloud.net

剩余531页未读，继续阅读

普通网友

粉丝: 629
资源:
5

开源工具引领数据分析：《数据之魅》深度解析

数据之魅基于开源工具的数据分析 带目录

RCFPD:蛋白质组学数据分析功能的随机集合-开源

开源数据分析工具CyberChef

数据之魅：基于开源工具的数据分析

数据之魅：基于开源工具的数据分析-5

数据之魅：基于开源工具的数据分析，中文完整扫描版

数据之魅：基于开源工具的数据分析(美Janert 2012)清晰完整中文扫描版

《数据之魅：基于开源工具的数据分析》（高清中文版,已经添加标签目录)

数据之魅-基于开源工具的数据分析-中文

aurora：一个开源的企业数据仓库和分析平台

最新资源

数据之魅基于开源工具的数据分析带目录