没有合适的资源?快使用搜索试试~ 我知道了~
首页开源工具引领的数据分析实战指南
开源工具引领的数据分析实战指南
需积分: 9 1 下载量 86 浏览量
更新于2024-07-19
1
收藏 5.22MB PDF 举报
《数据之魅:基于开源工具的数据分析》是一本深度剖析和实践数据分析的经典著作,作者是Philipp K. Janert。本书主要关注于利用开源工具进行数据分析,这在当今大数据和云计算时代尤为重要,因为开源工具不仅成本低廉,而且功能强大,能够满足专业分析师以及初学者的需求。 该书详细介绍了各种开源数据分析平台和技术,如Python的Pandas和NumPy,R语言,SQL(用于关系型数据库查询),以及Apache Hadoop和Spark等大数据处理框架。通过丰富的实例和代码,读者可以迅速上手并理解如何运用这些工具处理和解析海量数据,执行统计分析、数据可视化、机器学习等任务。 书中不仅涵盖了理论知识,还强调了实际操作技巧,包括数据清洗、数据转换、数据建模等步骤,帮助读者掌握一套完整的数据分析流程。此外,由于版权问题,本书只在2011年由O'Reilly Media首次出版,但其内容始终保持着时效性和实用性,随着技术的更新,作者可能在后续版本中加入了最新的开源工具和最佳实践。 阅读《数据之魅:基于开源工具的数据分析》,读者不仅能提升自身的数据分析能力,还能了解到开源社区的活跃和创新,这对于数据科学家和工程师来说是一份宝贵的资源。无论是对入门者还是经验丰富的专业人士,本书都能提供深入且实用的学习资料,推动他们在数据驱动的世界中取得成功。
资源详情
资源推荐
O’Reilly-5980006 master October 28, 2010 22:0
Data analysis, as I understand it, is not a fixed set of techniques. It is a way of life, and it
has a name: curiosity. There is always something else to find out and something more to
learn. This book is not the last word on the matter; it is merely a snapshot in time: things I
knew about and found useful today.
“Works are of value only if they give rise to better ones.”
(Alexander von Humboldt, writing to Charles Darwin, 18 September 1839)
Before We Begin
More data analysis efforts seem to go bad because of an excess of sophistication rather
than a lack of it.
This may come as a surprise, but it has been my experience again and again. As a
consultant, I am often called in when the initial project team has already gotten stuck.
Rarely (if ever) does the problem turn out to be that the team did not have the required
skills. On the contrary, I usually find that they tried to do something unnecessarily
complicated and are now struggling with the consequences of their own invention!
Based on what I have seen, two particular risk areas stand out:
•
The use of “statistical” concepts that are only partially understood (and given the
relative obscurity of most of statistics, this includes virtually all statistical concepts)
•
Complicated (and expensive) black-box solutions when a simple and transparent
approach would have worked at least as well or better
I strongly recommend that you make it a habit to avoid all statistical language. Keep it
simple and stick to what you know for sure. There is absolutely nothing wrong with
speaking of the “range over which points spread,” because this phrase means exactly what
it says: the range over which points spread, and only that! Once we start talking about
“standard deviations,” this clarity is gone. Are we still talking about the observed width of
the distribution? Or are we talking about one specific measure for this width? (The
standard deviation is only one of several that are available.) Are we already making an
implicit assumption about the nature of the distribution? (The standard deviation is only
suitable under certain conditions, which are often not fulfilled in practice.) Or are we even
confusing the predictions we could make if these assumptions were true with the actual
data? (The moment someone talks about “95 percent anything” we know it’s the latter!)
I’d also like to remind you not to discard simple methods until they have been proven
insufficient. Simple solutions are frequently rather effective: the marginal benefit that
more complicated methods can deliver is often quite small (and may be in no reasonable
relation to the increased cost). More importantly, simple methods have fewer
opportunities to go wrong or to obscure the obvious.
xiv PREFACE
O’Reilly-5980006 master October 28, 2010 22:0
True story: a company was tracking the occurrence of defects over time. Of course, the
actual number of defects varied quite a bit from one day to the next, and they were
looking for a way to obtain an estimate for the typical number of expected defects. The
solution proposed by their IT department involved a compute cluster running a neural
network! (I am not making this up.) In fact, a one-line calculation (involving a moving
average or single exponential smoothing) is all that was needed.
I think the primary reason for this tendency to make data analysis projects more
complicated than they are is discomfort: discomfort with an unfamiliar problem space and
uncertainty about how to proceed. This discomfort and uncertainty creates a desire to
bring in the “big guns”: fancy terminology, heavy machinery, large projects. In reality, of
course, the opposite is true: the complexities of the “solution” overwhelm the original
problem, and nothing gets accomplished.
Data analysis does not have to be all that hard. Although there are situations when
elementary methods will no longer be sufficient, they are much less prevalent than you
might expect. In the vast majority of cases, curiosity and a healthy dose of common sense
will serve you well.
The attitude that I am trying to convey can be summarized in a few points:
Simple is better than complex.
Cheap is better than expensive.
Explicit is better than opaque.
Purpose is more important than process.
Insight is more important than precision.
Understanding is more important than technique.
Think more, work less.
Although I do acknowledge that the items on the right are necessary at times, I will give
preference to those on the left whenever possible.
It is in this spirit that I am offering the concepts and techniques that make up the rest of
this book.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, and email addresses
Constant width
Used to refer to language and script elements
PREFACE xv
O’Reilly-5980006 master October 28, 2010 22:0
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in this
book in your programs and documentation. You do not need to contact us for permission
unless youre reproducing a significant portion of the code. For example, writing a
program that uses several chunks of code from this book does not require permission.
Selling or distributing a CD-ROM of examples from OReilly books does require
permission. Answering a question by citing this book and quoting example code does not
require permission. Incorporating a significant amount of example code from this book
into your products documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Data Analysis with Open Source Tools, by Philipp
K. Janert. Copyright 2011 Philipp K. Janert, 978-0-596-80235-6.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.
Safari® Books Online
.
>
Safari
Books online
Safari Books Online is an on-demand digital library that lets you easily search
over 7,500 technology and creative reference books and videos to find the
answers you need quickly.
With a subscription, you can read any page and watch any video from our library online.
Read books on your cell phone and mobile devices. Access new titles before they are
available for print, and get exclusive access to manuscripts in development and post
feedback for the authors. Copy and paste code samples, organize your favorites, download
chapters, bookmark key sections, create notes, print out pages, and benefit from tons of
other time-saving features.
O’Reilly Media has uploaded this book to the Safari Books Online service. To have full
digital access to this book and others on similar topics from OReilly and other publishers,
sign up for free at http://my.safaribooksonline.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
xvi PREFACE
O’Reilly-5980006 master October 28, 2010 22:0
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at:
http://oreilly.com/catalog/9780596802356
To comment or ask technical questions about this book, send email to:
bookquestions@oreilly.com
For more information about our books, conferences, Resource Centers, and the O’Reilly
Network, see our website at:
http://oreilly.com
Acknowledgments
It was a pleasure to work with O’Reilly on this project. In particular, O’Reilly has been
most accommodating with regard to the technical challenges raised by my need to include
(for an O’Reilly book) an uncommonly large amount of mathematical material in the
manuscript.
Mike Loukides has accompanied this project as the editor since its beginning. I have
enjoyed our conversations about life, the universe, and everything, and I appreciate his
comments about the manuscript—either way.
I’d like to thank several of my friends for their help in bringing this book about:
•
Elizabeth Robson, for making the connection
•
Austin King, for pointing out the obvious
•
Scott White, for suffering my questions gladly
•
Richard Kreckel, for much-needed advice
As always, special thanks go to PAUL Schrader (Bremen).
The manuscript benefited from the feedback I received from various reviewers. Michael E.
Driscoll, Zachary Kessin, and Austin King read all or parts of the manuscript and provided
valuable comments.
I enjoyed personal correspondence with Joseph Adler, Joe Darcy, Hilary Mason, Stephen
Weston, Scott White, and Brian Zimmer. All very generously provided expert advice on
specific topics.
Particular thanks go to Richard Kreckel, who provided uncommonly detailed and
insightful feedback on most of the manuscript.
During the preparation of this book, the excellent collection at the University of
Washington libraries was an especially valuable resource to me.
PREFACE xvii
O’Reilly-5980006 master October 28, 2010 22:0
Authors usually thank their spouses for their “patience and support” or words to that
effect. Unless one has lived through the actual experience, one cannot fully comprehend
how true this is. Over the last three years, Angela has endured what must have seemed
like a nearly continuous stream of whining, frustration, and desperation—punctuated by
occasional outbursts of exhilaration and grandiosity—all of which before the background
of the self-centered and self-absorbed attitude of a typical author. Her patience and
support were unfailing. It’s her turn now.
xviii PREFACE
剩余531页未读,继续阅读
charlie3china
- 粉丝: 0
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- C语言快速排序算法的实现与应用
- KityFormula 编辑器压缩包功能解析
- 离线搭建Kubernetes 1.17.0集群教程与资源包分享
- Java毕业设计教学平台完整教程与源码
- 综合数据集汇总:浏览记录与市场研究分析
- STM32智能家居控制系统:创新设计与无线通讯
- 深入浅出C++20标准:四大新特性解析
- Real-ESRGAN: 开源项目提升图像超分辨率技术
- 植物大战僵尸杂交版v2.0.88:新元素新挑战
- 掌握数据分析核心模型,预测未来不是梦
- Android平台蓝牙HC-06/08模块数据交互技巧
- Python源码分享:计算100至200之间的所有素数
- 免费视频修复利器:Digital Video Repair
- Chrome浏览器新版本Adblock Plus插件发布
- GifSplitter:Linux下GIF转BMP的核心工具
- Vue.js开发教程:全面学习资源指南
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功