没有合适的资源?快使用搜索试试~ 我知道了~
首页Python数据清洗实战指南:轻松整理与操纵数据
"《Clean Data with Python》是一本由Megan Squire撰写的专业书籍,专为IT专业人士设计,旨在帮助读者掌握在Python环境下进行数据清洗、组织和操纵的高效策略。本书内容涵盖了数据清洗的基础至高级技巧,适合那些希望提升数据处理能力,确保数据质量的读者。 该书的章节结构清晰,从目录可以看出,它首先会介绍数据清洗的基本概念,然后逐步深入到实际操作层面,可能会涵盖的主题包括但不限于数据清洗的工具和技术、异常值检测、缺失值处理、重复值消除、数据类型转换以及数据格式规范化等。作者以Python语言作为主要工具,因为Python拥有丰富的数据处理库,如Pandas、NumPy和SciPy,这些工具在数据清洗过程中扮演着关键角色。 在版权方面,这本书受到2015年Packt Publishing的版权保护,所有内容未经许可不得复制、存储或传输。尽管出版商已尽最大努力确保信息的准确性,但书中的信息并不保证绝对无误,且不承担因本书导致的直接或间接损失的责任。 订阅 Packt Pub的好处包括访问支持文件、电子书优惠以及定期更新等内容。对于拥有Packt账户的用户,可以享受免费访问资源。此外,预览章节通常会揭示书中实用的方法和技巧,让读者对内容有一个初步了解。 《Clean Data with Python》是一本实用的指南,无论你是初学者还是经验丰富的数据分析师,都能从中找到提高数据处理效率的宝贵知识。通过跟随书中的例子和练习,读者将能够熟练运用Python在实际项目中实现高效的数据清洗,从而节省时间和优化工作流程。"
资源详情
资源推荐
[ vii ]
Preface
"Pray, Mr. Babbage, if you put into the machine the wrong gures, will the right
answer come out?"
– Charles Babbage (1864)
"Garbage in, garbage out"
– The United States Internal Revenue Service (1963)
"There are no clean datasets."
– Josh Sullivan, Booz Allen VP in Fortune (2015)
In his 1864 collection of essays, Charles Babbage, the inventor of the rst calculating
machine, recollects being dumbfounded at the "confusion of ideas" that would
prompt someone to assume that a computer could calculate the correct answer
despite being given the wrong input. Fast-forward another 100 years, and the
tax bureaucracy started patiently explaining "garbage in, garbage out" to express
the idea that even for the all-powerful tax collector, computer processing is still
dependent on the quality of its input. Fast-forward another 50 years to 2015: a
seemingly magical age of machine learning, autocorrect, anticipatory interfaces, and
recommendation systems that know me better than I know myself. Yet, all of these
helpful algorithms still require high-quality data in order to learn properly in the
rst place, and we lament "there are no clean datasets".
Preface
[ viii ]
This book is for anyone who works with data on a regular basis, whether as a data
scientist, data journalist, software developer, or something else. The goal is to teach
practical strategies to quickly and easily bridge the gap between the data we want
and the data we have. We want high-quality, perfect data, but the reality is that most
often, our data falls far short. Whether we are plagued with missing data, data in
the wrong format, data in the wrong location, or anomalies in the data, the result is
often, to paraphrase rapper Notorious B.I.G., "more data, more problems".
Throughout the book, we will envision data cleaning as an important, worthwhile
step in the data science process: easily improved, never ignored. Our goal is to
reframe data cleaning away from being a dreaded, tedious task that we must slog
through in order to get to the real work. Instead, armed with a few tried-and-true
procedures and tools, we will learn that just like in a kitchen, if you wash your
vegetables rst, your food will look better, taste better, and be better for you. If you
learn a few proper knife skills, your meat will be more succulent and your vegetables
will be cooked more evenly. The same way that a great chef will have their favorite
knives and culinary traditions, a great data scientist will want to work with the very
best data possible and under the very best conditions.
What this book covers
Chapter 1, Why Do You Need Clean Data? motivates our quest for clean data by
showing the central role of data cleaning in the overall data science process. We
follow with a simple example showing some dirty data from a real-world dataset.
We weigh the pros and cons of each potential cleaning process, and then we describe
how to communicate our cleaning changes to others.
Chapter 2, Fundamentals – Formats, Types, and Encodings, sets up some foundational
knowledge about le formats, compression, and data types, including missing and
empty data and character encodings. Each section has its own examples taken from
real-world datasets. This chapter is important because we will rely on knowledge of
these basic concepts for the rest of the book.
Chapter 3, Workhorses of Clean Data – Spreadsheets and Text Editors, describes how to
get the most data cleaning utility out of two common tools: the text editor and the
spreadsheet. We will cover simple solutions to common problems, including how to
use functions, search and replace, and regular expressions to correct and transform
data. At the end of the chapter, we will put our skills to test using both of these tools
to clean some real-world data regarding universities.
Preface
[ ix ]
Chapter 4, Speaking the Lingua Franca – Data Conversions, focuses on converting data
from one format to another. This is one of the most important data cleaning tasks,
and it is useful to have a variety of tools at hand to easily complete this task. We
rst proceed through each of the different conversions step by step, including back
and forth between common formats such as comma-separated values (CSV), JSON,
and SQL. To put our new data conversion skills into practice, we complete a project
where we download a Facebook friend network and convert it into a few different
formats so that we can visualize its shape.
Chapter 5, Collecting and Cleaning Data from the Web, describes three different ways to
clean data found inside HTML pages. This chapter presents three popular tools to
pull data elements from within marked-up text, and it also provides the conceptual
foundation to understand other methods besides the specic tools shown here. As
our project for this chapter, we build a set of cleaning procedures to pull data from
web-based discussion forums.
Chapter 6, Cleaning Data in PDF Files, introduces several ways to meet this most
stubborn and all-too-common challenge for data cleaners: extracting data that has
been stored in Adobe's Portable Document Format (PDF) les. We rst examine
low-cost tools to accomplish this task, then we try a few low-barrier-to-entry tools,
and nally, we experiment with the Adobe non-free software itself. As always, we
use real-world data for our experiments, and this provides a wealth of experience
as we learn to work through problems as they arise.
Chapter 7, RDBMS Cleaning Techniques, uses a publicly available dataset of tweets to
demonstrate numerous strategies to clean data stored in a relational database. The
database shown is MySQL, but many of the concepts, including regular-expression
based text extraction and anomaly detection, are readily applicable to other storage
systems as well.
Chapter 8, Best Practices for Sharing Your Clean Data, describes some strategies to
make your hard work as easy for others to use as possible. Even if you never plan
to share your data with anyone else, the strategies in this chapter will help you stay
organized in your own work, saving you time in the future. This chapter describes
how to create the ideal data package in a variety of formats, how to document your
data, how to choose and attach a license to your data, and also how to publicize your
data so that it can live on if you choose.
Chapter 9, Stack Overow Project, guides you through a full-length project using a
real-world dataset. We start by posing a set of authentic questions that we can
answer about that dataset. In answering this set of questions, we will complete the
entire data science process introduced in Chapter 1, Why Do You Need Clean Data? and
we will put into practice many of the cleaning processes we learned in the previous
chapters. In addition, because this dataset is so large, we will introduce
a few new techniques to deal with the creation of test datasets.
Preface
[ x ]
Chapter 10, Twitter Project, is a full-length project that shows how to perform one of
the hottest and fastest-changing data collection and cleaning tasks out there right
now: mining Twitter. We will show how to nd and collect an existing archive
of publicly available tweets on a real-world current event while adhering to legal
restrictions on the usage of the Twitter service. We will answer a simple question
about the dataset while learning how to clean and extract data from JSON, the most
popular format in use right now with API-accessible web data. Finally, we will
design a simple data model for long-term storage of the extracted and cleaned data
and show how to generate some simple visualizations.
What you need for this book
To complete the projects in this book, you will need the following tools:
• A web browser, Internet access, and a modern operating system.
The browser and operating system should not matter, but access to a
command-line terminal window is ideal (for example, the Terminal
application in OS X). In Chapter 5, Collecting and Cleaning Data from the Web,
one of the three activities relies on a browser-based utility that runs in
the Chrome browser, so keep this in mind if you would like to complete
this activity.
• A text editor, such as Text Wrangler for Mac OSX or Notepad++ for
Windows. Some integrated development environments (IDEs, such as
Eclipse) can also be used as a text editor, but they typically have many
features you will not need.
• A spreadsheet application, such as Microsoft Excel or Google Spreadsheets.
When possible, generic examples are provided that can work on either of
these tools, but in some cases, one or the other is required.
• A Python development environment and the ability to install Python
libraries. I recommend the Enthought Canopy Python environment,
which is available here: https://www.enthought.com/products/canopy/.
• A MySQL 5.5+ server installed and running.
• A web server (running any server software) and PHP 5+ installed.
• A MySQL client interface, either the command-line interface, MySQL
Workbench, or phpMyAdmin (if you have PHP running).
Preface
[ xi ]
Who this book is for
If you are reading this book, I guess you are probably in one of two groups. One
group is the group of data scientists who already spend a lot of time cleaning data,
but you want to get better at it. You are probably frustrated with the tedium of data
cleaning, and you are looking for ways to speed it up, become more efcient, or just
use different tools to get the job done. In our kitchen metaphor, you are the chef who
just needs to brush up on a few knife skills.
The other group is made up of people doing the data science work but who never
really cared about data cleaning before. But now, you are starting to think that
maybe your results might actually get better if you had a cleaning process. Maybe
the old adage "garbage in, garbage out" is starting to feel a little too real. Maybe you
are interested in sharing your data with others, but you do not feel condent about
the quality of the datasets you are producing. With this book, you will gain enough
condence to "cook in public" by learning a few tricks and creating new habits that
will ensure a tidy, clean data science environment.
Either way, this book will help you reframe data cleaning away from being a symbol
of drudgery and toward being your hallmark of quality, good taste, style, and
efciency. You should probably have a bit of programming background, but you do
not have to be great at it. As with most data science projects, a willingness to learn
and experiment as well as a healthy sense of curiosity and a keen attention to detail
are all very important and valued.
Conventions
In this book, you will nd a number of text styles that distinguish between different
kinds of information. Here are some examples of these styles and an explanation of
their meaning.
Code words in text, database table names, folder names, lenames, le extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "
The issue is that
open() is not prepared to handle UTF-8 characters."
A block of code is set as follows:
for tweet in stream:
encoded_tweet = tweet['text'].encode('ascii','ignore')
print counter, "-", encoded_tweet[0:10]
f.write(encoded_tweet)
剩余271页未读,继续阅读
weixin_41957720
- 粉丝: 0
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- JDK 17 Linux版本压缩包解压与安装指南
- C++/Qt飞行模拟器教员控制台系统源码发布
- TensorFlow深度学习实践:CNN在MNIST数据集上的应用
- 鸿蒙驱动HCIA资料整理-培训教材与开发者指南
- 凯撒Java版SaaS OA协同办公软件v2.0特性解析
- AutoCAD二次开发中文指南下载 - C#编程深入解析
- C语言冒泡排序算法实现详解
- Pointofix截屏:轻松实现高效截图体验
- Matlab实现SVM数据分类与预测教程
- 基于JSP+SQL的网站流量统计管理系统设计与实现
- C语言实现删除字符中重复项的方法与技巧
- e-sqlcipher.dll动态链接库的作用与应用
- 浙江工业大学自考网站开发与继续教育官网模板设计
- STM32 103C8T6 OLED 显示程序实现指南
- 高效压缩技术:删除重复字符压缩包
- JSP+SQL智能交通管理系统:违章处理与交通效率提升
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功