Python数据清洗实战指南：轻松整理与操纵数据

需积分: 16 53 浏览量更新于2024-07-18 1 收藏 5.65MB PDF 举报

"《Clean Data with Python》是一本由Megan Squire撰写的专业书籍，专为IT专业人士设计，旨在帮助读者掌握在Python环境下进行数据清洗、组织和操纵的高效策略。本书内容涵盖了数据清洗的基础至高级技巧，适合那些希望提升数据处理能力，确保数据质量的读者。该书的章节结构清晰，从目录可以看出，它首先会介绍数据清洗的基本概念，然后逐步深入到实际操作层面，可能会涵盖的主题包括但不限于数据清洗的工具和技术、异常值检测、缺失值处理、重复值消除、数据类型转换以及数据格式规范化等。作者以Python语言作为主要工具，因为Python拥有丰富的数据处理库，如Pandas、NumPy和SciPy，这些工具在数据清洗过程中扮演着关键角色。在版权方面，这本书受到2015年Packt Publishing的版权保护，所有内容未经许可不得复制、存储或传输。尽管出版商已尽最大努力确保信息的准确性，但书中的信息并不保证绝对无误，且不承担因本书导致的直接或间接损失的责任。订阅 Packt Pub的好处包括访问支持文件、电子书优惠以及定期更新等内容。对于拥有Packt账户的用户，可以享受免费访问资源。此外，预览章节通常会揭示书中实用的方法和技巧，让读者对内容有一个初步了解。《Clean Data with Python》是一本实用的指南，无论你是初学者还是经验丰富的数据分析师，都能从中找到提高数据处理效率的宝贵知识。通过跟随书中的例子和练习，读者将能够熟练运用Python在实际项目中实现高效的数据清洗，从而节省时间和优化工作流程。"

[ vii ]

Preface

"Pray, Mr. Babbage, if you put into the machine the wrong gures, will the right

answer come out?"

– Charles Babbage (1864)

"Garbage in, garbage out"

– The United States Internal Revenue Service (1963)

"There are no clean datasets."

– Josh Sullivan, Booz Allen VP in Fortune (2015)

In his 1864 collection of essays, Charles Babbage, the inventor of the rst calculating

machine, recollects being dumbfounded at the "confusion of ideas" that would

prompt someone to assume that a computer could calculate the correct answer

despite being given the wrong input. Fast-forward another 100 years, and the

tax bureaucracy started patiently explaining "garbage in, garbage out" to express

the idea that even for the all-powerful tax collector, computer processing is still

dependent on the quality of its input. Fast-forward another 50 years to 2015: a

seemingly magical age of machine learning, autocorrect, anticipatory interfaces, and

recommendation systems that know me better than I know myself. Yet, all of these

helpful algorithms still require high-quality data in order to learn properly in the

rst place, and we lament "there are no clean datasets".

Preface

[ viii ]

This book is for anyone who works with data on a regular basis, whether as a data

scientist, data journalist, software developer, or something else. The goal is to teach

practical strategies to quickly and easily bridge the gap between the data we want

and the data we have. We want high-quality, perfect data, but the reality is that most

often, our data falls far short. Whether we are plagued with missing data, data in

the wrong format, data in the wrong location, or anomalies in the data, the result is

often, to paraphrase rapper Notorious B.I.G., "more data, more problems".

Throughout the book, we will envision data cleaning as an important, worthwhile

step in the data science process: easily improved, never ignored. Our goal is to

reframe data cleaning away from being a dreaded, tedious task that we must slog

through in order to get to the real work. Instead, armed with a few tried-and-true

procedures and tools, we will learn that just like in a kitchen, if you wash your

vegetables rst, your food will look better, taste better, and be better for you. If you

learn a few proper knife skills, your meat will be more succulent and your vegetables

will be cooked more evenly. The same way that a great chef will have their favorite

knives and culinary traditions, a great data scientist will want to work with the very

best data possible and under the very best conditions.

What this book covers

Chapter 1, Why Do You Need Clean Data? motivates our quest for clean data by

showing the central role of data cleaning in the overall data science process. We

follow with a simple example showing some dirty data from a real-world dataset.

We weigh the pros and cons of each potential cleaning process, and then we describe

how to communicate our cleaning changes to others.

Chapter 2, Fundamentals – Formats, Types, and Encodings, sets up some foundational

knowledge about le formats, compression, and data types, including missing and

empty data and character encodings. Each section has its own examples taken from

real-world datasets. This chapter is important because we will rely on knowledge of

these basic concepts for the rest of the book.

Chapter 3, Workhorses of Clean Data – Spreadsheets and Text Editors, describes how to

get the most data cleaning utility out of two common tools: the text editor and the

spreadsheet. We will cover simple solutions to common problems, including how to

use functions, search and replace, and regular expressions to correct and transform

data. At the end of the chapter, we will put our skills to test using both of these tools

to clean some real-world data regarding universities.

Preface

[ ix ]

Chapter 4, Speaking the Lingua Franca – Data Conversions, focuses on converting data

from one format to another. This is one of the most important data cleaning tasks,

and it is useful to have a variety of tools at hand to easily complete this task. We

rst proceed through each of the different conversions step by step, including back

and forth between common formats such as comma-separated values (CSV), JSON,

and SQL. To put our new data conversion skills into practice, we complete a project

where we download a Facebook friend network and convert it into a few different

formats so that we can visualize its shape.

Chapter 5, Collecting and Cleaning Data from the Web, describes three different ways to

clean data found inside HTML pages. This chapter presents three popular tools to

pull data elements from within marked-up text, and it also provides the conceptual

foundation to understand other methods besides the specic tools shown here. As

our project for this chapter, we build a set of cleaning procedures to pull data from

web-based discussion forums.

Chapter 6, Cleaning Data in PDF Files, introduces several ways to meet this most

stubborn and all-too-common challenge for data cleaners: extracting data that has

been stored in Adobe's Portable Document Format (PDF) les. We rst examine

low-cost tools to accomplish this task, then we try a few low-barrier-to-entry tools,

and nally, we experiment with the Adobe non-free software itself. As always, we

use real-world data for our experiments, and this provides a wealth of experience

as we learn to work through problems as they arise.

Chapter 7, RDBMS Cleaning Techniques, uses a publicly available dataset of tweets to

demonstrate numerous strategies to clean data stored in a relational database. The

database shown is MySQL, but many of the concepts, including regular-expression

based text extraction and anomaly detection, are readily applicable to other storage

systems as well.

Chapter 8, Best Practices for Sharing Your Clean Data, describes some strategies to

make your hard work as easy for others to use as possible. Even if you never plan

to share your data with anyone else, the strategies in this chapter will help you stay

organized in your own work, saving you time in the future. This chapter describes

how to create the ideal data package in a variety of formats, how to document your

data, how to choose and attach a license to your data, and also how to publicize your

data so that it can live on if you choose.

Chapter 9, Stack Overow Project, guides you through a full-length project using a

real-world dataset. We start by posing a set of authentic questions that we can

answer about that dataset. In answering this set of questions, we will complete the

entire data science process introduced in Chapter 1, Why Do You Need Clean Data? and

we will put into practice many of the cleaning processes we learned in the previous

chapters. In addition, because this dataset is so large, we will introduce

a few new techniques to deal with the creation of test datasets.

Preface

[ x ]

Chapter 10, Twitter Project, is a full-length project that shows how to perform one of

the hottest and fastest-changing data collection and cleaning tasks out there right

now: mining Twitter. We will show how to nd and collect an existing archive

of publicly available tweets on a real-world current event while adhering to legal

restrictions on the usage of the Twitter service. We will answer a simple question

about the dataset while learning how to clean and extract data from JSON, the most

popular format in use right now with API-accessible web data. Finally, we will

design a simple data model for long-term storage of the extracted and cleaned data

and show how to generate some simple visualizations.

What you need for this book

To complete the projects in this book, you will need the following tools:

• A web browser, Internet access, and a modern operating system.

The browser and operating system should not matter, but access to a

command-line terminal window is ideal (for example, the Terminal

application in OS X). In Chapter 5, Collecting and Cleaning Data from the Web,

one of the three activities relies on a browser-based utility that runs in

the Chrome browser, so keep this in mind if you would like to complete

this activity.

• A text editor, such as Text Wrangler for Mac OSX or Notepad++ for

Windows. Some integrated development environments (IDEs, such as

Eclipse) can also be used as a text editor, but they typically have many

features you will not need.

• A spreadsheet application, such as Microsoft Excel or Google Spreadsheets.

When possible, generic examples are provided that can work on either of

these tools, but in some cases, one or the other is required.

• A Python development environment and the ability to install Python

libraries. I recommend the Enthought Canopy Python environment,

which is available here: https://www.enthought.com/products/canopy/.

• A MySQL 5.5+ server installed and running.

• A web server (running any server software) and PHP 5+ installed.

• A MySQL client interface, either the command-line interface, MySQL

Workbench, or phpMyAdmin (if you have PHP running).

Preface

[ xi ]

Who this book is for

If you are reading this book, I guess you are probably in one of two groups. One

group is the group of data scientists who already spend a lot of time cleaning data,

but you want to get better at it. You are probably frustrated with the tedium of data

cleaning, and you are looking for ways to speed it up, become more efcient, or just

use different tools to get the job done. In our kitchen metaphor, you are the chef who

just needs to brush up on a few knife skills.

The other group is made up of people doing the data science work but who never

really cared about data cleaning before. But now, you are starting to think that

maybe your results might actually get better if you had a cleaning process. Maybe

the old adage "garbage in, garbage out" is starting to feel a little too real. Maybe you

are interested in sharing your data with others, but you do not feel condent about

the quality of the datasets you are producing. With this book, you will gain enough

condence to "cook in public" by learning a few tricks and creating new habits that

will ensure a tidy, clean data science environment.

Either way, this book will help you reframe data cleaning away from being a symbol

of drudgery and toward being your hallmark of quality, good taste, style, and

efciency. You should probably have a bit of programming background, but you do

not have to be great at it. As with most data science projects, a willingness to learn

and experiment as well as a healthy sense of curiosity and a keen attention to detail

are all very important and valued.

Conventions

In this book, you will nd a number of text styles that distinguish between different

kinds of information. Here are some examples of these styles and an explanation of

their meaning.

Code words in text, database table names, folder names, lenames, le extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "

The issue is that

open() is not prepared to handle UTF-8 characters."

A block of code is set as follows:

for tweet in stream:

encoded_tweet = tweet['text'].encode('ascii','ignore')

print counter, "-", encoded_tweet[0:10]

f.write(encoded_tweet)

剩余271页未读，继续阅读

weixin_41957720

粉丝: 0
资源: 3

Python数据清洗实战指南：轻松整理与操纵数据

Clean Python: Elegant Coding in Python (True PDF)

Clean Python.pdf

clean_data.py

quick_data_clean manual V1.1.4.pdf

Python_Data+Analytics+and+Visualization-Packt+Publishing(2017).pdf

Clean.Data.1785284010

Python数据处理框架quick_data_clean使用手册

Data Wrangling with Python 2016新书 无水印pdf 0分

Python Machine Learning.pdf 无水印书签修正完美版 2015

Python : end-to-end data analysis : leverage the power of Python to clean

最新资源

Data Wrangling with Python 2016新书无水印pdf 0分