精通数据清洗：从Python到实战项目

5星 · 超过95%的资源需积分: 9 41 浏览量更新于2024-07-22 1 收藏 4.64MB PDF 举报

"Clean Data.1785284010" 这本书《Clean Data》由Megan Squire撰写，是一本关于数据清洗的指南，旨在帮助数据科学家提高效率，学习如何整理和操作数据。书中详细介绍了各种数据清洗策略，适合各层次的数据科学家，特别是对数据清洗感兴趣的初学者。读者需要有Python或PHP的基础，但无需事先了解数据清洗知识。书中详细阐述了数据清洗在数据科学中的重要性，并引导读者掌握优化清洗流程的方法。首先，作者讲解了数据清理在数据分析过程中的关键作用，让读者理解为何需要清洁数据。接下来，介绍了数据的基本概念，如文件格式、数据类型和字符编码，这些都是后续章节的基础。在后续章节中，读者将学习如何从数据库、网页文件和PDF文档中提取和清洗数据。此外，书中还涉及了使用电子表格和文本编辑器进行数据处理，以及数据转换的技巧。对于网络数据的收集和清洗，作者提供了实用示例。专门针对PDF文件的数据清洗技术也有详细讲解，包括如何处理RDBMS（关系数据库管理系统）中的数据。书末，通过两个实际项目——Stack Overflow和Twitter项目，读者可以应用所学知识解决真实世界中的数据清洗问题。这些项目旨在巩固理论知识并提升实践能力。全书目录如下： 1. 为什么需要干净的数据？ 2. 基础知识——格式、类型和编码 3. 数据清洗的主力工具——电子表格和文本编辑器 4. 共享语言——数据转换 5. 从网络收集和清洗数据 6. 清理PDF文件中的数据 7. RDBMS清洗技术 8. 分享干净数据的最佳实践 9. Stack Overflow项目 10. Twitter项目这本书通过实例和项目，提供了一套全面的数据清洗方法论，帮助读者掌握高效的数据处理技巧，提升数据分析工作的质量和效率。无论是数据初学者还是经验丰富的专业人士，都能从中受益，构建强大的数据清洗工具箱。

Preface

"Pray, Mr. Babbage, if you put into the machine the wrong figures, will the right answer come

out?"

--Charles Babbage (1864)

"Garbage in, garbage out"

--The United States Internal Revenue Service (1963)

"There are no clean datasets."

--Josh Sullivan, Booz Allen VP in Fortune (2015)

In his 1864 collection of essays, Charles Babbage, the inventor of the first calculating machine,

recollects being dumbfounded at the "confusion of ideas" that would prompt someone to assume

that a computer could calculate the correct answer despite being given the wrong input. Fast-forward

another 100 years, and the tax bureaucracy started patiently explaining "garbage in, garbage out" to

express the idea that even for the all-powerful tax collector, computer processing is still dependent

on the quality of its input. Fast-forward another 50 years to 2015: a seemingly magical age of

machine learning, autocorrect, anticipatory interfaces, and recommendation systems that know me

better than I know myself. Yet, all of these helpful algorithms still require high-quality data in order to

learn properly in the first place, and we lament "there are no clean datasets".

This book is for anyone who works with data on a regular basis, whether as a data scientist, data

journalist, software developer, or something else. The goal is to teach practical strategies to quickly

and easily bridge the gap between the data we want and the data we have. We want high-quality,

perfect data, but the reality is that most often, our data falls far short. Whether we are plagued with

missing data, data in the wrong format, data in the wrong location, or anomalies in the data, the

result is often, to paraphrase rapper Notorious B.I.G., "more data, more problems".

Throughout the book, we will envision data cleaning as an important, worthwhile step in the data

science process: easily improved, never ignored. Our goal is to reframe data cleaning away from

being a dreaded, tedious task that we must slog through in order to get to the real work. Instead,

armed with a few tried-and-true procedures and tools, we will learn that just like in a kitchen, if you

wash your vegetables first, your food will look better, taste better, and be better for you. If you learn a

few proper knife skills, your meat will be more succulent and your vegetables will be cooked more

evenly. The same way that a great chef will have their favorite knives and culinary traditions, a great

data scientist will want to work with the very best data possible and under the very best conditions.

What this book covers

Chapter 1, Why Do You Need Clean Data? motivates our quest for clean data by showing the

central role of data cleaning in the overall data science process. We follow with a simple example

showing some dirty data from a real-world dataset. We weigh the pros and cons of each potential

cleaning process, and then we describe how to communicate our cleaning changes to others.

Chapter 2, Fundamentals – Formats, Types, and Encodings, sets up some foundational

knowledge about file formats, compression, and data types, including missing and empty data and

character encodings. Each section has its own examples taken from real-world datasets. This

chapter is important because we will rely on knowledge of these basic concepts for the rest of the

book.

Chapter 3, Workhorses of Clean Data – Spreadsheets and Text Editors, describes how to get the

most data cleaning utility out of two common tools: the text editor and the spreadsheet. We will cover

simple solutions to common problems, including how to use functions, search and replace, and

regular expressions to correct and transform data. At the end of the chapter, we will put our skills to

test using both of these tools to clean some real-world data regarding universities.

Chapter 4, Speaking the Lingua Franca – Data Conversions, focuses on converting data from one

format to another. This is one of the most important data cleaning tasks, and it is useful to have a

variety of tools at hand to easily complete this task. We first proceed through each of the different

conversions step by step, including back and forth between common formats such as comma-

separated values (CSV), JSON, and SQL. To put our new data conversion skills into practice, we

complete a project where we download a Facebook friend network and convert it into a few different

formats so that we can visualize its shape.

Chapter 5, Collecting and Cleaning Data from the Web, describes three different ways to clean

data found inside HTML pages. This chapter presents three popular tools to pull data elements from

within marked-up text, and it also provides the conceptual foundation to understand other methods

besides the specific tools shown here. As our project for this chapter, we build a set of cleaning

procedures to pull data from web-based discussion forums.

Chapter 6, Cleaning Data in PDF Files, introduces several ways to meet this most stubborn and all-

too-common challenge for data cleaners: extracting data that has been stored in Adobe's Portable

Document Format (PDF) files. We first examine low-cost tools to accomplish this task, then we try a

few low-barrier-to-entry tools, and finally, we experiment with the Adobe non-free software itself. As

always, we use real-world data for our experiments, and this provides a wealth of experience as we

learn to work through problems as they arise.

Chapter 7, RDBMS Cleaning Techniques, uses a publicly available dataset of tweets to

demonstrate numerous strategies to clean data stored in a relational database. The database

shown is MySQL, but many of the concepts, including regular-expression based text extraction and

anomaly detection, are readily applicable to other storage systems as well.

Chapter 8, Best Practices for Sharing Your Clean Data, describes some strategies to make your

hard work as easy for others to use as possible. Even if you never plan to share your data with

anyone else, the strategies in this chapter will help you stay organized in your own work, saving you

time in the future. This chapter describes how to create the ideal data package in a variety of

formats, how to document your data, how to choose and attach a license to your data, and also how

to publicize your data so that it can live on if you choose.

Chapter 9, Stack Overflow Project, guides you through a full-length project using a real-world

dataset. We start by posing a set of authentic questions that we can answer about that dataset. In

answering this set of questions, we will complete the entire data science process introduced in

Chapter 1, Why Do You Need Clean Data? and we will put into practice many of the cleaning

processes we learned in the previous chapters. In addition, because this dataset is so large, we will

introduce a few new techniques to deal with the creation of test datasets.

剩余234页未读，继续阅读

ramissue

粉丝: 354
资源: 1487

精通数据清洗：从Python到实战项目

clean_data.py

dataClean.py

我拿到的是公司返回的clean.data.gz数据，我该如何进行下一步分析

data.to_csv(path+'data_new_clean.csv') 怎么保存到F/desktop/xiugai文件夹里

raise RuntimeError(emojis(f"Dataset '{clean_url(self.args.data)}' error ❌ {e}")) from e

data = pd.read_csv('D:\\航空\\test.csv') data = pd.read_csv('D:\\航空\\train.csv') print(data.shape) data.head() data.info() def clean_data()

data = pd.read_csv('D:\\航空\\test.csv') data = pd.read_csv('D:\\航空\\train.csv') print(data.shape) data.head() data.info() def clean_data()下一步括号面写什么

path = "F:/desktop/xiugai/" data = pd.read_csv(path + 'data_new_clean.csv',encoding="gbk") data.head()

最新资源