knowledge about file formats, compression, and data types, including missing and empty data and
character encodings. Each section has its own examples taken from real-world datasets. This
chapter is important because we will rely on knowledge of these basic concepts for the rest of the
book.
Chapter 3, Workhorses of Clean Data – Spreadsheets and Text Editors, describes how to get the
most data cleaning utility out of two common tools: the text editor and the spreadsheet. We will cover
simple solutions to common problems, including how to use functions, search and replace, and
regular expressions to correct and transform data. At the end of the chapter, we will put our skills to
test using both of these tools to clean some real-world data regarding universities.
Chapter 4, Speaking the Lingua Franca – Data Conversions, focuses on converting data from one
format to another. This is one of the most important data cleaning tasks, and it is useful to have a
variety of tools at hand to easily complete this task. We first proceed through each of the different
conversions step by step, including back and forth between common formats such as comma-
separated values (CSV), JSON, and SQL. To put our new data conversion skills into practice, we
complete a project where we download a Facebook friend network and convert it into a few different
formats so that we can visualize its shape.
Chapter 5, Collecting and Cleaning Data from the Web, describes three different ways to clean
data found inside HTML pages. This chapter presents three popular tools to pull data elements from
within marked-up text, and it also provides the conceptual foundation to understand other methods
besides the specific tools shown here. As our project for this chapter, we build a set of cleaning
procedures to pull data from web-based discussion forums.
Chapter 6, Cleaning Data in PDF Files, introduces several ways to meet this most stubborn and all-
too-common challenge for data cleaners: extracting data that has been stored in Adobe's Portable
Document Format (PDF) files. We first examine low-cost tools to accomplish this task, then we try a
few low-barrier-to-entry tools, and finally, we experiment with the Adobe non-free software itself. As
always, we use real-world data for our experiments, and this provides a wealth of experience as we
learn to work through problems as they arise.
Chapter 7, RDBMS Cleaning Techniques, uses a publicly available dataset of tweets to
demonstrate numerous strategies to clean data stored in a relational database. The database
shown is MySQL, but many of the concepts, including regular-expression based text extraction and
anomaly detection, are readily applicable to other storage systems as well.
Chapter 8, Best Practices for Sharing Your Clean Data, describes some strategies to make your
hard work as easy for others to use as possible. Even if you never plan to share your data with
anyone else, the strategies in this chapter will help you stay organized in your own work, saving you
time in the future. This chapter describes how to create the ideal data package in a variety of
formats, how to document your data, how to choose and attach a license to your data, and also how
to publicize your data so that it can live on if you choose.
Chapter 9, Stack Overflow Project, guides you through a full-length project using a real-world
dataset. We start by posing a set of authentic questions that we can answer about that dataset. In
answering this set of questions, we will complete the entire data science process introduced in
Chapter 1, Why Do You Need Clean Data? and we will put into practice many of the cleaning
processes we learned in the previous chapters. In addition, because this dataset is so large, we will
introduce a few new techniques to deal with the creation of test datasets.