Preface
[ ix ]
Chapter 4, Speaking the Lingua Franca – Data Conversions, focuses on converting data
from one format to another. This is one of the most important data cleaning tasks,
and it is useful to have a variety of tools at hand to easily complete this task. We
rst proceed through each of the different conversions step by step, including back
and forth between common formats such as comma-separated values (CSV), JSON,
and SQL. To put our new data conversion skills into practice, we complete a project
where we download a Facebook friend network and convert it into a few different
formats so that we can visualize its shape.
Chapter 5, Collecting and Cleaning Data from the Web, describes three different ways to
clean data found inside HTML pages. This chapter presents three popular tools to
pull data elements from within marked-up text, and it also provides the conceptual
foundation to understand other methods besides the specic tools shown here. As
our project for this chapter, we build a set of cleaning procedures to pull data from
web-based discussion forums.
Chapter 6, Cleaning Data in PDF Files, introduces several ways to meet this most
stubborn and all-too-common challenge for data cleaners: extracting data that has
been stored in Adobe's Portable Document Format (PDF) les. We rst examine
low-cost tools to accomplish this task, then we try a few low-barrier-to-entry tools,
and nally, we experiment with the Adobe non-free software itself. As always, we
use real-world data for our experiments, and this provides a wealth of experience
as we learn to work through problems as they arise.
Chapter 7, RDBMS Cleaning Techniques, uses a publicly available dataset of tweets to
demonstrate numerous strategies to clean data stored in a relational database. The
database shown is MySQL, but many of the concepts, including regular-expression
based text extraction and anomaly detection, are readily applicable to other storage
systems as well.
Chapter 8, Best Practices for Sharing Your Clean Data, describes some strategies to
make your hard work as easy for others to use as possible. Even if you never plan
to share your data with anyone else, the strategies in this chapter will help you stay
organized in your own work, saving you time in the future. This chapter describes
how to create the ideal data package in a variety of formats, how to document your
data, how to choose and attach a license to your data, and also how to publicize your
data so that it can live on if you choose.
Chapter 9, Stack Overow Project, guides you through a full-length project using a
real-world dataset. We start by posing a set of authentic questions that we can
answer about that dataset. In answering this set of questions, we will complete the
entire data science process introduced in Chapter 1, Why Do You Need Clean Data? and
we will put into practice many of the cleaning processes we learned in the previous
chapters. In addition, because this dataset is so large, we will introduce
a few new techniques to deal with the creation of test datasets.