registration process, how many are over 25?” The field of how to structure a data
warehouse and organize information to make answering these kinds of questions
easy is a rich one, but we will mostly avoid its intricacies in this book.
Sometimes, “doing something useful” takes a little extra. SQL still may be core to the
approach, but to work around idiosyncrasies in the data or perform complex analysis,
we need a programming paradigm that’s a little bit more flexible and a little closer to
the ground, and with richer functionality in areas like machine learning and statistics.
These are the kinds of analyses we are going to talk about in this book.
For a long time, open source frameworks like R, the PyData stack, and Octave have
made rapid analysis and model building viable over small data sets. With fewer than
10 lines of code, we can throw together a machine learning model on half a data set
and use it to predict labels on the other half. With a little more effort, we can impute
missing data, experiment with a few models to find the best one, or use the results of
a model as inputs to fit another. What should an equivalent process look like that can
leverage clusters of computers to achieve the same outcomes on huge data sets?
The right approach might be to simply extend these frameworks to run on multiple
machines, to retain their programming models and rewrite their guts to play well in
distributed settings. However, the challenges of distributed computing require us to
rethink many of the basic assumptions that we rely on in single-node systems. For
example, because data must be partitioned across many nodes on a cluster, algorithms
that have wide data dependencies will suffer from the fact that network transfer rates
are orders of magnitude slower than memory accesses. As the number of machines
working on a problem increases, the probability of a failure increases. These facts
require a programming paradigm that is sensitive to the characteristics of the under‐
lying system: one that discourages poor choices and makes it easy to write code that
will execute in a highly parallel manner.
Of course, single-machine tools like PyData and R that have come to recent promi‐
nence in the software community are not the only tools used for data analysis. Scien‐
tific fields like genomics that deal with large data sets have been leveraging parallel
computing frameworks for decades. Most people processing data in these fields today
are familiar with a cluster-computing environment called HPC (high-performance
computing). Where the difficulties with PyData and R lie in their inability to scale,
the difficulties with HPC lie in its relatively low level of abstraction and difficulty of
use. For example, to process a large file full of DNA sequencing reads in parallel, we
must manually split it up into smaller files and submit a job for each of those files to
the cluster scheduler. If some of these fail, the user must detect the failure and take
care of manually resubmitting them. If the analysis requires all-to-all operations like
sorting the entire data set, the large data set must be streamed through a single node,
or the scientist must resort to lower-level distributed frameworks like MPI, which are
difficult to program without extensive knowledge of C and distributed/networked
2 | Chapter 1: Analyzing Big Data