Unfortunately, this trend in hardware stopped around 2005: due to hard limits in heat dissipation,
hardware developers stopped making individual processors faster, and switched toward adding
more parallel CPU cores all running at the same speed. This change meant that suddenly
applications needed to be modified to add parallelism in order to run faster, which set the stage
for new programming models such as Apache Spark.
On top of that, the technologies for storing and collecting data did not slow down appreciably in
2005, when processor speeds did. The cost to store 1 TB of data continues to drop by roughly
two times every 14 months, meaning that it is very inexpensive for organizations of all sizes to
store large amounts of data. Moreover, many of the technologies for collecting data (sensors,
cameras, public datasets, etc.) continue to drop in cost and improve in resolution. For example,
camera technology continues to improve in resolution and drop in cost per pixel every year, to
the point where a 12-megapixel webcam costs only $3 to $4; this has made it inexpensive to
collect a wide range of visual data, whether from people filming video or automated sensors in
an industrial setting. Moreover, cameras are themselves the key sensors in other data collection
devices, such as telescopes and even gene-sequencing machines, driving the cost of these
technologies down as well.
The end result is a world in which collecting data is extremely inexpensive—many organizations
today even consider it negligent not to log data of possible relevance to the business—but
processing it requires large, parallel computations, often on clusters of machines. Moreover, in
this new world, the software developed in the past 50 years cannot automatically scale up, and
neither can the traditional programming models for data processing applications, creating the
need for new programming models. It is this world that Apache Spark was built for.
History of Spark
Apache Spark began at UC Berkeley in 2009 as the Spark research project, which was first
published the following year in a paper entitled “Spark: Cluster Computing with Working Sets”
by Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, and Ion Stoica of the
UC Berkeley AMPlab. At the time, Hadoop MapReduce was the dominant parallel programming
engine for clusters, being the first open source system to tackle data-parallel processing on
clusters of thousands of nodes. The AMPlab had worked with multiple early MapReduce users to
understand the benefits and drawbacks of this new programming model, and was therefore able
to synthesize a list of problems across several use cases and begin designing more general
computing platforms. In addition, Zaharia had also worked with Hadoop users at UC Berkeley to
understand their needs for the platform—specifically, teams that were doing large-scale machine
learning using iterative algorithms that need to make multiple passes over the data.
Across these conversations, two things were clear. First, cluster computing held tremendous
potential: at every organization that used MapReduce, brand new applications could be built
using the existing data, and many new groups began using the system after its initial use cases.
Second, however, the MapReduce engine made it both challenging and inefficient to build large
applications. For example, the typical machine learning algorithm might need to make 10 or 20