Chapter 1 ■ Big Data, haDoop, anD hDinsight
3
by adding more resources is called scale-up, or vertical scaling. The same approach
has been used for years to tackle performance improvement issues: add more capable
hardware—and performance will go up. But this approach can only go so far; at some
point, data or query processing will overwhelm the hardware and you have to upgrade
the hardware again. As you scale up, hardware costs begin to rise. At some point, it will no
longer be cost effective to upgrade.
Think of a hotdog stand, where replacing a slow hotdog maker with a more
experienced person who prepares hotdogs in less time, but for higher wages, improves
efficiency. Yet, it can be improved up to only certain point, because the worker has to
take their time to prepare the hotdogs no matter how long the queue is and he cannot
serve the next customer in the queue until current one is served. Also, there is no control
over customer behavior: customers can customize their orders, and payment takes each
customer a different amount of time. So scaling up can take you so far, but in the end, it
will start to bottleneck.
So if your resource is completely occupied, add another person to the job, but
not at a higher wage. You should double the performance, thereby linearly scaling the
throughput by distributing the work across different resources.
The same approach is taken in large-scale data storage and processing scenarios:
you add more commodity hardware to the network to improve performance. But adding
hardware to a network is a bit more complicated than adding more workers to a hotdog
stand. These new units of hardware should be taken into account. The software has to
support dividing processing loads across multiple machines. If you only allow a single
system to process all the data, even if it is stored on multiple machines, you will hit the
processing power cap eventually. This means that there has to be a way to distribute not
only the data to new hardware on the network, but also instructions on how to process
that data and get results back. Generally, there is a master node that instructs all the
other nodes to do the processing, and then it aggregates the results from each of them.
The scale-out approach is very common in real life—from overcrowded hotdog stands to
grocery stores queues, everyone uses this approach. So in a way, big data problems and
their solutions are not so new.
Apache Hadoop
Apache Hadoop is an open source project, and undoubtedly the most used framework
for big data solutions. It is a very flexible, scalable, and fault-tolerant framework that
handles massive amounts of data. It is called a framework because it is made up of many
components and evolves at a rapid pace. Components can work together or separately,
if you want them to. Hadoop and its component are discussed in accordance with
HDInsight in this book, but all the fundamentals apply to Hadoop in general, too.
A Brief History of Hadoop
In 2003, Google released a paper on scalable distributed file systems for large distributed
data-intensive applications. This paper spawned “MapReduce: Simplified Data
Processing on Large Clusters” in December 2004. Based on these papers’ theory, an open
source project started—Apache Nutch. Soon thereafter, a Hadoop subproject was started