1
Chapter 1
Big Data Technology Landscape
We are in the age of big data. Data has not only become the lifeblood of any organization, but is also growing
exponentially. Data generated today is several magnitudes larger than what was generated just a few years
ago. The challenge is how to get business value out of this data. This is the problem that big data–related
technologies aim to solve. Therefore, big data has become one of the hottest technology trends over the last
few years. Some of the most active open source projects are related to big data, and the number of these
projects is growing rapidly. The number of startups focused on big data has exploded in recent years. Large
established companies are making significant investments in big data technologies.
Although the term “big data” is hot, its definition is vague. People define it in different ways. One
definition relates to the volume of data; another definition relates to the richness of data. Some define big
data as data that is “too big” by traditional standards; whereas others define big data as data that captures
more nuances about the entity that it represents. An example of the former would be a dataset whose
volume exceeds petabytes or several terabytes. If this data were stored in a traditional relational database
(RDBMS) table, it would have billions of rows. An example of the latter definition is a dataset with extremely
wide rows. If this data were stored in a relational database table, it would have thousands of columns.
Another popular definition of big data is data characterized by three Vs: volume, velocity, and variety. I just
discussed volume. Velocity means that data is generated at a fast rate. Variety refers to the fact that data can
be unstructured, semi-structured, or multi-structured.
Standard relational databases could not easily handle big data. The core technology for these databases
was designed several decades ago when few organizations had petabytes or even terabytes of data. Today
it is not uncommon for some organizations to generate terabytes of data every day. Not only the volume
of data, but also the rate at which it is being generated is exploding. Hence there was a need for new
technologies that could not only process and analyze large volume of data, but also ingest large volume of
data at a fast pace.
Other key driving factors for the big data technologies include scalability, high availability, and fault
tolerance at a low cost. Technology for processing and analyzing large datasets has been extensively
researched and available in the form of proprietary commercial products for a long time. For example, MPP
(massively parallel processing) databases have been around for a while. MPP databases use a “shared-
nothing” architecture, where data is stored and processed across a cluster of nodes. Each node comes with
its own set of CPUs, memory, and disks. They communicate via a network interconnect. Data is partitioned
across a cluster of nodes. There is no contention among the nodes, so they can all process data in parallel.
Examples of such databases include Teradata, Netezza, Greenplum, ParAccel, and Vertica. Teradata was
invented in the late 1970s, and by the 1990s, it was capable of processing terabytes of data. However,
proprietary MPP products are expensive. Not everybody can afford them.
This chapter introduces some of the open source big data–related technologies. Although it may seem
that the technologies covered in this chapter have been randomly picked, they are connected by a common
theme. They are used with Spark, or Spark provides a better alternative to some of these technologies. As you
start using Spark, you may run into these technologies. In addition, familiarity with these technologies will
help you better understand Spark, which we will introduce in Chapter 3.