Getting Started with Talend Big Data
[ 8 ]
As you can see, there is a project for each task that you need to accomplish in a
Hadoop cluster which is explained in the following points:
• HDFS is the main layer where the data is stored. We will see in the
following chapter how to use TOSBD to read and write data in it.
More information can be found at http://hadoop.apache.org/
docs/stable1/hdfs_design.html.
• MapReduce is a framework used to process a large amount of data stored
in HDFS, and it relies on a map function that processes key values pairs
and a reduce function to merge all the values as the following publication
explains http://research.google.com/archive/mapreduce.html.
• In this book, we will use a bunch of high-level projects over HDFS, such as
Pig and HIVE, in order to generate the MapReduce code and manipulate
the data in an easier way instead of coding the MapReduce itself.
• Other projects such as Flume or Sqoop are used for integration purpose
with an industry framework and tools such as RDBMS in the case of Sqoop.
The more you get into Big Data projects, the more skills you need, the more time you
need to ramp up on the different projects and framework. TOSBD will help to reduce
this ramp up time by providing a comprehensive graphical set of tools that ease the
pain of starting and developing such projects.
Prerequisites for running examples
As described earlier in this chapter, this book will describe how to implement Big Data
Hadoop jobs using TOSBD. For this the following technical assets will be needed:
• A Windows/Linux/Mac OS machine
• Oracle (Sun) Java JDK 7 is required to install and run TOSBD, and is available
at http://www.oracle.com/technetwork/java/javase/downloads/
jdk7-downloads-1880260.html
• Cloudera CDH Quick Start VM, a Hadoop distribution, which by default
contains a ready-to-use single node Apache Hadoop is available at
http://www.cloudera.com/content/support/en/downloads/
download-components/download-products.html?productID=F6mO278Rvo
• A VMWare Player or VirtualBox free for personal use (for windows and
linux only) to run the Cloudera VM available at https://my.vmware.com/
en/web/vmware/free#desktop_end_user_computing/vmware_player/
6_0 and https://www.virtualbox.org/wiki/Downloads