20 Journal of Communications and Information Networks
with such big data alone
[6,9,14]
. Attempts at com-
bining Hadoop and several other technologies (such
as Apache Kafka) lead us to this successful novel
project. The design of the Trinity model is based
on the lambda architecture that provides a platform
for processing both batch and real-time data simul-
taneously. This study claims that, if the Trinity
model is implemented, it would be capable of solv-
ing the major issues of big data. Apache Kafka
[15]
on Hadoop, another open-source tool for distributed
data, exhibits a highly desirable performance. It has
a promising future in real-time applications. It helps
our model in the queuing of the incoming data into
the memory, which can further be used by Spark for
better results. The lambda architecture
[8,16]
is being
used to incorporate the suitable technologies. There
exists extensive debate regarding the incorporation
of the right technology on the lambda architecture
for improved workflow. Our literature review and
in-depth analysis allow us to choose the best tech-
nology for batch processing (MapReduce) and real-
time processing (Apache Spark). Storage is always
a requirement of any big data application
[11,14]
. The
application should have scalable data storage that
allows it to save as much data of all types as it re-
quires.
3 Analysis of big data platforms
In this section, we conduct an in-depth analysis of
existing big data systems and platforms.
3.1 Hadoop
Apache Hadoop is a Java-based programming frame-
work that is used for processing huge sets of data
[14]
.
The advantage of Hadoop is that it processes data
that is placed in a distributed computer environ-
ment. The Hadoop architecture is used by several
large companies such as Facebook, Google, Yahoo,
and IBM
[6,14]
. Fig. 1 shows the two core compo-
nents of Hadoop
[17]
. Hadoop has its own file system,
i.e., the HDFS (Hadoop Distributed File System),
which facilitates fast data transfers and prevents sys-
tem failure with the help of its core function called
MapReduce. The MapReduce algorithm breaks the
big data into small chunks, distributes it on multiple
servers/nodes, and then performs operations
[17,18]
.
Figure 1 Components of Hadoop
The HDFS is considered as highly fault-tolerant
and is designed for special deployment in minimum-
cost hardware
[17,18]
. It has similarities with other
currently used file systems but its unique features are
sufficiently significant to make it a giant architecture
in today’s industry. Fig. 2 depicts the architecture of
the HDFS. The HDFS stores file system meta data
and application data separately
[17]
. The HDFS uses
the master/slave scheme. The Hadoop cluster is di-
vided into the NameNode and DataNodes. The Na-
meNode is treated as a master node, the function of
which is to manage the namespaces and accesses on
file. There will always be only one NameNode but
there could be innumerable DataNodes, at least one
per node in a given cluster. DataNodes are treated
as slaves in the architecture that contains the ac-
tual application data. However, a file is internally
divided into numerous blocks that are stored in a
DataNode
[18]
.
The operations handled by the NameNode include
opening, closing, and renaming files or directories
[18]
.
DataNodes are given the responsibility of handling
read and write operations. They are also allowed to
manage replication
[18]
.
MapReduce is a core functionality of Hadoop. It
was first introduced by Google in 2004
[18]
with the
sole objective of supporting the distributed comput-
ing of large data. It is one of the popular pro-
gramming models for processing large sets of data
located on several servers
[6,17]
. The users are re-
quired to specify a map function that processes a
key/value pair in order to obtain another intermedi-