Application Performance Analysis of Distributed File Systems under Cloud
Computing Environment
Tiezhu Zhao
Computer College
Dongguan University of Technology
Dongguan, 523808, P.R. China
tzzhao83@163.com
Zusheng Zhang
Computer College
Dongguan University of Technology
Dongguan, 523808, P.R. China
zushengzhang@163.com
Xin Ao
Computer College
Dongguan University of Technology
Dongguan, 523808, P.R. China
283588024@qq.com
Abstract—The processing efficiency of data-intensive
application on Hadoop with the general-purpose
distributed file system such as Lustre, as the backend file
system, is not clear. This paper focuses on the
similarities and differences between Lustre and HDFS
(Hadoop Distributed File System). We propose a
Hadoop-Lustre platform and evaluate the performance
differences of Lustre and HDFS by using a set of data-
intensive computing benchmarks. Experimental results
indicate Lustre can reach parity with HDFS, or even
better than HDFS if the much faster network
interconnect is available. It is necessary to study non-
HDFS distributed file system to make up the
performance lack of HDFS in some MapReduce-based
application scenarios.
Keywords-Distributed file system, Hadoop, Lustre, Data-
intensive application
I. I
NTRODUCTION
Cloud computing has emerged to be a new computing
paradigm and lead to the establishment of global data storage
and computation platforms. It is necessary to construct a
global data center for cloud storage firms. However, how to
build an efficient and stable data storage service is the key
problem. Therefore, the industry is witnessing distributed file
systems for large data center storage. Distributed file system
can effectively solve the problems of the mass data storage
and I/O bottlenecks. Lin H.Y. pointed out that distributed file
systems and MapReduce programming paradigm are the key
enabling technologies for cloud computing [1].
Hadoop is a open-source distributed compute and storage
platform, which implements the MapReduce algorithm and
uses the HDFS as the backend file system. The HDFS file
system consists of a single NameNode and a number of
DataNodes. It can provide high-throughput access to
application data. The NameNode manages the file system
namespace and regulates access to files by clients.
DataNodes manage storage attached to the nodes that they
run on. HDFS exposes a file system namespace and allows
user data to be stored in files. Internally, a file is split into
one or more blocks and these blocks are stored in a set of
DataNodes. The NameNode executes file system namespace
operations like opening, closing, and renaming files and
directories. It also determines the mapping of blocks to
DataNodes. The DataNodes are responsible for serving read
and write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and
replication upon instruction from the NameNode [2].
In traditional MapReduce environments, input and output
data are stored on the HDFS, with intermediate data stored in
a local, temporary file system on the mapper nodes, and
shuffled as needed to the nodes running the reducer tasks.
Although the Hadoop on HDFS have been widely studied
for several years, the potential performance impact on
Hadoop with the non-HDFS file systems, especially in HPC
environment, is not clear. In this paper, we integrate the
Lustre file system into Hadoop platform and propose a
Hadoop-Lustre platform, which uses the Lustre file system
as the backend file system for Hadoop storage. In order to
understand performance characteristics of the non-HDFS file
systems, we examine the underlying performance differences
of Lustre and HDFS in data-intensive application
environment.
The remainder of this paper is organized as follows. We
introduce the related work in section
Ċ. The Hadoop-Lustre
platform is proposed in section
ċ . We validate the
performance of Lustre and HDFS on Hadoop and discuss the
experiment results in section in section
Č and conclude the
paper in section
č.
II. R
ELATED WORK
In the designing and analyzing the performance
parameters of distributed/parallel file systems, it is necessary
to develop an analytical model to determine the potential
performance characteristics. Previous research includes the
following three aspects:
(1) Performance evaluation of distributed file system
with the specific application scenario. Nathan R. conducted a
survey on the performance characteristics of non_HDFS with
Hadoop platform [3]. To mitigate striping overhead and
2015 2nd International Conference on Information Science and Control Engineering
978-1-4673-6850-6/15 $31.00 © 2015 IEEE
DOI 10.1109/ICISCE.2015.41
152
2015 2nd International Conference on Information Science and Control Engineering
978-1-4673-6850-6/15 $31.00 © 2015 IEEE
DOI 10.1109/ICISCE.2015.41
152
2015 2nd International Conference on Information Science and Control Engineering
978-1-4673-6850-6/15 $31.00 © 2015 IEEE
DOI 10.1109/ICISCE.2015.41
152
2015 2nd International Conference on Information Science and Control Engineering
978-1-4673-6850-6/15 $31.00 © 2015 IEEE
DOI 10.1109/ICISCE.2015.41
152