云环境下的Hadoop与Lustre分布式文件系统性能对比分析

108 浏览量更新于2024-08-28 收藏 176KB PDF 举报

"这篇研究论文探讨了在云计算环境中分布式文件系统，特别是Lustre和Hadoop Distributed File System (HDFS)的应用性能分析。作者通过构建一个Hadoop-Lustre平台，对比评估了Lustre和HDFS在处理数据密集型应用时的性能差异。实验结果显示，如果具备更快的网络互连，Lustre可以与HDFS达到相当的性能，甚至在某些情况下优于HDFS。这提示我们需要研究非HDFS的分布式文件系统来弥补HDFS在特定MapReduce任务中的性能短板。" 云计算环境下的分布式文件系统是支撑大数据处理和分析的关键技术。HDFS作为Apache Hadoop的一部分，被广泛用于存储和处理大规模数据集，而Lustre则是一款高性能、面向并行计算的分布式文件系统，常用于科研和工程领域。这篇论文的核心目标是对这两种文件系统在实际应用中的性能进行深入比较。作者首先指出了使用Hadoop与通用分布式文件系统（如Lustre）处理数据密集型应用的效率问题。尽管HDFS在大数据处理上表现出色，但其性能是否能与专为高性能计算设计的Lustre相媲美，是本文关注的重点。为进行比较，研究者构建了一个整合Hadoop和Lustre的平台，通过一系列数据密集型计算基准测试来衡量两者的性能。实验结果揭示了Lustre在某些情况下能够与HDFS达到性能平衡，甚至在拥有高速网络连接的情况下超越HDFS。这表明，对于需要高速网络传输和低延迟操作的场景，Lustre可能是更好的选择。此外，论文还强调了研究非HDFS分布式文件系统的重要性，因为它们可能在特定的MapReduce任务中弥补HDFS的性能不足。这不仅对提升云计算环境中的数据处理能力具有实践意义，也为优化和定制更适合特定业务需求的文件系统提供了理论依据。这篇论文对云计算环境下的分布式文件系统进行了深入的性能分析，提出了一种结合Hadoop和Lustre的解决方案，并强调了对非HDFS文件系统研究的价值。这些发现有助于我们更好地理解如何根据不同的工作负载和环境选择合适的分布式文件系统，从而提升整体的计算效率和性能。

Application Performance Analysis of Distributed File Systems under Cloud

Computing Environment

Tiezhu Zhao

Computer College

Dongguan University of Technology

Dongguan, 523808, P.R. China

tzzhao83@163.com

Zusheng Zhang

Computer College

Dongguan University of Technology

Dongguan, 523808, P.R. China

zushengzhang@163.com

Xin Ao

Computer College

Dongguan University of Technology

Dongguan, 523808, P.R. China

283588024@qq.com

Abstract—The processing efficiency of data-intensive

application on Hadoop with the general-purpose

distributed file system such as Lustre, as the backend file

system, is not clear. This paper focuses on the

similarities and differences between Lustre and HDFS

(Hadoop Distributed File System). We propose a

Hadoop-Lustre platform and evaluate the performance

differences of Lustre and HDFS by using a set of data-

intensive computing benchmarks. Experimental results

indicate Lustre can reach parity with HDFS, or even

better than HDFS if the much faster network

interconnect is available. It is necessary to study non-

HDFS distributed file system to make up the

performance lack of HDFS in some MapReduce-based

application scenarios.

Keywords-Distributed file system, Hadoop, Lustre, Data-

intensive application

I. I

NTRODUCTION

Cloud computing has emerged to be a new computing

paradigm and lead to the establishment of global data storage

and computation platforms. It is necessary to construct a

global data center for cloud storage firms. However, how to

build an efficient and stable data storage service is the key

problem. Therefore, the industry is witnessing distributed file

systems for large data center storage. Distributed file system

can effectively solve the problems of the mass data storage

and I/O bottlenecks. Lin H.Y. pointed out that distributed file

systems and MapReduce programming paradigm are the key

enabling technologies for cloud computing [1].

Hadoop is a open-source distributed compute and storage

platform, which implements the MapReduce algorithm and

uses the HDFS as the backend file system. The HDFS file

system consists of a single NameNode and a number of

DataNodes. It can provide high-throughput access to

application data. The NameNode manages the file system

namespace and regulates access to files by clients.

DataNodes manage storage attached to the nodes that they

run on. HDFS exposes a file system namespace and allows

user data to be stored in files. Internally, a file is split into

one or more blocks and these blocks are stored in a set of

DataNodes. The NameNode executes file system namespace

operations like opening, closing, and renaming files and

directories. It also determines the mapping of blocks to

DataNodes. The DataNodes are responsible for serving read

and write requests from the file system’s clients. The

DataNodes also perform block creation, deletion, and

replication upon instruction from the NameNode [2].

In traditional MapReduce environments, input and output

data are stored on the HDFS, with intermediate data stored in

a local, temporary file system on the mapper nodes, and

shuffled as needed to the nodes running the reducer tasks.

Although the Hadoop on HDFS have been widely studied

for several years, the potential performance impact on

Hadoop with the non-HDFS file systems, especially in HPC

environment, is not clear. In this paper, we integrate the

Lustre file system into Hadoop platform and propose a

Hadoop-Lustre platform, which uses the Lustre file system

as the backend file system for Hadoop storage. In order to

understand performance characteristics of the non-HDFS file

systems, we examine the underlying performance differences

of Lustre and HDFS in data-intensive application

environment.

The remainder of this paper is organized as follows. We

introduce the related work in section

Ċ. The Hadoop-Lustre

platform is proposed in section

ċ . We validate the

performance of Lustre and HDFS on Hadoop and discuss the

experiment results in section in section

Č and conclude the

paper in section

č.

II. R

ELATED WORK

In the designing and analyzing the performance

parameters of distributed/parallel file systems, it is necessary

to develop an analytical model to determine the potential

performance characteristics. Previous research includes the

following three aspects:

(1) Performance evaluation of distributed file system

with the specific application scenario. Nathan R. conducted a

survey on the performance characteristics of non_HDFS with

Hadoop platform [3]. To mitigate striping overhead and

2015 2nd International Conference on Information Science and Control Engineering

DOI 10.1109/ICISCE.2015.41

152

2015 2nd International Conference on Information Science and Control Engineering

DOI 10.1109/ICISCE.2015.41

152

2015 2nd International Conference on Information Science and Control Engineering

DOI 10.1109/ICISCE.2015.41

152

2015 2nd International Conference on Information Science and Control Engineering

DOI 10.1109/ICISCE.2015.41

152

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38523251

粉丝: 3
资源: 884

云环境下的Hadoop与Lustre分布式文件系统性能对比分析

云计算环境下分布式文件系统的负载平衡研究.pdf

云计算中的分布式文件系统.pdf

云计算环境中分布式数据存储关键技术的研究.pdf

分布式操作系统的应用

5、 分布式文件系统的定义是什么？

什么时候需要使用分布式文件系统

盘古分布式文件系统是如何利用Paxos一致性算法在分布式环境中保证数据一致性的？

fastdfs在云计算中应用详解

云计算技术在存储系统中的应用

南理工分布式系统与web应用

最新资源

5、分布式文件系统的定义是什么？