大数据并行应用中内存溢出错误深度分析：Hadoop与Spark案例研究

152 浏览量更新于2024-08-30 收藏 398KB PDF 举报

本文是一篇深入的研究报告，标题为"经验报告：分布式数据并行应用中内存溢出错误的特性研究"。该研究针对在Hadoop和Spark等分布式数据并行框架上运行的数据密集型应用程序频繁出现的内存溢出（OOM）问题进行了系统性探讨。在这些框架中，内存空间由框架自身和用户代码共享，这使得用户很难定位问题的根本原因并进行修复。研究报告主要关注了123个真实世界中的OO M错误案例，其核心发现如下： 1. 框架内大内存缓冲/缓存导致的问题：12%的错误是由框架内部的大数据缓冲或缓存所引发的。这表明，在配置框架和用户代码之间的内存使用平衡时，用户面临相当大的挑战。这些大内存分配通常是由于框架设计或默认行为，使得应用程序可能无意识地消耗过多的内存资源，尤其是在处理大数据集时。 2. 内存管理透明性带来的复杂性：由于分布式执行的细节隐藏在框架背后，用户往往难以直观地理解哪些部分消耗了大量内存。这增加了理解和解决OOM错误的难度，因为可能涉及到多任务并发、数据分片和序列化/反序列化的内存开销。 3. 配置挑战：内存溢出错误的频繁发生，提醒开发者在使用这些框架时必须仔细考虑内存配置，以防止资源耗尽。然而，现有的配置选项可能并不足够灵活或者文档不清晰，使得用户在面对大规模数据和复杂工作负载时难以做出最优决策。 4. 优化策略：根据这些发现，报告提出了一些潜在的优化策略，包括改进框架的内存管理机制，提供更明确的内存使用指导，以及开发工具来帮助用户监控和调试内存使用情况。 5. 实践经验与教训：通过深入分析这些实际案例，研究者分享了他们在排查和解决OO M问题上的经验和教训，为其他开发者提供了宝贵的实践指导，有助于减少类似问题的发生率和修复时间。这篇研究报告不仅揭示了分布式数据并行应用中内存溢出错误的普遍性，还提出了关键的洞察和建议，对于提高这类应用程序的健壮性和用户体验具有重要意义。

model as follows:

Map Stage : map(k

, v

) ⇒ list(k

, v

)

Reduce Stage : reduce(k

, list(v

)) ⇒ list(v

)

In the map stage, map(k, v) reads hk, vi records one by one

from an input split, processes each record and output new

hk, vi records. In the reduce stage, the framework groups the

hk, vi records into hk, list(v)i by the key k, and then launches

reduce(k, list(v)) to process each hk, list(v)i group. Hadoop

natively supports this programming model, while Dryad and

Spark provide general and user-friendly operators, such as

map(), ﬂatMap(), groupByKey(), reduceByKey(), coGroup(),

and join(), which are built on top of map() and reduce().

Users can also write applications using high-level languages

such as SQL-like Pig script [10], which can automatically

generate binary map() and reduce(). For optimization, users

can deﬁne a mini reduce() named combine(). We regard

combine() as reduce() since they usually share the same code

for aggregation.

Apart from map() and reduce(), users need to write a driver

program (shown in Fig. 2) to submit an application to Spark.

The driver program can also (1) generate and broadcast data to

each task; (2) collect the tasks’ outputs. So, in this paper, we

regard user code as map(), reduce(), and the driver program.

B. Dataﬂow

A distributed data-parallel application consists of one or

multiple MapReduce jobs. As shown in Fig. 1, a job will

go through a map stage and a reduce stage (a Dryad/Spark

job can go through multiple map and reduce stages connected

as a directed acyclic graph). Each stage contains multiple

map/reduce tasks (i.e., mappers/reducers). For parallelism, the

mappers’ outputs are partitioned and each partition is shufﬂed

to a corresponding reducer by the framework. Dataﬂow refers

to the data that ﬂows among mappers and reducers.

The major difference between MapReduce and Dryad/Spark

is that Dryad/Spark supports pipeline. In the pipeline, map/re-

duce tasks can continuously execute multiple user-deﬁned

functions (e.g., run another map() after a map()) without

storing the intermediate results (e.g., results of the ﬁrst map())

into the disk. In Spark, users can also explicitly tell the

framework to cache reusable intermediate results in memory

(e.g., outputs of reduce() used for the next job) using cache().

C. Conﬁgurations

The application’s conﬁgurations consist of two parts: (1)

Memory-related conﬁgurations affect the memory usage di-

rectly. For example, memory limit deﬁnes the memory space

(heap size) of map/reduce tasks and buffer size deﬁnes the

size of framework buffers. (2) Dataﬂow-related conﬁgurations

affect the volume of data that ﬂow among mappers and reduc-

ers. For instance, partition function deﬁnes how to partition the

hk, vi records outputted by map(), while the partition number

deﬁnes how many partitions will be generated and how many

reducers will be launched.

III. METHODOLOGY

A. Subjects

We took real-world data-parallel applications that run atop

Apache Hadoop and Apache Spark as our study subjects.

Since there are not any special bug repositories for OOM

errors (JIRA mainly covers the framework bugs), users usually

post their OOM errors on the open forums (e.g., StackOver-

ﬂow.com and Hadoop/Spark mailing list). We totally found

1151 issues by searching keywords such as “Hadoop out of

memory” and “Spark OOM” in StackOverﬂow.com, Hadoop

mailing list [8], Spark user/dev mailing list [9], developers’

blogs, and two MapReduce books [14], [15]. We manually re-

viewed each issue and only selected the issues that satisfy: (1)

The issue is a Hadoop/Spark OOM error, since 786 issues are

not OOM errors (e.g., only contain partial keywords “Hadoop

Memory”). (2) The OOM error occurs in the Hadoop/Spark

applications, not other service components (e.g., the scheduler

and resource manager). In total, 276 OOM errors are selected.

These errors occur in diverse Hadoop/Spark applications, such

as raw MapReduce/Spark code, Apache Pig [10], Apache Hive

[11], Apache Mahout [16], Cloud

[17] (a Hadoop toolkit

for text processing), GraphX [18] and MLlib [12]. Based on

the approach in Section B, we identiﬁed the root causes of

123 OOM errors (listed in Table I). The root causes of the

other 153 OOM errors are unknown. Therefore, our study only

performs on these 123 OOM errors (a.k.a. failures).

B. Root cause and ﬁx pattern identiﬁcation

For each OOM error, we manually reviewed the user’s error

description and the professional answers given by experts (e.g.,

Hadoop/Spark committers from cloudera.com, experienced

developers from ebay.com, and book authors). Out of the

276 OOM errors, the root causes of 123 errors have been

identiﬁed in the following three scenarios: (1) The experts

identiﬁed the root causes and users have accepted the experts’

professional answers. (2) Users identiﬁed the root causes

themselves. They have explained the causes (e.g., abnormal

data, abnormal conﬁgurations, and abnormal code logic) in

their error descriptions and just asked how to ﬁx the errors.

(3) We identiﬁed the causes by reproducing the errors in our

cluster and manually analyzing the root causes.

Similar to the root causes, we collected the ﬁx patterns from

42 OOM errors, where the experts provided ﬁx methods or

users reported the successful ﬁx methods (25 errors). Then,

we merged the similar ﬁx methods together and got 11 ﬁx

patterns.

C. OOM error reproduction

To fully understand the root causes and ﬁx patterns of OOM

errors, we have reproduced 43 OOM errors (35%), which

have detailed data characteristics, reproducible user code, and

OOM stack traces. Since we did not have the same dataset

as the users’, we used the public dataset (Wikipedia) and

synthetic dataset (random text and a well-known benchmark

[19]) instead. The experiments were conducted on a 11-

node cluster using Hadoop-1.2 and Spark-1.2. Each node has

剩余11页未读，继续阅读

weixin_38740848

粉丝: 6
资源: 888

大数据并行应用中内存溢出错误深度分析：Hadoop与Spark案例研究

Influence of the net gain on characteristic of stochastic resonance in a single-mode laser system

A Fractional Characteristic Study of Liquid and Vapor Interface in Lennard-Jones Fluids* (2002年)

Effects of processing on all-optical poling characteristic of guest-host azo-dye polymer thin films

Research on dynamic characteristic of flexible parallel robot

A Mixed Static Analysis Method based on Event Flow and Data Flow in Distributed Debugger

Analysis of the Retention Characteristic in Three dimensional Junction-less Charge Trapping Memory

Study on the transmission characteristic of terahertz pulse through packing materials

The study on the damping characteristic of the micro electromagnetic relay

Effect of temperature on three-in-one composite compensator

Experimental study of sweep control in e-beam evaporated optical coatings

最新资源