没有合适的资源?快使用搜索试试~ 我知道了~
首页大数据应用程序设计Design for Big Data Applications
大数据应用程序设计Design for Big Data Applications

大数据应用程序设计,这是一篇论文。论文来自国外大学:加州大学欧文分校,计算机系。全名:《A Bloat-Aware Design for Big Data Applications 》。论文系统分析和说明了大数据应用程序的组成、结构、内存管理、范例和设计经验等。文档为全英文文档,内容夯实、有利。相对于国内大学中的泛泛论文有着本质的区别。全英文文档,下载者注意。内容精彩,不容错过
资源详情
资源评论
资源推荐

A Bloat-Aware Design for Big Data Applications
Yingyi Bu Vinayak Borkar Guoqing Xu Michael J. Carey
Department of Computer Science, U niversity of California, Irvine
{yingyib,vborkar,guoqingx, mjcarey}@ics.uci.edu
Abstract
Over the past decade, the increasing demands on data-driven busi-
ness intelligence have led to the proliferation of large-scale, data-
intensive applications that often have huge amounts of data (often
at terabyte or petabyte scale) to process. An object-oriented pro-
gramming language such as Java is often the developer’s choice for
implementing such applications, primarily due to its quick develop-
ment cycle and rich community resource. While the use of such lan-
guages makes programming easier, significant performance prob-
lems can often be seen — the combination of the inefficiencies in-
herent in a managed run-time system and t he impact of the huge
amount of data to be processed in the limited memory space often
leads to memory bloat and performance degradation at a surpris-
ingly early stage.
This paper proposes a bloat-aware design paradigm towards
the development of efficient and scalable Big Data applications in
object-oriented GC enabled languages. To motivate this work, we
first perform a study on the impact of several typical memory bloat
patterns. These patterns are summarized from the user complaints
on the mailing lists of t wo widely-used open-source Big Data appli-
cations. Next, we discuss our design paradigm to eliminate bloat.
Using examples and real-world experience, we demonstrate that
programming under this paradigm does not incur significant pro-
gramming burden. We have implemented a few common data pro-
cessing tasks both using this design and using the conventional
object-oriented design. Our experimental results show that this new
design paradigm is extremely effective in improving performance
— even for the moderate-size data sets processed, we have ob-
served 2.5×+ performance gains, and the improvement grows sub-
stantially with the size of the data set.
Categories and Subject Descriptors D.3.4 [Programming Lan-
guages]: Processors—Memory management, optimization, run-
time environment; H.4 [Information Systems Applications]: Mis-
cellaneous
General Terms Languages, Design, Performance
Keywords Big Data Applications, Memory Bloat, Design
1. Introduction
Modern computing has entered the era of Big Data. The mas-
sive amounts of information available on the Internet enable com-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
ISMM’13,
June 20–21, 2013, Seattle, Washington, USA.
Copyright
c
2013 ACM 978-1-4503-2100-6/13/06. . . $15.00
puter scientists, physicists, economists, mathematicians, political
scientists, bio-informaticists, sociologists, and many others to dis-
cover interesting properties about people, things, and their inter-
actions. Analyzing information from Twitter, Google, Facebook,
Wikipedia, or the Human Genome Project requires the develop-
ment of scalable platforms that can quickly process massive-scale
data. Such frameworks often utilize large numbers of machines in a
cluster or in the cloud to process data in a parallel manner. Typical
data-processing frameworks include data-flow and message pass-
ing runtime systems. A data-flow system (such as MapReduce [19],
Hadoop [10 ] , Hyracks [16], Spark [49], or Storm [40]) uses dis-
tributed file system to store data and computes results by push-
ing data t hrough a processing pipeline, while a message passing
system (such as Pregel [29] or Giraph [9]) often loads one parti-
tion of data per processing unit (machine, process, or thread) and
sends/receives messages among different units to perform compu-
tations. High-level languages (such as Hive [11], Pig [34], Fl ume-
Java [18], or Asteri xDB [14]) are designed to describe data pro-
cessing at a more abstract level.
An object-oriented programming l anguage such as Java is
often the developer’s choice for implementing data-processing
frameworks. In fact, the Java community has already been the
home of many data-intensive computing infrastructures, such as
Hadoop [10], Hyracks [16], Storm [40], and Giraph [9]. Spark [49]
is written in Scala, but it relies on a Java Virtual Machine (JVM) to
execute. Despite the many development benefits provided by Java,
these applications commonly suffer from severe memory bloat—a
situation where large amounts of memory are used to store infor-
mation not str ictly necessary for t he execution—that can lead to
significant performance degradation and reduced scalability.
Bloat in such applications stems primarily from a combination
of the inefficient memory usage inherent in the run time of a man-
aged language as well as the processing of huge volumes of data
that can exacerbate t he already-existing inefficiencies by orders of
magnitude. As such, Big Data applications are much more vulner-
able to runtime bloat than regular Java applications. As an interest-
ing reference point, our experience shows that the latest (I ndigo)
release of the Eclipse framework with 16 large Java projects loaded
can successfully run (without any noticeable lag) on a 2GB heap;
however, a moderate-size application on Giraph [9] wi th 1GB input
data can easily run out of memory on a 12 GB heap. Due to the in-
creasing popularity of Big Data applications in modern computing,
it is important to understand why these applications are so vulner-
able, how they are affected by runtime bloat, and what changes
should be made to the existing design and implementation princi-
ples in order to make them scalable.
In this paper, we describe a study of memory bloat using
two real-world Big Data applications: Hive [11] and Giraph [9],
where Hive is a large-scale data warehouse software (Apache top-
level project, powering Facebook’s data analytics) built on top
of Hadoop and G iraph is an Apache open-source graph analytics
framework initiated by Yahoo!. Our study shows that freely creat-

ing objects (as encouraged by object-orientation) is the root cause
of the performance bottleneck that prevents these applications from
scaling up to large data sets.
To gain a deep understanding of the bottleneck and how to ef-
fectively optimize it away, we break down the problem of exces-
sive object creation into two different aspects, (1) what is the space
overhead if all data items are represented by Java objects? and (2)
given all these objects, what is the memory management (i.e., GC)
costs in a typical B ig Data application? These two questions are
related, respectively, to the spatial impact and the temporal impact
that object creation can have on performance and scalability.
On the one hand, each Java object has a fixed-size header space
to store its type and the information necessary for garbage collec-
tion. What constitutes the space overhead is not just object head-
ers; the other major component is from the pervasive use of object-
oriented data structures that commonly have multiple layers of del-
egations. Such delegation patterns, while simplifying development
tasks, can easily lead to wasteful memory space that stores point-
ers to form data structures, rather than the actual data needed for
the forward execution. Based on a study reported in [32], the frac-
tion of the actual data in an IBM application i s only 13% of the
total used space. This impact can be significantly magnified in a
Big Data application that contains a huge number of (relatively
small) data item objects. For such small objects, the space overhead
cannot be easily amortized by the actual data content. The prob-
lem of inefficient memory usage becomes increasingly painful for
highly-parallel data-processing systems because each thread con-
sumes excessive memory resource, leading to increased I/O costs
and reduced concurrency.
On the other hand, a typical tracing garbage collection (GC) al-
gorithm periodically traverses the entire live object graph to iden-
tify and reclaim unreachable objects. For non-allocation-intensive
applications, efficient garbage collection algorithms such as a gen-
erational GC can quickly mark reachable objects and reclaim mem-
ory from dead objects, causing negligible interruptions from the
main execution threads. However, once the heap grows to be large
(e.g., a few dozens of GBs) and most objects in the heap are live, a
single GC call can become exceedingly longer. In addition, because
the amount of used memory in a Big Data application is often close
to the heap size, GC can be frequently triggered and would even-
tually become the major bottleneck that prevents the main threads
from making satisfactory progress. We observe that in most B ig
Data applications, a huge number of objects (representing data to
be processed in the same batch) often have the same lifetime, and
hence it is highly unnecessary for the GC to traverse each individ-
ual object every time to determine whether or not it is reachable.
Switch back to an unmanaged language? Switching back to
an unmanaged language such as C++ appears to be a reasonable
choice. H owever, our experience with many Big Data applications
(such as Hive, Pig, Jaql, Giraph, or Mahout) suggests t hat a Big
Data application often exhibits clear distinction between a control
path and a data path. The control path organizes tasks into the
pipeline and performs optimizations while the data path represents
and manipulates data. For example, in a typical Big Data applica-
tion that runs on a shared-nothing cluster, there is often a driver
at the client side that controls the data flow and there are multi-
ple run-time data operators executing data processing algorithms
on each machine. The execution of the driver in the control path
does not touch any actual data. Only the execution of the data op-
erators in the data path manipulates data items. While the data path
creates most of the run-time objects, it s development often takes a
very small amount of coding effort, primarily because data process-
ing algorithms (e.g., joining, grouping, sorting, etc.) can be easily
shared and reused across applications.
One study we have performed on seven open-source Big Data
applications shows that the data flow path takes an overage 36.8%
of the lines of source code but creates more than 90% of the
objects during execution. Details of this study can be found in
Section 4.3. Following the conventional object-oriented design for
the control path is often unharmful; it is the data path that needs a
non-conventional design and an extremely careful implementation.
As the control path takes the majority of the development work,
it is unnecessary to f orce developers to switch to an unmanaged
language for the whole application where they have to face the (old)
problems of low productivity, less community resource, manual
memory management, and error-prone implementations.
Although the inefficient use of memory in an object-oriented
language is a known problem and has been studied before (e.g.,
in [32]), there does not exist any systematic analysis of its impact
on Big Data applications. In this paper, we study this impact both
analytically and empirically. We argue that the designers of a B ig
Data application should strictly follow the following principle: the
number of data objects in the system has to be bounded and can-
not grow proportionally with the size of the data to be processed.
To achieve this, we propose a new design paradigm that advocates
to merge small objects in the storage and access the merged ob-
jects using data processors. The idea is inspired from old mem-
ory management technique called page-based record management,
which has been used widely to build database systems. We adopt
the proposed design paradigm in our “build-from-scratch” general-
purpose Big Data processing framework ( call ed H yracks) at the
application (Java) level, without requiring the modification of the
JVM or the Java compiler. We demonstrate, using both examples
and experience, that writing programs using the proposed design
paradigm does not create much burden for developers.
We have implemented several common data processing tasks
by following both this new design paradigm and t he conventional
object-oriented design principle. Our experimental results demon-
strate that the implementations following the new design can scale
to much larger data sizes than t hose following the conventional
design. We believe that this new design paradigm is valuable in
guiding the future implementations of Big Data applications using
managed object-oriented languages. The observations made in this
paper strongly call for novel optimization techniques targeting Big
Data applications. For example, optimizer designers can develop
automated compiler and/or runtime system support (e.g., within a
JVM) to remove the identified inefficiency patterns in order to pro-
mote the use of object-oriented languages in developing Big Data
applications. Furthermore, future development of benchmark suites
should consider the inclusion of such applications to measure JVM
performance.
Contributions of this paper include:
•
an analytical study on the memory usage of common Bi g Data
processing tasks such as the graph link analysis and the rela-
tional join. We find that the excessive creation of objects to rep-
resent and process data items i s the bottleneck that prevents Big
Data applications from scaling up t o large datasets (Section 2);
•
a bloat-aware design paradigm for the development of highly-
efficient Big Data applications; instead of building a new mem-
ory system to solve t he memory issues from scratch, we propose
two application-level optimizations, including (1) merging (in-
lining) a chunk of small data item objects with the same lifetime
into few large objects (e.g., few byte arrays) and (2) manipulat-
ing data by directly accessing merged objects (i.e., at the binary
level), in order to mitigate the observed memory bloat patterns
(Section 3);
•
a set of experimental results (Section 5) that demonstrate sig-
nificant memory and time savings using the design. We report

our experience of programming for real-world data-processing
tasks in the Hyracks platform (Section 4). We compare the per-
formance of several Big Data applications with and without us-
ing the proposed design; the experimental results show that our
optimizations are extremely effective (Section 5).
2. Memory Analysis of Big Data Applications
In this section, we study two popular data-intensive applications,
Giraph [9] and Hive [11], to investigate the impact of creating
of Java objects to represent and process data on performance and
scalability. Our analysis drills down to two fundamental problems,
one in space and one in time: (1) large space consumed by object
headers and object references, leading to low packing factor of
the memory, and (2) massive amounts of objects and references,
leading to poor GC performance. We analyze these two problems
using examples from Giraph and Hive, respectively.
2.1 Low Packing Factor
In the Java runtime, each object requires a header space for type
and memory management purposes. An additional space is needed
by an array to store its length. For instance, in the Oracle 64-
bit HotSpot JVM, the header spaces for a regular object and for
an array take 8 and 12 bytes, r espectively. In a typical Big Data
application, the heap often contains many small objects (such as
Integers representing record IDs), in which the overhead in-
curred by headers cannot be easily amortized by the actual data
content. Space inefficiencies are exacerbated by the pervasive uti-
lization of object-oriented data structures. These data structures of-
ten use multiple-level of delegations to achieve their functionality,
leading to large space storing pointers instead of the actual data. In
order to measure the space inefficiencies introduced by the use of
objects, we employ a metric called packing factor, which is defined
as the maximal amount of actual data that be accommodated into
a fixed amount of memory. While a similar analysis [32] has been
conducted to understand the health of Java collections, our analysis
is specific to Big Data applications where a huge amount of data
flow through a fixed amount memory in a batch-by-batch manner.
To analyze the packing factor for the heap of a typical Big Data
application, we use the PageRank algorithm [35] (i.e., an applica-
tion built on top of Giraph [9]) as a running example. PageRank is
a link analysis algorithm that assigns weights (ranks) to each vertex
in a graph by iteratively computing the weight of each vertex based
on the weight of its inbound neighbors. This algorithm is widely
used to rank web pages in search engines.
We ran PageRank on different open-source Big Data com-
puting systems, including Giraph [9], Spark [49], and Ma-
hout [12], using a 6-rack, 180-machine research cluster. Each ma-
chine has 2 quad-core Intel Xeon E5420 processors and 16GB
RAM. We used a 70GB web graph dataset that has a total of
1, 413, 511, 393 vertices. We found none of t he three systems
could successfully process this dataset. They all crashed with
java.lang.OutOfMemoryError, even though the data par-
titioned for each machine (i.e., less than 500MB) should easily fit
into its physical memory.
We found t hat many real-world developers experienced simi-
lar problems. For example, we saw a number of complaints on
OutOfMemoryError from Giraph’s user mailing list, and there
were 13 bloat-related threads on Giraph’s mailing list during the
past 8 months
1
. In order to locate the bottleneck, we perform a
quantitative analysis using PageRank. Gir aph contains an example
1
http://mail-archives.apache.org/mod
mbox/giraph-user/
Figure 1. Giraph object subgraph rooted at a vertex.
Class #Objects Header (b) Pointer (b)
Vertex 1 8 40
List 3 24 24
List$Array 3 36 8(m + n)
LongWritable m + 1 8m + 8 0
DoubleWritable n + 1 8n + 8 0
Total m + n + 9 8(m + n) + 84 8(m + n) + 64
Table 1. Numbers of objects per vertex and their space overhead
(in bytes) in PageRank in the Sun 64-bit H opspot JVM.
implementation of the PageRank algorithm. Part of its data repre-
sentation implementation
2
is shown below.
public abstract class EdgeListVertex<
I extends WritableComparable,
V extends Writable,
E extends Writable, M extends Writable>
extends MutableVertex<I, V, E, M> {
private I vertexId = null;
private V vertexValue = null;
/
**
indices of its outgoing edges
*
/
private List<I> destEdgeIndexList;
/
**
values of its outgoing edges
*
/
private List<E> destEdgeValueList;
/
**
incoming messages from
the previous iteration
*
/
private List<M> msgList;
......
/
**
return the edge indices starting from 0
*
/
public List<I> getEdegeIndexes(){
...
}
}
Graphs handled in Giraph are labeled (i.e., both their ver-
tices and edges are annotated with values) and t heir edges
are directional. Class EdgeListVertex represents a graph
vertex. Among its fields, vertexId and vertexValue
store the ID and the value of the vertex, respectively. Field
destEdgeIndexList and destEdgeValueList reference,
respectively, a list of IDs and a list of values of its outgoing edges.
msgList contains incoming messages sent to the vertex from the
previous iteration. Figure 1 visualizes the Java object subgraph
rooted at an EdgeListVertex object.
In Giraph’s PageRank implementation, the concrete types for
I, V , E, and M are LongWritable, DoubleWritable,
2
in revision 1232166.
剩余11页未读,继续阅读













安全验证
文档复制为VIP权益,开通VIP直接复制

评论2