Adaptive Join Plan Generation in Hadoop
For CPS296.1 Course Project
Gang Luo
Duke University
Durham, NC 27705
gang@cs.duke.edu
Liang Dong
Duke University
Durham, NC 27705
liang@cs.duke.edu
ABSTRACT
Joins in Hadoop has always been a problem for its users: the
Map/Reduce framework seems to be specifically designed
for group-by aggregation tasks rather than across-table op-
erations; on the other hand, join operation in distributed
database systems was never an easy task because data lo-
cation and skewness makes join strategies harder to opti-
mize. Fragment-replicate join (map join) may be a clever
step towards good performance in some cases, but it can be
a dangerous move under certain circumstances. This paper
introduces some new techniques used in map join to tackle
these issues, and proposes a plan generator for the join types
that we currently have.
Categories and Subject Descriptors
H.2 [Database Management]: Plan Generation
General Terms
Theory
Keywords
Hadoop, join operation
1. INTRODUCTION
Currently, the amount of data the industry and academia
are facing is large, and will keep increasing, which makes
large-scale data processing a hot issue. Map-Reduce[5] is
a popular parallel data processing framework. The simple
programming model allows users to write simple program
that could run on hundreds of machines simultaneously to
process data. Its fault tolerance feature also makes it a
robust system even for commodity machines. These features
enable the system running Map-Reduce expand to a really
large scale by adding commodity machines to the cluster
at a low cost, thus could greatly reduce the time consume
by jobs. The open source implementation of Map/Reduce
framework, namely Hadoop[2], has caught much attention
ever since it was born.
Even though it seems promising to improve the efficiency
for data processing by brutally enlarging the cluster size and
running the jobs on more nodes, it is a better idea to de-
sign sophisticate plan that make good use the Map-Reduce
paradigm while avoid the side effect as much as possible. As
one of the most critical operations in data processing, join
operation is usually more time-consuming than other kinds
of work and thus has a greater impact on the overall per-
formance. Basically, to join two datasets in Map-Reduce is
quite simple, as we will introduce later. But for the most-
obvious join method, which reads both tables from the disk,
and shuffles all the data over the network to the reducers,
the performance can be limited by the network connection
speed. When the datasets are too large, the network transfer
time becomes the bottleneck, thus lowering the utilization
of the computing resources.
With some tools/framework built on top of Hadoop, for ex-
ample, Pig[6] or Hive[8], the annoying work of programming
in Java could be saved. Instead, users could write declar-
ative (for Hive) or procedural (for Pig) queries to perform
tasks that could take much lines of code with pure Hadoop.
However, neither of these tools has addressed the problem of
join: Pig has implemented fragment-duplicate join (known
as “map join” in our paper), and also skew join that can
handle skewed tables; the user may want to give some hints
to the compiler, indicating which join method the system
should use. This is not a good way of handling the problem
since the user may not know what is “map join”, or the un-
derlying data; furthermore, the user may give a wrong hint
which could hurt the performance. Building a plan genera-
tor which decides smartly which plan to use will make those
tools more favorable.
Through our early experience with map join, we have learned
that it may consume more memory of a map task than it
possesses, the consequence will be either extremely slow ex-
ecution or thrashing map tasks. “Advanced join” is another
approach that can be potentially beneficial, but the over-
head of this kind of join should not be neglected.
Using Distributed Cache[1] to copy files to each node could
be a potential improvement to the performance, but our pre-
liminary experiments suggested otherwise, which forms one
issue that this paper tries to solve. Other than that, our
work will focus on extending the “original” map join imple-
mentation for it to work with more cases. More importantly,
we will propose a cost-based plan generator for efficient joins