occurs in this period and seriously affects the performance
of MapReduce.
2.2. Data Skew in ARM on MapReduce. Data mining is the
computational process of discovering patterns in large
datasets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database sys-
tems. The overall goal of the data mining process is to extract
information from a dataset and transform it into an under-
standable structure for further use. Data mining nowadays
has become popular in healthcare because of the need for
an efficient analytical methodology to detect unknown and
valuable information in healthcare data [11]. Association is
one of the most vital approaches to data mining used to
determine frequent patterns and other interesting relation-
ships among a set of data items in a repository. Associa-
tion has a significant impact on healthcare in detecting
relationships among diseases, patient statuses, and symp-
toms. Ji et al. used association to discover infrequent causal
relationships in electronic healthcare databases [12]. Patil
et al. [13] used an Apriori algorithm to generate association
rules to classify patients suffering from type 2 diabetes.
Abdullah et al. [14] proposed a modification in an existing
Apriori algorithm to add information to medical bills.
Efficiency is the most important factor in association
mining. Parallel algorithms for ARM are not suitable for
high-dimensional and large amounts of data because they
are susceptible to data placement problems, which lead to
skew [15]. For MapReduce, data skew is an important prob-
lem adversely affecting load balancing in ARM algorithms. It
partitions the dataset horizontally in blocks of equal size.
However, the number of frequent itemsets generated from
each block can be heavily skewed, that is, while one block
may contribute many frequent itemsets, another may have
very few, implying that the processor responsible for the
latter block is idle most of the time. Another kind of data
skew occurs if itemsets are frequent in many blocks, or if they
are frequent in only a few blocks. Hence, the algorithm for
ARM needs good load balancing.
2.3. Partitioning Skew in MapReduce. In a MapReduce
application, the outputs of map tasks are distributed among
reduce tasks via hash partitioning (by default). In the map
phase, the hash partitioning usually takes a hash function
hash key%R to determine the partition number correspond-
ing to each type of key-value pai r, where R is the number of
reduce tasks. The hash function is usually adequate to evenly
distribute the data. However, if the outputs are not evenly
distributed, hash partitioning may fail with skewed data. This
phenomenon is referred to as partitioning skew. For example,
in the Inverted Index application, the hash function may
partition intermediate data based on the first letter of a word;
reducers processing more popular letters are assigned a
disproportionate amount of data. Partitioning skew can
occur for the following reasons [16]:
(1) Skewed tuple sizes: The sizes of values in applica-
tions vary significantly, which can lead to uneven
workload distribution.
(2) Skewed key frequencies: Some keys occur more fre-
quently in intermediate data, causing reduce tasks that
process these popular keys to become overloaded.
(3) Skewed execution times: Processing a single, large
key-value pair may require more time than process-
ing multiple small pairs. Even when the partitioning
function perfectly distributes keys across reducers,
the execution times of reduce tasks may differ simply
because the key groups they are assigned contain
significantly more values.
For skewed execution times, we can use domain knowl-
edge when choosing the map output partitioning scheme if
the reduce operation is expensive [17]. However, we focus
on the other two reasons for significantly longer job exe-
cution times that affect the performance of MapReduce.
Motivated by the limitati ons in existing solutions, we use
the partition tuning method to disperse key-value pairs
in virtual partitions and recombine each virtual partition
in case of data skew.
Map
Map
Map
Reduce
Reduce
Reduce
Map phase Reduce phase
Output dat
Intermediate data
Input split
Shue
Figure 1: MapReduce programing model.
3Journal of Healthcare Engineering