32 Journal of Communications and Information Networks
from the real-time load value and the normalized
performance.
• The concept of QoS is introduced into the dis-
tributed file system. We use a tuple composed of the
number of files and total file size of requested the
task as a measure of QoS. The master node sets a
threshold based on the QoS of the requested task to
filter the set of storage nodes that meet the require-
ments of the task.
2 Related work
Load balancing is a higher-level load allocation strat-
egy than load sharing. It must distribute the system
load to each node, eliminate or avoid any load imbal-
ance problems, and optimize the overall performance
of the distributed file system. Load balancing algo-
rithms can be divided into two categories
[11]
: static
load balancing
[12]
and dynamic load balancing
[9,13]
.
Static load balancing
[14,15]
is also known as state-
independent balancing. It determines a load allo-
cation strategy before a task is triggered, meaning
the master node does not consider the real-time load
status of each storage node while processing a re-
quest, but instead operates based on known sys-
tem static information to make decisions and assign
tasks. The advantages of static load balancing are
that the logic is simple, the overhead is small, and a
task request can be quickly allocated to each storage
node. However, it does not consider the real-time
load of the storage nodes or dynamic changes in the
system state. The task assignments are made blindly
and the accuracy is low, causing task allocation to
be uneven and limiting the system load balance
[10]
.
Dynamic load balancing
[10]
focuses on the state
of information in the system, by analyzing the real-
time load of each storage node, tasks are allocated
to the storage nodes dynamically. The advantages
of dynamic load balancing are that it can adjust the
allocation of tasks in real time based on the load in-
formation of the storage nodes, adapt to changes in
the load state of the system, and that it has excellent
flexibility. However, it also has some disadvantages.
The master node must periodically collect the status
of the storage nodes necessitating, frequent informa-
tion exchange between the master node and storage
nodes to make network overhead, resulting in a waste
of network bandwidth
[10]
.
The weighted rotation scheduling algorithm
[16]
is
an upgraded version of the round robin algorithm.
The round robin algorithm assigns all tasks in turn
to each storage node in the system. The round robin
algorithm causes all work for nodes to be handled in
a circular pattern. In other words, each node is fixed
with a time slice and performs a task at designated
time on its turn
[17,18]
. A scheduling and load balanc-
ing algorithm that considers the capabilities of each
VM, the task length of each requested job, and the
interdependency of multiple tasks was proposed
[17]
.
As a result, some nodes may encounter heavy loads
while others may have no task requests. This is-
sue could be improved by using a weighted round
robin algorithm, where each node can to possess a
specific number of requests according to its assigned
weight
[19,20]
. The weighted rotation scheduling al-
gorithm sets different weighting factors for different
storage nodes based on their processing ability. The
assignment of tasks is based on the weights, and
higher priorities are assigned to storage nodes with
higher weight factors. The algorithm is more effi-
cient when dealing with requests with smaller time
spans, because the load becomes unbalanced when
handling tasks with large time spans.
In the minimum connection scheduling algo-
rithm
[21]
, the master node of the distributed file sys-
tem detects and records the current number of active
connections of its storage nodes in real time. When
a new request arrives, the master node assigns it to
the storage node with the smallest number of active
connections, and increments the number of active
connections for that node by one. When the task
is completed, the number of active connections is
decremented by one. If the processing capacity of
all the storage nodes in the distributed file system
is the same, the minimum connection scheduling al-
gorithm will distribute requests with large loads to
each storage node in a balanced manner, which is
more efficient. However, the real environment of a