arrivals. In particular, reporting decisions are made locally by each processor, which can take into account
its heterogeneity. Note that data locality is not a problem in the front end where the servers are responsible
for gathering data from the back end and organizing them into pages. The general distribution of arrival
intervals does not change the analysis in the large system limit as the arrivals at an individual dispatcher
become Poisson. The evaluation of the performance of JIQ algorithm is based on simulation with a variety of
service time distributions, corresponding to different application workloads. Since the JIQ algorithm exploits
the large scale of the system, a testbed capturing its behavior will need to contain at least hundreds, if not
thousands of servers, which is not available at the moment. We defer the implementation details of the JIQ
algorithm to future publications.
Section 2 describes the JIQ algorithm with distributed dispatchers. We analyze the algorithm in the
large system limit with general service time distributions in Section 3 and discuss design extensions and
implementation issues in Section 4. Section 5 compares the performance of the JIQ algorithm with the SQ(2)
algorithm via simulation.
2 The Join-Idle-Queue Algorithm
Consider a system of n parallel homogeneous processors interconnected with commo dity network components.
There are an array of m dispatchers, with m << n. Requests arrive at the system as a rate-nλ Poisson process.
Each request is directed to a randomly chosen dispatcher, which assigns it to a processor. The service time
of a request is assumed to be i.i.d. with a general service time distribution B(·) of mean 1. We consider both
PS and FIFO service disciplines.
The objective of the load balancing algorithm is to provide fast response time at each processor without
incurring excessive communication overhead. In particular, communication overhead on the critical path,
i.e., at the arrival of a request, is to be avoided as it adds to the overall response time. Communication
off the critical path is much less costly as it can ride on heartbeats sent from processors to job dispatchers
signalling the health of the nodes.
2.1 Preliminary
There are two ways to think about load balancing in a system of parallel processors. As an entire system,
the processors collaborate to adapt quickly to the randomness in the arrival process and service times. On
the other hand, when focusing on a single processor, efficient load balancing changes the arrival rate to
the processor based on the number of jobs in its queue. In particular, it increases the arrival rate to idle
processors and decreases that to processors with a large queue size. This results in shorter busy cycles for
each processor and faster response time.
To illustrate the effect of length of busy cycles on response times, compare the two busy cycle patterns
on a single processor illustrated in Fig. 2. The letter ’b’ denotes ’busy’ and the letter ’i’ denotes ’idle’. The
two patterns can result from different load balancing schemes in the system. The load is the same for both
patterns, as they share the same mean idle time. However, pattern 2 indicates a much larger arrival rate
than pattern 1 when the processor is idle. This results in shorter busy cycles and a much shorter response
time.
i b i
i i i i
b
b
b b
b
Pattern 1:
Pattern 2:
Figure 2: Busy (b) / idle (i) patterns of a processor.
Motivated by the above, we compare the rate of arrival to an idle processor, λ
0
, for the following three
algorithms. With rate-nλ arrivals and n processors of service rate 1, the load on the system is λ. The Random
4