In order to allow for greater concurrency, we might consider
spawning lightweight threads to execute each of the events in the
list and then aggregate the results of the various computations. The
execution of such an abstraction is presented in Figure 1b. Here, the
thread T
0
executing chooseAll is suspended until all the results
are gathered. New lightweight threads T
1
to T
n
are created to
execute corresponding events. Each lightweight thread will place
the result of the event it executes in its corresponding slot in the
result list. After all of the threads have completed execution, T
0
is
resumed with the result of chooseAll .
Even though semantically equivalent to the synchronous im-
plementation, this solution will actually perform worse in many
scenarios than the synchronous solution due to the overheads of
scheduling, synchronization and thread creation costs, even if the
threads are implemented as lightweight entities or one/many-shot
continuations. The most obvious case where performance degra-
dation occurs, is when each of the events encodes a very short
computation. In such cases, it takes longer to allocate and sched-
ule the thread than it takes to complete the event. However, there is
more at play here, and even when some events are long running, the
asynchronous solution incurs additional overheads. We can loosely
categorize these overheads into three groups:
•
Synchronization costs: The creation of a burst of lightweight
threads within a short period of time increases contention for
shared resources such as channels and scheduler queues.
•
Scheduling costs: Besides typical scheduling overheads, the
lightweight threads that are created internally by the asyn-
chronous primitive might not be scheduled prior to threads
explicitly created by the programmer. In such a scenario, the
completion of a primitive, implicitly creating threads for asyn-
chrony, is delayed. In the presence of tight interaction between
threads, such as synchronous communication, if one of the
threads is slow, progress is affected in all of the transitively
dependent threads.
•
Garbage collection costs: Creating a large number of threads
in short period of time increases allocation and might subse-
quently trigger a collection. This problem becomes worse in a
parallel setting, when multiple mutators might be allocating in
parallel.
In this paper, we explore a novel threading mechanism, called
parasitic threads, that reduces costs associated with lightweight
threading structures. Parasitic threads allow the expression of arbi-
trary asynchronous computation, while mitigating typical thread-
ing costs. Parasitic threads are especially useful to model asyn-
chronous computations that are usually short-lived, but can also be
arbitrarily long. Consider once again our chooseAll primitive.
It takes an arbitrary list of events, so at any given invokation of
chooseAll we do not know apriori if a particular event is short or
long lived.
An implementation of chooseAll using parasitic threads is de-
scribed in Section 2.3.2. Such an implementation, abstractly, delays
creating threads unless a parasite performs a blocking action. In
practice, this alleviates scheduling and GC costs. Even when para-
sites block and resume, they are reified into an entity that does not
impose synchronization and GC overheads. If the computation they
encapsulate is long running, they can be inflated to a lightweight
thread anytime during execution.
Importantly, in the implementation of chooseAll with parasitic
threads, if all of the events in the list are available before the exe-
cution of chooseAll , none of the parasites created for executing
individual events will block. In addition to an implementation of
chooseAll , we show how parasitic threads can be leveraged at
the library level to implement a library of specialized asynchronous
primitives. We believe parasitic threads are a useful runtime tech-
nique to accelerate the performance of functional runtime systems.
This paper makes the following contributions:
•
The design and implementation of parasitic threads, a novel
threading mechanism that allows for the expression of a logical
thread of control using raw stack frames. Parasitic threads can
easily be inflated into lightweight threads if necessary.
•
A formal semantics governing the behavior of parasitic threads.
•
A case study leveraging the expressivity of parasitic threads to
implement a collection of asynchronous primitives, and illus-
trating the performance benefits of parasitic threads through a
collection of micro-benchmarks.
•
A detailed performance analysis of runtime costs of parasitic
threads over a large array of benchmarks, including a full-
fledged web server. The performance analysis is performed on
two distinct GC schemes to illustrate that the parasitic threads
are beneficial irrespective of the underlying GC.
The rest of the paper is organized as follows: in Section 2, we
present our base runtime system and the design of parasitic threads.
In Section 3, we provide a formal characterization of parasitic
threads and show their semantic equivalence to classic threads. We
discuss salient implementation details in Section 4. We present a
case study illustrating the utility of parasitic threads in the con-
struction of asynchronous primitives 5. We present our experiment
details and results in Section 6. Related work and concluding re-
marks are given in Section 7 and Section 8 respectively.
2. System Design
In this section, we describe the salient details for the design of
parasitic threads and relevant characteristic of our runtime system.
We envision parasitic threads to be used as a fundamental build-
ing block for constructing asynchronous primitives. With this in
mind, we introduce an API for programming with parasitic threads,
and show how to construct an efficient chooseAll primitive intro-
duced earlier.
2.1 Base System
Our runtime system is built on top of lightweight (green) threads,
synchronous communication, and leverages a GC. Our threading
system multiplexes many lightweight threads on top of a few ker-
nel threads. We leverage one kernel thread for each processor or
core for a given system. The kernel threads are also pined to the
processor. Hence, the runtime views a kernel thread as a virtual
processor. The number of kernel threads is determined statically
and is specified by the user. Kernel threads are not created during
program execution, instead, all spawn primitives create lightweight
threads.
Threads communicate in our system through synchronous mes-
sage passing primitives based on PCML [25], a parallel definition
of CML. Threads may perform sends or receives to pass data be-
tween one another. Such primitives block until a matching com-
munication partner is present. Our system supports two different
GC schemes; a stop-the-world collector with parallel allocation
and a split-heap parallel GC scheme. Necessary details about both
GC designs, with respect to parasites and their implementation are
given in Section 4.
2 2011/10/20