At run time each channel is used to transport a finite se-
quence of structured items. This channel abstraction has
several concrete implementations that use shared memory,
TCP pipes, or files temporarily persisted in a file system.
As far as the program in each vertex is concerned, channels
produce and consume heap objects that inherit from a base
type. This means that a vertex program reads and writes its
data in the same way regardless of whether a channel seri-
alizes its data to buffers on a disk or TCP stream, or passes
object pointers directly via shared memory. The Dryad sys-
tem does not include any native data model for serializa-
tion and the concrete type of an item is left entirely up to
applications, which can supply their own serialization and
deserialization routines. This decision allows us to support
applications that operate directly on existing data includ-
ing exported SQL tables and textual log files. In practice
most applications use one of a small set of library item types
that we supply such as newline-terminated text strings and
tuples of base types.
A schematic of the Dryad system organization is shown
in Figure 1. A Dryad job is coordinated by a process called
the “job manager” (denoted JM in the figure) that runs
either within the cluster or on a user’s workstation with
network access to the cluster. The job manager contains
the application-specific code to construct the job’s commu-
nication graph along with library code to schedule the work
across the available resources. All data is sent directly be-
tween vertices and thus the job manager is only responsible
for control decisions and is not a bottleneck for any data
transfers.
Files, FIFO, Network
Job schedule
Data plane
Control plane
D D DNS
V V V
JM
Figure 1: The Dryad system organization. The job manager (JM)
consults the name server (NS) to discover the list of available com-
puters. It maintains the job graph and schedules running vertices (V)
as computers become available using the daemon (D) as a proxy.
Vertices exchange data through files, TCP pipes, or shared-memory
channels. The shaded bar indicates the vertices in the job that are
currently running.
The cluster has a name server (NS) that can be used to
enumerate all the available computers. The name server
also exposes the position of each computer within the net-
work topology so that scheduling decisions can take account
of locality. There is a simple daemon (D) running on each
computer in the cluster that is responsible for creating pro-
cesses on behalf of the job manager. The first time a vertex
(V) is executed on a computer its binary is sent from the job
manager to the daemon and subsequently it is executed from
a cache. The daemon acts as a proxy so that the job man-
ager can communicate with the remote vertices and monitor
the state of the computation and how much data has been
read and written on its channels. It is straightforward to run
a name server and a set of daemons on a user workstation
to simulate a cluster and thus run an entire job locally while
debugging.
A simple task scheduler is used to queue batch jobs. We
use a distributed storage system, not described here, that
shares with the Google File System [21] the property that
large files can be broken into small pieces that are replicated
and distributed across the local disks of the cluster comput-
ers. Dryad also supports the use of NTFS for accessing files
directly on local computers, which can be convenient for
small clusters with low management overhead.
2.1 An example SQL query
In this section, we describe a concrete example of a Dryad
application that will be further developed throughout the re-
mainder of the paper. The task we have chosen is representa-
tive of a new class of eScience applications, where scientific
investigation is performed by processing large amounts of
data available in digital form [24]. The database that we
use is derived from the Sloan Digital Sky Survey (SDSS),
available online at http://skyserver.sdss.org.
We chose the most time consuming query (Q18) from a
published study based on this database [23]. The task is to
identify a “gravitational lens” effect: it finds all the objects
in the database that have neighboring objects within 30 arc
seconds such that at least one of the neighbors has a color
similar to the primary object’s color. The query can be
expressed in SQL as:
select distinct p.objID
from photoObjAll p
join neighbors n — call this join “X”
on p.objID = n.objID
and n.objID < n.neighborObjID
and p.mode = 1
join photoObjAll l — call this join “Y”
on l.objid = n.neighborObjID
and l.mode = 1
and abs((p.u-p.g)-(l.u-l.g))<0.05
and abs((p.g-p.r)-(l.g-l.r))<0.05
and abs((p.r-p.i)-(l.r-l.i))<0.05
and abs((p.i-p.z)-(l.i-l.z))<0.05
There are two tables involved. The first, photoObjAll
has 354,254,163 records, one for each identified astronomical
object, keyed by a unique identifier objID. These records
also include the object’s color, as a magnitude (logarithmic
brightness) in five bands: u, g, r, i and z. The second table,
neighbors has 2,803,165,372 records, one for each object
located within 30 arc seconds of another object. The mode
predicates in the query select only “primary” objects. The
< predicate eliminates duplication caused by the neighbors
relationship being symmetric. The output of joins “X” and
“Y” are 932,820,679 and 83,798 records respectively, and the
final hash emits 83,050 records.
The query uses only a few columns from the tables (the
complete photoObjAll table contains 2 KBytes per record).
When executed by SQLServer the query uses an index on
photoObjAll keyed by objID with additional columns for
mode, u, g, r, i and z, and an index on neighbors keyed by
objID with an additional neighborObjID column. SQL-
Server reads just these indexes, leaving the remainder of the
tables’ data resting quietly on disk. (In our experimental
setup we in fact omitted unused columns from the table, to
avoid transporting the entire multi-terabyte database across