mantics to the logic running in map/reduce steps and imposed a
sorted & partitioned movement of data between map and reduce
steps [21]. These built-in semantics, ideal in some core use cases,
could be pure overhead in many other scenarios and even unde-
sirable in some. The observation here is the need for an API to
describe the structure of arbitrary DAGs without adding unrelated
semantics to that DAG structure.
Data-plane Customizability. Once the structure of distributed
computation has been defined, there can be a variety of alternative
implementations of the actual logic that executes in that structure.
These could be algorithmic, e.g. different ways of partitioning the
data or these could be related to using different hardware, e.g. using
remote memory access (RDMA) where available. In the context of
MapReduce, the built-in semantics of the engine makes such cus-
tomizations difficult because they intrude in the implementation of
the engine itself. Secondly, the monolithic structure of the tasks
executing the MapReduce job on the cluster makes plugging in al-
ternative implementations difficult. This motivates that data trans-
formations and data movements that define the data plane need to
be completely customizable. There is a need to be able to model
different aspects of task execution in a manner that allows individ-
ual aspects of the execution, e.g. reading input, processing data
etc. to be customized easily. Interviewing several members of the
Hadoop community we confirmed that evolving existing engines
(e.g., changing the shuffle behavior in MapReduce) is far from triv-
ial.
While other frameworks such as [24, 15, 38], already support
a more general notion of DAGs, they share the same limitation of
MapReduce, built-in semantics and implementations of the data-
plane. With Tez we provide a lower level abstraction, that enables
such semantics and specialized implementations to be added on top
of a basic shared scaffolding.
Late-binding Runtime Optimizations. Applications need to
make late-binding decisions on their data processing logic for per-
formance [13]. The algorithm, e.g. join strategies and scan mech-
anisms, could change based on dynamically observing data being
read. Partition cardinality and work division could change as the
application gets a better understanding of its data and environment.
Hadoop clusters can be very dynamic in their usage and load char-
acteristics. Users and jobs enter and exit the cluster continuously
and have varying resource utilization. This makes it important for
an application to determine its execution characteristics based on
the current state of the cluster. We designed Tez to make this late-
binding and on-line decision-making easier to implement, by en-
abling updates to key abstractions at runtime.
This concludes our overview of historical context and rationale
for building Tez. We now turn to describing the high level architec-
ture of Tez, and provide some insight into the key building blocks.
3. ARCHITECTURE
Apache Tez is designed and implemented with a focus on the
issues discussed above, in summary: 1) expressiveness of the un-
derlying model, 2) customizability of the data plane, and 3) facil-
itate runtime optimizations. Instead of building a general purpose
execution engine, we realize the need for Tez to provide a unifying
framework for creating purpose-built engines that customize data
processing for their specific needs. Tez solves the common, yet
hard problem of orchestrating and running a distributed data pro-
cessing application on Hadoop and enables the application to focus
on providing specific semantics and optimizations. There is a clear
separation of concerns between the application layer and the Tez li-
brary layer. Apache Tez provides cluster resource negotiation, fault
tolerance, resource elasticity, security, built-in performance opti-
mizations and a shared library of ready to use components. The
application provides custom application logic, custom data plane
and specialized optimizations.
This leads to three key benefits: 1) amortized development costs
(Hive and Pig completely rewrote their engines using the Tez li-
braries in about 6 months), 2) improved performance (we show in
Section 6 up to 10× performance improvement while using Tez),
and 3) enabling future pipelines that leverage multiple engines, to
be run more efficiently because of a shared substrate.
Tez is composed of a set of core APIs that define the data pro-
cessing and an orchestration framework to launch that on the clus-
ter. Applications are expected to implement these APIs to provide
the execution context to the orchestration framework. Its useful to
think of Tez as a library to create a scaffolding representing the
structure of the data flow, into which the application injects its cus-
tom logic (say operators) and data transfer code (say reading from
remote machine disks). This design is both tactical and strategic.
Long-term, this makes Tez remain application agnostic while in the
short term, allows existing applications like Hive or Pig to leverage
Tez without significant changes in their core operator pipelines. We
will begin with describing the DAG API and Runtime API. These
are the primary application facing interfaces used to describe the
DAG structure of the application and the code to be executed at run-
time. Next we explain support for applying runtime optimizations
to the DAG via an event based control plane using VertexManagers
and DataSourceInitializers. Finally, in Section 4 we describe the
YARN based orchestration framework to execute the all of this on
a Hadoop cluster. In particular, we will focus on the performance
and production-readiness aspects of the implementation.
3.1 DAG API
The Tez DAG API is exposed to runtime engine builders as an
expressive way to capture the structure of their computation in a
concise way. The class of data processing application we focus
on, are naturally represented as DAGs, where data proceeds from
data sources towards data sinks, while being transformed in inter-
mediate vertices. Tez focuses on acyclic graphs, and by assuming
deterministic computation on the vertex and data routing on the
edges, we enable re-execution based fault tolerance, akin to [24]
and is further explained in Section 4.3. Modeling computations as
a DAG is not new but hitherto most systems have typically designed
DAG APIs in the context of supporting a higher level engine. Tez
is designed to model this data flow graph as the main focus. Using
well-known concepts of vertices and edges the DAG API enables a
clear and concise description of the structure of the computation.
Vertex. A vertex in the DAG API represents transformation of
data and is one of the steps in processing the data. This is where
the core application logic gets applied to the data. Hence a vertex
must be configured with a user-provided processor class that de-
fines the logic to be executed in each task. One ‘vertex’ in the DAG
is often executed in parallel across a (possibly massive) number of
parallel tasks. The definition of a vertex controls such parallelism.
Parallelism is usually determined by the need to process data that
is distributed across machines or by the need to divide a large op-
eration into smaller pieces. The task parallelism of a vertex may be
defined statically during DAG definition but is typically determined
dynamically at runtime.
Edge. An edge in the graph represents the logical and physical
aspects of data movement between producer and consumer vertices.