Apache Tez：统一的数据处理框架

Hadoop

YARN

需积分: 11 90 浏览量更新于2024-09-09 收藏 1.76MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Apache Tez是基于YARN构建的用于数据处理应用的统一框架，它将数据流模型化为有向无环图（DAG），旨在提高程序执行效率和通用性，减少基于MapReduce的程序的碎片化和重复开发工作。" Apache Tez是Apache Hadoop生态系统中的一个关键组件，它在YARN（Yet Another Resource Negotiator）的基础上提供了一个高级的、灵活的数据处理模型。YARN作为Hadoop 2.x的核心资源管理系统，负责任务调度和集群资源的管理。而Tez则是一个用于构建复杂数据处理应用程序的框架，它允许开发者以DAG的形式定义任务之间的依赖关系，这使得数据处理流程更加直观和高效。传统的MapReduce模型在处理大规模数据时表现出效率较低，因为它通常需要多个Map和Reduce阶段，每个阶段之间存在数据传输的开销。相比之下，Tez允许在一个任务内部执行更复杂的计算模式，减少了数据移动的次数，从而提高了整体性能。此外，Tez的DAG执行模型可以更好地并行化任务，优化资源利用率，处理更复杂的作业结构，如有多个输入和输出、多跳数据依赖等。 Tez的核心设计理念是组件重用与可定制性的平衡。它提供了一套基础架构和库组件，开发者可以快速构建出可扩展且高效的以数据流为中心的引擎。同时，Tez允许对特定功能进行定制，例如故障恢复、安全性以及处理落后的任务（stragglers），这些在不同的垂直领域引擎中通常是需要从零开始实现的。通过使用Tez，开发者可以创建出更加灵活和高性能的应用程序，这些应用不仅能够处理批处理任务，还可以适应实时和近实时的数据处理需求。例如，Apache Hive（一个基于Hadoop的数据仓库工具）就利用了Tez来显著提升查询性能，使得交互式分析变得更加可行。 Apache Tez是对Hadoop生态系统的强大补充，它减少了开发新的数据处理引擎时的重复工作，并促进了不同引擎之间的兼容性和互操作性。随着大数据处理需求的多样化，Tez作为通用框架，有望继续在简化开发、优化性能和促进创新方面发挥重要作用。

资源详情

资源推荐

mantics to the logic running in map/reduce steps and imposed a

sorted & partitioned movement of data between map and reduce

steps [21]. These built-in semantics, ideal in some core use cases,

could be pure overhead in many other scenarios and even unde-

sirable in some. The observation here is the need for an API to

describe the structure of arbitrary DAGs without adding unrelated

semantics to that DAG structure.

Data-plane Customizability. Once the structure of distributed

computation has been deﬁned, there can be a variety of alternative

implementations of the actual logic that executes in that structure.

These could be algorithmic, e.g. different ways of partitioning the

data or these could be related to using different hardware, e.g. using

remote memory access (RDMA) where available. In the context of

MapReduce, the built-in semantics of the engine makes such cus-

tomizations difﬁcult because they intrude in the implementation of

the engine itself. Secondly, the monolithic structure of the tasks

executing the MapReduce job on the cluster makes plugging in al-

ternative implementations difﬁcult. This motivates that data trans-

formations and data movements that deﬁne the data plane need to

be completely customizable. There is a need to be able to model

different aspects of task execution in a manner that allows individ-

ual aspects of the execution, e.g. reading input, processing data

etc. to be customized easily. Interviewing several members of the

Hadoop community we conﬁrmed that evolving existing engines

(e.g., changing the shufﬂe behavior in MapReduce) is far from triv-

ial.

While other frameworks such as [24, 15, 38], already support

a more general notion of DAGs, they share the same limitation of

MapReduce, built-in semantics and implementations of the data-

plane. With Tez we provide a lower level abstraction, that enables

such semantics and specialized implementations to be added on top

of a basic shared scaffolding.

Late-binding Runtime Optimizations. Applications need to

make late-binding decisions on their data processing logic for per-

formance [13]. The algorithm, e.g. join strategies and scan mech-

anisms, could change based on dynamically observing data being

read. Partition cardinality and work division could change as the

application gets a better understanding of its data and environment.

Hadoop clusters can be very dynamic in their usage and load char-

acteristics. Users and jobs enter and exit the cluster continuously

and have varying resource utilization. This makes it important for

an application to determine its execution characteristics based on

the current state of the cluster. We designed Tez to make this late-

binding and on-line decision-making easier to implement, by en-

abling updates to key abstractions at runtime.

This concludes our overview of historical context and rationale

for building Tez. We now turn to describing the high level architec-

ture of Tez, and provide some insight into the key building blocks.

3. ARCHITECTURE

Apache Tez is designed and implemented with a focus on the

issues discussed above, in summary: 1) expressiveness of the un-

derlying model, 2) customizability of the data plane, and 3) facil-

itate runtime optimizations. Instead of building a general purpose

execution engine, we realize the need for Tez to provide a unifying

framework for creating purpose-built engines that customize data

processing for their speciﬁc needs. Tez solves the common, yet

hard problem of orchestrating and running a distributed data pro-

cessing application on Hadoop and enables the application to focus

on providing speciﬁc semantics and optimizations. There is a clear

separation of concerns between the application layer and the Tez li-

brary layer. Apache Tez provides cluster resource negotiation, fault

tolerance, resource elasticity, security, built-in performance opti-

mizations and a shared library of ready to use components. The

application provides custom application logic, custom data plane

and specialized optimizations.

This leads to three key beneﬁts: 1) amortized development costs

(Hive and Pig completely rewrote their engines using the Tez li-

braries in about 6 months), 2) improved performance (we show in

Section 6 up to 10× performance improvement while using Tez),

and 3) enabling future pipelines that leverage multiple engines, to

be run more efﬁciently because of a shared substrate.

Tez is composed of a set of core APIs that deﬁne the data pro-

cessing and an orchestration framework to launch that on the clus-

ter. Applications are expected to implement these APIs to provide

the execution context to the orchestration framework. Its useful to

think of Tez as a library to create a scaffolding representing the

structure of the data ﬂow, into which the application injects its cus-

tom logic (say operators) and data transfer code (say reading from

remote machine disks). This design is both tactical and strategic.

Long-term, this makes Tez remain application agnostic while in the

short term, allows existing applications like Hive or Pig to leverage

Tez without signiﬁcant changes in their core operator pipelines. We

will begin with describing the DAG API and Runtime API. These

are the primary application facing interfaces used to describe the

DAG structure of the application and the code to be executed at run-

time. Next we explain support for applying runtime optimizations

to the DAG via an event based control plane using VertexManagers

and DataSourceInitializers. Finally, in Section 4 we describe the

YARN based orchestration framework to execute the all of this on

a Hadoop cluster. In particular, we will focus on the performance

and production-readiness aspects of the implementation.

3.1 DAG API

The Tez DAG API is exposed to runtime engine builders as an

expressive way to capture the structure of their computation in a

concise way. The class of data processing application we focus

on, are naturally represented as DAGs, where data proceeds from

data sources towards data sinks, while being transformed in inter-

mediate vertices. Tez focuses on acyclic graphs, and by assuming

deterministic computation on the vertex and data routing on the

edges, we enable re-execution based fault tolerance, akin to [24]

and is further explained in Section 4.3. Modeling computations as

a DAG is not new but hitherto most systems have typically designed

DAG APIs in the context of supporting a higher level engine. Tez

is designed to model this data ﬂow graph as the main focus. Using

well-known concepts of vertices and edges the DAG API enables a

clear and concise description of the structure of the computation.

Vertex. A vertex in the DAG API represents transformation of

data and is one of the steps in processing the data. This is where

the core application logic gets applied to the data. Hence a vertex

must be conﬁgured with a user-provided processor class that de-

ﬁnes the logic to be executed in each task. One ‘vertex’ in the DAG

is often executed in parallel across a (possibly massive) number of

parallel tasks. The deﬁnition of a vertex controls such parallelism.

Parallelism is usually determined by the need to process data that

is distributed across machines or by the need to divide a large op-

eration into smaller pieces. The task parallelism of a vertex may be

deﬁned statically during DAG deﬁnition but is typically determined

dynamically at runtime.

Edge. An edge in the graph represents the logical and physical

aspects of data movement between producer and consumer vertices.

1359

剩余12页未读，继续阅读

弹指神通

粉丝: 34
资源: 35

Apache Tez：统一的数据处理框架

tez:Tez是用于PyTorch的超级简单且轻巧的Trainer。 它还带有许多实用程序，可用于解决PyTorch中90％以上的深度学习项目

源码apache-tez-0.8.3编译后的hadoop2.7.3版本hive-tez包tez-0.8.3.tar.gz

CDH6.0.1基于centos7编译成功的tez(0.9.1)文件.zip

Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.api.SessionNotRunning 解决方法

如果没有timelineserver可以使用tez-ui

apache-hive-3.1.2-bin.tar.gz适配什么版本的tez

Tez,Flink,Spark,Storm哪个属于批流融合系统

hive on spark 和 hive on tez 深入对比

tez.am.launch.cmd-opts

hivesql底层是mapreduce嘛

怎么检测hive的计算引擎

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration

coflow scheduling frameworks

Hadoop生态系统中，除了HDFS还有哪些核心组件？

Caused by: java.lang.ClassNotFoundException: org.apache.tez.runtime.api.Event

hive启动Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration

hive启动失败Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration

为什么hive比mysql更适合大规模数据处理

could not find or load main class org.apache.tez.dag.app.dagappmaste

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning 解决方法

最新资源

tez:Tez是用于PyTorch的超级简单且轻巧的Trainer。它还带有许多实用程序，可用于解决PyTorch中90％以上的深度学习项目