Hive on Spark EXPLAIN详解：解析Spark执行计划中的不同join类型

需积分: 43 109 浏览量更新于2024-09-05 收藏 179KB PDF 举报

在Hive on Spark环境中，EXPLAIN语句是一个关键工具，用于解析查询执行计划。Hive的默认查询引擎是MapReduce（设置为"mr"），但当设置为"spark"时，它会切换到基于Spark的执行引擎。这个命令的使用方法与标准Hive并无显著变化，它依然能展示依赖图（dependency graph）以及每个阶段的详细计划。依赖图在Hive on Spark中展示了各个阶段之间的逻辑关系，无论使用的是MapReduce还是Spark，基础的阶段如Move（数据移动）、Stats-Aggr（统计聚合）等保持不变。然而，当涉及到更复杂的操作，如Join操作时，展现的细节会有不同。 - **Common Join**：普通的Join操作在Hive中会被转换为一系列的MapReduce任务，但在Hive on Spark下，如果使用了Spark的执行引擎，这些Join可能会被优化为Spark的内联Join（in-memory join），从而减少网络通信和磁盘I/O。 - **Map Join**：在Hive中，Map Join是将小表直接加载到内存中与大表进行匹配。在Hive on Spark中，Map Join同样存在，但Spark的Caching功能可以加速这一过程，将小表存储在内存中供后续操作使用。 - **Bucket Map Join**：这是一种基于分区（buckets）的优化策略，通过将数据分布在内存中的桶中，使得JOIN操作更快。Hive on Spark支持Sorted Merge Bucket Map Join，即对排序后的数据进行合并，进一步提高性能。 - **Skew Join**：当数据分布严重偏斜时，传统的Join可能会导致性能瓶颈。Hive on Spark可能利用Spark的特性，如Broadcast Join或Locality-Sensitive Hashing（LSH）来处理skew join，以减少热点数据对整个查询的影响。需要注意的是，Hive中的Stage概念与Spark的Stage不同。在Hive中，一个Stage可能对应于Spark中的多个步骤，因为Hive的某些操作（如MapJoin）可能在单个Spark任务中完成，从而减少了总阶段数量。而对于包含复杂Join操作（如skew join）的查询，可能会拆分为多个Spark stages来执行。总结来说，Hive on Spark的EXPLAIN statement提供了深入理解查询执行路径的洞察，允许开发者和优化器识别并优化潜在的性能瓶颈。通过理解各种Join类型的执行机制，用户可以根据具体场景调整查询策略，以充分利用Spark的优势，提高查询效率。

Hive on Spark EXPLAIN statement



InHive,commandEXPLAINcanbeusedtoshowtheexecutionplanofaquery.The

languagemanualhaslotsofgoodinformation.ForHiveonSpark,thiscommanditselfisnot

changed.Itbehavesthesameasbefore.Itstillshowsthedependencygraph,andplansfor

eachstage.However,ifthequeryengine(hive.execution.engine)issetto“spark”,itshows

theexecutionplanwiththeSparkqueryengine,insteadofthedefault(“mr”)MapReduce

queryengine.



Dependency Graph



Dependencygraphshowsthedependencyrelationshipamongstages.ForHiveonSpark,

thereareSparkstagesinsteadofMapReducestages.Thereisnodifferenceforother

stages,forexample,Movestage,StatsAggrstage,etc..Formostqueries,thereisjustone

SparkstagesincemanymapandreduceworkscanbedoneinoneSparkwork.Therefore,

forasamequery,withHiveonSpark,theremaybelessnumberofstages.Forsomequeries,

therearemultipleSparkstages,forexample,querieswithmapjoin,skewjoin,etc..



OnethingshouldbepointedoutthathereastagemeansaHivestage.Itisverydifferentfrom

thestageconceptinSpark.AHivestagecouldcorrespondtomultiplestagesinSpark.In

Spark,astageusuallymeansagroupoftasksthatcanbeprocessedinoneexecutor.In

Hive,astagecontainsalistofoperationsthatcanbeprocessedinonejob.



Spark Stage Plan



TheplansforeachstageareshownbycommandEXPLAIN,besidesdependencygraph.For

HiveonSpark,theSparkstageisnew.ItreplacestheMapReducestageforHiveon

MapReduce.TheSparkstageshowstheSparkworkgraph,whichisaDAG(directedacyclic

graph).Itcontains:



● DAGname,thenameoftheSparkworkDAG;

● Edges,thatshowsthedependencyrelationshipamongworksinthisDAG;

● Vertices,thatshowstheoperatortreeofeachwork.



Foreachindividualoperatortree,thereisnochangeforHiveonSpark.Thedifferenceis

dependencygraph.ForMapReduce,youcan’thaveareducerwithoutamapper.ForSpark,

that’snotaproblem.Therefore,HiveonSparkcanoptimizetheplanandgetridofthose

mappersnotneeded.



TheedgeinformationisnewforHiveonSpark.ThereisnosuchinformationforMapReduce.

Differentedgetypeindicatesdifferentshufflerequirement.Forexample,

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_46206568

粉丝: 0

Hive on Spark EXPLAIN详解：解析Spark执行计划中的不同join类型

Hive on Spark源码分析DOC

基于CDH 6.3.0 搭建 Hive on Spark 及相关配置和调优

Hive：基于Hadoop的数据仓库与SQL查询

hive notin

人脸识别_深度学习_CNN_表情分析系统_1741778057.zip

Hono框架下基于TypeScript的Web应用构建指南：从项目初始化到模块全面实现（可复现，有问题请联系博主）

掌静脉识别算法源码（门禁）.zip

计算机视觉_手势识别_色域转换_控制应用_1741857836.zip

（参考GUI）MATLAB BP的交通标志系统.zip

人脸识别_Hadoop_视频图像检索_安防辅助系统_1741777456.zip

最新资源