sprak DataFrame join
时间: 2023-11-03 11:55:06 浏览: 91
Spark DataFrame的Join操作可以根据不同的策略进行选择。根据引用,在没有合适的Join机制可供选择时,最终会选择Broadcast Nested Loop Join。Broadcast Nested Loop Join的优先级较低,只有当其他Join策略不可行时才会选择这个策略。而根据引用,在没有Join提示的情况下,Spark会按照一定的顺序选择Join策略,优先选择Broadcast hash join,其次是Shuffle hash join,然后是Sort merge join,最后是Cartesian join和Broadcast nested loop join。其中,Broadcast hash join会在某个表可以被广播时使用,Shuffle hash join会在参数spark.sql.join.preferSortMergeJoin设定为false且一张表足够小的情况下使用,Sort merge join会在key是排序的情况下使用,Cartesian join会在内连接时使用,Broadcast nested loop join会在可能会发生OOM(内存耗尽)或者没有其他可选择的策略时使用。至于具体使用哪种Join策略,还需要根据具体的应用场景和数据特点来决定。<span class="em">1</span><span class="em">2</span><span class="em">3</span>
#### 引用[.reference_title]
- *1* *2* [SparkSQL的Join的实现方式](https://blog.csdn.net/junkmachine/article/details/126898499)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"]
- *3* [【极简spark教程】DataFrame常用操作](https://blog.csdn.net/ljp7759325/article/details/124135234)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"]
[ .reference_list ]
阅读全文