优化通信成本的决策树并行算法：PV-Tree

需积分: 19 200 浏览量更新于2024-09-09 收藏 458KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文主要探讨了"Communication-Efficient Parallel Algorithm for Decision Tree"（简称PV-Tree），这是针对大规模数据背景下决策树训练过程进行并行化的一种创新方法，特别关注于降低数据并行计算中的通信开销。决策树及其扩展算法如梯度提升决策树（GBDT）和随机森林由于其实用性和模型解释性而在机器学习领域广泛应用。随着大数据时代的到来，对决策树训练过程的并行化需求日益增长，然而现有的并行化方法往往面临通信成本较高的问题。 PV-Tree的核心策略是将训练数据划分为M个机器（例如M台），在每个迭代中同时执行局部投票和全局投票。局部投票阶段，每个机器根据自身的数据集选出排名前k的特征，这样可以减少机器间不必要的通信，因为每个机器只与自身数据相关的特征进行交流。这显著降低了通信量，使得通信代价降到了常数级别，这对于处理海量数据时具有极大的加速效果。全局投票则是所有机器共同参与，对每个候选特征进行汇总，以确定最终用于分裂节点的最佳特征。这种设计旨在保持模型的准确性和一致性，即使在分布式环境中也能得到较好的性能。通过这种方式，PV-Tree能够在保证模型性能的同时，显著提高了决策树训练的并行效率。该算法的关键优势在于它巧妙地平衡了局部决策和全局协调，减少了数据移动和同步的复杂性，从而在大规模并行环境中实现了通信效率的显著提升。NIPS 2016年的论文详细介绍了PV-Tree的实现细节、理论分析以及实验结果，展示了其在实际应用中的优越性能和可扩展性。研究者们对比了PV-Tree与其他并行决策树方法，证实了其在减小通信开销和提高并行训练速度方面的有效性。 "A Communication-Efficient Parallel Algorithm for Decision Tree"是一个重要的进展，为解决大规模数据决策树训练中的通信瓶颈问题提供了新的解决方案，对于推动决策树算法在现代数据密集型场景下的实际应用具有重要意义。

资源详情

资源推荐

A Communication-Efﬁcient Parallel Algorithm for

Decision Tree

Qi Meng

1,∗

, Guolin Ke

2,∗

, Taifeng Wang

, Wei Chen

, Qiwei Ye

Zhi-Ming Ma

, Tie-Yan Liu

Peking University

Microsoft Research

Chinese Academy of Mathematics and Systems Science

qimeng13@pku.edu.cn;

{Guolin.Ke, taifengw, wche, qiwye, tie-yan.liu}@microsoft.com;

mazm@amt.ac.cn

Abstract

Decision tree (and its extensions such as Gradient Boosting Decision Trees and

Random Forest) is a widely used machine learning algorithm, due to its practical

effectiveness and model interpretability. With the emergence of big data, there is

an increasing need to parallelize the training process of decision tree. However,

most existing attempts along this line suffer from high communication costs. In

this paper, we propose a new algorithm, called Parallel Voting Decision Tree

(PV-Tree), to tackle this challenge. After partitioning the training data onto a

number of (e.g.,

) machines, this algorithm performs both local voting and

global voting in each iteration. For local voting, the top-

attributes are selected

from each machine according to its local data. Then, globally top-

attributes

are determined by a majority voting among these local candidates. Finally, the

full-grained histograms of the globally top-

attributes are collected from local

machines in order to identify the best (most informative) attribute and its split point.

PV-Tree can achieve a very low communication cost (independent of the total

number of attributes) and thus can scale out very well. Furthermore, theoretical

analysis shows that this algorithm can learn a near optimal decision tree, since it

can ﬁnd the best attribute with a large probability. Our experiments on real-world

datasets show that PV-Tree signiﬁcantly outperforms the existing parallel decision

tree algorithms in the trade-off between accuracy and efﬁciency.

1 Introduction

Decision tree [

] is a widely used machine learning algorithm, since it is practically effective and

the rules it learns are simple and interpretable. Based on decision tree, people have developed other

algorithms such as Random Forest (RF) [

] and Gradient Boosting Decision Trees (GBDT) [

which have demonstrated very promising performances in various learning tasks [5].

In recent years, with the emergence of very big training data (which cannot be held in one single

machine), there has been an increasing need of parallelizing the training process of decision tree. To

this end, there have been two major categories of attempts:

∗

Denotes equal contribution. This work was done when the ﬁrst author was visiting Microsoft Research Asia.

There is another category of works that parallelize the tasks of sub-tree training once a node is split [

which require the training data to be moved from machine to machine for many times and are thus inefﬁcient.

Moreover, there are also some other works accelerating decision tree construction by using pre-sorting [

] [

]

[

] and binning [

] [

], or employing a shared-memory-processors approach [

] [

]. However, they are

out of our scope.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

下载后可阅读完整内容，剩余8页未读，立即下载

Daisy和她的单程车票

粉丝: 76
资源: 20

优化通信成本的决策树并行算法：PV-Tree

DecisionTree

mapreduce近三年参考文献

verilog 2-bit parallel adder

近三年关于Java的外文带页码的期刊参考文献

cuda-runtime-api-1.5.2-parallel

parallel_for｛parallel_for｛｝｝

Mysqlpump并行导出单表数据

mysql8.0.25使用mysqldump提示没有parrel=4

gradle 多线程编译

parallel 和parallel for

Parallel.For 和Parallel.ForEach 例子

Parallel.For怎么用

pester测试框架-Parallel参数怎么使用

opencvsharp parallel_for_

pragma omp parallel for

JESD204B: High-Speed Serial Interface for Data Converters

最新资源