并行Pareto优化：子集选择的新方法

PDF格式 | 1004KB | 更新于2024-08-26 | 42 浏览量 | 举报

"子集选择的并行Pareto优化" 在信息技术领域，子集选择是一个核心问题，尤其在数据挖掘、机器学习、统计建模等多个领域。子集选择的目标是从一组大量的变量中挑选出一个最优的子集，使得这个子集在保留原有信息或预测能力的同时，尽可能地减少复杂性和计算成本。这有助于提高模型的效率和解释性。 Pareto优化（Pareto Optimization）是一种多目标优化方法，它通过寻找一组非劣解来平衡多个相互冲突的目标。在子集选择中，Pareto优化方法（POSS）能够生成一组在不同评价指标下都表现良好的子集，而不是仅仅追求单一指标的最大化。然而，POSS算法的并行化是个挑战，因为它通常涉及到大量的迭代和复杂的数据依赖关系，这限制了它在大规模并行计算环境中的应用。本文提出的并行Pareto子集选择优化（PPOSS）是对POSS的并行化改进。通过理论分析，作者证明了PPOSS能够在保持近似质量不变的情况下，实现良好的并行性能。当处理器数量有限，即小于变量总数时，PPOSS的运行时间几乎可以随着处理器数量的增加而线性减少。这表明PPOSS能够有效地利用多核处理器或分布式计算资源，以更快的速度完成子集选择任务。随着处理器数量的进一步增加，PPOSS的运行时间可以继续减少，并最终趋于一个常量。这意味着即使在大规模的计算环境中，PPOSS也能保持高效的运行效率。此外，实验结果还表明，PPOSS的异步实现不仅更有效，而且在牺牲极小的质量损失的前提下，能够提高并行效率。 PPOSS为解决子集选择问题提供了一个新的途径，特别是在需要处理大数据集和需要高效执行的场景中。通过并行化技术，PPOSS能够显著降低计算时间，适应现代计算架构的需求，这对于数据驱动的应用和科学研究有着重要的意义。未来的研究可能涉及将PPOSS应用于更多的实际问题，以及进一步优化并行策略，以提升算法在各种环境下的性能。

Parallel Pareto Optimization for Subset Selection

Chao Qian

1,2

, Jing-Cheng Shi

, Yang Yu

, Ke Tang

, Zhi-Hua Zhou

1⇤

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

UBRI, School of Computer Science and Technology,

University of Science and Technology of China, Hefei 230027, China

{qianc, shijc, yuy, zhouzh}@lamda.nju.edu.cn, ketang@ustc.edu.cn

Abstract

Subset selection that selects a few variables from

a large set is a fundamental problem in many ar-

eas. The recently emerged Pareto Optimization

for Subset Selection (POSS) method is a power-

ful approximation solver for this problem. How-

ever, POSS is not readily parallelizable, restricting

its large-scale applications on modern computing

architectures. In this paper, we propose PPOSS, a

parallel version of POSS. Our theoretical analysis

shows that PPOSS has good properties for paral-

lelization while preserving the approximation qual-

ity: when the number of processors is limited (less

than the total number of variables), the running

time of PPOSS can be reduced almost linearly with

respect to the number of processors; with increas-

ing number of processors, the running time can be

further reduced, eventually to a constant. Empiri-

cal studies verify the effectiveness of PPOSS, and

moreover suggest that the asynchronous implemen-

tation is more efﬁcient with little quality loss.

1 Introduction

Given a total set of n variables, the subset selection problem is

to select a subset of size at most k for optimizing some given

objective. One origin of this problem is the column subset

selection problem

[

Gu and Eisenstat, 1996

]

, which aims at

selecting a few columns from a matrix that capture as much of

the matrix as possible. Since then, subset selection has been

signiﬁcantly extended and numerous applications of it have

emerged, e.g., feature selection, sparse learning, compressed

sensing, etc.

Subset selection is NP-hard in general

[

Davis et al., 1997

]

Much effort has been put into developing polynomial-time

approximation algorithms. These algorithms can be mainly

categorized into two branches, greedy algorithms and con-

vex relaxation methods. Greedy algorithms iteratively add

or remove one variable that makes the given objective cur-

rently optimized

[

Gilbert et al., 2003; Tropp, 2004

]

. Albeit

⇤

This work was supported by the NSFC (61333014, 61375061,

61329302), the Collaborative Innovation Center of Novel Software

Technology and Industrialization, and the 2015 Microsoft Research

Asia Collaborative Research Program.

widely used in practice, the performance of these algorithms

is limited due to their greedy nature. Convex relaxation meth-

ods relax the original problem by replacing the set size con-

straint (i.e., the `

-norm constraint) with convex constraints,

e.g., the `

-norm constraint

[

Tibshirani, 1996

]

and the elastic

net penalty

[

Zou and Hastie, 2005

]

. However, the optimal so-

lutions of the relaxed problem could be distant to that of the

original problem.

Recently, Pareto optimization has been shown to be very

powerful for the subset selection problem

[

Qian et al., 2015c

]

The Pareto Optimization for Subset Selection (POSS) method

treats subset selection as a bi-objective optimization prob-

lem, which requires optimizing the given objective and min-

imizing the subset size simultaneously. Then, a bi-objective

evolutionary algorithm with theoretical guarantee

[

Yu et al.,

2012; Qian et al., 2015a; 2015b

]

is applied to solve it. Fi-

nally, the best solution satisfying the size constraint is picked

out from the solution set produced by POSS. POSS is proved

to achieve the best previous known polynomial-time approx-

imation guarantee

[

Das and Kempe, 2011

]

on the sparse re-

gression problem

[

Miller, 2002

]

, a representative example of

subset selection. Particularly, it can also even ﬁnd an optimal

solution on an important subclass of sparse regression

[

Das

and Kempe, 2008

]

. In addition to the theoretical guarantee,

POSS has also achieved signiﬁcantly better empirical perfor-

mance than the greedy and the relaxation methods.

POSS requires calling 2ek

n (e ⇡ 2.71828 is Euler’s num-

ber) number of objective function evaluations

[

Qian et al.,

2015c

]

to achieve a high quality solution, which could be un-

satisfactory from the practical viewpoint when k and n are

large. On the other hand, POSS is a sequential algorithm that

cannot be readily parallelized, which hinders the exploration

of modern computer facilities for applying POSS to large-

scale real-world problems.

In this paper, we propose a parallel version of POSS, called

PPOSS. Instead of generating one solution at a time (as in

POSS), PPOSS generates as many solutions as the number

of processors at a time, and can be easily parallelized. More

important, on subset selection with monotone objective func-

tions, we prove that, while preserving the solution quality,

(1) when the number of processors is limited (less than the

number n of variables), the running time of PPOSS can be

reduced almost linearly w.r.t. the number of processors;

(2) with increasing number of processors, the running time

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

1939

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38617604

粉丝: 4

并行Pareto优化：子集选择的新方法

基于MATLAB的特征子集选择算法研究：从FSS到SBS与SFS的实践与优化,基于MATLAB的特征子集选择算法研究：优化系统性能的SBS与SFS策略,基于matlab的特征选择也叫特征子集选择(FS

论文研究-基于IGA的支持向量机特征子集选择和参数优化.pdf

NSGA-II并行化策略：提升优化效率的关键步骤与实例

遗传算法优化秘籍：从自然选择到算法优化，彻底解读原理

CFSFDP算法并行加速

【并行化新思路】：提升MOGOA计算效率的革命性并行技术

【遗传算法精讲】：模拟自然选择，解锁最优化的秘密

特征选择优化术：揭秘机器学习模型性能提升的关键

【多目标优化解法全解析】：算法选择与模型验证的实战攻略

【选择最合适的MATLAB算法】：遗传算法与其他优化方法的对比分析

最新资源