Subset Selection by Pareto Optimization
Chao Qian Yang Yu Zhi-Hua Zhou
National Key Laboratory for Novel Software Technology, Nanjing University
Collaborative Innovation Center of Novel Software Technology and Industrialization
Nanjing 210023, China
{qianc,yuy,zhouzh}@lamda.nju.edu.cn
Abstract
Selecting the optimal subset from a large set of variables is a fundamental problem
in various learning tasks such as feature selection, sparse regression, dictionary
learning, etc. In this paper, we propose the POSS approach which employs evo-
lutionary Pareto optimization to find a small-sized subset with good performance.
We prove that for sparse regression, POSS is able to achieve the best-so-far the-
oretically guaranteed approximation performance efficiently. Particularly, for the
Exponential Decay subclass, POSS is proven to achieve an optimal solution. Em-
pirical study verifies the theoretical results, and exhibits the superior performance
of POSS to greedy and convex relaxation methods.
1 Introduction
Subset selection is to select a subset of size k from a total set of n variables for optimizing some
criterion. This problem arises in many applications, e.g., feature selection, sparse learning and
compressed sensing. The subset selection problem is, however, generally NP-hard [13, 4]. Previous
employed techniques can be mainly categorized into two branches, greedy algorithms and convex
relaxation methods. Greedy algorithms iteratively select or abandon one variable that makes the
criterion currently optimized [9, 19], which are however limited due to its greedy behavior. Convex
relaxation methods usually replace the set size constraint (i.e., the `
0
-norm) with convex constraints,
e.g., the `
1
-norm constraint [18] and the elastic net penalty [29]; then find the optimal solutions to
the relaxed problem, which however could be distant to the true optimum.
Pareto optimization solves a problem by reformulating it as a bi-objective optimization problem
and employing a bi-objective evolutionary algorithm, which has significantly developed recently in
theoretical foundation [22, 15] and applications [16]. This paper proposes the POSS (Pareto Opti-
mization for Subset Selection) method, which treats subset selection as a bi-objective optimization
problem that optimizes some given criterion and the subset size simultaneously. To investigate the
performance of POSS, we study a representative example of subset selection, the sparse regression.
The subset selection problem in sparse regression is to best estimate a predictor variable by linear
regression [12], where the quality of estimation is usually measured by the mean squared error, or
equivalently, the squared multiple correlation R
2
[6, 11]. Gilbert et al. [9] studied the two-phased
approach with orthogonal matching pursuit (OMP), and proved the multiplicative approximation
guarantee 1 + Θ(µk
2
) for the mean squared error, when the coherence µ (i.e., the maximum cor-
relation between any pair of observation variables) is O(1/k). This approximation bound was later
improved by [20, 19]. Under the same small coherence condition, Das and Kempe [2] analyzed the
forward regression (FR) algorithm [12] and obtained an approximation guarantee 1−Θ(µk) for R
2
.
These results however will break down when µ ∈ w(1/k). By introducing the submodularity ratio
γ, Das and Kempe [3] proved the approximation guarantee 1 − e
−γ
on R
2
by the FR algorithm;
this guarantee is considered to be the strongest since it can be applied with any coherence. Note
that sparse regression is similar to the problem of sparse recovery [7, 25, 21, 17], but they are for
1