稀疏性统计学习：lasso方法与推广

需积分: 10 121 浏览量更新于2024-07-19 收藏 10.31MB PDF 举报

“Statistic learning with sparsity”是一本由Trevor Hastie、Robert Tibshirani和Martin Wainwright合著的专业书籍，专注于稀疏性在统计学习中的应用，特别是Lasso回归及其推广。这本书是统计学和应用概率论系列的第143部作品，经过修正，更新至2016年12月19日。作者们向他们的父母以及家庭表达了敬意。书中涵盖了从基础到高级的稀疏性学习方法，旨在探讨如何在数据中有效地识别和利用稀疏结构。以下是书中部分内容的详细解释： 1. **Lasso for Linear Models**（Lasso回归在线性模型中的应用） - **Introduction**：这部分介绍Lasso回归的基本概念，它是线性回归的一个变种，通过引入L1正则化项来实现变量选择。 - **The Lasso Estimator**：Lasso估计器通过最小化残差平方和与惩罚项的和来求解，其中惩罚项是所有系数的绝对值之和。 - **Cross-Validation and Inference**：交叉验证是选择最佳正则化参数的重要工具，而推断部分则涉及对模型参数的统计分析。 - **Computation of the Lasso Solution**：讨论了如何计算Lasso解，包括单个预测变量的软阈值算法、多预测变量的循环坐标下降法，以及在正交基上的软阈值处理。 - **Degrees of Freedom**：度量Lasso模型的复杂度，帮助理解模型的解释能力和预测能力。 - **Uniqueness of the Lasso Solutions**：讨论Lasso解的唯一性问题，以及在某些情况下可能存在的多解情况。 - **A Glimpse at the Theory**：简要介绍Lasso理论基础，包括其理论保证和性质。 - **The Nonnegative Garrote**：非负 garrote 是一种与Lasso相关的回归方法，限制系数为非负。 - **ℓq Penalties and Bayes Estimates**：拓展到其他正则化方法，如ℓq惩罚，并与贝叶斯估计联系起来。 - **Some Perspective**：提供一个更广泛的视角，比较不同正则化技术的优缺点。 - **Exercises**：提供了练习题以巩固理解和应用所学知识。 2. **Generalized Linear Models**（广义线性模型） - **Introduction**：这部分将Lasso方法扩展到广义线性模型，如逻辑回归，以处理非连续响应变量。书中的内容不仅限于这些章节，还包括更深入的理论分析、其他正则化技术，如Elastic Net，以及在高维数据分析、信号处理、生物信息学等领域的应用。通过这本书，读者可以系统地了解稀疏性在统计学习中的核心作用，以及如何利用这些工具解决实际问题。

STATISTICAL LEARNING WITH SPARSITY 5

theory tells us that, if

f(t )

actually has very low band-

width, then a small number of (uniform) samples will suf-

fice for recovery. As we will see in the remainder of this

article, signal recovery can actually be made possible for a

much broader class of signal models.

INCOHERENCE AND THE SENSING OF SPARSE SIGNALS

This section presents the two fundamental premises underlying

CS: sparsity and incoherence.

SPARSITY

Many natural signals have concise representations when

expressed in a convenient basis. Consider, for example, the

image in Figure 1(a) and its wavelet transform in (b).

Although nearly all the image pixels have nonzero values, the

wavelet coefficients offer a concise summary: most coeffi-

cients are small, and the relatively few large coefficients cap-

ture most of the information.

Mathematically speaking, we have a vector

f ∈

(such as

the

-pixel image in Figure 1) which we expand in an orthonor-

mal basis (such as a wavelet basis)

 = [ψ

···ψ

]

as follows:

f(t ) =



i=1

(t), (2)

where

is the coefficient sequence of

=f,ψ



. It will be

convenient to express

 x

(where



is the

n × n

matrix

with

,... ,ψ

as columns). The implication of sparsity is

now clear: when a signal has a sparse expansion, one can dis-

card the small coefficients without much perceptual loss.

Formally, consider

(t)

obtained by keeping only the terms

corresponding to the

largest values of

)

in the expansion

(2). By definition,

:=  x

, where here and below,

is the

vector of coefficients

)

with all but the largest

set to zero.

This vector is sparse in a strict sense since all but a few of its

entries are zero; we will call

-sparse

such objects with at most

nonzero

entries. Since



is an orthonormal

basis (or “orthobasis”), we have

f − f





=x − x





and if

sparse or compressible in the sense

that the sorted magnitudes of the

)

decay quickly, then

is well approxi-

mated by

and, therefore, the error

f − f





is small. In plain terms,

one can “throw away” a large fraction

of the coefficients without much loss.

Figure 1(c) shows an example where

the perceptual loss is hardly noticeable

from a megapixel image to its approxi-

mation obtained by throwing away

97.5% of the coefficients.

This principle is, of course, what

underlies most modern lossy coders

such as JPEG-2000 [4] and many

others, since a simple method for data compression would be to

compute

from

and then (adaptively) encode the locations

and values of the

significant coefficients. Such a process

requires knowledge of all the

coefficients

, as the locations

of the significant pieces of information may not be known in

advance (they are signal dependent); in our example, they tend

to be clustered around edges in the image. More generally,

sparsity is a fundamental modeling tool which permits efficient

fundamental signal processing; e.g., accurate statistical estima-

tion and classification, efficient data compression, and so on.

This article is about a more surprising and far-reaching impli-

cation, however, which is that sparsity has significant bearings

on the acquisition process itself. Sparsity determines how effi-

ciently one can acquire signals nonadaptively.

INCOHERENT SAMPLING

Suppose we are given a pair

(, )

of orthobases of

. The first

basis



is used for sensing the object

as in (1) and the second is

used to represent

. The restriction to pairs of orthobases is not

essential and will merely simplify our treatment.

DEFINITION 1

The coherence between the sensing basis



and the representa-

tion basis



μ(, ) =

√

n · max

1≤k, j≤n

|ϕ

,ψ

|.(3)

In plain English, the coherence measures the largest correlation

between any two elements of



and



; see also [5]. If



and



contain correlated elements, the coherence is large. Otherwise,

it is small. As for how large and how small, it follows from linear

algebra that

μ(, ) ∈ [1,

√

Compressive sampling is mainly concerned with low coher-

ence pairs, and we now give examples of such pairs. In our first

example,



is the canonical or spike basis

(t) = δ(t − k )

and

[FIG1] (a) Original megapixel image with pixel values in the range [0,255] and (b) its

wavelet transform coefficients (arranged in random order for enhanced visibility).

Relatively few wavelet coefficients capture most of the signal energy; many such images

are highly compressible. (c) The reconstruction obtained by zeroing out all the coefficients

in the wavelet expansion but the 25,000 largest (pixel values are thresholded to the range

[0,255]). The difference with the original picture is hardly noticeable. As we describe in

“Undersampling and Sparse Signal Recovery,” this image can be perfectly recovered from

just 96,000 incoherent measurements.

(a) (b)

−1

0246810

−0.5

0.5

1.5

Wavelet

Coefficients

× 10

(c)

× 10

IEEE SIGNAL PROCESSING MAGAZINE [23] MARCH 2008

Figure 1.2 (a) Original megapixel image with pixel values in the range [0, 255]

and (b) its wavelet transform coeﬃcients (arranged in random order for enhanced

visibility). Relatively few wavelet coeﬃcients capture most of the signal energy; many

such images are highly compressible. (c) The reconstruction obtained by zeroing out

all the coeﬃcients in the wavelet expansion but the 25, 000 largest (pixel values are

thresholded to the range [0, 255]). The diﬀerences from the original picture are hardly

noticeable.

ical models and their selection are discussed in Chapter 9 while compressed

sensing is the topic of Chapter 10. Finally, a survey of theoretical results for

the lasso is given in Chapter 11.

We note that both supervised and unsupervised learning problems are dis-

cussed in this book, the former in Chapters 2, 3, 4, and 10, and the latter in

Chapters 7 and 8.

Notation

We have adopted a notation to reduce mathematical clutter. Vectors are col-

umn vectors by default; hence β ∈ R

is a column vector, and its transpose

is a row vector. All vectors are lower case and non-bold, except N-vectors

which are bold, where N is the sample size. For example x

might be the

N-vector of observed values for the j

variable, and y the response N-vector.

All matrices are bold; hence X might represent the N ×p matrix of observed

predictors, and Θ a p × p precision matrix. This allows us to use x

∈ R

represent the vector of p features for observation i (i.e., x

is the i

row of

X), while x

is the k

column of X, without ambiguity.

8 THE LASSO FOR LINEAR MODELS

This chapter is devoted to discussion of the lasso, a method that combines

the least-squares loss (2.2) with an `

-constraint, or bound on the sum of the

absolute values of the coeﬃcients. Relative to the least-squares solution, this

constraint has the eﬀect of shrinking the coeﬃcients, and even setting some

to zero.

In this way it provides an automatic way for doing model selection

in linear regression. Moreover, unlike some other criteria for model selection,

the resulting optimization problem is convex, and can be solved eﬃciently for

large problems.

2.2 The Lasso Estimator

Given a collection of N predictor-response pairs {(x

, y

)}

i=1

, the lasso ﬁnds

the solution (

β) to the optimization problem

minimize

,β







i=1

− β

−

j=1

)







subject to

j=1

|β

| ≤ t.

(2.3)

The constraint

j=1

|β

| ≤ t can be written more compactly as the `

-norm

constraint kβk

≤ t. Furthermore, (2.3) is often represented using matrix-

vector notation. Let y = (y

, . . . , y

) denote the N-vector of responses, and

X be an N × p matrix with x

∈ R

in its i

row, then the optimization

problem (2.3) can be re-expressed as

minimize

,β



ky − β

1 − Xβk



subject to kβk

≤ t,

(2.4)

where 1 is the vector of N ones, and k ·k

denotes the usual Euclidean norm

on vectors. The bound t is a kind of “budget”: it limits the sum of the abso-

lute values of the parameter estimates. Since a shrunken parameter estimate

corresponds to a more heavily-constrained model, this budget limits how well

we can ﬁt the data. It must be speciﬁed by an external procedure such as

cross-validation, which we discuss later in the chapter.

Typically, we ﬁrst standardize the predictors X so that each column is

centered (

i=1

= 0) and has unit variance (

i=1

= 1). Without

A lasso is a long rope with a noose at one end, used to catch horses and cattle. In

a ﬁgurative sense, the method “lassos” the coeﬃcients of the model. In the original lasso

paper (Tibshirani 1996), the name “lasso” was also introduced as an acronym for “Least

Absolute Selection and Shrinkage Operator.”

Pronunciation: in the US “lasso” tends to be pronounced “lass-oh” (oh as in goat), while in

the UK “lass-oo.” In the OED (2nd edition, 1965): “lasso is pronounced l˘asoo by those who

use it, and by most English people too.”

THE LASSO ESTIMATOR 9

standardization, the lasso solutions would depend on the units (e.g., feet ver-

sus meters) used to measure the predictors. On the other hand, we typically

would not standardize if the features were measured in the same units. For

convenience, we also assume that the outcome values y

have been centered,

meaning that

i=1

= 0. These centering conditions are convenient, since

they mean that we can omit the intercept term β

in the lasso optimization.

Given an optimal lasso solution

β on the centered data, we can recover the

optimal solutions for the uncentered data:

β is the same, and the intercept

is given by

= ¯y −

j=1

¯x

where ¯y and {¯x

}

are the original means.

For this reason, we omit the

intercept β

from the lasso for the remainder of this chapter.

It is often convenient to rewrite the lasso problem in the so-called La-

grangian form

minimize

β∈R



ky − Xβk

+ λkβk



, (2.5)

for some λ ≥ 0. By Lagrangian duality, there is a one-to-one correspondence

between the constrained problem (2.3) and the Lagrangian form (2.5): for

each value of t in the range where the constraint kβk

≤ t is active, there is

a corresponding value of λ that yields the same solution from the Lagrangian

form (2.5). Conversely, the solution

to problem (2.5) solves the bound

problem with t = k

We note that in many descriptions of the lasso, the factor 1/2N appearing

in (2.3) and (2.5) is replaced by 1/2 or simply 1. Although this makes no

diﬀerence in (2.3), and corresponds to a simple reparametrization of λ in

(2.5), this kind of standardization makes λ values comparable for diﬀerent

sample sizes (useful for cross-validation).

The theory of convex analysis tells us that necessary and suﬃcient condi-

tions for a solution to problem (2.5) take the form

−

, y −Xβi + λs

= 0, j = 1, . . . , p. (2.6)

Here each s

is an unknown quantity equal to sign(β

) if β

6= 0 and some

value lying in [−1, 1] otherwise—that is, it is a subgradient for the absolute

value function (see Chapter 5 for details). In other words, the solutions

to problem (2.5) are the same as solutions (

β, ˆs) to (2.6). This system is a

form of the so-called Karush–Kuhn–Tucker (KKT) conditions for problem

(2.5). Expressing a problem in subgradient form can be useful for designing

This is typically only true for linear regression with squared-error loss; it’s not true, for

example, for lasso logistic regression.

剩余361页未读，继续阅读

cloud,forest,protein

粉丝: 0
资源: 1

稀疏性统计学习：lasso方法与推广

Statistical Learning with Sparsity - The Lasso and Generalizations

稀疏的统计学习：套索和概化Statistical Learning with Sparsity: The Lasso and Generalizations

统计学习和稀疏Lasso

statistic with R

An introduction to Statistical Learning with R

statistic books for machine learning

Lie with statistic

statistic

Level_combine_with_characteristic_and_statistic:桌面模拟

statistic作业

最新资源