ElasticNet

Hui Zou, Stanford University

Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Department of Statistics Stanford University

1

ElasticNet

Hui Zou, Stanford University

Outline • Variable selection problem • Sparsity by regularization and the lasso • The elastic net

2

ElasticNet

Hui Zou, Stanford University

Variable selection • Want to build a model using a subset of “predictors” • Multiple linear regression; logistic regression (GLM); Cox’s partial likelihood, . . . – model selection criteria: AIC, BIC, etc. – relatively small p (p is the number of predictors) – instability (Breiman, 1996) • Modern data sets: high-dimensional modeling – microarrays (the number of genes  10,000) – image processing – document classification – ...

3

ElasticNet

Hui Zou, Stanford University

Example: Leukemia classification • Leukemia Data, Golub et al. Science 1999 • There are 38 training samples and 34 test samples with total p = 7129 genes. • Record the expression for sample i and gene j. • Tumors type: AML or ALL. • Golub et al. used a Univariate Ranking method to select relevant genes.

4

ElasticNet

Hui Zou, Stanford University

The p  n problem and grouped selection • Microarrays: p  10, 000 and n < 100. A typical “large p, small n” problem (West et al. 2001). • For those genes sharing the same biological “pathway”, the correlations among them can be high. We think of these genes as forming a group. • What would an “oracle” do? ✔ Variable selection should be built into the procedure. ✔ Grouped selection: automatically include whole groups into the model if one variable amongst them is selected.

5

ElasticNet

Hui Zou, Stanford University

Sparsity via 1 penalization • Wavelet shrinkage and Basis pursuit; Donoho et al. (1995) • Lasso; Tibshirani (1996) • Least Angle Regression (LARS); Efron, Hastie, Johnstone and Tibshirani (2004) • COSSO in smoothing spline ANOVA; Lin and Zhang (2003) • 0 and 1 relation; Donoho et al. (1999,2004)

6

ElasticNet

Hui Zou, Stanford University

Lasso • Data (X, y). X is the n × p predictor matrix of standardized variables; and y is the response vector. min y − Xβ2 β

s.t. β1 =

p 

|βj | ≤ t

j=1

• Bias-variance tradeoff by a continuous shrinkage • Variable selection by the 1 penalization • Survival analysis: Cox’s partial likelihood + the 1 penalty (Tibshirani 1998) • Generalized linear models (e.g. logistic regression) • LARS/Lasso: Efron et al. (2004).

7

ElasticNet

Hui Zou, Stanford University

The limitations of the lasso • If p > n, the lasso selects at most n variables. The number of selected genes is bounded by the number of samples. • Grouped variables: the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.

8

ElasticNet

Hui Zou, Stanford University

Elastic Net regularization βˆ =

arg min y − Xβ2 + λ2 β2 + λ1 β1 β

• The 1 part of the penalty generates a sparse model. • The quadratic part of the penalty – Removes the limitation on the number of selected variables; – Encourages grouping effect; – Stabilizes the 1 regularization path.

9

ElasticNet

Hui Zou, Stanford University

10

Geometry of the elastic net The elastic net penalty 2-dimensional illustration α = 0.5

J(β) = αβ2 +(1−α)β1

β2 Ridge Lasso Elastic Net

(with α =

λ2 λ2 +λ1 )

min y−Xβ2 s.t. J(β) ≤ t. β

β1

• Singularities at the vertexes (necessary for sparsity) • Strict convex edges. The strength of convexity varies with α (grouping)

ElasticNet

Hui Zou, Stanford University

A simple illustration: elastic net vs. lasso • Two independent “hidden” factors z1 and z2 z1 ∼ U (0, 20),

z2 ∼ U (0, 20)

• Generate the response vector y = z1 + 0.1 · z2 + N (0, 1) • Suppose only observe predictors x1 = z1 + 1 ,

x2 = −z1 + 2 ,

x3 = z1 + 3

x4 = z2 + 4 ,

x5 = −z2 + 5 ,

x6 = z2 + 6

• Fit the model on (X, y) • An “oracle” would identify x1 , x2 , and x3 (the z1 group) as the most important variables.

11

ElasticNet

Hui Zou, Stanford University

Lasso

12

Elastic Net lambda = 0.5

20

40

3 1

10 0

5

−20

0

4

4 6

−10

1

Standardized Coefficients

10

20

6 5

−10

Standardized Coefficients

30

3

2

2 0.0

0.2

0.4

0.6

s = |beta|/max|beta|

0.8

1.0

0.0

0.2

0.4

0.6

s = |beta|/max|beta|

0.8

1.0

ElasticNet

Hui Zou, Stanford University

Lasso

13

Elastic Net lambda = 0.5

3 1

30

20

5

10 0

4 6 5

−10

−10

Standardized Coefficients

0

10

1 3 4

−20

−20

Standardized Coefficients

20

6

2

2 0.0

0.2

0.4

0.6

s = |beta|/max|beta|

0.8

1.0

0.0

0.2

0.4

0.6

s = |beta|/max|beta|

0.8

1.0

ElasticNet

Hui Zou, Stanford University

Results on the grouping effect Regression ˆi (λ1 )βˆj (λ1 ) > 0, then Let ρij = cor(x  i , xj ). Suppose β √  1 ˆ ˆj (λ1 )| ≤ 2 1 − ρij . | β (λ ) − β i 1 |y| λ2 Classification Let φ be a margin-based loss function, i.e., φ(y, f ) = φ(yf ) and y ∈ {1, −1}. Consider n    T ˆ φ yk xk β + λ2 β2 + λ1 β1 β = arg min β

k=1

Assume that φ is Lipschitz, i.e., |φ(t1 ) − φ(t2 )| ≤ M |t1 − t2 | , then ∀ a pair of (i, j), we have √ n   M  2M  ˆ  ˆ |xk,i − xk,j | ≤ 1 − ρij . βi − βj  ≤ λ2 λ2 k=1

14

ElasticNet

Hui Zou, Stanford University

15

Elastic net with scaling correction def βˆenet = (1 + λ2 )βˆ

• Keep the grouping effect and overcome the double shrinkage by the quadratic penalty.  λ = (1 − γ)Σ  + γI, γ = λ2 . Σ  λ is  = XT X and Σ • Consider Σ 2 2 1+λ2 a shrunken estimate for the correlation matrix of the predictors. • Decomposition of the ridge operator: βˆridge =

1  −1 T 1+λ2 Σλ2 X y.

• We can show that  − 2yT Xβ + λ1 β1 βˆlasso = arg min β T Σβ β

 λ β − 2yT Xβ + λ1 β1 βˆenet = arg min β T Σ 2 β

• With orthogonal predictors, βˆenet reduces to the (minimax) optimal soft-thresholding estimator.

ElasticNet

Hui Zou, Stanford University

Computation • The elastic net solution path is piecewise linear. • Given a fixed λ2 , a stage-wise algorithm called LARS-EN efficiently solves the entire elastic net solution path. – At step k, efficiently updating or downdating the Cholesky factorization of XTAk−1 XAk−1 + λ2 I, where Ak is the active set at step k. – Only record the non-zero coefficients and the active set at each LARS-EN step. – Early stopping, especially in the p  n problem. • R package: elasticnet

16

ElasticNet

Hui Zou, Stanford University

Simulation example 1: 50 data sets consisting of 20/20/200 observations and 8 predictors. β = (3, 1.5, 0, 0, 2, 0, 0, 0) and σ = 3. cor(xi , xj ) = (0.5)|i−j| . Simulation example 2: Same as example 1, except βj = 0.85 for all j. Simulation example 3: 50 data sets consisting of 100/100/400 observations and 40 predictors. β = (0, . . . , 0, 2, . . . , 2, 0, . . . , 0, 2, . . . , 2) and σ = 15; cor(xi , xj ) = 0.5 | {z } | {z } | {z } | {z } 10

10

10

10

for all i, j. Simulation example 4: 50 data sets consisting of 50/50/400 observations and 40 predictors. β = (3, . . . , 3, 0, . . . , 0) and σ = 15. | {z } | {z } 15

25

xi = Z1 + xi ,

Z1 ∼ N (0, 1),

i = 1, · · · , 5,

xi = Z2 + xi ,

Z2 ∼ N (0, 1),

i = 6, · · · , 10,

xi = Z3 + xi ,

Z3 ∼ N (0, 1),

i = 11, · · · , 15,

xi ∼ N (0, 1),

xi

i.i.d

i = 16, . . . , 40.

17

ElasticNet

Hui Zou, Stanford University

18

Median MSE for the simulated examples

Method

Ex.1

Ex.2

Ex.3

Ex.4

Ridge

4.49 (0.46)

2.84 (0.27)

39.5 (1.80)

64.5 (4.78)

Lasso

3.06 (0.31)

3.87 (0.38)

65.0 (2.82)

46.6 (3.96)

Elastic Net

2.51 (0.29)

3.16 (0.27)

56.6 (1.75)

34.5 (1.64)

No re-scaling

5.70 (0.41)

2.73 (0.23)

41.0 (2.13)

45.9 (3.72)

Variable selection results

Method

Ex.1

Ex.2

Ex.3

Ex.4

Lasso

5

6

24

11

Elastic Net

6

7

27

16

ElasticNet

Hui Zou, Stanford University

19

Leukemia classification example Method

10-fold CV error

Test error

No. of genes

Golub UR

3/38

4/34

50

SVM RFE

2/38

1/34

31

PLR RFE

2/38

1/34

26

NSC

2/38

2/34

21

Elastic Net

2/38

0/34

45

UR: univariate ranking (Golub et al. 1999) RFE: recursive feature elimination (Guyon et al. 2002) SVM: support vector machine (Guyon et al. 2002) PLR: penalized logistic regression (Zhu and Hastie 2004) NSC: nearest shrunken centroids (Tibshirani et al. 2002)

ElasticNet

Hui Zou, Stanford University

20

8 10 12 14 2

4

6

CV TEST

0

Misclassification Error

Leukemia classification: early stopping at 200 steps

0

50

100

150

200

steps

8 10 12 14

s(steps=200)

2

4

6

= 0.50

0

Misclassification Error

Leukemia classification: the whole elastic net paths

0.0

0.2

0.4

0.6 s

0.8

1.0

ElasticNet

Hui Zou, Stanford University

Effective degrees of freedom • Effective df describes the model complexity. • df is very useful in estimating the prediction accuracy of the fitted model. ˆ = Sy, df (µ) ˆ = tr(S). • df is well studied for linear smoothers: µ • For the 1 related methods, the non-linear nature makes the analysis difficult. • Conjecture by Efron et al. (2004): Starting at step 0, let mk be the index of the last model in the Lasso sequence containing . exact k predictors. Then df (mk ) = k.

21

ElasticNet

Hui Zou, Stanford University

Elastic Net: degrees of freedom  ], where df  is an unbiased estimate for df , and • df = E[df  = Tr (Hλ (A)) df 2 where A is the active set and  T −1 T Hλ2 (A) = XA XA XA + λ2 I XA . • For the lasso (λ2 = 0),  (lasso) = the number of nonzero coefficients. df • Proof: SURE+LARS+convex analysis

22

ElasticNet

Hui Zou, Stanford University

Elastic Net: other applications • Sparse PCA – Obtain (modified) principal components with sparse loadings. • Kernel elastic net – Generate a class of kernel machines with support vectors.

23

ElasticNet

Hui Zou, Stanford University

Sparse PCA • Xn×p and xi is the i-th row vector of X. • α and β are p-vectors. SPCA: the leading sparse PC min α,β

n 

xi − αβ T xi 2 + λ2 β2 + λ1 β1

i=1

subject to α2 = 1. vˆ =

βˆ ˆ , β

the loadings.

• A large λ1 generates sparse loadings. • The equivalence theorem: consider the SPCA with λ1 = 0 1. ∀λ2 > 0, SPCA ≡ PCA; 2. When p > n, SPCA ≡ PCA if only if λ2 > 0.

24

ElasticNet

Hui Zou, Stanford University

Sparse PCA (cont.) • Ap×k = [α1 , · · · , αk ] and Bp×k = [β1 , · · · , βk ] SPCA: the first k sparse PCs min A,B

n 

xi − ABT xi 2 + λ2

i=1

k 

βj 2 +

j=1

k 

λ1j βj 1

j=1

subject to AT A = Ik×k . Let vˆj =

βˆj , βˆj 

for j = 1, . . . , k.

• Solution: – B given A: k independent elastic net problems. – A given B: exact solution by SVD.

25

ElasticNet

Hui Zou, Stanford University

SPCA algorithm

1. Let A start at V[ , 1 : k], the loadings of the first k ordinary principal components. 2. Given a fixed A = [α1 , · · · , αk ], solve the following elastic net problem for j = 1, 2, . . . , k βj = arg min(αj − β)T XT X(αj − β) + λ2 β2 + λ1,j β1 β

3. For a fixed B = [β1 , · · · , βk ], compute the SVD of XT XB = UDVT , then update A = UVT . 4. Repeat steps 2–3, until convergence. 5. Normalization: vˆj =

βj βj  ,

j = 1, . . . , k.

26

ElasticNet

Hui Zou, Stanford University

Sparse PCA: pitprops data example • There are 13 measured variables. First introduced by Jeffers (1967) who tried to interpret the first 6 principal components. • A classic example showing the difficulty of interpreting principal components. • The original data have 180 observations. The sample correlation matrix (13 × 13) is sufficient in our analysis.

27

ElasticNet

Hui Zou, Stanford University

PCA

28

SPCA

topdiam

-.404

.218

-.207

-.477

length

-.406

.186

-.235

-.476

moist

-.124

.541

.141

.785

testsg

-.173

.456

.352

.620

ovensg

-.057

-.170

.481

ringtop

-.284

-.014

.475

ringbut

-.400

-.190

.253

-.250

bowmax

-.294

-.189

-.243

-.344

bowdist

-.357

.017

-.208

-.416

whorls

-.379

-.248

-.119

-.400

clear

.011

.205

-.070

knots

.115

.343

.092

diaknot

.113

.309

-.326

variance

32.4

18.3

14.4

.177

.640 .589 .492 -.021

.013 -.015 28.0

14.0

13.3

ElasticNet

Hui Zou, Stanford University

Kernel Machines • Binary classification: y ∈ {1, −1}. • Take a margin-based loss function φ(y, f ) = φ(yf ). • A kernel matrix Ki,j = k(xi , xj ). We consider n ˆ ˆ i k(xi , x) with f (x) = i=1 α n n  1 α ˆ = arg min φ(yi αi k(xi , x)) + λ2 αT Kα α n i=1 i=1

• SVMs uses φ(y, f ) = (1 − yf )+ , the hinge loss (Wahba, 2000). ✔ maximizes the margin ✔ directly approximates the Bayes rule (Lin, 2002) ✔ only a fraction of α are non-zero: support vectors ✖ no estimate for p(y|x)

29

ElasticNet

Hui Zou, Stanford University

Kernel elastic net • Take φ(y, f ) = log(1 + exp(−yf )). We consider n ˆ ˆ i k(xi , x) with f (x) = i=1 α n n n   1 α ˆ = arg min φ(yi αi k(xi , x)) + λ2 αT Kα + λ1 |αi | α n i=1 i=1 i=1

✔ estimates p(y|x) • KLR: λ1 = 0, no support vectors ✔ a large λ1 generates genuine support vectors ✔ combines margin maximization with boosting – λ1 is the main tuning parameter: the regularization method in boosting (Rosset, Zhu and Hastie, 2004). – small positive λ2 : the limiting solution (λ1 → 0) is close to the margin-maximization classifier.

30

ElasticNet

Hui Zou, Stanford University

Summary • The elastic net performs simultaneous regularization and variable selection. • Ability to perform grouped selection • Appropriate for the p  n problem • Analytical results on the df of the elastic net/lasso • Interesting implications in other areas: sparse PCA and new support kernel machines

31

ElasticNet

Hui Zou, Stanford University

References • Zou, H. and Hastie, T. (2004) Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B. To appear. • Zou, H., Hastie, T. and Tibshirani, R. (2004). Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics. Tentatively accepted. • Zou, H., Hastie, T. and Tibshirani, R. (2004). On the “Degrees of Freedom” of the Lasso. Submitted to Annals of Statistics. http://www-stat.stanford.edu/~hzou

32

Regularization and Variable Selection via the ... - Stanford University

ElasticNet. Hui Zou, Stanford University. 8. The limitations of the lasso. • If p>n, the lasso selects at most n variables. The number of selected genes is bounded by the number of samples. • Grouped variables: the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.
Missing:

185KB Sizes 0 Downloads 207 Views

Recommend Documents

Learning a Factor Model via Regularized PCA - Stanford University
Jul 15, 2012 - Abstract We consider the problem of learning a linear factor model. ... As such, our goal is to design a learning algorithm that maximizes.

Learning a Factor Model via Regularized PCA - Stanford University
Jul 15, 2012 - To obtain best performance from such a procedure, one ..... Specifically, the equivalent data requirement of UTM versus URM behaves very ...... )I + A, we know C and A share the same eigenvectors, and the corresponding ...

via Total Variation Regularization
sensor have a field of view of 360 degrees; this property is very useful in robotics since it increases a rohot's performance for navigation and localization.

Stochastic Superoptimization - Stanford CS Theory - Stanford University
at most length 6 and produce code sequences of at most length. 3. This approach ..... tim e. (n s. ) Figure 3. Comparison of predicted and actual runtimes for the ..... SAXPY (Single-precision Alpha X Plus Y) is a level 1 vector operation in the ...

Stanford University
Xeog fl(v) P(v, v) + Т, s = Xeog E (II, (v) P (v, v) + Т,6). (4) = X.-c_g E (II, (v) P (v, v1) + П,6). = EII, (v) = f(v), v e D. The first equality follows from the definition of P.

Stanford-UBC at TAC-KBP - Stanford NLP Group - Stanford University
IXA NLP Group, University of the Basque Country, Donostia, Basque Country. ‡. Computer Science Department, Stanford University, Stanford, CA, USA. Abstract.

Stanford-UBC at TAC-KBP - Stanford NLP Group - Stanford University
We developed several entity linking systems based on frequencies of backlinks, training on contexts of ... the document collection containing both entity and fillers from Wikipedia infoboxes. ..... The application of the classifier to produce the slo

The Anatomy of a Search Engine - Stanford InfoLab - Stanford University
In this paper, we present Google, a prototype of a large-scale search engine which makes .... 1994 -- Navigators, "The best navigation service should make it easy to find ..... of people coming on line, there are always those who do not know what a .

The Anatomy of a Search Engine - Stanford InfoLab - Stanford University
Google is designed to crawl and index the Web efficiently ...... We hope Google will be a resource for searchers and researchers all around the world and will ...

The Anatomy of a Search Engine - Stanford InfoLab - Stanford University
traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a pra

Learned helplessness and generalization - Stanford University
In learned helplessness experiments, subjects first expe- rience a lack of control in one situation, and then show learning deficits when performing or learning ...

Achieving Anonymity via Clustering - Stanford CS Theory
2Department of Computer Science, Stanford University,. Stanford, CA .... year with a maximum of 100 years. In this ... clustering with minimum cluster size r = 2, applied to the table in .... the clause nodes uj have degree at most 3 and cannot be.

Transparency and Distressed Sales under ... - Stanford University
of Business, Stanford University, 518 Memorial Way, Stanford, CA 94305 (e-mail: ... wants to finance by the proceeds from the sale of the asset can diminish at a .... with private offers) we have not been able to formally establish that the ranking.

Transparency and Distressed Sales under ... - Stanford University
pete inter- and intra-temporarily for a good sold by an informed ... of Business, Stanford University, 518 Memorial Way, Stanford, CA 94305 ... of the 8th Annual Paul Woolley Center Conference at LSE, Central European University, CERGE, 2013 ..... is

The Rise and Decline of the American Ghetto ... - Stanford University
Data on house prices and atti- tudes toward ...... the white neighborhood, and house prices in the black area will rise relative to house ...... Chapel Hill: Univ.

Downlink Interference Alignment - Stanford University
Paper approved by N. Jindal, the Editor for MIMO Techniques of the. IEEE Communications ... Interference-free degrees-of-freedom ...... a distance . Based on ...

LEARNING CONCEPTS THROUGH ... - Stanford University
bust spoken dialogue systems (SDSs) that can handle a wide range of possible ... assitant applications (e.g., Google Now, Microsoft Cortana, Apple's. Siri) allow ...

Downlink Interference Alignment - Stanford University
cellular networks, multi-user MIMO. I. INTRODUCTION. ONE of the key performance metrics in the design of cellular systems is that of cell-edge spectral ...

The Effects of Roads on Trade and Migration - Stanford University
Dec 5, 2016 - ond, although the trade effect dominates, accounting for costly ..... 1956), during which the automobile industry came of age and the national capital was ..... The cost of land, LCnt, depends on the demand for housing services.13 The h

Biological conceptions of race and the motivation ... - Stanford University
Conflict Resolution, 25, 563–579. O'Gorman, R., Wilson, D. S., & Miller, R. R. (2005). Altruistic punishing and helping differ in sensitivity to relatedness, friendship ...

Burn-in, bias, and the rationality of anchoring - Stanford University
The model's quantitative predictions match published data on anchoring in numer- ... In cognitive science, a recent analysis concluded that time costs make.

Variable selection in PCA in sensory descriptive and consumer data
Keywords: PCA; Descriptive sensory data; Consumer data; Variable selection; Validation. 1. Introduction. In multivariate analysis where data-tables with sen-.

Split Intransitivity and Variable Auxiliary Selection in ...
Mar 14, 2014 - Je suis revenu–j'ai revenu `a seize ans, j'ai revenu `a Ottawa. ... J'ai sorti de la maison. 6 ..... 9http://www.danielezrajohnson.com/rbrul.html.

Consistent Variable Selection of the l1−Regularized ...
Proof. The proof for Lemma S.1 adopts the proof for Lemma 1 from Chapter 6.4.2 of Wain- ..... An application of bound (3) from Lemma S.4 with ε = φ. 6(¯c−1).