Sparsity and statistical inference in high-dimensional ...

Viewer
Transcript

Sparsity and statistical inference in high-dimensional settings Martin Wainwright Department of Statistics, and Department of Electrical Engineering and Computer Science UC Berkeley, California, USA wainwrig@{stat,eecs}.berkeley.edu ICML 2008, Helsinki, Finland Workshop on sparsity and variable selection

1

Introduction classical asymptotic theory in statistical inference: – number of observations n → +∞ – model dimension p stays fixed

contrast with many modern applications and models: – systems frequently large (p ≈ 103 − 108 ), and observations costly – interesting consequences: might have p = Θ(n) or even p ≫ n

distinct issues in high-dimensional inference: – performance limits of specific polynomial-complexity methods? – fundamental (information-theoretic) limits that apply to any procedure?

some form of complexity-regularization required to avoid curse of dimensionality (e.g, sparsity) 2

Different sparsity models Given signal vector β ∗ ∈ Rp : Hard-sparsity model: Support:

6 0}. S(β ∗ ) = {i ∈ {1, . . . , p | |βi∗ | =

Exactly k non-zero elements:

|S(β ∗ )| = k.

Weak ℓq -sparsity models: For q ∈ [0, 1), define ℓq -ball: ) ( p X p |ui |q ≤ Cq . Bq (C) = u∈R | i=1

Require that β ∗ ∈ B(C) for some q ∈ [0, 1). 3

ℓ1 -norm relaxation Problem: ℓq “ball” constraints are non-convex (computationally intractable in general).

ℓ1

ℓq Standard relaxation: Relax non-convex constraints into the nearest convex norm—namely, the ℓ1 norm. 4

Compressed sensing X

Y

n

=

n×p

β∗ k

p−k

Set-up: Under-determined sparse linear system: Y = Xβ ∗ , with kβ ∗ k0 = k ≪ p. Goal: Recover β ∗ using n ≪ p measurements. Some relevant work: Chen et al., 1998; Xuo & Donoho, 2001; Elad & Bruckstein, 2002; Candes et al., 2005; Candes/Tao, 2005; Donoho, 2005; Donoho & Tanner, 2006; Baron et al., 2006; Baraniuk, 2007; Indyk et al, 2008 5

Sparse log-linear models

n

β∗

X

Y

= f

W k

n

n×p

,

i h ~ i) ∼ P X ~ T β∗ . Set-up: General log-linear observation model: (Yi | X i Goal: Estimate β ∗ (various loss functions possible).

Some relevant work: Tibshirani, 1996; Chen et al, 1998; Ng, 2004; Tropp, 2004; Fuchs, 2005; Candes & Tao, 2005; Donoho, 2005; Zhao & Yu, 2006; Wainwright, 2006; Rocha et al., 2007; van de Geer, 2007; Ji & Carin, 2007; Tsybakov et al., 2008.

6

Graphical model selection let G = (V, E) be an undirected graph on p = |V | vertices

~ with density: consider p-dimensional random vector Z    X X ~ β) ∝ exp βij Zi Zj . p(Z; βi Zi + βii Zi2 +   i∈V

(i,j)∈E

~ identify the underlying graph structure given n samples of Z, Some relevant work: Abbeel et al., 2006; Meinshausen & Buhlmann, 2006; Rothman et al., 2008; Ravikumar et al., 2006, 2008; Johnson et al., 2007; Santhanam & Wainwright, 2008; Banerjee et al., 2008 7

11111 00000 00000 11111 00000 11111 00000 11111 00000 11111

Sparse principal components analysis

Σ

11111 00000 00000 11111 00000 11111 00000 11111 = 11111 00000

ZZ T

+ D

Set-up: Covariance matrix Σ = ZZ T + D, where leading eigenspace Z has sparse columns. b based on samples X (i) with covariance Goal: Produce an estimate Z matrix Σ. Some relevant work: Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al., 2004;

d’Aspr´ emont et al., 2007; Moghaddam et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008. 8

Low-rank matrix approximation Set-up: given matrix Γ ∈ Rp×p (rank k ≪ p). low rank is equivalent to sparsity in eigenvalues

{λ1 (Γ), λ2 (Γ), . . . , λp (Γ)} has k non-zeros. Observations: For i = 1, 2, . . . , n: T Yi = trace Xi Γ + Wi h i = trace XTi U UT diag{λ1 (Γ), . . . , λp (Γ)}U + Wi . Goal: Recover matrix Γ (exactly or approximately). Some relevant work: Recht et al., 2007; Bach, 2008; Meka et al., ICML 2008 9

Compressed sensing Basis pursuit (ℓ1 -relaxation, linear programming) (Chen, Donoho & Saunders, 1998)

Given noiseless observations Y = Xβ ∗ , where β ∗ ∈ Rp is “sparse”, and X ∈ Rn×p is the measurement matrix. ℓ0 problem minp

kβk0

=

Xβ ∗ .

β∈R

s.t.

Y

ℓ1 relaxation (LP1 ) minp

β∈R

s.t.

Y

kβk1 =

Xβ ∗ .

Question: When does solution to LP-relaxation recover the sparsest solution (optimal for ℓ0 )?

10

Theoretical guarantees for basis pursuit Exact recovery: earlier results on deterministic measurement matrices (Donoho & Xuo, 2001; Elad & Bruckstein, 2002; Feuer & Nemirovski, 2003)

later results on random measurement matrices (Donoho, 2005, Candes/Tao, 2005; Donoho & Tanner, 2006)

– standard Gaussian ensemble: X ∈ Rn×p with Xij ∼ N (0, 1) i.i.d. – signal β ∗ is k-sparse, with k = αp, for some α ∈ (0, 21 ). – then there exists a constant γ = f (α) < 1 such that n = γp observations suffice for BP to recover β ∗

Approximate recovery:

Donoho, 2006; Candes et al., 2006; Tropp, 2006;

Gilbert et al., 2006,

say linear system Y = Xβ has a near-sparse solution β ∗ basis pursuit LP (ℓ1 -relaxation) approximately recovers the best k-term approximation to β ∗ (e.g., in ℓ2 norm) 11

LP duality and almost-spherical sections 6 0} if and only basis pursuit LP recovers support set S = {i | βi∗ = if ∃ λ ∈ Rn such that XTS λ = sign(βS∗ ), and |XTS c λ| < 1. after some further algebraic manipulation, sufficient to show |S| k~ek2 . k~ek2 ≤ k~ek1 ≤ f ∀ ~e ∈ Rp , s.t. XTS ~e = 0. |{z} n | {z } trivial non-trivial

random slices of ℓ1 ball are almost-spherical (Dvoretsky, 1976; Milman & Schechtman, 1986) 12

Design of measurement matrices restricted isometry property (RIP) ≡ controlled eigenspectrum of XTS XS : sufficient condition for recovery (Candes & Tao, 2005) link to Johnson-Lindenstrauss and random projections

(Baraniuk et

al., 2007)

random sub-Gaussian matrices have desirable properties with high probability, but explicit constructions or “learned” matrices? lots of on-going work: – sparse graph methods and expanders (inspired by coding theory) (e.g., Cormode & Muthukrishan, 2005; Gilbert et al., 2006; Sarvotham et al., 2006; Wu & Hassibi, 2007; Guruswami & Lee, 2008 )

– learning design matrices – sparse random projections

(e.g., Weiss et al., 2007; Seeger et al., 2008) (e.g., Achilioptas, 2001; Wang et al., 2007)

13

Sparse linear regression (noisy observations) ℓ1 -relaxed QP:

(Tibshirani, 1996; Chen et al., 1998)

Given noisy observations Y = Xβ ∗ + W , where β ∗ ∈ Rp is “sparse”, and X ∈ Rn×p is the measurement matrix. ℓ0 problem minp

kY − Xβk22

s.t.

kβk0 ≤ C.

β∈R

ℓ1 relaxation (Lasso) minp

kY − Xβk22

s.t.

kβk1 ≤ C.

β∈R

Question: In what sense does QP-solution βb approximate the true β ∗ ?

14

Various loss functions b how to compare to truth β ∗ ? Given an estimate β,

Prediction: When using model to predict a new sample Y ′ : b β ∗ ) = E[ℓ(Y ′ ; X ′ , β)], b where (Y ′ , X ′ ) ∼ Pβ ∗ R(β, i h ′ T b 2 e.g., mean-squared error E (Y − X β)

average loss:

Estimation: When signal β ∗ of primary interest (e.g., comp. sensing): assess behavior of ℓa -metrics kβb − β ∗ ka for some a ≥ 1. bounds on expectation, or in probability

6 0} is Signed support recovery: When support S(β ∗ ) = {i | |βi∗ | = of primary interest (e.g., variable selection; graphical models):  1 if signed supports differ ∗ b 6= S± (β )] = 0 − 1 loss: I [S± (β) 0 otherwise. 15

Some geometric intuition

(a) Favourable setting

(b) Poorly conditioned

16

Empirical behavior of Lasso for variable selection ρ = 0.10; Linear 1

Prob. of success

0.8

0.6

0.4

0.2

0 0

p = 128 p = 256 p = 512 100

200 300 400 500 Number of samples n

17

600

Empirical behavior: Appropriately rescaled (Wainwright, 2006)

ρ = 0.10; Linear 1

Prob. of success

0.8

Tup

0.6

0.4

Tlow p = 128 p = 256 p = 512

0.2

0 0

0.5

1 1.5 Control parameter θ

2

control parameter for rescaling θ(n, p, β ∗ ) = n/f (p, β ∗ ) for some complexity function f 18

Theoretical guarantees for Lasso say X ∈ Rn×p has i.i.d. rows, drawn from (sub)-Gaussian ensemble with covariance Λ, and signal β ∗ ∈ Rp with k non-zeros 2 solve ℓ1 -regularized QP (Lasso): minp kY − Xβk2 + λn kβk2 . β∈R

b 2 ]: Prediction E[(Y ′ − X T β)

(e.g., Greenshtein/Ritov, 2003; Candes/Tao,

2006,, van de Geer, 2007)

Estimation error kβb − β ∗ k2 :

(e.g., Donoho, 2006; Candes/Tao, 2006;

Meinshausen & Yu, 2007; Koltchinskii, 2006; Tsybakov et al., 2008; Zhang, 2007)

Variable selection:

(e.g., Meins./Buhlmann 2006; Wai., 2006; Zhao/Yu, 2006)

there exist constants θℓ (Λ) ≤ 1 ≤ θu (Λ) such that: n > θu , then Lasso recovers support of β ∗ with prob. – if 2k log(p−k) greater than 1 − exp(−c log p)).

– conversely, if

n 2k log(p−k)

< θℓ , then then Lasso fails w.p. one. 19

Sparsity and graphical model selection graphical models based on “sparse” factorization of distribution:     X 1 P(x; θ) = θst xs xt . exp   Z(θ) (s,t)∈E

searching over graphs with at most k edges equivalent to kθk0 ≤ k. given i.i.d. samples X (1) , . . . , X (n) , regularized version of maximum likelihood:

ℓ0 problem

ℓ1 relaxation n

n

minp

β∈R

s.t.

1X log P(X (i) ; θ) n i=1

kβk0 ≤ C.

minp

β∈R

s.t.

1X log P(X (i) ; θ) n i=1

kβk1 ≤ C.

calculation of likelihood: intractable for general discrete models 20

Reduction to neighborhood selection X2

X3

X6 X1

X4

X5 conditioned on (x2 , . . . , xp ), variable X1 ∈ {−1, +1} has density (e.g., for binary variables): −1 X ∗ . Pθ∗ (X1 = 1 | x2 , . . . , xp ) = 1 + exp θ1j xj j∈N (1)

neighborhood structured can be recovered by local regressions (Meinshausen & Buhlmann, 2006; Ravikumar et al., 2006, 2008) 21

Phase transition behavior 4−nearest neighbor grid (mixed) 1

0.8

0.8 Prob. success

Prob. success

Star graph; Logarithmic neighbors 1

0.6

0.4

0.2

0 0

1 1.5 Control parameter

0.4

0.2

p = 64 p = 100 p = 225 0.5

0.6

2

0 0

p = 64 p = 100 p = 225 0.5

1 1.5 2 Control parameter

2.5

3

Set-up: Binary random vector (X1 , . . . , Xp ); graphical model with max. degree d. Sufficient conditions: ℓ1 -neighborhood regression recovers graph 3 structure w.h.p. with n = Ω d log(p − d) samples.

(Ravikumar et al., NIPS 2006)

22

Block-structured and hierarchical regularization Structured regularization:

(Rocha et al., 2006; Yuan/Lin, 2006; Obozinski et

al., 2007; Meier et al., 2007; Bach, 2008; Roth/Fischer, 2008; Negahbahn/Wainwright, 2008)

(a) ℓ1 /ℓ2 norm

(b) ℓ1 /ℓ∞ norm

Motivation: data with shared sparsity (e.g., β1 6= 0 ⇐⇒ β2 6= 0): simultaneous recovery, data fusion, time series, etc. data with hierarchical sparsity (e.g., β1 6= 0 =⇒ β2 6= 0) 23

Information-theoretic analysis Various results on ℓ1 -regularization: but are these “good” results? Core issue: Disregarding computational complexity, what are fundamental limits in high-dimensional learning? can be addressed with information-theoretic methods communication channel: sends information about {sparse vector, graph, low-rank matrix } to user

Possible insights: show optimality of ℓ1 -methods (as good as any other algorithm) conversely, reveal sub-optimality of ℓ1 -methods

Open question: When ℓ1 is sub-optimal, do there exist alternative polynomial-time methods with optimal scaling? 24

Geometric intuition

d2

Truth

d1

Error probability controlled by two competing quantities: ` k ´ `p−k´ number of models with non-overlap t: N (t) = k−t . t

distinguishability of models with overlap t: 2 Near-by: d1 (1) ≈ βmin `k´`p−k´ N (1) = 1 1

2 Far-away: d2 (k) ≈ log(1 + kβmin ). `p−k´ N (k) = k

25

Information-theory: necessary/sufficient conditions Set-up: Given k-sparse vector β ∗ ; Minimum value βmin : = min |βi∗ |. i∈S

Necessary conditions:

(Baron et al., 2006; Wainwright, 2007; Reeves &

Gastpar, 2008; Akcakaya & Tarokh, 2008; Fletcher et al., 2008; Wang et al., 2008)

Wang et al. (2008): for any design X with i.i.d. rows and cov. Σ: 8 9 `p ´ > > < = log k log(p − k) ˆ ˜´ , n > max ` 1 . 2 2 > > c (Σ) β log 1 + c (Σ) β k 2 1 : 2 min {z min}; | | {z } Far-away subsets Near-by subsets

Sufficient conditions for oracle decoder: Gastpar, 2008; Akcakaya & Tarokh, 2008):

(Wainwright, 2007; Reeves &

Wainwright (2007): for standard Gaussian ensemble, oracle succeeds for ! ( ) “ p log(p − k) ” , n > Ω max log . 2 k βmin 26

Summary and open questions ℓ1 -regularization and sparsity: – natural polynomial-time relaxation of difficult ℓq problems (q ∈ [0, 1)) – attractive theoretical guarantees (high-dimensional scaling) – various connections to Bayesian procedures – lots of exciting applications (e.g., compressed sensing [Baraniuk, next talk]; audio/images/video [Saul, last talk])

moving beyond ℓ1 , and other open questions: – fundamental limits of optimal procedures (ℓq , q < 1 etc.) – block and hierarchical regularization: when does more computation yield better results? – Bayesian versions, and adaptive procedures (learning designs) – other forms of complexity regularization for high-dimensional data: manifolds, dimensionality reduction, etc. 27

HUDM4122 Probability and Statistical Inference