Sparsity and statistical inference in high-dimensional settings Martin Wainwright Department of Statistics, and Department of Electrical Engineering and Computer Science UC Berkeley, California, USA wainwrig@{stat,eecs}.berkeley.edu ICML 2008, Helsinki, Finland Workshop on sparsity and variable selection

1

Introduction ˆ classical asymptotic theory in statistical inference: – number of observations n → +∞ – model dimension p stays fixed

ˆ contrast with many modern applications and models: – systems frequently large (p ≈ 103 − 108 ), and observations costly – interesting consequences: might have p = Θ(n) or even p ≫ n

ˆ distinct issues in high-dimensional inference: – performance limits of specific polynomial-complexity methods? – fundamental (information-theoretic) limits that apply to any procedure?

ˆ some form of complexity-regularization required to avoid curse of dimensionality (e.g, sparsity) 2

Different sparsity models Given signal vector β ∗ ∈ Rp : Hard-sparsity model: Support:

6 0}. S(β ∗ ) = {i ∈ {1, . . . , p | |βi∗ | =

Exactly k non-zero elements:

|S(β ∗ )| = k.

Weak ℓq -sparsity models: For q ∈ [0, 1), define ℓq -ball: ) ( p X p |ui |q ≤ Cq . Bq (C) = u∈R | i=1

Require that β ∗ ∈ B(C) for some q ∈ [0, 1). 3

ℓ1 -norm relaxation Problem: ℓq “ball” constraints are non-convex (computationally intractable in general).

ℓ1

ℓq Standard relaxation: Relax non-convex constraints into the nearest convex norm—namely, the ℓ1 norm. 4

Compressed sensing X

Y

n

=

n×p

β∗ k

p−k

Set-up: Under-determined sparse linear system: Y = Xβ ∗ , with kβ ∗ k0 = k ≪ p. Goal: Recover β ∗ using n ≪ p measurements. Some relevant work: Chen et al., 1998; Xuo & Donoho, 2001; Elad & Bruckstein, 2002; Candes et al., 2005; Candes/Tao, 2005; Donoho, 2005; Donoho & Tanner, 2006; Baron et al., 2006; Baraniuk, 2007; Indyk et al, 2008 5

Sparse log-linear models

n

β∗

X

Y

= f

W k

n

n×p

,

i h ~ i) ∼ P X ~ T β∗ . Set-up: General log-linear observation model: (Yi | X i Goal: Estimate β ∗ (various loss functions possible).

Some relevant work: Tibshirani, 1996; Chen et al, 1998; Ng, 2004; Tropp, 2004; Fuchs, 2005; Candes & Tao, 2005; Donoho, 2005; Zhao & Yu, 2006; Wainwright, 2006; Rocha et al., 2007; van de Geer, 2007; Ji & Carin, 2007; Tsybakov et al., 2008.

6

Graphical model selection ˆ let G = (V, E) be an undirected graph on p = |V | vertices

~ with density: ˆ consider p-dimensional random vector Z    X X  ~ β) ∝ exp βij Zi Zj . p(Z; βi Zi + βii Zi2 +   i∈V

(i,j)∈E

~ identify the underlying graph structure ˆ given n samples of Z, Some relevant work: Abbeel et al., 2006; Meinshausen & Buhlmann, 2006; Rothman et al., 2008; Ravikumar et al., 2006, 2008; Johnson et al., 2007; Santhanam & Wainwright, 2008; Banerjee et al., 2008 7

11111 00000 00000 11111 00000 11111 00000 11111 00000 11111

Sparse principal components analysis

Σ

11111 00000 00000 11111 00000 11111 00000 11111 = 11111 00000

ZZ T

+ D

Set-up: Covariance matrix Σ = ZZ T + D, where leading eigenspace Z has sparse columns. b based on samples X (i) with covariance Goal: Produce an estimate Z matrix Σ. Some relevant work: Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al., 2004;

d’Aspr´ emont et al., 2007; Moghaddam et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008. 8

Low-rank matrix approximation Set-up: ˆ given matrix Γ ∈ Rp×p (rank k ≪ p). ˆ low rank is equivalent to sparsity in eigenvalues

{λ1 (Γ), λ2 (Γ), . . . , λp (Γ)} has k non-zeros. Observations: For i = 1, 2, . . . , n:  T Yi = trace Xi Γ + Wi  h i = trace XTi U UT diag{λ1 (Γ), . . . , λp (Γ)}U + Wi . Goal: Recover matrix Γ (exactly or approximately). Some relevant work: Recht et al., 2007; Bach, 2008; Meka et al., ICML 2008 9

Compressed sensing Basis pursuit (ℓ1 -relaxation, linear programming) (Chen, Donoho & Saunders, 1998)

Given noiseless observations Y = Xβ ∗ , where β ∗ ∈ Rp is “sparse”, and X ∈ Rn×p is the measurement matrix. ℓ0 problem minp

kβk0

=

Xβ ∗ .

β∈R

s.t.

Y

ℓ1 relaxation (LP1 ) minp

β∈R

s.t.

Y

kβk1 =

Xβ ∗ .

Question: When does solution to LP-relaxation recover the sparsest solution (optimal for ℓ0 )?

10

Theoretical guarantees for basis pursuit Exact recovery: ˆ earlier results on deterministic measurement matrices (Donoho & Xuo, 2001; Elad & Bruckstein, 2002; Feuer & Nemirovski, 2003)

ˆ later results on random measurement matrices (Donoho, 2005, Candes/Tao, 2005; Donoho & Tanner, 2006)

– standard Gaussian ensemble: X ∈ Rn×p with Xij ∼ N (0, 1) i.i.d. – signal β ∗ is k-sparse, with k = αp, for some α ∈ (0, 21 ). – then there exists a constant γ = f (α) < 1 such that n = γp observations suffice for BP to recover β ∗

Approximate recovery:

Donoho, 2006; Candes et al., 2006; Tropp, 2006;

Gilbert et al., 2006,

ˆ say linear system Y = Xβ has a near-sparse solution β ∗ ˆ basis pursuit LP (ℓ1 -relaxation) approximately recovers the best k-term approximation to β ∗ (e.g., in ℓ2 norm) 11

LP duality and almost-spherical sections 6 0} if and only ˆ basis pursuit LP recovers support set S = {i | βi∗ = if ∃ λ ∈ Rn such that XTS λ = sign(βS∗ ), and |XTS c λ| < 1. ˆ after some further algebraic manipulation, sufficient to show   |S| k~ek2 . k~ek2 ≤ k~ek1 ≤ f ∀ ~e ∈ Rp , s.t. XTS ~e = 0. |{z} n | {z } trivial non-trivial

ˆ random slices of ℓ1 ball are almost-spherical (Dvoretsky, 1976; Milman & Schechtman, 1986) 12

Design of measurement matrices ˆ restricted isometry property (RIP) ≡ controlled eigenspectrum of XTS XS : sufficient condition for recovery (Candes & Tao, 2005) ˆ link to Johnson-Lindenstrauss and random projections

(Baraniuk et

al., 2007)

ˆ random sub-Gaussian matrices have desirable properties with high probability, but explicit constructions or “learned” matrices? ˆ lots of on-going work: – sparse graph methods and expanders (inspired by coding theory) (e.g., Cormode & Muthukrishan, 2005; Gilbert et al., 2006; Sarvotham et al., 2006; Wu & Hassibi, 2007; Guruswami & Lee, 2008 )

– learning design matrices – sparse random projections

(e.g., Weiss et al., 2007; Seeger et al., 2008) (e.g., Achilioptas, 2001; Wang et al., 2007)

13

Sparse linear regression (noisy observations) ℓ1 -relaxed QP:

(Tibshirani, 1996; Chen et al., 1998)

Given noisy observations Y = Xβ ∗ + W , where β ∗ ∈ Rp is “sparse”, and X ∈ Rn×p is the measurement matrix. ℓ0 problem minp

kY − Xβk22

s.t.

kβk0 ≤ C.

β∈R

ℓ1 relaxation (Lasso) minp

kY − Xβk22

s.t.

kβk1 ≤ C.

β∈R

Question: In what sense does QP-solution βb approximate the true β ∗ ?

14

Various loss functions b how to compare to truth β ∗ ? Given an estimate β,

Prediction: When using model to predict a new sample Y ′ : b β ∗ ) = E[ℓ(Y ′ ; X ′ , β)], b where (Y ′ , X ′ ) ∼ Pβ ∗ R(β, i h ′ T b 2 ˆ e.g., mean-squared error E (Y − X β)

ˆ average loss:

Estimation: When signal β ∗ of primary interest (e.g., comp. sensing): ˆ assess behavior of ℓa -metrics kβb − β ∗ ka for some a ≥ 1. ˆ bounds on expectation, or in probability

6 0} is Signed support recovery: When support S(β ∗ ) = {i | |βi∗ | = of primary interest (e.g., variable selection; graphical models):  1 if signed supports differ ∗ b 6= S± (β )] = 0 − 1 loss: I [S± (β) 0 otherwise. 15

Some geometric intuition

(a) Favourable setting

(b) Poorly conditioned

16

Empirical behavior of Lasso for variable selection ρ = 0.10; Linear 1

Prob. of success

0.8

0.6

0.4

0.2

0 0

p = 128 p = 256 p = 512 100

200 300 400 500 Number of samples n

17

600

Empirical behavior: Appropriately rescaled (Wainwright, 2006)

ρ = 0.10; Linear 1

Prob. of success

0.8

Tup

0.6

0.4

Tlow p = 128 p = 256 p = 512

0.2

0 0

0.5

1 1.5 Control parameter θ

2

ˆ control parameter for rescaling θ(n, p, β ∗ ) = n/f (p, β ∗ ) for some complexity function f 18

Theoretical guarantees for Lasso ˆ say X ∈ Rn×p has i.i.d. rows, drawn from (sub)-Gaussian ensemble with covariance Λ, and signal β ∗ ∈ Rp with k non-zeros  2 ˆ solve ℓ1 -regularized QP (Lasso): minp kY − Xβk2 + λn kβk2 . β∈R

b 2 ]: Prediction E[(Y ′ − X T β)

(e.g., Greenshtein/Ritov, 2003; Candes/Tao,

2006,, van de Geer, 2007)

Estimation error kβb − β ∗ k2 :

(e.g., Donoho, 2006; Candes/Tao, 2006;

Meinshausen & Yu, 2007; Koltchinskii, 2006; Tsybakov et al., 2008; Zhang, 2007)

Variable selection:

(e.g., Meins./Buhlmann 2006; Wai., 2006; Zhao/Yu, 2006)

ˆ there exist constants θℓ (Λ) ≤ 1 ≤ θu (Λ) such that: n > θu , then Lasso recovers support of β ∗ with prob. – if 2k log(p−k) greater than 1 − exp(−c log p)).

– conversely, if

n 2k log(p−k)

< θℓ , then then Lasso fails w.p. one. 19

Sparsity and graphical model selection ˆ graphical models based on “sparse” factorization of distribution:     X 1 P(x; θ) = θst xs xt . exp   Z(θ) (s,t)∈E

ˆ searching over graphs with at most k edges equivalent to kθk0 ≤ k. ˆ given i.i.d. samples X (1) , . . . , X (n) , regularized version of maximum likelihood:

ℓ0 problem

ℓ1 relaxation n

n

minp

β∈R

s.t.

1X log P(X (i) ; θ) n i=1

kβk0 ≤ C.

minp

β∈R

s.t.

1X log P(X (i) ; θ) n i=1

kβk1 ≤ C.

ˆ calculation of likelihood: intractable for general discrete models 20

Reduction to neighborhood selection X2

X3

X6 X1

X4

X5 ˆ conditioned on (x2 , . . . , xp ), variable X1 ∈ {−1, +1} has density (e.g., for binary variables):  −1 X  ∗ . Pθ∗ (X1 = 1 | x2 , . . . , xp ) = 1 + exp θ1j xj j∈N (1)

ˆ neighborhood structured can be recovered by local regressions (Meinshausen & Buhlmann, 2006; Ravikumar et al., 2006, 2008) 21

Phase transition behavior 4−nearest neighbor grid (mixed) 1

0.8

0.8 Prob. success

Prob. success

Star graph; Logarithmic neighbors 1

0.6

0.4

0.2

0 0

1 1.5 Control parameter

0.4

0.2

p = 64 p = 100 p = 225 0.5

0.6

2

0 0

p = 64 p = 100 p = 225 0.5

1 1.5 2 Control parameter

2.5

3

Set-up: Binary random vector (X1 , . . . , Xp ); graphical model with max. degree d. Sufficient conditions: ℓ1 -neighborhood regression recovers graph  3 structure w.h.p. with n = Ω d log(p − d) samples.

(Ravikumar et al., NIPS 2006)

22

Block-structured and hierarchical regularization Structured regularization:

(Rocha et al., 2006; Yuan/Lin, 2006; Obozinski et

al., 2007; Meier et al., 2007; Bach, 2008; Roth/Fischer, 2008; Negahbahn/Wainwright, 2008)

(a) ℓ1 /ℓ2 norm

(b) ℓ1 /ℓ∞ norm

Motivation: ˆ data with shared sparsity (e.g., β1 6= 0 ⇐⇒ β2 6= 0): simultaneous recovery, data fusion, time series, etc. ˆ data with hierarchical sparsity (e.g., β1 6= 0 =⇒ β2 6= 0) 23

Information-theoretic analysis Various results on ℓ1 -regularization: but are these “good” results? Core issue: Disregarding computational complexity, what are fundamental limits in high-dimensional learning? ˆ can be addressed with information-theoretic methods ˆ communication channel: sends information about {sparse vector, graph, low-rank matrix } to user

Possible insights: ˆ show optimality of ℓ1 -methods (as good as any other algorithm) ˆ conversely, reveal sub-optimality of ℓ1 -methods

Open question: When ℓ1 is sub-optimal, do there exist alternative polynomial-time methods with optimal scaling? 24

Geometric intuition

d2

Truth

d1

Error probability controlled by two competing quantities: ` k ´ `p−k´ ˆ number of models with non-overlap t: N (t) = k−t . t

ˆ distinguishability of models with overlap t: 2 Near-by: d1 (1) ≈ βmin `k´`p−k´ N (1) = 1 1

2 Far-away: d2 (k) ≈ log(1 + kβmin ). `p−k´ N (k) = k

25

Information-theory: necessary/sufficient conditions Set-up: Given k-sparse vector β ∗ ; Minimum value βmin : = min |βi∗ |. i∈S

Necessary conditions:

(Baron et al., 2006; Wainwright, 2007; Reeves &

Gastpar, 2008; Akcakaya & Tarokh, 2008; Fletcher et al., 2008; Wang et al., 2008)

ˆ Wang et al. (2008): for any design X with i.i.d. rows and cov. Σ: 8 9 `p ´ > > < = log k log(p − k) ˆ ˜´ , n > max ` 1 . 2 2 > > c (Σ) β log 1 + c (Σ) β k 2 1 : 2 min {z min}; | | {z } Far-away subsets Near-by subsets

Sufficient conditions for oracle decoder: Gastpar, 2008; Akcakaya & Tarokh, 2008):

(Wainwright, 2007; Reeves &

ˆ Wainwright (2007): for standard Gaussian ensemble, oracle succeeds for ! ( ) “ p log(p − k) ” , n > Ω max log . 2 k βmin 26

Summary and open questions ˆ ℓ1 -regularization and sparsity: – natural polynomial-time relaxation of difficult ℓq problems (q ∈ [0, 1)) – attractive theoretical guarantees (high-dimensional scaling) – various connections to Bayesian procedures – lots of exciting applications (e.g., compressed sensing [Baraniuk, next talk]; audio/images/video [Saul, last talk])

ˆ moving beyond ℓ1 , and other open questions: – fundamental limits of optimal procedures (ℓq , q < 1 etc.) – block and hierarchical regularization: when does more computation yield better results? – Bayesian versions, and adaptive procedures (learning designs) – other forms of complexity regularization for high-dimensional data: manifolds, dimensionality reduction, etc. 27

Sparsity and statistical inference in high-dimensional ...

(Donoho, 2005, Candes/Tao, 2005; Donoho & Tanner, 2006). – standard Gaussian ensemble: X ∈ R n×p with Xij ∼ N(0,1) i.i.d.. – signal β∗ is k-sparse, with k ...

275KB Sizes 0 Downloads 179 Views

Recommend Documents

HUDM4122 Probability and Statistical Inference
Apr 15, 2015 - The clinical observation scale goes from 0 to 10, with any score below 3 considered evidence for appropriate behavior. • The average clincal observation score in your sample is 3.2 points, with a standard deviation of 4 points. • W

HUDM4122 Probability and Statistical Inference
Apr 15, 2015 - Compute a two-tailed Z test to find out whether the difference between .... And I'm happy to go along with that practice to spare you lots of ...

Inducing Value Sparsity for Parallel Inference in Tree ...
In International Semantic Web Conference, pages 640–653,. 2006. [2] Fuchun Peng and Andrew McCallum. Information extraction from research papers using ...

Inducing Value Sparsity for Parallel Inference in ... - Semantic Scholar
observe the marginal of each latent variable in the model, and as soon as the probability of the max- marginal value crosses a ..... entity names in text. Bioinformatics, 21(14):3191–3192, 2005. [4] Yan Liu, Jaime G. Carbonell, Peter Weigele, and V

probability and statistical inference pdf
Page 1 of 1. File: Probability and statistical inference. pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. probability and ...

A Concise Course in Statistical Inference (Springer ...
suitable for students and researchers in statistics, computer science, data mining and machine learning. This book covers a much wider range of topics than a ...