Sparsity and statistical inference in high-dimensional settings Martin Wainwright Department of Statistics, and Department of Electrical Engineering and Computer Science UC Berkeley, California, USA wainwrig@{stat,eecs}.berkeley.edu ICML 2008, Helsinki, Finland Workshop on sparsity and variable selection
1
Introduction classical asymptotic theory in statistical inference: – number of observations n → +∞ – model dimension p stays fixed
contrast with many modern applications and models: – systems frequently large (p ≈ 103 − 108 ), and observations costly – interesting consequences: might have p = Θ(n) or even p ≫ n
distinct issues in high-dimensional inference: – performance limits of specific polynomial-complexity methods? – fundamental (information-theoretic) limits that apply to any procedure?
some form of complexity-regularization required to avoid curse of dimensionality (e.g, sparsity) 2
Different sparsity models Given signal vector β ∗ ∈ Rp : Hard-sparsity model: Support:
6 0}. S(β ∗ ) = {i ∈ {1, . . . , p | |βi∗ | =
Exactly k non-zero elements:
|S(β ∗ )| = k.
Weak ℓq -sparsity models: For q ∈ [0, 1), define ℓq -ball: ) ( p X p |ui |q ≤ Cq . Bq (C) = u∈R | i=1
Require that β ∗ ∈ B(C) for some q ∈ [0, 1). 3
ℓ1 -norm relaxation Problem: ℓq “ball” constraints are non-convex (computationally intractable in general).
ℓ1
ℓq Standard relaxation: Relax non-convex constraints into the nearest convex norm—namely, the ℓ1 norm. 4
Compressed sensing X
Y
n
=
n×p
β∗ k
p−k
Set-up: Under-determined sparse linear system: Y = Xβ ∗ , with kβ ∗ k0 = k ≪ p. Goal: Recover β ∗ using n ≪ p measurements. Some relevant work: Chen et al., 1998; Xuo & Donoho, 2001; Elad & Bruckstein, 2002; Candes et al., 2005; Candes/Tao, 2005; Donoho, 2005; Donoho & Tanner, 2006; Baron et al., 2006; Baraniuk, 2007; Indyk et al, 2008 5
Sparse log-linear models
n
β∗
X
Y
= f
W k
n
n×p
,
i h ~ i) ∼ P X ~ T β∗ . Set-up: General log-linear observation model: (Yi | X i Goal: Estimate β ∗ (various loss functions possible).
Some relevant work: Tibshirani, 1996; Chen et al, 1998; Ng, 2004; Tropp, 2004; Fuchs, 2005; Candes & Tao, 2005; Donoho, 2005; Zhao & Yu, 2006; Wainwright, 2006; Rocha et al., 2007; van de Geer, 2007; Ji & Carin, 2007; Tsybakov et al., 2008.
6
Graphical model selection let G = (V, E) be an undirected graph on p = |V | vertices
~ with density: consider p-dimensional random vector Z X X ~ β) ∝ exp βij Zi Zj . p(Z; βi Zi + βii Zi2 + i∈V
(i,j)∈E
~ identify the underlying graph structure given n samples of Z, Some relevant work: Abbeel et al., 2006; Meinshausen & Buhlmann, 2006; Rothman et al., 2008; Ravikumar et al., 2006, 2008; Johnson et al., 2007; Santhanam & Wainwright, 2008; Banerjee et al., 2008 7
11111 00000 00000 11111 00000 11111 00000 11111 00000 11111
Sparse principal components analysis
Σ
11111 00000 00000 11111 00000 11111 00000 11111 = 11111 00000
ZZ T
+ D
Set-up: Covariance matrix Σ = ZZ T + D, where leading eigenspace Z has sparse columns. b based on samples X (i) with covariance Goal: Produce an estimate Z matrix Σ. Some relevant work: Joliffe et al., 2003; Johnstone & Lu, 2004; Zou et al., 2004;
d’Aspr´ emont et al., 2007; Moghaddam et al., 2007; Johnstone & Paul, 2008; Amini & Wainwright, 2008. 8
Low-rank matrix approximation Set-up: given matrix Γ ∈ Rp×p (rank k ≪ p). low rank is equivalent to sparsity in eigenvalues
{λ1 (Γ), λ2 (Γ), . . . , λp (Γ)} has k non-zeros. Observations: For i = 1, 2, . . . , n: T Yi = trace Xi Γ + Wi h i = trace XTi U UT diag{λ1 (Γ), . . . , λp (Γ)}U + Wi . Goal: Recover matrix Γ (exactly or approximately). Some relevant work: Recht et al., 2007; Bach, 2008; Meka et al., ICML 2008 9
Compressed sensing Basis pursuit (ℓ1 -relaxation, linear programming) (Chen, Donoho & Saunders, 1998)
Given noiseless observations Y = Xβ ∗ , where β ∗ ∈ Rp is “sparse”, and X ∈ Rn×p is the measurement matrix. ℓ0 problem minp
kβk0
=
Xβ ∗ .
β∈R
s.t.
Y
ℓ1 relaxation (LP1 ) minp
β∈R
s.t.
Y
kβk1 =
Xβ ∗ .
Question: When does solution to LP-relaxation recover the sparsest solution (optimal for ℓ0 )?
10
Theoretical guarantees for basis pursuit Exact recovery: earlier results on deterministic measurement matrices (Donoho & Xuo, 2001; Elad & Bruckstein, 2002; Feuer & Nemirovski, 2003)
later results on random measurement matrices (Donoho, 2005, Candes/Tao, 2005; Donoho & Tanner, 2006)
– standard Gaussian ensemble: X ∈ Rn×p with Xij ∼ N (0, 1) i.i.d. – signal β ∗ is k-sparse, with k = αp, for some α ∈ (0, 21 ). – then there exists a constant γ = f (α) < 1 such that n = γp observations suffice for BP to recover β ∗
Approximate recovery:
Donoho, 2006; Candes et al., 2006; Tropp, 2006;
Gilbert et al., 2006,
say linear system Y = Xβ has a near-sparse solution β ∗ basis pursuit LP (ℓ1 -relaxation) approximately recovers the best k-term approximation to β ∗ (e.g., in ℓ2 norm) 11
LP duality and almost-spherical sections 6 0} if and only basis pursuit LP recovers support set S = {i | βi∗ = if ∃ λ ∈ Rn such that XTS λ = sign(βS∗ ), and |XTS c λ| < 1. after some further algebraic manipulation, sufficient to show |S| k~ek2 . k~ek2 ≤ k~ek1 ≤ f ∀ ~e ∈ Rp , s.t. XTS ~e = 0. |{z} n | {z } trivial non-trivial
random slices of ℓ1 ball are almost-spherical (Dvoretsky, 1976; Milman & Schechtman, 1986) 12
Design of measurement matrices restricted isometry property (RIP) ≡ controlled eigenspectrum of XTS XS : sufficient condition for recovery (Candes & Tao, 2005) link to Johnson-Lindenstrauss and random projections
(Baraniuk et
al., 2007)
random sub-Gaussian matrices have desirable properties with high probability, but explicit constructions or “learned” matrices? lots of on-going work: – sparse graph methods and expanders (inspired by coding theory) (e.g., Cormode & Muthukrishan, 2005; Gilbert et al., 2006; Sarvotham et al., 2006; Wu & Hassibi, 2007; Guruswami & Lee, 2008 )
– learning design matrices – sparse random projections
(e.g., Weiss et al., 2007; Seeger et al., 2008) (e.g., Achilioptas, 2001; Wang et al., 2007)
13
Sparse linear regression (noisy observations) ℓ1 -relaxed QP:
(Tibshirani, 1996; Chen et al., 1998)
Given noisy observations Y = Xβ ∗ + W , where β ∗ ∈ Rp is “sparse”, and X ∈ Rn×p is the measurement matrix. ℓ0 problem minp
kY − Xβk22
s.t.
kβk0 ≤ C.
β∈R
ℓ1 relaxation (Lasso) minp
kY − Xβk22
s.t.
kβk1 ≤ C.
β∈R
Question: In what sense does QP-solution βb approximate the true β ∗ ?
14
Various loss functions b how to compare to truth β ∗ ? Given an estimate β,
Prediction: When using model to predict a new sample Y ′ : b β ∗ ) = E[ℓ(Y ′ ; X ′ , β)], b where (Y ′ , X ′ ) ∼ Pβ ∗ R(β, i h ′ T b 2 e.g., mean-squared error E (Y − X β)
average loss:
Estimation: When signal β ∗ of primary interest (e.g., comp. sensing): assess behavior of ℓa -metrics kβb − β ∗ ka for some a ≥ 1. bounds on expectation, or in probability
6 0} is Signed support recovery: When support S(β ∗ ) = {i | |βi∗ | = of primary interest (e.g., variable selection; graphical models): 1 if signed supports differ ∗ b 6= S± (β )] = 0 − 1 loss: I [S± (β) 0 otherwise. 15
Some geometric intuition
(a) Favourable setting
(b) Poorly conditioned
16
Empirical behavior of Lasso for variable selection ρ = 0.10; Linear 1
Prob. of success
0.8
0.6
0.4
0.2
0 0
p = 128 p = 256 p = 512 100
200 300 400 500 Number of samples n
17
600
Empirical behavior: Appropriately rescaled (Wainwright, 2006)
ρ = 0.10; Linear 1
Prob. of success
0.8
Tup
0.6
0.4
Tlow p = 128 p = 256 p = 512
0.2
0 0
0.5
1 1.5 Control parameter θ
2
control parameter for rescaling θ(n, p, β ∗ ) = n/f (p, β ∗ ) for some complexity function f 18
Theoretical guarantees for Lasso say X ∈ Rn×p has i.i.d. rows, drawn from (sub)-Gaussian ensemble with covariance Λ, and signal β ∗ ∈ Rp with k non-zeros 2 solve ℓ1 -regularized QP (Lasso): minp kY − Xβk2 + λn kβk2 . β∈R
b 2 ]: Prediction E[(Y ′ − X T β)
(e.g., Greenshtein/Ritov, 2003; Candes/Tao,
2006,, van de Geer, 2007)
Estimation error kβb − β ∗ k2 :
(e.g., Donoho, 2006; Candes/Tao, 2006;
Meinshausen & Yu, 2007; Koltchinskii, 2006; Tsybakov et al., 2008; Zhang, 2007)
Variable selection:
(e.g., Meins./Buhlmann 2006; Wai., 2006; Zhao/Yu, 2006)
there exist constants θℓ (Λ) ≤ 1 ≤ θu (Λ) such that: n > θu , then Lasso recovers support of β ∗ with prob. – if 2k log(p−k) greater than 1 − exp(−c log p)).
– conversely, if
n 2k log(p−k)
< θℓ , then then Lasso fails w.p. one. 19
Sparsity and graphical model selection graphical models based on “sparse” factorization of distribution: X 1 P(x; θ) = θst xs xt . exp Z(θ) (s,t)∈E
searching over graphs with at most k edges equivalent to kθk0 ≤ k. given i.i.d. samples X (1) , . . . , X (n) , regularized version of maximum likelihood:
ℓ0 problem
ℓ1 relaxation n
n
minp
β∈R
s.t.
1X log P(X (i) ; θ) n i=1
kβk0 ≤ C.
minp
β∈R
s.t.
1X log P(X (i) ; θ) n i=1
kβk1 ≤ C.
calculation of likelihood: intractable for general discrete models 20
Reduction to neighborhood selection X2
X3
X6 X1
X4
X5 conditioned on (x2 , . . . , xp ), variable X1 ∈ {−1, +1} has density (e.g., for binary variables): −1 X ∗ . Pθ∗ (X1 = 1 | x2 , . . . , xp ) = 1 + exp θ1j xj j∈N (1)
neighborhood structured can be recovered by local regressions (Meinshausen & Buhlmann, 2006; Ravikumar et al., 2006, 2008) 21
Phase transition behavior 4−nearest neighbor grid (mixed) 1
0.8
0.8 Prob. success
Prob. success
Star graph; Logarithmic neighbors 1
0.6
0.4
0.2
0 0
1 1.5 Control parameter
0.4
0.2
p = 64 p = 100 p = 225 0.5
0.6
2
0 0
p = 64 p = 100 p = 225 0.5
1 1.5 2 Control parameter
2.5
3
Set-up: Binary random vector (X1 , . . . , Xp ); graphical model with max. degree d. Sufficient conditions: ℓ1 -neighborhood regression recovers graph 3 structure w.h.p. with n = Ω d log(p − d) samples.
(Ravikumar et al., NIPS 2006)
22
Block-structured and hierarchical regularization Structured regularization:
(Rocha et al., 2006; Yuan/Lin, 2006; Obozinski et
al., 2007; Meier et al., 2007; Bach, 2008; Roth/Fischer, 2008; Negahbahn/Wainwright, 2008)
(a) ℓ1 /ℓ2 norm
(b) ℓ1 /ℓ∞ norm
Motivation: data with shared sparsity (e.g., β1 6= 0 ⇐⇒ β2 6= 0): simultaneous recovery, data fusion, time series, etc. data with hierarchical sparsity (e.g., β1 6= 0 =⇒ β2 6= 0) 23
Information-theoretic analysis Various results on ℓ1 -regularization: but are these “good” results? Core issue: Disregarding computational complexity, what are fundamental limits in high-dimensional learning? can be addressed with information-theoretic methods communication channel: sends information about {sparse vector, graph, low-rank matrix } to user
Possible insights: show optimality of ℓ1 -methods (as good as any other algorithm) conversely, reveal sub-optimality of ℓ1 -methods
Open question: When ℓ1 is sub-optimal, do there exist alternative polynomial-time methods with optimal scaling? 24
Geometric intuition
d2
Truth
d1
Error probability controlled by two competing quantities: ` k ´ `p−k´ number of models with non-overlap t: N (t) = k−t . t
distinguishability of models with overlap t: 2 Near-by: d1 (1) ≈ βmin `k´`p−k´ N (1) = 1 1
2 Far-away: d2 (k) ≈ log(1 + kβmin ). `p−k´ N (k) = k
25
Information-theory: necessary/sufficient conditions Set-up: Given k-sparse vector β ∗ ; Minimum value βmin : = min |βi∗ |. i∈S
Necessary conditions:
(Baron et al., 2006; Wainwright, 2007; Reeves &
Gastpar, 2008; Akcakaya & Tarokh, 2008; Fletcher et al., 2008; Wang et al., 2008)
Wang et al. (2008): for any design X with i.i.d. rows and cov. Σ: 8 9 `p ´ > > < = log k log(p − k) ˆ ˜´ , n > max ` 1 . 2 2 > > c (Σ) β log 1 + c (Σ) β k 2 1 : 2 min {z min}; | | {z } Far-away subsets Near-by subsets
Sufficient conditions for oracle decoder: Gastpar, 2008; Akcakaya & Tarokh, 2008):
(Wainwright, 2007; Reeves &
Wainwright (2007): for standard Gaussian ensemble, oracle succeeds for ! ( ) “ p log(p − k) ” , n > Ω max log . 2 k βmin 26
Summary and open questions ℓ1 -regularization and sparsity: – natural polynomial-time relaxation of difficult ℓq problems (q ∈ [0, 1)) – attractive theoretical guarantees (high-dimensional scaling) – various connections to Bayesian procedures – lots of exciting applications (e.g., compressed sensing [Baraniuk, next talk]; audio/images/video [Saul, last talk])
moving beyond ℓ1 , and other open questions: – fundamental limits of optimal procedures (ℓq , q < 1 etc.) – block and hierarchical regularization: when does more computation yield better results? – Bayesian versions, and adaptive procedures (learning designs) – other forms of complexity regularization for high-dimensional data: manifolds, dimensionality reduction, etc. 27