SPARSE RECOVERY WITH UNKNOWN VARIANCE

Viewer
Transcript

SPARSE RECOVERY WITH UNKNOWN VARIANCE: A LASSO-TYPE APPROACH STÉPHANE CHRÉTIEN

AND

SÉBASTIEN DARSES

Abstract. We address the issue of estimating the regression vector β and the variance σ 2 in the generic s-sparse linear model y = Xβ +z, with β ∈ Rp , y ∈ Rn , z ∼ N (0, σ 2 I) and p > n. We propose a new LASSOtype method that jointly estimates β, σ 2 and the relaxation parameter λ by imposing an explicit trade-off constraint between the log-likelihood and ℓ1 -penalization terms. We prove that exact recovery of the support and sign pattern of β holds with probability at least 1 − O(p−α ). Our assumptions, parametrized by α, are similar to the ones proposed in [9] for σ 2 known. The proof relies on a tail decoupling argument with explicit constants and a recent version of the Non-Commutative Bernstein inequality [30]. Our result is then derived from the optimality conditions for the estimators of β and λ. Finally, a thorough analysis of the standard LASSO estimator as a function of λ allows us to construct an efficient Newton scheme for the fast computation of our estimators.

1. Introduction 1.1. Problem statement. The well-known standard Gaussian linear model reads (1.1)

y = Xβ + z,

where X denotes a n×p design matrix, β ∈ Rp is an unknown parameter and the components of the error z are assumed i.i.d. with normal distribution N (0, σ 2 ). The present paper aims at studying this model in the case where the number of covariates is greater than the number of observations, n < p, and the regression vector β and the variance σ 2 are both unknown. The estimation of the parameters in this case is of course impossible without further assumptions on the regression vector β. One such assumption is sparsity, i.e. only a few components of β are different from zero. There has been an great interest in the study of this problem recently. Recovering β when the noise is zero has been extensively studied in the context of Compressed Sensing, a new paradigm for designing observation matrices X allowing exact recovery of β. In this framework, it is now a standard fact that matrices X can be found (e.g. with high probability if drawn from subgaussian i.i.d distributions) such that the number of observations needed to recover β exactly is proportional to s log (p/n). When the variance is known and positive, two popular techniques to estimate the regression vector β are the LASSO [28] and the Dantzig selector [11]. We refer to [3] for a recent simultaneous analysis of these two methods. 1

2

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

The standard LASSO estimator βbλ of β is defined as (1.2)

1 βbλ ∈ argmin ky − Xbk22 + λkbk1 , 2 b∈Rp

where λ > 0 is a regularization parameter controlling the sparsity of the estimated coefficients. Sparse recovery cannot hold without some geometric assumptions on the dictionary (or the design matrix), as recalled in [19] pp. 4–5. One common assumption for the precise study of the statistical performance of these estimators is an incoherence property of the matrix X; this requires the scalar product of the (normalized) columns of X to be very small. Coherence based conditions appeared first in the context of Basis Pursuit for sparse approximation in [12], [15] and [13]. It then had a significant impact on Compressed Sensing; see [26] and [10]. The recent references [3], [6] and [18] contain interesting assumptions on the coherence in our context of interest, i.e. high dimensional p sparse regression. For instance, [3] and [6] require a bound of the order log n/n whereas [18] requires a bound of the order 1/s. The recent paper [9] requires the coherence of X, i.e. the maximum scalar product of two columns of X, to be less than Cµ / log p. This assumption seems to be pretty pertinent since log p really represents the complexity of the problem. In [9], the authors work under the following assumptions: incoherence property, β is s-sparse with s sufficiently small, and the support of β and the sign pattern are independent and uniformly distributed. Then, they proved that βb has the same support and sign pattern as β with probability 1 − p−1 ((2π log p)−1/2 + sp−1 ) − O(p−2 log 2 ). Our work mainly aims at extending the results of [9] to the case where σ 2 is unknown using the generic s-sparse model introduced in [9, p. 288]. The basic idea of our approach is that the fidelity term and the penalization term of the LASSO objective function should have the same order of magnitude. More precisely, the penalization parameter λ will be tuned so as to impose these two terms to be proportional, i.e. their ratio to be equal to a positive b and constant C. The value of λ achieving this property will be denoted by λ b b the LASSO estimator of β corresponding to λ will be our estimator β.

1.2. Our contribution. Given an arbitrary α > 0, our main result specifies a reasonable set of constraints on the number of observations n and the sparsity s so that our LASSO-type method only fails to identify the support and the signs of β with probability at most of the order p−α . As in [9], we require the components of β to be above the noise level. However, the main difference with the case where the variance σ 2 is known is that we also have to impose that the components of β are not too large, the upper bound being a function of the proportion C. This assumption is quite natural because the fidelity term is essentially less sensitive to the magnitude of β than the ℓ1 -norm penalization term. Thus, any preassigned value of the ratio of these two terms cannot be relevant for arbitrary large values of the components of β. The two main ingredients in our proofs are a new version of the NonCommutative Bernstein inequality, proven in [30], and a refined analysis of

SPARSE RECOVERY WITH UNKNOWN VARIANCE

3

tail decoupling for second order Rademacher chaos. This may be of independent interest in various other settings. Moreover, we provide explicit values for all the constants involved in the analysis and we compare them to the ones leading to Theorem 1.3 in [9]. These comparisons seem to confirm that the Non-Commutative Bernstein inequality is a particularly efficient tool for the analysis of the LASSO. The problem of estimating the variance in the sparse regression model has been addressed in only a few references until now. One very interesting work is [2] where AIC, BIC and AMDL based estimators, as well as estimators using a more general complexity penalty, are analyzed in the unknown variance setting. The main advantage of our Balanced-LASSO procedure is to avoid the enumeration of subsets of variables that usually arises in practice when the number of covariates is large. 1.3. Perspectives. One subject of future research would be how to weaken the coherence assumption. One interesting direction is the recent work of Juditsky and Nemirovsky [17] in which a polynomial time verifiable condition for exact recovery in compressed sensing is proposed. Adapting this method to the regression setting with unknown variance would provide a satisfactory answer to this problem. 1.4. Plan of the paper. Our Balanced-LASSO estimator, the main results Theorem 2.4 and Theorem 2.5 , together with the assumptions used throughout the paper are presented in Section 2. We proceed to the proofs of Theorem 2.4 and Theorem 2.5 in Section 3 and Section 4 respectively. Section 5 studies the basic properties (continuity, uniqueness, range of the ratio between fidelity term and the penalization term, monotonicity of this ratio, etc) of the LASSO estimator as a function of the relaxation parameter b is described in Section 6 where some numerical λ. The algorithm finding λ experiments are also presented. The proofs of some technical intermediate results are gathered in the Appendix. 1.5. Notations. 1.5.1. Generalities. When E ⊂ {1, . . . , N }, we denote by |E| the cardinal of E. For I ⊂ {1, . . . , p} and x a vector in Rp , we set xI = (xi )i∈I ∈ R|I| . The usual scalar product is denoted by h·, ·i. The notations for the norms on vectors and matrices are also standard: for any vector x = (xi ) ∈ RN , X X kxk22 = x2i ; kxk1 = |xi | ; kxk∞ = sup |xi |. 1≤i≤N

1≤i≤N

1≤i≤N

For any matrix A, we denote by At its transpose. The set of symmetric real matrices is denoted by Sn . For any matrix A in Rd1 ×d2 , we denote by kAk the operator norm of A. The maximum (resp. minimum) singular value of A is denoted by σmax (resp. σmin (A)). Recall that σmax (A) = kAk and σmin (A)−1 = kA−1 k. We use the Loewner ordering on symmetric real matrices: if A ∈ Sn , 0 A is equivalent to saying that A is positive semidefinite, and A B stands for 0 B − A.

4

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

The notations N (µ, σ 2 ) (resp. χ2 (ν) and B(ν)) stands for the normal distribution on the real line with mean µ and variance σ 2 (resp. the Chisquare distribution with ν degrees of freedom and the Bernoulli distribution with parameter ν). For any vector b ∈ Rp , b+ (resp. b− ) will dente its non-negative (resp. − non-positive) part, i.e. b = b+ − b− , with b+ j , bj ≥ 0. 1.5.2. Specific notations related to the design matrix X. For I ⊂ {1, . . . , p}, and a matrix X, we denote by XI the submatrix whose columns are indexed by I. We will denote the range of XI by VI and the orthogonal projection onto VI by PVI . The coherence of X, denoted by µ(X), is defined by (1.3)

µ(X) =

max |hXi , Xj i|.

1≤i,j≤p

As in [29], we consider the ’hollow-gram’ matrix H and the selector matrix R = diag(δ): H = X tX − I R = diag(δ),

(1.4) (1.5)

where δ is a vector of length p whose components are i.i.d. random variables following the Bernoulli distribution B(s/p). In a similar fashion, we define Rs = diag(δ (s) ) where δ (s) is a random vector of length, uniformly distributed on the set of all vectors with exactly s components equal to 1 and p − s components equal to 0. 1.5.3. Specific notations for the estimators of β. The Balanced-LASSO estimator of β (resp. σ 2 , λ) introduced in this paper is denoted by βb (resp. b The support of βb is denoted by Tb. In Section 5, we will study the σ b2 , λ). standard LASSO estimator of β as a function of λ. We simply denote by βbλ this estimator and we denoted by Tbλ its support. For the sake of notational simplicity, we set βbλ βbTbλ := (1.6) . Tbλ

2. A Balanced-LASSO estimator

2.1. Presentation of the estimator. When the variance σ 2 is unknown, one has to find an appropriate modification of this strategy. A possible way to handle this situation is to perform cross validation, a computationally very demanding procedure. Another type of estimator has thus to be proposed. We consider and ℓ1 -penalized log-likelihood estimator where the parameter λ has to be estimated as well by imposing an additional constraint in the likelihood maximization problem. More precisely, the goal of the present paper is to introduce a new estibσ b defined by mator (β, b2 , λ) (2.7)

bσ b ∈ argmin (β, b2 , λ) (b,s,l)∈C

(2.8)

C =

kbk1 n ky − Xbk22 + l 2 + log(s2 ) 2s2 s 2

(b, s, l) ∈ Rp × R+ × R+ , lkbk1 = Cky − Xbk22 ,

SPARSE RECOVERY WITH UNKNOWN VARIANCE

5

where C is a positive constant. b 2 The basic idea is to hold a balance between the fidelity term 21 ky − X βk 2 b b and the penalty term λkβk1 . The main problem with this approach is the one of choosing the constant C. As we will see in Theorem 2.5, fixing C to an arbitrary value determines a set of conditions under which exact recovery of the support and sign pattern of β holds probability at least 1−O(p−α ). In particular, the magnitude of the nonzero coefficients of β plays a particular role: as in [9], one will require that the nonzero components of β be not too small (in fact, slightly above the noise level), but also not too large. As the range of admissible values for the components of β depends on C, C will have to be tuned as a function of our priors on this range. 2.2. Main results. 2.2.1. Preliminary remarks. The main idea behind the analysis of LASSOtype methods is the following. First, the ℓ1 penalty promotes sparsity of the b Since βb is sparse, we may restrict the study to the subvector βbb estimator β. T of β, resp. the submatrix XTb of X, whose components, resp. columns, are indexed by Tb. Moreover, in order for βb to be a good estimator, one clearly needs to prove that XTb is far from being singular. Taking this idea a little further, since Tb is supposed to estimate the true support T of cardinal s, the first kind of result one may ask for is a proof that XT is far from singular for every possible T . Unfortunately, proving such a strong property, based on incoherence only, is out of hope. The idea proposed by Candès and Plan in [9] to overcome this problem is to assume that T is random. Based on this model, the first part of the analysis of LASSO-type methods consists of proving that XT satisfies (2.9)

1 − r0 ≤ σmin (XT ) ≤ σmax (XT ) ≤ 1 + r0 ,

0 < r0 < 1,

with high probability. The method of the proof used in [9] is based on the Non-Commutative Kahane-Kintchine inequalities. In the present paper, we will instead use a recent version of the Non-Commutative Bernstein inequality proposed by Tropp [30], in order to obtain better estimates for the involved constants. The most intuitive conditions to prove (2.9) are (i) T is a random support with uniform distribution on index sets with cardinal s; (ii) s is sufficiently small; (iii) X is sufficiently incoherent. The second part of the analysis consists in proving that the least-squares oracle estimator, which knows the support ahead of time, satisfies the optimality conditions of the Balanced-LASSO estimator with high probability. This will prove that the Balanced-LASSO automatically detects the right support and sign pattern. The proofs of these results highly depend on the quasi-isometry condition (2.9) and similar properties obtained with the same techniques as for (2.9). We also need the sign patter of β to be uniformly distributed, a quite natural assumption already invoked in [9].

6

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

2.2.2. Assumptions. The first so-called Coherence condition deals with the minimum angle between the columns of X. Assumption 2.1. (Range and Coherence condition for X) The matrix X has full rank and its coherence verifies Cµ µ(X) ≤ (2.10) . log p Assumption 2.2. (Generic s-sparse model [9]) (1) The support of β is uniform among all index subsets of {1, . . . , n} with cardinal s, (2) Given T , the sign pattern of βT is random with uniform distribution over {−1, +1}s ,

The last condition concerns the magnitude of the nonzero regression coefficients βj , j ∈ T . Let θ ∈ (0, 1) and ) (p 2(1 + α)(1 + r0 ) p (2.11) , 2 2(1 + α) . κ = max 1 − r0 Let us define

2e−1 n−s √ log p κC(1 + θ)

(2.12)

Mθ,C

=

(2.13)

mθ,C

= max(m1 , m2 , m3 ),

with m1

(2.15)

m2

(2.16)

m3

(2.17) (2.18)

q α n − s + 2 log( p2 ) p = 2 (1 − θ)Cs q √ α √ n − s + 2 log( p2 ) 1 + 2C p = 1+ C (1 − r0 )s q √ α s + 2 log( p2 ) p . = θ (1 − r0 )s √

(2.14)

Moreover, set

! 4 p π(n − s) n−s pα

K1,2 = max K3 =

√

1 + 1+2C C p , p (1 − θ)C (1 − r0 )s 1

!

1 p . θ (1 − r0 )

Finally, let ℓ−1 be the inverse of x 7→ x log x. We can now define the range assumption for the coefficients of β. Assumption 2.3. (Range condition for β) The unknown vector β verifies (2.19)

min |βj | ≥ mθ,C σ,

(2.20)

kβk1 ≤ Mθ,C σ.

i∈T

SPARSE RECOVERY WITH UNKNOWN VARIANCE

7

2.3. First part of the analysis – Invertibility of XT . As discussed in Section 2.2.1, the property that XT is far from singularity is very natural. The following theorem precisely relates the sparsity s, the coherence µ(X) and the parameters of the problem n, p and kXk, under which the quasiisometry property fails for XT with probability at most of the order p−α . Theorem 2.4. Let p, α > 0 such that α log p ≥ 8. Assume that (2.21)

Cs ≤

(2.22)

Cµ ≤

r02 3(1 + α)η η−1 p p Cs 2 + 2η/3

for some arbitrary constants r0 ∈ (0, 1), η > 1 and γ > 0. Let Assumptions 2.1 and 2.2 hold with Cs p s ≤ (2.23) . kXk2 log p Then the following bound holds: 2 × 4 × 324 (2.24) . P kXTt XT − Ik ≥ r0 ≤ pα Notice that α log p ≥ 8 just ensures that (2.24) is non-trivial.

2.4. Second part of the analysis – Support and sign recovery. Our second main result shows that the estimator βb defined by (2.7) recovers the support and sign pattern of β exactly with probability of the order 1−O(p−α ) using similar bounds on the coherence and the sparsity as in [9] and an additional assumption on the magnitude of the coefficients of β. Theorem 2.5. Let p, α > 0 such that α log p ≥ 8. Choose r0 ∈ (0, 1), η > 1 and γ > 0. Set κ(1 − r0 ) r02 √ , Cs ≤ min (2.25) 3(1 + α)η 2(1 + γ) α + 1 ! γ η−1 p Cs , (2.26) Cµ ≤ min p Cs 2(4 + γ)(α + 1) 2 + 2η/3 (2.27)

Cn ≥ ℓ−1 (8α).

Let Assumptions 2.1, 2.2 and 2.3 hold with (2.28) (2.29)

n ≥ Cn log p 1 p Cs . min , s ≤ log p kXk2 4 max(K1,2 , K3 )2

Then the probability that the estimator βb defined by (2.7–2.8) exactly recovers the support and sign pattern of β is greater than 2604 1− α . p Remark 2.6. The constant 2604 stems from the following decomposition: 2 (poissonization) ×324 (decoupling) ×4 (union bound) +10 (union bound on a set of key ℓ∞ -norms) +2 (χ2 bounds).

8

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

2.5. Important comments. 2.5.1. About β. Basically, Theorem 2.5 tells that the admissible range for the ℓ1 -norm of β ensuring the exact recovery is Range(kβk1 ) = [σ s mθ,C , σ Mθ,C ] . Notice that, given a model defined by the parameters p, n and s, one may want to optimize the constants θ ∈ (0, 1), η ≥ 1 and C > 0 in order to make the range [s mθ,C , Mθ,C ] maximal for a given large probability. Our condition on minj∈T |βj | is similar to the one given in [9]. However, we also need an upper bound to ensure exact recovery. This additional condition reflects the fact that the fidelity term is intuitively less sensitive to the magnitude of the βj ’s than the penalization term is. Indeed, if the nonzero βj ’s were uncontrollably large, then the penalization term would overwhelm the fidelity term and lead to an unreliable estimator. Common sense would predict that in the case of such a loss of balance between fidelity and penalization, the resulting estimator is numerically unstable. Our work shows that it is statistically bad, even if one works with infinite precision, when the variance is unknown. Let us also notice that the larger the number n of observations is, the wider the admissible range for the values of kβk1 is, a highly natural phenomenon. A less intuitive but remarkable fact is that the same phenomenon occurs when r0 decreases to zero (a side effect being to decrease Cs , Cµ and s as well however). 2.5.2. About C. The constant C can be adapted to our prior knowledge on the magnitude of the βj ’s. However, increasing the upper bound on the magnitude of the βj ’s via decreasing the constant C also results in increasing the lower bound. Therefore, C governs a sliding window inside which the coefficients of β can be recovered by the Balanced-LASSO. 2.5.3. About X. It is well known that when X is obtained from a random matrix with i.i.d. standard gaussian random entries by normalizing the columns, we have that kXk2 is of the order p/n, and taking n of the order of log p3 is sufficient for satisfying the Incoherence Assumption 2.1. This instructive example suggests that the upper bound (2.29) on the number s of nonzero components of β will very often reduce to s ≤

Cs p log p kXk2

in practical situations. 2.5.4. About the constants Cs et Cµ . Let us compare the numerical values of these constants to the one obtained in [9]. One of the various constraints on the rate α in [9] is given by the theorem of Tropp in [29]. In this setting, α = 2 log 2 r0 = 1/2,

SPARSE RECOVERY WITH UNKNOWN VARIANCE

9

the author’s choice of 1/2 being unessential. To obtain such a rate α, they need to impose the r.h.s. of (3.15) in [9] to be less than 1/4, that is: p 30Cµ + 13 2Cs ≤

(2.30)

1 . 4

This yields Cs < 1.19 × 10−4 . Let us choose Cs close to this maximum allowed. Then, compute Cµ by (2.30). This yields Cs ≃ 1.18 10−4 ,

Cµ ≃ 1.7 10−3 .

(The additional condition coming from the endpof the proof of [9, Lemma 3 3.5], that is 64C 3/(128 log 2) ≫ 1.7 10−3 .) 2 = 2 log(2), is not limiting since µ

Our theorem allows to choose any rate α > 0. To make a fair comparison, let us also choose α = 1.5 > 2 log 2 and r0 = 1/2. When setting η = 1.1 and γ = 25, we obtain: Cs ≃ 0.03,

Cµ ≃ 5.2 10−3 .

As a second example, take α = 2, r0 = 1/2, η = 1.1 and γ = 33. Then Cs ≃ 0.025,

Cµ ≃ 3.7 10−3 .

3. Proof of Theorem 2.4 In order to study the invertibility condition (i), we want to obtain bounds for the distribution tail of random sub-matrices of H = X t X − I. Recall that R is the selector matrix defined by (1.5). Let R′ be an independent copy of R. Let us recall two basic estimates: (3.31) (3.32)

kHk21→2 ≤ kXk2

kHk2 ≤ kXk4 .

We proceed our study in two steps; We first obtain for the distribution tail of random sub-matrices of H, a general bound determined by several ranges of parameters. Second, we tune these parameters. As a preliminary, let us notice that (3.33)

P (kRs HRs k ≥ r0 ) ≤ 2 P (kRHRk ≥ r0 ) ,

which can be actually proven using the same kind of ’Poissonization argument’ as in Claim (3.29) p.2173 in [9]. 3.1. Step 1 – Deviation of random sub-matrices. Theorem 3.1. For all 4-tuples [r, t, u, v] of parameters such that u2 ≥ s s2 s 4 2 2 2 2 4 p kXk , v ≥ p kXk and r − t ≥ p2 kXk , the following bound holds: (3.34)

P (kRHRk ≥ r) ≤ 2p V(s, [u, v, r, t]),

10

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

where V(s, [r, t, u, v]) = exp

−(u2 − ps kXk4 )2 /2

!

s 6 2 2 p kXk + kXk u /3  2  p 2 s 2 4 /2   − s (r − t ) − p kXk + exp  s  p 6 2 2 2 p kXk + kXk s (r − t )/3



 + exp − s

v 2 − ps kXk2

2

/2

2 2 2 2 p kXk µ(X) + v µ(X) /3

t4 /(2v 2 ) + exp − s 2 2 p u + t /3

!

  

.

Proof. To study the distribution tail of kRHRk, we use a decoupling technique which consists of replacing kRHRk with kRHR′ k. Lemma 3.2. The operator norm of RHR satisfies

P (kRHRk ≥ t) ≤ 324 P kRHR′ k ≥ t .

(3.35)

Proof. As in [29], let us write

RHR =

X

δj δk (Hjk + Hkj ).

j
Let {ηi } be a sequence of iid Rademacher (±1 symmetric random variable) independent of D := {δi , 1 ≤ i ≤ p}. Following Bourgain and Tzafriri’s [5], and Gine and De La Pena’s [24], we build up from these two methods an auxiliary r.v. for our purpose: 1X ((1 + ηj )(1 − ηk ) + (1 + ηk )(1 − ηj )) δj δk (Hjk + Hkj ) Z = 2 j
(3.36)

= RHR + Y,

where (3.37)

Y =

X

ai ηi +

1≤i≤p

X

bij ηi ηj .

1≤i6=j≤p

In particular, (3.38)

RHR = E [Z|D] . x∗ (Y

Set ξ = ) for some linear form x∗ on Mp (R). Conditionally on D, ξ is an homogeneous chaos of order 2, i.e. of the form (3.37) as mentioned in the Appendix. From Theorem 7.1, one then obtains 1 1 (3.39) ED |ξ|4 4 ≤ 3 ED |ξ|2 2 . Holder’s inequality yields (3.40)

ED |ξ|

(ED |ξ|2 )

1 2

≥

ED |ξ|2

(ED |ξ|4 )

1 2

≥

1 . 32

SPARSE RECOVERY WITH UNKNOWN VARIANCE

11

We then apply Lemma 7.2 (cf Appendix) to (3.36): (3.41)

1 1 = . 4 × 34 324 and taking the expectation, one ob-

P (kZk ≥ kRHRk |D) ≥

Multiplying both sides by 1{kRHRk≥t} tains 1 (3.42) P(kRHRk ≥ t) ≤ P (kZk ≥ t) . 324 But (3.43) P (kZk ≥ t) = E E 1{kZk≥t} |(ηi ) .

As from now, we can follow the same arguments as in [29, Prop. 2.1], which we recall for the sake of completeness. There is a 0 − 1 vector η ∗ for which 1{kZk≥t} exceeds its expectation over (ηi ). Hence, setting T = {j, ηj∗ = 1},

 

X

 δj δk (Hjk + Hkj ) P (kZk ≥ t) ≤ P 

≥t

j∈T, k∈T c

 

X

 δj δk′ (Hjk + Hkj ) ≤ P 

≥t ,

j∈T, k∈T c

where (δi′ ) is an independent copy of (δi ). Using an identity for block counterdiagonal Hermitian matrices, and re-introducing the missing entries in H, one concludes that (3.44)

P (kZk ≥ t) = P(kRHR′ k ≥ t),

which concludes the proof of the lemma due to (3.42). Remark 3.3. The previous lemma can be seen as a special case of Theorem 1 p.224 in [23]. Tracking back the various constants involved in this theorem, we obtain the inequality t 3 ′ (3.45) , P (kRHRk ≥ t) ≤ 10 P kRHR k ≥ 18 which is weaker than our lemma. Instead, we took advantage of the refinement of [5, Prop. 1.9] obtained in [29], and the decoupling techniques for distribution tail in [24]. We then apply the Non-Commutative Bernstein inequality to kRHR′ k. We need three technical lemmas to complete the proof of Theorem 3.1. Lemma 3.4. The following bound holds: p P kRHR′ k ≥ t ≤ P kRHk2 ≥ u2 + P kRHk2 ≥ (r2 − t2 ) s ! t4 /2 (3.46) . + P (kRHk1→2 ≥ v) + p exp − s 2 2 2 2 p v u + u t /3

12

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

Proof. We have kRHR′ k2 = kRHR′2 HRk. But R′ 2 = R′ , so (3.47) P kRHR′ k ≥ t = P kRHR′ HRk ≥ t2 .

We will first compute the conditional probability (3.48) P kRHR′ HRk ≥ t2 | R := E 1{kRHR′ HRk≥t2 } | R . Note that

′

(3.49)

RHR HR =

p X

δj′ Zj Zjt ,

j=1

where Zj is the j th column of RH, j = 1, . . . , p. In order to apply the noncommutative Bernstein inequality let us define the centered random matrices s (3.50) Aj = δj Zj Zjt − Zj Zjt , j = 1, . . . , p. p Moreover, we have kZj Zjt k = kZj k22 and thus, kAj k ≤ max1≤j≤p kZj k22 . Let us now bound the variance   p p X X Var  Aj  = (3.51) Var (Aj ) , j=1

j=1

(3.52)

=

s p

p s X Zj Zjt Zj Zjt , 1− p j=1

p

(3.53) Since (3.54)

Pp

t j=1 Zj Zj

X s max kZj k22 Zj Zjt . p 1≤j≤p j=1

= RH 2 R, we then obtain

 

p X

Var 

≤ s kRHk21→2 kRHk2 .  A j

p

j=1

The non-commutative Bernstein inequality then yields (3.55) where (3.56)

P (A | R) ≤ p exp

−t4 /2 s 2 2 2 2 p kRHk1→2 kRHk + kRHk1→2 t /3

s ′ 2 2 A = kRHR HR − RH Rk ≥ t . p

Let us now introduce (3.57) (3.58)

B = {kRHk ≥ u} C = {kRHk1→2 ≥ v} .

We have (3.59) (3.60)

P(A) = P(A | B ∪ C)P(B ∪ C) + P(A ∩ B c ∩ C c ) ≤ P(B) + P(C) + P(A ∩ B c ∩ C c ).

!

,

SPARSE RECOVERY WITH UNKNOWN VARIANCE

The following inequality holds on B c ∩ C c : s (3.61) kRHk21→2 kRHk2 + kRHk21→2 t2 /3 ≤ p Using (3.62)

13

s 2 2 v u + u2 t2 /3. p

P(A ∩ B c ∩ C c ) = E [1A∩Bc ∩C c ] = E [P (A | R) 1Bc ∩C c ]

we can deduce s 2 2 ′ ≤ P (kRHk ≥ u) + P (kRHk1→2 ≥ v) P kRHR HR − RH Rk ≥ t p ! t4 /2 +p exp − s 2 2 2 2 p v u + u t /3 for all t, u, v in R+ . The inequality s ′ 2 2 2 2 P kRHR HRk ≥ r ≤ P k RH Rk ≥ r − t p s +P kRHR′ HR − RH 2 Rk ≥ t2 , p for 0 ≤ t ≤ r, concludes the Lemma.

We now have to control the norm of ps RH 2 R, the norm of RH and the column norm of RH. Let us begin with kRHk = kHRk. Lemma 3.5. The following bound holds: P (kHRk > u) ≤ p exp

−(u2 − ps kXk4 )2

s 6 p kXk

+ 4kXk2 (u2 − ps kXk4 )

!

.

Proof. The steps are of course the same as what we have just done in the proof of Lemma 3.2. We have (3.63) P (kRHk > u) = P kHRk2 > u2 (3.64) = P kHRHk > u2 . The j th column of H is Hj = X t Xj − ej . Moreover,

(3.65)

HRH =

Set Bj = δj − (3.66) (3.67)

s p

p X

δj Hj Hjt .

j=1

Hj Hjt . Then, kHj Hjt k = kHj k22

≤ kHk21→2 ≤ kXk2 .

Let us now bound the variance. We have   X p p X s s Var  Bj  = Hj Hjt Hj Hjt 1− p p j=1

j=1

s kHk21→2 p

p X j=1

Hj Hjt .

14

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

Thus,

 

p X

Var  Bj 

≤

j=1

(3.68)

≤

(3.69)

s kHk21→2 kHk2 p s kXk6 . p

We finally deduce from the non-commutative Bernstein inequality that !

4 /2

−u s 2 2 . ≤ p exp s P

HRH − p H ≥ u 6 2 2 p kXk + kXk u /3 A triangular inequality yields the conclusion.

Let us now control the supremum ℓ2 -norm of the columns of RH, i.e. kRHk1→2 . Lemma 3.6. The following bound holds:  (3.70) P (kRHk1→2

 2 2 − s kXk2 v /2 p   ≥ v) ≤ p exp − . s 2 2 2 µ(X) p kXk + v /3

Proof. This result is actually a consequence of Theorem 3.2 in [29]. We provide below an alternative short proof which uses the non commutative Bernstein inequality. Set (3.71)

M

=

p X

δk diag(Hk Hkt ).

k=1

Note that

kRHk21→2 = kM k.

(3.72) We then have (3.73)

EM

=

(3.74)

kE M k =

(3.75)

kVarM k ≤

s diag(HH t ) p s kHk21→2 p s s kHk21→2 µ(X)2 ≤ kXk2 µ(X)2 . p p

Note that for all k ∈ {1, · · · , p}, (3.76)

kdiag(Hk Hkt )k ≤ µ(X)2 .

We are now able to apply the non commutative Bernstein inequality: ! v 4 /2 2 P kM − E M k ≥ v ≤ p exp − s . 2 2 2 2 p kXk µ(X) + v µ(X) /3

This completes the lemma using the triangular inequality.

This ends the proof of the theorem.

SPARSE RECOVERY WITH UNKNOWN VARIANCE

15

3.2. Step 2 – Tuning the parameters. We now have to analyze carefully the exponential bounds in Theorem 3.1. A crucial quantity turns out to be s 2 p kXk . Keeping in mind that the hypothesis on the coherence reads µ(X) ≤

(3.77)

Cµ , log p

it is relevant to seek the maximum sparsity s as (3.78)

s =

Cs p , kXk2 log p

where the constants Cµ and Cs will be tuned according to several constraints. Using repeatedly the equality s kXk2 = p

(3.79)

Cs , log p

the previous considerations yield the following first choice for the parameters: (3.80)

r = r0 s2 kXk4 (log p)2 = r02 − LCs2 p2 s ηCs = η kXk2 = p log p s = L′ kXk4 (log p)2 = L′ Cs2 . p

(3.81)

t2 = r02 − L

(3.82)

v2

(3.83)

u2

Hence, we obtain t4 /(2v 2 ) s 2 2 p u + t /3

(3.84)

(3.85)

(3.86)

(3.87)

(u2 − ps kXk4 )2 /2

s 6 p kXk

+ kXk2 u2 /3 2 p 2 2 ) − s kXk4 (r − t /2 s p + kXk2 ps (r2 − t2 )/3 2 v 2 − ps kXk2 /2 µ(X)2 ps kXk2 + v 2 /3

s 6 p kXk

=

=

=

≥

2 r02 − LCs2 3/2 log p ηCs r02 + (3L′ − L)Cs2 2 1 1 − ′ 2 3 L (log p) L′ Cs log p 3 2 1 + L′ (log 2 p) 2 1 1 − 3 L(log p)2 LCs log p 2 1 + L(log3 p)2 (η − 1)2 /2 Cs log p. 1 + η/3 Cµ2

Set to α the minimum desired rate of convergence for P (kRHRk ≥ r0 ). This implies  2 2  1  3/2 2 2 1 − ′ 2 r0 − LCs 3 L (log p) L ′ Cs , min , 3 2 ′ 2  ηC 2 1 + L′ (log p)2  s r0 + (3L − L)Cs  2  1 1 − 2 3 (η − 1) /2 Cs  L(log p)2 ≥ α + 1. LCs , 2 1 + L(log3 p)2 1 + η/3 Cµ2  

16

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

It is natural to choose α+1 , Cs

L = L′ =

(3.88)

and take an infimum bound for log p so that L(log p)2 ≥ 12, to ensure 2 1 1 − 3 L(log p)2 (3.89) ≥ 1. 2 1 + L(log3 p)2 We now want to optimize Cs in terms of t ∈ (0, 1) such that LCs2 ≤ t r02 , i.e. r02 t. α+1

Cs ≤

(3.90)

Writing that the constant in the r.h.s. of (3.84) is greater than α + 1, we obtain 3 (1 − t)2 r02 . (α + 1) η 2 1 + 2t

Cs ≤

(3.91)

n o 2 Since supt∈(0,1) min t, 32 (1−t) ≥ 13 , the finest inequality reads 1+2t Cs ≤

(3.92)

r02 . 3(α + 1)η

We wanted L(log p)2 ≥ 12, thus it is sufficient to set log p ≥ 2 since 2 ≥

2r0 √ . (α + 1) η

Inequalities (3.87) and (3.92) are satisfied when Cµ ≤

(3.93)

η−1 p p Cs , 2 + 2η/3

as required in the assumptions of the theorem.

4. Proof of Theorem 2.5 The proof is divided into several steps. We first provide the description and consequences of the optimality conditions for the standard LASSO estimator as a function of λ. Second, we prove that these optimality conditions are satisfied by a simple and natural oracle estimator. Last but not least, we end with giving the conditions under which the range is nonempty. 4.1. Optimality conditions. A necessary and sufficient optimality condition in (1.2) is that (4.94)

0∈∂

ky − X βb k2 λ 2

2b σλ2

+λ

kβbλ k1 n 2 ) log(b σ + λ , 2 σ bλ2

SPARSE RECOVERY WITH UNKNOWN VARIANCE

17

where ∂ denotes the sub-differential, which is equivalent to the existence of gλ in ∂k · k1 at βbλ such that

(4.95)

(4.96)

−

X t (y − X βbλ ) λ + 2 gλ = 0 σ bλ2 σ bλ

λ n ky − X βbλ k22 − 2 3 kβbλ k1 + 3 σ bλ σ bλ σ bλ

= 0.

On the other hand, the sub-differential of k · k1 at βbλ is defined by n o ∂k · k1 (βbλ ) = γ ∈ Rp , γTbλ = sgn(βbTbλ ) and kγTbc k∞ < 1 . λ

Thus, using the fact that y = Xβ+z, we may easily conclude that a necessary and sufficient condition for optimality in (2.7) is the existence of a vector gλ , satisfying gTbλ = sgn(βbTbλ ) and kgTbc k∞ < 1, and such that λ

(4.97)

−X (y − X βbλ ) = λ gλ t

ky − X βbλ k22 + 2λkβbλ k1 . n The following corollary is a direct but important consequence of these previous preliminary remarks. (4.98)

σ bλ2 =

Corollary 4.1. A necessary and sufficient condition for a given random vector b with support T to simultaneously satisfy the two following conditions: (1) b = βbλ , (2) b has the same support T and sign pattern sgn(βT ) as β is that XTt (y − Xb) = −λ sgn(βT )

(4.99)

kXTt c (y − Xb)k∞ < λ.

(4.100)

Proof. The fact that (4.99) and (4.100) are necessary is a straightforward consequence of (4.97). Conversely, assume that (4.99) and (4.100) hold. Set 1 g = − X t (y − Xb). (4.101) λ Using (4.95), we deduce that g belongs to ∂k · k1 (b) and that the support of b is exactly the set T = {j ∈ {1, . . . , p}, |g| = 1}. On the other hand, we have that (4.102) (4.103)

g = sgn(βT ) kgk∞ < 1,

and we may deduce that g is at the same time in the sub-differential of any vector b in Rp with same support and sign pattern as β. Therefore, we have (4.104)

T

= {j ∈ {1, . . . , p}, |gj | = 1} = T,

and we conclude that β and b have the same support. Moreover, the index set T + of the positive components of β and the index set T+ of the positive components of b satisfy (4.105)

T + = {j ∈ {1, . . . , p}, gj = 1} = T+ .

18

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

The same argument implies that the index set T − of the negative components of β equals the index set T− of the negative components of b. To sum up, β and b have the same support and sign pattern and the proof is completed. This moreover implies that (4.99) and (4.100) are the optimality conditions for (1.2) and we obtain that b = βb as announced.

Remark 4.2. Notice that Theorem 2.5 indirectly implies uniqueness of this estimator with high probability when X is sufficiently incoherent, β is sufficiently sparse and λ is appropriately tuned so that the ratio between the fidelity term and the penalization term is equal to C. However, one should keep in mind that in practice, (i) one has to solve the estimation problem for λ as well and (ii) for an arbitrary value of λ, uniqueness is not guaranteed anymore. Therefore, uniqueness in a more general setting and continuity properties of the estimator as a function of λ are also crucial for the analysis of the practical estimation procedure. Such results are presented in Section 5 below based on a king of general position assumption on the matrix X, a much less restrictive assumption in general than the incoherence assumption.

b We now discuss the next step of 4.2. Oracle estimators for βb and λ. the proof of Theorem 2.5, which consists of studying some sort of oracle estimator for β which enjoys the property of knowing the support T of β ahead of time. This idea was introduced in [9]. As in [9], our goal in this section is to prove that this estimator satisfies the optimality conditions (4.99) and (4.100) of Corollary 4.1 with high probability. Hence, we will obtain that the oracle estimator for β is nothing but βbλ and the support of βbλ is T with high probability. 4.2.1. Enforcing the invertibility assumption. By Theorem 2.4, we have −1 (4.106) k ≤ (1 − r0 )−1 k XTt XT (4.107)

kXT k ≤ (1 + r0 )1/2

with probability greater than 1 − 2 × 4 × 324 p−α . Thus, throughout this section, we will assume that (4.106) and (4.107) hold, i.e. we will reduce all events considered to their intersection with the event that (4.106) and (4.107) are satisfied. b For a 4.2.2. Closed form expressions for the oracle estimators of βb and λ. e one might like to consider the following oracle for β: b given λ, 1 e 1, β ∈ argmax − ky − Xbk22 − λkbk (4.108) 2 b∈B where

B = {b ∈ Rp , supp(b) = T, sgn(b) = sgn(βT )}. However, it is not so easy to derive a closed form expression for β. Therefore, it might be more interesting to consider the following oracle instead 1 e sgn(βT )t b. βe ∈ argmax − ky − Xbk22 − λ (4.109) 2 b∈Rp , supp(b)=T

SPARSE RECOVERY WITH UNKNOWN VARIANCE

Indeed βe satisfies

XTt

(4.110)

e e sgn(βT ) = 0 y − XT β T − λ

and, we obtain that βe is given by βeT

(4.111)

19

XTt XT

=

−1

e sgn(βT ) . XTt y − λ

e is a This formula is the same as in the proof of Th. 1.3 in [9], but here, λ variable. e verifying We now seek λ

1 e sgn(βT )t βeT . ky − XT βeT k22 = C λ 2

(4.112)

Replacing βe by its value (4.111), we obtain −1 t t 1 e sgn(βT ) k2 X X X y + λ ky − X T T T T 2 2 e sgn(βT )t X t XT −1 X t y − λ e sgn(βT ) . = Cλ T T

Thus,

1 e T X t XT −1 sgn(βT )k2 = kPV ⊥ y − λX 2 T T 2 −1 e sgn(βT )t X t XT −1 X t y, e2 hsgn(βT ), X t XT sgn(βT )i + C λ −C λ T T T

and using the orthogonality relations, we obtain 1 2

e2 −1 λ sgn(βT )k22 kXT XTt XT 2 e2 hsgn(βT ), X t XT −1 sgn(βT )i −C λ T e sgn(βT )t X t XT −1 X t y, +C λ T T

kPV ⊥ yk22 + T

=

which is equivalent to 1 2 1 e k X t XT − 2 sgn(βT )k2 (4.113) + C λ 2 T 2 −1 t 1 t t e sgn(βT ) X XT XT y + kPV ⊥ zk22 = 0. −C λ T T 2

The roots of the quadratic equation are (4.114) where

e = λ ∆ =

C sgn(βT )t XTt XT (1 + 2C)k XTt XT

−1

− 1 2

XTt y ±

√

∆

sgn(βT )k22

,

−1 t 2 XT y C sgn(βT )t XTt XT 1 − −(1 + 2C)k XTt XT 2 sgn(βT )k22 kPV ⊥ zk22 .

T

20

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

Moreover, we have −1 t −1 t XT (XT β + z) XT y = sgn(βT )t XTt XT sgn(βT )t XTt XT −1 t t XT z = sgn(βT ) β + sgn(βT )t XTt XT = kβk1 −1 sgn(βT ), PVT z + PV ⊥ zi. +hXT XTt XT T

Hence,

(4.115) sgn(βT )t XTt XT

−1

XTt y = kβk1 +hXT XTt XT

−1

sgn(βT ), PVT zi.

Note that the Cauchy-Schwarz inequality yields (4.116) − 1 −1 sgn(βT ), PVT zi ≤ k XTt XT 2 sgn(βT )k2 kPVT zk2 . hXT XTt XT

4.2.3. Bounds on kPVT zk2 and kPV ⊥ zk2 . Using some well known properties T of the χ2 distribution recalled in Lemma 7.4 in the Appendix, we obtain that √ √ P kPVT (z)k2 /σ ≥ s + 2t (4.117) ≤ exp(−t) √ √ (4.118) ≤ exp(−t) P kPV ⊥ (z)k2 /σ ≥ n − s + 2t T

and

(4.119) P

kPV ⊥ (z)k22 /σ 2 T

≤ u(n − s)

≤

Tune u and t such that

i.e. (4.120)

n−s 2 p (u e/2) 4 . π(n − s)

n−s 2 2 (u e/2) 4 = α , exp(−t) = p p π(n − s)

t = log(pα /2)

and

u=

2 e

!4/(n−s) p π(n − s) . pα

Thus, we obtain that (4.121)

kPV ⊥ zk22 T

σ2

≤

and (4.122)

kPV ⊥ zk22 T

σ2

≥

√

n−s+

2(n − s) e

2 p 2 log(pα /2)

!4/(n−s) p π(n − s) pα

with probability greater than or equal to 1 − 2p−α .

SPARSE RECOVERY WITH UNKNOWN VARIANCE

21

4.2.4. Positivity of the discriminent. We begin with the study of two key − 1 −1 t XT y and k XTt XT 2 sgn(βT )k22 kPV ⊥ zk22 . quantities: sgn(βT )t XTt XT T

XTt y. By (4.106), we have r s sgn(βT )k2 ≤ . 1 − r0

We first study sgn(βT )t XTt XT k XTt XT

(4.123)

− 1 2

−1

Thus, using the lower bound (2.19) from Assumption 2.3 on the non-zero components of β, we deduce s α ! √ √ p s σ −1 s + 2 log sgn(βT ), PVT zi ≤ √ hXT XTt XT 2 1 − r0 q √ α s + 2 log p2 p ≤ sσmθ,C mθ,C s(1 − r0 ) q √ α s + 2 log p2 p kβk1 . ≤ mθ,C s(1 − r0 )

By virtue of (2.13), we have: q √ α s + 2 log p2 p ϑ := (4.124) mθ,C s(1 − r0 )

≤ θ < 1.

Therefore, from (4.115) we deduce that

(4.125) (1 − ϑ)kβk1 ≤ sgn(βT )t XTt XT

Set

− 1 2

XTt y ≤ (1 + ϑ)kβk1 .

sgn(βT )k22 kPV ⊥ zk22 . We have T q √ α n − s + 2 log p2 p sgn(βT )k2 kPV ⊥ zk2 ≤ kβk1 T mθ,C s(1 − r0 )

Second, we study k XTt XT k XTt XT

− 1

−1

ϑ⊥ kβk1 =

(4.126)

2

√

q α n − s + 2 log p2 p . mθ,C s(1 − r0 )

Finally, using (4.125) and (4.126), we obtain (4.127)

∆ ≥

Recall from (2.13) that (4.128) which yields (4.129)

√

1 − 2C C(1 − θ)

C 2 (1 − ϑ)2 − (1 + 2C)ϑ2⊥ kβk21 . √

q n − s + 2 log √ 1 − r0

pα 2

≤ mθ,C ,

C 2 (1 − ϑ)2 − (1 + 2C)ϑ2⊥ > 0.

22

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

e is well defined. Therefore ∆ > 0, which ensures that λ ˜ First, let us write 4.2.5. Bounds on λ. √ −1 t ∆ = C sgn(βT )t XTt XT XT y v u − 1 u (1 + 2C) k XTt XT 2 sgn(βT )k22 kPV ⊥ zk22 u T . ×u t1 − −1 t 2 t XT y C sgn(βT )t XT XT

√ On one hand, the concavity of the square-root gives 1 − δ ≤ 1 − δ ∈ (0, 1), and we obtain √ −1 t XT y ∆ ≤ C sgn(βT )t XTt XT − 1 (1 + 2C)k XTt XT 2 sgn(βT )k22 kPV ⊥ zk22 T − . −1 t t t XT y 2C sgn(βT ) XT XT

δ 2

where

Combining this last equation with (4.114), we obtain that

(4.130)

e ≥ λ

kPV ⊥ zk22 T

−1 t . 2C sgn(βT XTt XT XT y √ On the other hand, we also have 1 − δ ≥ 1 − δ on (0, 1). Thus we can write √ −1 t XT y ∆ ≥ C sgn(βT )t XTt XT − 1 (1 + 2C)k XTt XT 2 sgn(βT )k22 kPV ⊥ zk22 T − −1 t XT y C sgn(βT )t XTt XT )t

and combining this last equation with (4.114) and the previous upper bound, we thus obtain kPV ⊥ zk22 T e λ ≤ −1 t . XT y C sgn(βT )t XTt XT Using (4.125), we finally get

(4.131)

kPV ⊥ zk22 T

2C(1 + ϑ) kβk1

e ≤ ≤ λ

kPV ⊥ zk22 T

C(1 − ϑ) kβk1

.

Combining this last equation with (4.122), we obtain: 4 √ √ 2 π(n−s) n−s p (n − s) n − s + 2 log(pα /2) pα e ≤ σ2 ≤ λ . σ2 2C (1 + ϑ)kβk1 C(1 − ϑ)kβk1

Using Assumption 2.3 and (2.12), we thus obtain p e ≥ κ σ log p. λ

SPARSE RECOVERY WITH UNKNOWN VARIANCE

23

4.3. Exact recovery of the support and sign pattern. We now conclude the proof using the strategy announced in the beginning of this section: e satisfy the optimality conditions (i) We prove that the proxies βe and λ b = λ. e of Corollary 4.1, from which we deduce that βb = βe and λ (ii) Since the proxy βe has the right support and sign patterns, we conclude that βb exactly recovers these features as well.

4.3.1. Bounds holding with high probability for the ℓ∞ -norm of some important quantities. To obtain the exact recovery of the support and sign patterns of β, we will need similar bounds as the ones in [9, Section 3.5]. Namely, √ 2(1+α)(1+r0 ) √ t −1 t σ log p (i) k(XT XT ) XT zk∞ ≤ 1−r0 (ii) k(XTt XT )−1 sgn(βT )k∞ ≤ 3 (iii) kXTt c XT (XTt XT )−1 sgn(βT )k∞ ≤ 41 p √ (iv) kXTt c I − XT (XTt XT )−1 XTt zk∞ ≤ 2(1 + α) σ log p.

When r0 = 12 , these assumptions were proven to hold with high probability in [9] based on previous results due to Tropp [29]. Most of the proofs that these conditions hold with high probability are the same as in [9] up to some slight improvements of the constants. Proposition 4.3. The bounds (i-iv) hold with probability at least 1 −

10 pα .

Proof. See Section 7.3 in the Appendix. Notice that the proof of (ii) requires the conditions γ Cs Cµ ≤ 2(4 + γ)(α + 1) 2 √ Cs ≤ (1 + γ) α + 1 from Theorem 2.5. Moreover, the proof of (iii) requires requires the conditions γ Cs Cµ ≤ 2(4 + γ)(α + 1) (1 − r0 ) √ . Cs ≤ 8(1 + γ) α + 1 4.3.2. βe and β have the same support and sign pattern. First, it is clear that βe and β have the same support. Next, we must prove that βe has the same sign pattern as β. Use Proposition 4.3 to obtain e k(X t XT )−1 sgn(βT )k∞ kβeT − βT k∞ ≤ k(XTt XT )−1 XTt zk∞ + λ T p 2(1 + α)(1 + r0 ) p e σ log p + 3λ. ≤ 1 − r0 e we obtain Using the lower bound (4.132) on λ, (4.132)

e kβeT − βT k∞ ≤ 4λ.

24

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

A sufficient condition to guarantee that the sign pattern is recovered is that this last upper bound be lower than the minimum absolute value of non-zero components of β, i.e. e ≤ min |βj |. 4λ

(4.133)

j∈T

e in (4.132), this is achieved in particular when Using the upper bound on λ 4

√

n−s+

2 p 2 log(pα /2)

C(1 − ϑ)

≤ min |βj | j∈T

kβk1 . σ2

In view of this inequality, and since kβk1 ≥ s minj∈T |βj |, an even stronger sufficient condition is 2 √ p n − s + 2 log(pα /2) minj∈T |βj |2 ≤ . 4 C s(1 − ϑ) σ2 Noting that ϑ ≤ θ, we conclude that this condition is easily implied by Assumption 2.3. e satisfy the optimality conditions. Using the lower bound 4.3.3. βe and λ e e satisfy the optimality condi(4.132) on λ, the proof of the fact βe and λ tions is exactly the same as in [9, Section 3.5]. We repeat the argument for the sake of completeness. On the one hand, by construction, we clearly have

On the other hand,

e = −λ e sgn(βT ). XTt (y − X β)

e X t c XT (X t XT )−1 sgn(βT )k∞ e ∞ = kX t c P ⊥ (z) + λ kXTt c (y − X β)k V T T T t t e ≤ kXT c PV ⊥ (z)k∞ + λ kXT c XT (XTt XT )−1 sgn(βT )k∞ p p 1e ≤ 2(1 + α) σ log p + λ 4 3 e e ≤ λ < λ. 4

Thus, the two parts of the optimality conditions in Corollary 4.1 are satisfied e and we deduce from this observation that βe = βb and λ e = λ. b by βe and λ,

4.3.4. Conclusion. The two preceding sub-sections prove that βb has same support and sign pattern as β as announced. This occurs when (4.117), (4.118), (4.119), (4.106) and (4.107) are satisfied simultaneously. This occurs with probability 1− as announced.

2 × 4 × 324 + 10 + 2 , pα

SPARSE RECOVERY WITH UNKNOWN VARIANCE

25

4.4. Nonempty range for kβk1 . We need to ensure that the range of admissible values for β is sufficiently large. The intuition says that this can be achieved by allowing sufficiently large values of n. In other words, we would like to know if there exists a constant Cn such that n − s ≥ Cn log p ≥ 2 log(pα /2),

(4.134)

and for which we have s mθ,C < Mθ,C . This suffices to know for which values of Cn , (4.134) implies q √ ! 4 α p n − s + 2 log( p2 ) π(n − s) n−s n−s √ (4.135) K1,2 s ≤ √ pα s log p and (4.136)

K3 s

√

s+

q α 2 log( p2 ) √ s

≤

n−s √ log p

! 4 p π(n − s) n−s pα

for K12 and K3 defined by (2.17) and (2.18). Consider (4.135) first. We take the log of (4.135) and we compute the difference L: 1 L = log (n − s) − log(log p) 2 4 1 + (log(π) + log(n − s)) − α log p (n − s) 2 !! r √ √ pα − log K1,2 s n − s + 2 log( ) 2 ≥

1 4α 1 2 log (n − s) − log p − log 4K1,2 s log p . 2 n−s 2

Thus, (4.134) implies that L ≥ ≥

4α 1 1 2 log (Cn log p) − − log 4K1,2 s log p 2 Cn 2 ! 1 p 4α 1 log (Cn ) − + log 2 s log p . 2 Cn 2 4K1,2

2 log p and Therefore, if p/s ≥ 4K12

Cn log(Cn ) ≥ 8 α,

(4.137)

then (4.135) holds, which is the announced bound (2.27) in Theorem 2.5. Turning to (4.136), we can easily repeat the same computations and obtain that it holds for the same bound on Cn if p/s ≥ 4K32 log p. Thus, as announced, n ≥ Cn log p with Cn satisfying (4.137) ensures that s mθ,C < Mθ,C , i.e. the range for β is nonempty, as soon as s and p satisfy p s

≥ 4 max(K12 , K3 )2 log p.

26

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

5. The standard LASSO estimator as a function of λ This section establishes various continuity and monotonicity properties of some important functions of βbλ . Our main goal is to study the function   R+  R+ −→ λkβbλ k1 (5.138) Γ : , λ − 7 →   ky − X βbλ k2 2

b in the Balanced-LASSO which is of crucial importance to find the value of λ estimator. We will prove in particular that Γ is continuous, tends to +∞ when λ tends to zero and is decreasing for λ sufficiently large. For this purpose, let us first introduce the following useful General Position Condition on the matrix X. This condition will play a role in proving that the solution of the LASSO problem is unique for any positive λ (Recall Remark 4.2). Assumption 5.1. (General Position Condition for X) For any b∗ in Rp with support B, |B| ≤ n, and for any C ⊂ {1, . . . , p} such that |C| = n, C 6= B and XC is invertible, we have t t (5.139) sgn(b∗B ) − XB XC−t sgn(XC−1 XB b∗ ) b∗B 6= 0.

This property clearly holds with probability one if the entries of X are independent and have an absolutely continuous density with respect to the Lebesgue measure, a generic situation in statistics where the covariate measurements are usually corrupted by some noise. In the case of a more general type of design, this definition could easily be generalized so as to guarantee that (5.139) fails with probability at most of the order p−α (or is automatically satisfied for a deterministic design). The following notations will be useful. Define L as the cost function: ( ⋆ R+ × Rp −→ R+ 1 (5.140) L : ky − Xbk22 + λkbk1 , (λ, b) 7−→ 2 and for all λ > 0, (5.141)

θ(λ) =

inf L(λ, b).

b∈Rp

5.1. More on the estimator βbλ . We begin with the following useful characterization of the LASSO estimators.

Lemma 5.2. Any solution βbλ of (1.2), i.e. any LASSO estimator, is also a solution of (5.142) min kbk1 s.t. Xb = X βbλ . b∈Rp

Proof. Let β˜λ denote a solution of (5.142). Then, we have

(5.143)

kβ˜λ k1 ≤ kβbλ k1 .

On the other hand, the definition of βbλ implies that 1 1 (5.144) ky − X β˜λ k22 + λkβ˜λ k1 ≥ ky − X βbλ k22 + λkβbλ k1 . 2 2

SPARSE RECOVERY WITH UNKNOWN VARIANCE

Moreover, since X β˜λ = X βbλ , we have 21 ky − X β˜λ k22 = subtracting this equality to (5.144), we obtain that kβbλ k1 ≤ kβ˜λ k1 .

1 2 ky

27

− X βbλ k22 , and

which, combined with (5.143), implies that βb is a solution of (5.142).

The following lemma shows that βbλ has a support of size at most n and gives the expression of βbλ in terms of λ and the submatrix of X indexed by Tb.

Lemma 5.3. Assume that Assumption 5.1 holds. Then, for any λ > 0, the minimization problem (1.2) has a unique solution βbλ , and its support Tbλ ⊂ {1, . . . , p} verifies |Tbλ | ≤ n.

(5.145)

For any λ > 0 such that βbλ = 6 0, the matrix XTbλ is non-singular and we have βbTbλ = (XTtb XTbλ )−1 XTtb y − λ sgn βbTbλ . (5.146) λ

λ

Proof. Throughout this proof, we denote by the solution set, the set of vectors b which solve (5.142), and not the set of vectors satisfying the constraints of (5.142), as may be sometimes found in the literature. Study of #Tb — Recall that b+ (resp. b− ) be the non-negative (resp. non− positive) part of b, i.e. b = b+ − b− , with b+ j , bj ≥ 0. Then, Lemma 5.2 above equivalently says that βb is a solution of (1.2) if and only if βbλ+ and βbλ− are solutions of p n o X − (5.147) min p b+ + b s.t. Xb+ − Xb− = X βbλ . j j b+ , b− ∈R+

j=1

The remainder of the proof relies on linear programming theory and Assumption 5.1. Notice first that the solution set is compact due to the coercivity of the ℓ1 -norm. Thus, the theory of linear programming [27] ensures that each extreme point of the solution set of (5.147) is completely determined by a "basis" B. In the present setting, for an extreme point b∗ = b∗ + − b∗ − of the solution set of (5.147), the associated basis B ∗ can be written (in a non-unique way) as B ∗ = B ∗+ ∪ B ∗− , |B ∗ | = n, and is such that (i) the square matrix [XB ∗+ , −XB ∗− ] is non singular, (ii) b∗ B ∗ c = 0 and (iii) the couple (b∗ + , b∗ − ) is uniquely determined by the system B ∗+ B ∗−

(5.148)

∗− XB ∗+ b∗ + B ∗+ − XB ∗− b B ∗−

(or equivalently, XB ∗ b∗B ∗ = X βbλ ).

= X βbλ ,

An immediate consequence is that the support of b∗ has cardinal at most n: #Tb ≤ n.

28

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

Uniqueness of βbλ — Let c∗ be another extreme point of the solution set of (5.142) with associated basis C ∗ , i.e. XC ∗ c∗C ∗ = X βbλ . Thus c∗C ∗ = XC−1∗ XB ∗ b∗B ∗ . Moreover, since b∗ and c∗ have the same objective values, sgn(b∗B ∗ )t b∗B ∗ − sgn(c∗C ∗ )t c∗C ∗

(5.149)

= 0,

which implies that (5.150)

t ∗ −t t ∗ sgn(b∗B ∗ ) − XB bB ∗ ∗ X ∗ sgn(cC ∗ ) C

= 0,

and thus contradicts Assumption 5.1. Therefore, (5.142) has a unique solution b∗ . Moreover, combining this result with Lemma 5.2 gives that any solution βbλ of (1.2) satisfies βbλ = b∗ which completes the proof. To prove (5.146), we check that βbλ defined by (5.146) satisfies the optimality conditions. This completes the proof. 5.1.1. Continuity. The following intuitive lemmas address the continuity of βbλ and will be useful for the analysis of the algorithm defined in Section 3.

Lemma 5.4. Let Assumption 5.1 hold. Then, the function θ is concave and non-decreasing.

Proof. Since θ is the infimum of a set of affine functions of the variable λ, it is concave. Moreover, we have θ(λ) = L(λ, βbλ ),

where, by Lemma 5.3, βb is the unique solution of (1.2). Using the filling property [16, Chapter XII], we obtain that ∂θ(λ) is the singleton {kβbλ k1 }. Thus, θ is differentiable and its derivative at λ is given by (5.151)

θ′ (λ) = kβbλ k1 .

Moreover, this last expression shows that θ is nondecreasing.

Lemma 5.5. Let Assumption 5.1 hold. Then, the map ⋆ R+ −→ Rp λ 7−→ βbλ

is bounded and continuous. Moreover, its ℓ1 -norm is non-increasing. Proof. We naturally divide the proof into three parts: (i) kβbλ k1 is non-increasing – The fact that λ 7−→ kβbλ k1 is non-increasing is an immediate consequence of the concavity of θ. (ii) Boundedness – Notice that using (5.146), we obtain that

kβbλ k1 ≤ max (XSt XS )−1 (XSt y − λδ) 1 . (S,δ)∈Σ

Thus, λ 7−→ βbλ is bounded on any interval of the form (0, M ], with M ∈ (0, +∞). Moreover, since its ℓ1 -norm is non-increasing, it is bounded on (0, ∞).

SPARSE RECOVERY WITH UNKNOWN VARIANCE

29

(iii) Continuity – Assume for contradiction that λ 7−→ βbλ is not continuous at some λ◦ > 0. Using boundedness, we can construct two sequences converging towards βbλ+◦ and βbλ−◦ respectively with βbλ+◦ 6= βbλ−◦ . Since L(λ◦ , ·) is continuous, both limits are optimal solutions of the problem argmin L(λ◦ , b),

(5.152)

b∈Rp

hence contradicting the uniqueness proven in Lemma 5.3 above. 5.2. Variations of the ratio between fidelity and penalty. We have the following trivial result. Lemma 5.6. (Nontriviality of the estimator) Let Σ be the set (5.153) n o Σ = (S, δ); S ⊂ {1, . . . , p}, δ ∈ {−1, 1}|S| , |S| ≤ n, σmin (XS ) > 0 .

The inequality (5.154)

inf

(S,δ)∈Σ

t

(XS XS )−1 (XSt y − λδ)

1

> 0

holds with probability one.

Proof. This is an immediate consequence of the Gaussian distribution of z. Lemma 5.7. Let Assumptions 2.1 and 5.1 hold. Then, the function   R+  R+ −→ λkβbλ k1 (5.155) Γ : λ − 7 →   ky − X βbλ k2 2

satisfies (5.156)

lim Γ(λ) = +∞. λ↓0

Moreover, there exist τ in and Γ(τ ) = 0.

R∗+

such that Γ is decreasing on the interval (0, τ ]

Proof. We will use repeatedly Lemmas 5.4 and 5.5. Step 1. limλ↓0 Γ(λ) = +∞. We divide this proof into two parts. Step 1.a. We first show that |Tbλ | = n for λ sufficiently small. Let (λk )k∈N be any positive sequence converging to 0. Let β ∗ be any cluster point of the sequence (βbλk )k∈N (recall that this sequence is bounded thanks to Lemma 5.5). Fix ε > 0 and b ∈ Rp . For all k ∈ N, we have L(λk , βbλ ) ≤ L(λk , b). (5.157) k

Since L(λk , ·) is continuous, we can also write for k sufficiently large: L(λk , β ∗ ) ≤ L(λk , βbλ ) + ε. k

Hence,

L(λk , β ∗ ) ≤ L(λk , b) + ε.

30

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

Letting λk → 0, we obtain 1 1 ky − Xβ ∗ k22 ≤ ky − Xbk22 + ε, 2 2 and thus, 1 1 (5.158) ky − Xβ ∗ k22 ≤ infp ky − Xbk22 . b∈R 2 2 Since range(X) = Rn , (5.158) implies 1 ky − Xβ ∗ k22 = 0, 2 and then (5.159) lim ky − X βbλ k2 = 0. 2

λ↓0

Notice further that {b ∈ Rp , |supp(b)| < n} is a finite union of subspaces of Rp , each with dimension n − 1. Thus, 1 m := inf (5.160) ky − Xbk22 > 0, p {b∈R ; |supp(b)|
from which we deduce that |Tbλ | = n.

Step 1.b. Let λ0 > 0 be sufficiently small so that for all λ ≤ λ0 , |Tbλ | = n. Such a λ0 exists due to Step 1.a. Using Lemma 5.3, XTbλ is nonsingular, which implies that (5.162)

PV ⊥ y = 0. Tλ

Thus, using (5.146), we obtain (5.163) (5.164)

y − X βbλ = PV ⊥ y − λ(XTtb XTbλ )−1 sgn βbTbλ Tλ λ = −λ(XTtb XTbλ )−1 sgn βbTbλ , λ

which implies that (5.165)

ky − X βbλ k22 = λ2 k(XTtb XTbλ )−1 sgn βbTbλ k22 . λ

Moreover, Lemma 5.6 combined with (5.146 gives

(5.166) kβbλ k1 > inf (S,δ)∈Σ (XSt XS )−1 (XSt y − λδ) 1 > 0. Hence, for λ ≤ λ0 , (5.167)

Γ(λ) ≥

λm′ . λ2 k(X tb XTbλ )−1 γTλ k22 Tλ

Using the trivial fact that (5.168)

sup k(XSt XS )−1 δk22 < ∞,

(S,δ)∈Σ

the proof is complete.

SPARSE RECOVERY WITH UNKNOWN VARIANCE

31

Step 2. Partitioning (0, +∞) into good intervals. The continuity result of Lemma 5.5 implies that the interval (0, +∞) can be partitioned into subintervals of the type Ik = (λk , λk+1 ], with (i) λ0 = 0 and λk ∈ R⋆+ ∪ {+∞} for k > 0, (ii) the support and sign pattern of βbλ are constant on each ˚ Ik . Notice further that due to Step 1.a, Tbλ 6= ∅ on at least I0 . Let K be the nonempty set n o (5.169) K = k ∈ N, ∀λ ∈ ˚ Ik , βbλ 6= 0 . On any interval Ik , k ∈ K, Lemma 5.3 states that the expression (5.146) for t βbTbλ holds. Multiplying (5.146) on the left by sgn βbTbλ , we obtain (5.170)

(5.171) Thus

t kβbλ k1 = sgn βbTbλ (XTtb XTbλ )−1 XTt y λ t t −1 b b −λsgn βTbλ (XTb XTbλ ) sgn βTbλ . λ

t dkβbλ k1 (λ) = −sgn βbTbλ (XTtb XTbλ )−1 sgn βbTbλ , λ dλ

on R∗+ . Thus, the definition of Σ, we obtain that (5.172)

dkβbλ k1 (λ) ≤ − inf δ t (XSt XS )−1 δ < 0 dλ (S,δ)∈Σ

on each ˚ Ik , k ∈ K and

dkβbλ k1 (λ) = 0 dλ

(5.173)

on each ˚ Ik , k 6∈ K, i.e. on each ˚ Ik such that kβbTbλ k1 = 0 for all λ in Ik , if any such Ik exists. Since λ 7−→ kβbλ k1 is continuous on R∗ , (5.172) implies that +

(i) there exists τ in such that βbτ = 0 (as an easy consequence of the Fundamental Theorem of Calculus and a contradiction). (ii) βbλ = 0 for all λ ≥ τ . R∗+ ,

Hence ∪k∈K Ik is a connected bounded interval. Step 3. Γ is decreasing on (0, τ ). Let us study the function (5.174)

Φ :

R+ R⋆+ −→ λ 7−→ λkβbλ k1 .

We immediately deduce from Step 2 and the definition of the intervals Ik , k ∈ K, that Φ is differentiable on each ˚ Ik , k ∈ K, and using (5.146), its

32

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

derivative on ˚ Ik reads t dΦ (λ) = kβbTbλ k1 − λ sgn βbTbλ (XTtb XTbλ )−1 sgn βbTbλ λ dλ t −1/2 b = kβTbλ k1 − λ k(XTb XTbλ ) sgn βbTbλ k22 . (5.175) λ

Now, since Γ(λ) = Φ(λ)/ky − X βbλ k22 and XTbλ is non singular, ky − X βbλ k2 = λ2 k(XTtb XTbλ )−1 sgn βbTbλ k22 (5.176) λ 2 (5.177) > λ2 n2 σmin (XTtb XTbλ )−1 > 0 λ

for λ > 0. Therefore Γ(λ) < +∞ on R∗+ , Γ is continuous on Ik and differentiable on ˚ Ik . Moreover, using (5.165), we have dΓ (λ) = dλ =

dΦ dλ (λ)ky

dky−X βbλ k2 − X βbλ k22 − Φ(λ) (λ) dλ 4 b ky − X βλ k 2

2

dΦ dλ (λ)

ky −

− 2 Φ(λ) λ . X βbλ k22

Hence, using (5.175) and (5.165), dΓ (λ) = dλ

−kβbTbλ k1 − λ k(X tb XTbλ )−1/2 sgn βbTbλ k22 Tλ

ky − X βbλ k22 −λ k(X tb XTbλ )−1/2 sgn βbTbλ k22 Tλ ≤ λ2 k(X tb XTbλ )−1 sgn βbTbλ k22 Tλ 2  t X )−1/2 (X σ min b b 1 T Tλ λ  , ≤ −  λ t −1 σ (X X ) max

Tbλ

Tbλ

on each ˚ Ik . We can thus conclude, by the non-singularity of XTbλ that Γ is decreasing on (0, τ ) as announced.

SPARSE RECOVERY WITH UNKNOWN VARIANCE

33

6. The Newton scheme In this section, we give the implementation details of our estimation procedure and provide the results of some preliminary simulation experiments. 6.1. Algorithms for solving λkβbλ k1 = Cky − X βbλ k22 . Recall from Section 5.2 that λkβbλ k1 Γ(λ) = ky − X βbλ k22

and

dΓ (λ) = dλ

−kβbTbλ k1 − λ k(X tb XTbλ )−1/2 sgn βbTbλ k22 Tλ

ky − X βbλ k22

on each ˚ Ik , where ˚ Ik , is a maximal open interval on which the support of βbλ is constant, for k ∈ K and ∪k∈K Ik = [0, τ ). A simple Newton-Raphson procedure is summarized Algorithm 6.1 below. Algorithm 1 Newton-Balanced LASSO algorithm Input λ(0) , l = 1 and L ∈ N∗ while l ≤ L do Compute βbλ(l) as a solution of the LASSO problem 1 (6.178) βbλ(l) ∈ argmin ky − Xbk22 + λ(l) kbk1 2 b∈Rp dΓ (l) −1 Set λ(l+1) = λ(l) − (λ ) Γ(λ(l) ) − C dλ l ←l+1 end while 2 b(L) = λ(L) , βb(L) = βb (L) and let σ b(L) be given by Set λ λ

(6.179)

σ b(L)

2 Output βb(L) , σ b(L)

b 2 + 2λ b(L) kβb(L) k1 ky − X βk 2 n (L) b . and λ

2

=

The superlinear local convergence of this procedure is given by the following standard theorem. Theorem 6.1. (Local convergence). Assume that Γ is differentiable at b Assume that |λ b−λ b(0) | < ∆λ for some the Balanced LASSO estimator λ. 2 (L) (L) (L) b ) tends to (β, bσ b and there sufficiently small ∆λ . Then (βb , σ b ,λ b2 , λ) exists a constant CN R > 0 such that b(L+1) − λ| b ≤ CN R | λ b(L) − λ| b 2. |λ

b λ b belongs to ˚ Proof. Since Γ is supposed to be differentiable at λ, Ik for some k in K. On the other hand, it is clear from (5.146) that λ 7→ βbλ is twice continuously differentiable in ˚ Ik . Therefore, Γ is twice continuously differentiable in ˚ Ik . Thus, Proposition 10.5.12 in [1, p. 342] applies, after

34

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

b−λ b(L) | < ∆λ (and thus noticing that that ∆λ can be chosen so that |λ b(L) ∈ ˚ λ Ik ) for all L in N. Global convergent schemes are also available. We refer the reader to [25] or [14] for instance.

6.2. Numerical experiments. Given the parameter λ > 0, problem (6.178) is a simple quadratic programming problem which can be solved using many different methods. In our experiments, we chose Becker, Bobin and Candès implementation [4] of Nesterov’s algorithm. The experiments were performed with the following set of parameters: p = 500, n = 62 and s = 6. We used the plain Newton method of Algorithm 6.1. We drew the support uniformly at random and we drew the non-zeros components of β as gaussian i.i.d. random variables with standard deviation set to a certain level L . The results are shown in Figures 1, 2, 3 and 4 below. Exact reconstruction of the support was observed for a sufficiently large value of L as in Figure 4 and the largest components of β were recovered when L was set to a smaller value. Newton’s algorithm suffered from oscillations when C was set to a too large value and one should stabilize the algorithm with trust region steps to circumvent such undesirable features in real life experiments. In these preliminary experiments, we noticed that the appropriate value for C followed closely the value of L , as expected from Remark 2.5.2 above, but several experiments showed that the range of permitted values for the trade-off constant C was quite large in practice. A exhaustive Monte Carlo study of the proposed estimator together with concrete applications will be provided in a forthcoming companion paper.

SPARSE RECOVERY WITH UNKNOWN VARIANCE

250

35

6 Evolution of λ

Evolution of Γ 5

200 4

150

3

2 100 1

50

0

5

10

15

20

25

0

30

0

5

10

15

20

25

12 Original signal l1 reconstruction

10 8 6 4 2

0 −2 −4

0

50

100

150

200

250

300

350

400

450

500

Figure 1. Case C = 1 and L = 5 — Evolution of λ and Γ as a function of Newton’s iteration number; Reconstruction of β

30

36

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

150

6 Evolution of λ

Evolution of Γ

140

5.5

130

5

120

4.5

110 4 100 3.5 90 3

80

2.5

70 60

2

50

1.5

0

5

10

15

20

25

30

0

5

10

15

20

25

12 Original signal l1 reconstruction

10 8 6 4 2

0 −2 −4

0

50

100

150

200

250

300

350

400

450

500

Figure 2. Case C = 2 and L = 5 — Evolution of λ and Γ as a function of Newton’s iteration number; Reconstruction of β.

30

SPARSE RECOVERY WITH UNKNOWN VARIANCE

59

37

11 Evolution of λ

Evolution of Γ 10.9

58

10.8

57

10.7

56

10.6 55 10.5 54 10.4 53

10.3

52

10.2

51 50

10.1

0

5

10

15

20

25

10

30

0

5

10

15

20

25

25 Original signal l1 reconstruction 20

15

10

5

0

−5

−10

0

50

100

150

200

250

300

350

400

450

500

Figure 3. Case C = 10 and L = 10 — Evolution of λ and Γ as a function of Newton’s iteration number and reconstruction of β

30

38

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

62

23 Evolution of λ

Evolution of Γ

60

22.5

58

22

56

21.5

54

21

52

20.5

50

0

5

10

15

20

25

20

30

0

5

10

15

20

25

25 Original signal l1 reconstruction 20

15

10

5

0

−5

−10

0

50

100

150

200

250

300

350

400

450

500

Figure 4. Case C = 20 and L = 20 — Evolution of λ and Γ as a function of Newton’s iteration number; Reconstruction of β

30

SPARSE RECOVERY WITH UNKNOWN VARIANCE

39

7. Appendix 7.1. Standard inequalities for homogeneous Rademacher chaos. We recall two useful bounds. Theorem 7.1. [24, Theorem 3.2.2 p.113] For all d ∈ N, let X be a Banach valued homogeneous chaos of order d, X (7.180) X = Xi1 ···id ηi1 · · · ηid , 1≤i1 <···
where {ηi } are iid ±1 Bernoulli random variable. Let 1 < p < q < ∞. Then 1 1 q − 1 d/2 q q (7.181) ≤ (EkXkp ) p . (EkXk ) p−1

Lemma 7.2. [24, Corollary 3.3.8 p.124] Let B be a Banach space and B ∗ be its dual. Let Y be a B-valued random variable such that for all x∗ ∈ B ∗ , x∗ (Y ) are measurable centered and square integrable. Then for all x ∈ B, (7.182)

P∗ (kx + Y k ≥ kxk) ≥

(E|x∗ (Y )|)2 1 inf . 4 x∗ ∈B ∗ E|x∗ (Y )|2

7.2. A Non-Commutative Bernstein inequality. We recall a Matrix version of Bernstein’s inequality recently established in [22] and improved in [30]. Theorem 7.3. (Bernstein’s inequality for matrices) Let X1 ,. . . ,Xp be independent centered symmetric random matrices matrices taking values in Rd×d . Assume that kXj k 6 B for all j = 1, . . . , p for some positive real number B. Let p X Sp = (7.183) Xj j=1

(7.184)

vp

Then, for all t ≥ 0, (7.185)

X

p 2

= E Xj

.

j=1

P (kSp k ≥ t) ≤ d exp −

t2 /2 vp + Bt/3

.

7.3. Proof of Proposition 4.3. Let us introduce the event: (7.186)

E = {kXTt XT − Ik ≤ r0 }.

The proofs that Conditions (i) and (ii) hold with high probability are trivial modifications of the ones given in [9] up to the constants. The proofs that Conditions (iii) and (iv) hold with high probability can be performed using the following by-product inequality from our Lemma 3.6: 2 ! κ2 − Cs /2 κ ≤ p exp − (7.187) P kRHk1→2 ≥ √ , Cµ (κ2 + 3Cs ) log p

40

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

instead of using [9, Lemma 3.5] and [9, Lemma 3.6]. All the proofs are moreover based on the very simple inequality (7.188) (7.189)

P(A) = P(A ∩ E ∩ B) + P(A ∩ (E c ∪ B c )) ≤ E [P (A | R) 1E∩B ] + P(E c ) + P(B c ),

and the inequality, for a given vector W : P (|hW, Xi| > t) ≤ 2e−t

(7.190)

2 /(2

kW k22 )

,

which holds true for sub-Gaussian random vectors with independent components having Bernoulli or standard gaussian distribution, for instance. 7.3.1. Condition (i). Here, let Wi be the ith row of (XTt XT )−1 XTt . Since hWi , zi ∼ N (0, kWi k22 ), we have from (7.190) and the union bound: 2 2 (7.191) P max |hWi , zi| > t ≤ 2s e−t /(2 maxi kWi k2 ) . i∈T

Note that on E: (7.192)

max kWi k2 ≤ i∈T

k(XTt XT )−1 kkXTt k

≤

√

1 + r0 . 1 − r0

One then obtains p 2 (7.193) P k(XTt XT )−1 XTt zk∞ ≤ σ κ log p ≥ 1 − α, p with

(7.194)

κ =

p

2(1 + α)(1 + r0 ) . 1 − r0

7.3.2. Condition (ii). Let us now show that the estimate (iii) holds with high probability. This is an actual consequence of our Lemma 3.6. First, as in [9] p.2171 and Lemma 3.3 p.2166, we write the inequality (7.195)

k(XTt XT )−1 sgn(βT )k∞ ≤ 1 + max |hWi , sgn(βT )i|, i∈T

where Wi is the ith row or column of (XTt XT )−1 − I. Set (7.196) A = max |hWi , sgn(βT )i| ≥ 2 . i∈J

Hoeffding’s inequality yields: (7.197)



 2 2 . P (A|R) ≤ 2|J| exp − 2 max kWi k22 i∈J

As in [9] p.2171 and p.2172, we write (7.198)

kWi k2 ≤

kRHRei k2 1 − r0

SPARSE RECOVERY WITH UNKNOWN VARIANCE

41

and thus on E: kRHRei k2 1 − kRHRk kRHRk1→2 1 − r0 kRHk1→2 . 1 − r0

kWi k2 ≤ ≤ ≤ Set (7.199)

κ , B = kRHk1→2 ≤ √ log p

with κ = (γ + 1)Cs . Then, we can easily show using (7.187) that P(B c ) ≤ is true under the condition γ (7.200) Cs . Cµ ≤ 2(4 + γ)(α + 1)

1 pα

Moreover E [P (A | R) 1E∩B ] ≤

(7.201)

1 pα

is an easy consequence of Cs ≤

(7.202)

2 √ . (1 + γ) α + 1

Therefore, under these two last conditions, the event k(XTt XT )−1 sgn(βT )k∞ ≤ 1 + 2 = 3 holds with probability at least 1−

3 . pα

7.3.3. Condition (iv). If one now sets Wi as the ith row of I−XT (XTt XT )−1 XTt and note that on E for any i ∈ T : kWi k2 ≤ kXi k2 = 1,

(7.203) then:

with

p 2 P kXTt c I − XT (XTt XT )−1 XTt zk∞ ≤ σ κ log p ≥ 1 − α, p

(7.204)

κ =

p 2(1 + α).

7.3.4. Condition (iii). Here, Wi = (XTt XT )−1 XTt Xi . Set κ (7.205) , B = kRHk1→2 ≤ √ log p with κ = (γ + 1)Cs . Note that on E ∩ B: (7.206)

maxc kWi k2 ≤ i∈T

κ √ . (1 − r0 ) log p

42

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

Using (7.187) again, simple computations show that the conditions γ (7.207) Cs Cµ ≤ 2(4 + γ)(α + 1) (1 − r0 ) √ (7.208) Cs ≤ 8(1 + γ) α + 1 suffice to imply that 3 1 ≥ 1 − α. (7.209) P kXTt c XT (XTt XT )−1 sgn(βT )k∞ ≤ 4 p 7.4. Some properties of the χ2 distribution. We recall the following useful bounds for the χ2 (ν) distribution of degree of freedom ν Lemma 7.4. The following bounds hold: √ √ P χ(ν) ≥ ν + 2t (7.210) ≤ exp(−t) √

P χ(ν) ≤

(7.211)

uν

≤

ν 2 √ (u e/2) 4 . πν

Proof. For the first statement, see e.g. [21]. For the second statement, recall that Z u ν ν −1 −t 2 t2 e 2 (7.212) dt P χ (ν) ≤ uν = ν Γ( 2 ) 0 Z u ν ν −1−α α −t 2 t2 t e (7.213) dt. = ν Γ( 0 2) Since maxt∈R+ tα e−t = (α/e)α and is attained at t = α, we obtain that Z ν ν ν −α (α/e)α u 2 ν −1−α (α/e)α 2 2 P χ2 (ν) ≤ uν ≤ dt = u t . Γ( ν2 ) 0 ( ν2 − α)Γ( ν2 ) 2

Take for instance α =

ν 4

and obtain

P χ2 (ν) ≤ uν

(7.214)

On the other hand, we have

Γ(z) ≥

(7.215)

ν (ν/4e) 4 ν ν4 . u ν ν 2 4 Γ( 2 )

=

√

1

z z− 2 2π ez

and then, ν

(ν/4e) 4 ν ν 4 Γ( 2 )

(7.216) Hence, 2

P χ (ν) ≤ uν as desired.

≤

≤ r

r

ν

2 (e/2) 4 ν2 pν π 2

− ν

4

.

ν

ν 2 (u e/2) 4 2 pν = √ (u e/2) 4 , π πν 2

SPARSE RECOVERY WITH UNKNOWN VARIANCE

43

Acknowledgment The first author is grateful to E. Candès for helpful discussions while staying at Caltech. The second author would like to thank the people in Statistics at LATP for their helpful remarks and references. References 1. Allaire, G., Numerical analysis and optimization. Oxford Science Publications. Numerical Mathematics and Scientific Computation Series (2007). 2. Baraud, Y.; Giraud, C.; Huet, S. Gaussian model selection with an unknown variance. Ann. Statist. 37 (2009), no. 2, 630–672. 3. Bickel, P. J., Ritov, Y., Tsybakov, A. B. Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 (2009), no. 4, 1705–1732. 4. Becker, S., Bobin, J. and Candès, E. J., Nesta: a fast and accurate first-order method for sparse recovery. In press SIAM J. on Imaging Science. 5. Bourgain, J., Tzafriri, L., Invertibility of “large” submatrices with applications to the geometry of Banach spaces and harmonic analysis. Israel J. Math. 57 (1987), no. 2, 137–224. 6. Bunea, F., Tsybakov, A., and Wegkamp, M. (2007a). Sparsity oracle inequa- lities for the Lasso. Electron. J. Stat., 1 :169–194. 7. Candès, E. J. The restricted isometry property and its implications for compressed sensing. C. R. Math. Acad. Sci. Paris 346 (2008), no. 9-10, 589–592. 8. Candès, E. J. Modern statistical estimation via oracle inequalities. Acta Numer. 15 (2006), 257–325. 9. Candès, E. J. and Plan, Yaniv. Near-ideal model selection by ℓ1 minimization. Ann. Statist. 37 (2009), no. 5A, 285–2177. 10. Candès, Emmanuel and Romberg, Justin, Sparsity and incoherence in compressive sampling. Inverse Problems 23 (2007), no. 3, 969–985. 11. Candès, E. J. and Tao, T., The Dantzig Selector: statistical estimation when p is much larger than n. Ann. Stat. 35, no. 6 (2007), 2313–2351. 12. D. L. Donoho and X. Huo., Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory, 47 (2001) 2845–2862. 13. D. L. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization. Proc. Natl. Acad. Sci. USA, 100 (2003) 2197–2202. 14. Du, D.-Z., Qi, L. and Womersley, R., Recent advances in nonsmooth optimization. World Scientific (1995). 15. M. Elad and A. M. Bruckstein., A generalized uncertainty principle and sparse representation in pairs of RN bases. IEEE Trans. Inform. Theory, 48 (2002) 2558–2567. 16. Hiriart-Urruty, J.-B. and Lemaréchal, C. Convex Analysis and Minimization Algorithms II. Advanced theory and bundle methods. Grundlehren der Mathematischen Wissenschaften 306. Springer Verlag. 17. Juditsky, A. and Nemirovski, A.S. On verifiable sufficient conditions for sparse signal recovery via l1 minimization. ArXiv:0809.2650, 2008. 18. Kerkyacharian, G.; Mougeot, M.; Picard, D.; Tribouley, K. Learning out of leaders. Multiscale, nonlinear and adaptive approximation, 295–324, Springer, Berlin, 2009. 19. Koltchinskii, V. Sparse recovery in convex hulls via entropy penalization. Ann. Statist. 37 (2009), no. 3, 1332–1359. 20. Ledoux, M. and Talagrand, M. Probability in Banach spaces. Isoperimetry and processes. Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)], 23. Springer-Verlag, Berlin, 1991. xii+480 pp. 21. Massart, P., Concentration inequalities and model selection. Lectures from the 33rd Summer school on Probability Theory in Saint Flour. Lecture Notes in Mathematics, 1896. Springer Verlag (2007). 22. Oliveira, R. I., Concentration of the adjacency matrix and of the laplacian in random graphs with independent edges. ArXiv:0911.0600, (2010).

44

STÉPHANE CHRÉTIEN AND SÉBASTIEN DARSES

23. de la Peña, Victor H., and Montgomery-Smith, S.J. Bounds on the tail probability of U -statistics and quadratic forms. Bull. Amer. Math. Soc. (N.S.) 31 (1994), no. 2, 223–227. 24. de la Peña, Victor H. and Giné, Evarist. Decoupling. From dependence to independence. Randomly stopped processes. U -statistics and processes. Martingales and beyond. Probability and its Applications (New York). Springer-Verlag, New York, 1999. 25. Qi, L. and Sun, D. A survey of some nonsmooth equations and smoothing Newton methods. Progress in Optimization, 121–146, Appl. Optim., Kluwer Acad. Publ., Dordrecht (1999). 26. Rudelson, M. and Vershynin, R., Geometric approach to error correcting codes and reconstruction of signals. Int. Math. Res. Not. 64 (2005) 4019–4041. 27. Schrijver, A., Theory of linear and integer programming. Wiley-Interscience Series in Discrete Mathematics. A Wiley-Interscience Publication. John Wiley & Sons, Ltd., Chichester, 1986. xii+471 pp. 28. Tibshirani, R. Regression shrinkage and selection via the LASSO, J.R.S.S. Ser. B, 58, no. 1 (1996), 267–288. 29. Tropp, J. A. Norms of random submatrices and sparse approximation. C. R. Math. Acad. Sci. Paris 346 (2008), no. 23-24, 1271–1274. 30. Tropp, J. A. "User friendly tail bounds for sums of random matrices", http://arxiv.org/abs/1004.4389, (2010).

Laboratoire de Mathématiques, UMR 6623, Université de Franche-Comté, 16 route de Gray,, 25030 Besancon, France E-mail address: [email protected] LATP, UMR 6632, Université de Provence, Technopôle Château-Gombert, 39 rue Joliot Curie, 13453 Marseille Cedex 13, France E-mail address: [email protected]

A Sparse Embedding and Least Variance Encoding ... - IEEE Xplore