The growth rate of significant regressors for high ...

Viewer
Transcript

Statistics and Probability Letters 83 (2013) 1969–1972

Contents lists available at SciVerse ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

The growth rate of significant regressors for high dimensional data Qi Zheng, Colin Gallagher, K.B. Kulasekera ∗ Department of Mathematical Sciences, Clemson University, Clemson, SC 29634-0975, United States

article

info

Article history: Received 25 April 2013 Accepted 25 April 2013 Available online 6 May 2013

abstract We give a new consistency proof for high-dimensional quantile regression estimators. A consequence of this proof is that the number of significant regressors can grow at a rate s log2 (s) = o(n). To our best knowledge, this is the fastest rate achieved for highdimensional quantile regression. © 2013 Elsevier B.V. All rights reserved.

Keywords: High dimension Quantile regression Increasing rate

1. Introduction Over last decade, problems arising in genetics, signal processing and many other fields have created a great demand for data analysis of high-dimensional sparse regression models: yi = zTi β ∗ + ϵi ,

i = 1, . . . , n,

(1)

where yi ’s are random variables, zi ’s are p × 1 independent random covariate vectors, and ϵi ’s are identically distributed random errors with mean 0, independent of zi ’s. In this context, we are interested in estimating the vector of regression coefficients β ∗ = (β1∗ , . . . , βp∗ )T , when the regression parameter β ∗ is sparse in the sense that only s ≪ p of its components are non-zero and both s and p may depend on the sample size. Various methods have been developed to simultaneously identify the unknown model and estimate the corresponding coefficients. These include: SCAD (Fan and Peng, 2004), the bridge estimator (Huang et al., 2008a), the adaptive lasso (Huang et al., 2008b), L1 penalized quantile regression estimator (Belloni and Chernozhukov, 2011), and  the Dantzig selector (Candes and Tao, 2007). These authors show the respective estimators achieve or get very close to the

n s

rate, which is the oracle

rate obtained by knowing the underlying model in advance (Fan and Li, 2001). For M-estimators with absolutely continuous score functions, Portnoy (1984) showed that s log2 (s) can grow as fast as o(n). Welsh (1989) considered M-estimators with discontinuous score functions and showed that in this case s can grow as fast as o(n1/3 ). He and Shao (2000) improved the growth rate to s log3 (s) = o(n). However, we wonder if we can have a faster growth rate for quantile regression so that the quantile regression can be carried out for more complex applications. In this note we give a new proof of consistency of the quantile regression estimator by exploiting common moment assumptions on the covariates. The main contribution is likely the new proof method, but the growth rate of predictor variables is improved slightly. We show that the allowed growth rate of significant variables can be as much as s log2 (s) = o(n). This is a significant improvement upon Welsh (1989), and a very slight improvement upon the result from

∗

Corresponding author. E-mail addresses: [email protected] (Q. Zheng), [email protected] (C. Gallagher), [email protected], [email protected] (K.B. Kulasekera).

0167-7152/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.spl.2013.04.029

1970

Q. Zheng et al. / Statistics and Probability Letters 83 (2013) 1969–1972

He and Shao (2000). This rate gives us more insight into high-dimensional sparse models, and allows the high-dimensional sparse model to be used in more applications. To our best knowledge, this rate has not been shown elsewhere in the high dimensional regression literature. 2. Main results Since we consider the oracle rate, we assume that p = s throughout the rest of this article. The quantile regression oracle estimator is constructed by minimizing Qτ (θ ) =

n 

ρτ (yi − xTi θ )

i =1

where ρτ (t ) = τ 1(t > 0)t − (1 − τ )1(t ≤ 0)t is the check function for some quantile index τ , and θ = (qτ , β T )T ∈ Rs+1 . We consider the following commonly assumed conditions: C1 (Sampling and smoothness). For any value z in the support of zi , the conditional density fy|z (y|z) is continuously differentiable at each y ∈ R, and fy|z (y|z) and ∂∂y fy|z (y|z) are bounded in absolute value by constants f and f¯′ uniformly in y ∈ R and z in the support of zi . C2 (Eigenvalues of covariate matrix) c1 < λmin (E [xi xTi ]) < λmax (E [xi xTi ]) < c2 , where xi = (1, zTi )T . Moreover, q :=

3f

3/2

E [|xTi δ|2 ]3/2

inf

8 f¯ ′ δ∈Rs+1 ,δ̸=0 E [|xTi δ|3 ]

> 0.

C3 (Moments of covariates) Covariates satisfy the Cramér condition E [|zij |k ] ≤ and all j = 1, . . . , s.

! for some constant Cm , M, all k ≥ 2

Cm M k−2 k 2

Condition C3 is important for us to apply Bernstein’s inequality to control the tail probabilities of the quantile regression. In n addition, C3 (a) also implies i=1 E ∥xia ∥2 ∼ O(ns), which is essential for establishing the oracle consistency rate. Theorem 2.1. Suppose that conditions C1, C2, and C3 are satisfied. For any s log2 (s) ∼ o(n), the oracle quantile regression √ estimator βˆ , has the consistency rate n/s, that is,

∥βˆ − β ∗ ∥ ∼ Op

  s

n

. √

Theorem 2.1 implies the oracle consistency rate n/s can be achieved, as the growth rate of the number of covariates can be as much as s log2 (s)/n → 0. Since score functions for quantile regression are discontinuous, it relaxes the required rate s3 (log n)2+γ = o(n) in Welsh (1989) and the rate s log3 (s) = o(n) in He and Shao (2000). It is slightly slower than the growth rate s log(s) = o(n) required for the absolute continuous score functions.

√

Remark 2.1. Though we only show the largest growth rate for quantile regression to attain n/s consistency, with minor modifications the technique may be applied to the proofs of Welsh (1989) by using more complicated stochastic equicontinuity arguments (see e.g. Bickel, 1975). 3. Conclusion In this paper, we consider the allowable growth rate of the number of significant regressors in high-dimensional sparse models. It is shown that under commonly assumed conditions the possible growth rate of the number of true regressors s log(s)/n → 0 can be allowed, which relaxes dimension restrictions for high-dimensional sparse models, and hence make them more applicable. Our results can be extended to generalized M-estimators. Appendix Proof of Theorem 2.1. We want to show that for any ϵ > 0, there exists a sufficiently large constant C , such that

 P

    s ∗ δ > Qτ (θ ∗ ) > 1 − ϵ inf Qτ θ +

∥δ∥=C

n

(2)

Q. Zheng et al. / Statistics and Probability Letters 83 (2013) 1969–1972

1971

where δ ∈ Rs+1 and ∥δ∥ = C . Since the objective function Qτ (θ ) is strictly convex,  the inequality (2) implies with probability

at least 1 − ϵ the oracle quantile estimator lies in the shrinking ball {θ ∗ +

s n

δ : δ ∈ Rs+1 , ∥δ∥ ≤ C }. This provides the

consistency result immediately. To verify (2), consider Qτ

       n  s s θτ∗ + δ − Qτ (θτ∗ ) = ρτ yi − xTi θτ∗ + δ − ρτ (yi − xTi θ ∗ ). n

(3)

n

i =1

According to Knight (1998), for any x ̸= 0, we have y



|x − y| − |x| = −y[1(x > 0) − 1(x < 0)] + 2

[1(x < t ) − 1(x < 0)]dt . 0

Then

ρτ (x − y) − ρτ (x) = y[1(x < 0) − τ ] + 2

y



[1(x < t ) − 1(x < 0)]dt . 0

Hence, (3) can be written as



n s 

n i=1

xTi

δ[1(yi <

θ ) − τ] +

xTi τ∗

√ T sx δ √i

n   i=1

n

[1(yi <

0

θ + t ) − 1(yi <

xTi τ∗

θ )]dt :=



xTi τ∗

s n

T1 + T2 .

Using independence, the Cauchy–Schwarz inequality and condition C3 , E [T12 ] ≤ nτ (1 − τ )E ∥xi ∥2 ∥δ∥2 ≤ nsτ (1 − τ )Cm C 2 . Then applying Chebychev’s inequality, we see that for any constant k

 P

s n

|T1 | > ksC

2

 ≤

τ (1 − τ )Cm k2 C 2

.

(4) p

Next, we deal with T2 . The goal is to show that T2 ≥ nE

 √s T  n xi δ 0

[1(yi − xTi θτ∗ < t ) − 1(yi − xTi θτ∗ < 0)]dt

 √  s

T n xi δ

nE 

[1(yi −

0

θ < t ) − 1(yi −

xTi τ∗

2

1 f 4

(q∗τ )c1 Cm C 2 . Since V (X ) ≤ E [X 2 ], we obtain that V [T2 ] ≤

. Then given an η > 0 we have

θ < 0)]dt

xTi τ∗

2  1

s n





| δ| > η  xTi

     1/3 s T s T ≤ 4sE (xTi δ)2 1 |xi δ| > η ≤ 4sE [|xTi δ|3 ]2/3 P |xi δ| > η , n

n

where the last line follows from Holder’s inequality. Under condition C2, E [|xTi δ|3 ] ≤

3f

3/2

E [|xTi δ|2 ]3/2

8 f¯ ′

q

.

(5)

Applying Bernstein’s inequality (Lemma 2.2.11 of Van Der Vaart and Wellner (1996)),

 −η2 n P (| δ| > η n) ≤ 2 exp . √ √ 2s(C 2 Cm + MC η n/ s) √

xTi



(6)

Combining bounds (5) and (6) yields: xTi

4sE [|

3 2/3

 

δ| ]

P

1 n

1/3 ≤ 32/3 21/3

| δ| > η xTi

f

(f¯ ′ q)2/3

Cm C 2 s2 exp

√ η n √



√  −η n √

6MC

s

√

η n √ ). On the other hand, → ∞ and (A2) log(s) ∼ o( 12MC s  √   2    s xT δ n i s T nE  [1(yi − xTi θτ∗ < t ) − 1(yi − xTi θτ∗ < 0)]dt 1 |xi δ| ≤ η 

which converges to 0 if η satisfies (A1)

s

n

0

≤ 2nηE

 √ s T |x δ| n

0

i

[1(yi −

θ < t ) − 1(yi −

xTi τ∗

 

θ < 0)]dt 1

xTi τ∗

s n

| δ| < η xTi



1972

Q. Zheng et al. / Statistics and Probability Letters 83 (2013) 1969–1972

= 2nηE

 √ s T |x δ| n

i

0

[F (q∗τ

+ t) −

F (q∗τ )]dt

  1

s n

| δ| < η xTi



.

If η is close to 0, we can find a constant Cτ , s.t. F (t ) − F (0) ≤ Cτ f (q∗τ )t , ∀|t | < η. Thus, we obtain 2nηE

 √ s

T n |xi δ|

0

≤ 2Cτ f (q∗τ )ηnE

[F (q∗τ

+ t) −

F (q∗τ )]dt

 √ s T |x δ| n

 

 

i

tdt

1

0

1 s n

s n

| δ| < η xTi

|xTi δ| < η





≤ Cτ f (q∗τ )c2 C 2 ηs.

Thus, if η satisfies (A1), (A2) and η → 0 (A3), then V [T2 ] ≤ Cτ f (q∗τ )c2 C 2 ηs. By Chebyshev’s inequality, we have P (|T2 − E [T2 ]| >

√

s) ≤ Cτ f (q∗τ )c2 C 2 η.

(7)

Using the Cauchy–Schwarz inequality, assumption C2, and a similar argument as in the proof of getting the bound of V (T2 ), we can show that for n sufficiently large

E [T2 ] = nE

  



0

1 T n xi δ

[1(yi − xTi θτ∗ < t ) − 1(yi − xTi θτ∗ < 0)]dt

  

≥

1 2

f (q∗τ )c1 C 2 s.

(8)

Combining (7) with (8), we see that for sufficiently large C and sufficiently small η, T2 > 41 f (q∗τ )c1 Cm C 2 with probability at least 1 − 2ϵ . This, coupled with (4) implies (3) is positive with probability at least 1 − ϵ and (2) is satisfied. Combining (A1)–(A3) together, we can see that we only require s log2 (s) ∼ o(n). This completes our proof.

References Belloni, A., Chernozhukov, V., 2011. l1 penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39, 82–130. Bickel, P.J., 1975. One-step Huber estimates in the linear model. Journal of the American Statistical Association 70, 428–433. Candes, E., Tao, T., 2007. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics 35, 2313–2351. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360. Fan, J., Peng, H., 2004. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics 32, 928–961. He, X., Shao, Q., 2000. On parameters of increasing dimensions. Journal of Multivariate Analysis 73, 120–135. Huang, J., Horowitz, J.L., Ma, S., 2008a. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics 36, 587–613. Huang, J., Ma, S., Zhang, C., 2008b. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica 18, 1603–1618. Knight, K., 1998. Limiting distributions for L1 regression estimators under general conditions. The Annals of Statistics 26, 755–770. Portnoy, S., 1984. Asymptotic behavior of M-estimators of p regression parameter when p2 /n is large. I. Consistency. The Annals of Statistics 13, 1402–1417. Van Der Vaart, A.W., Wellner, J.A., 1996. Weak Convergence and Empirical Processes. Springer, New York, MR1385671. Welsh, A.H., 1989. On M-processes and M-estimation. The Annals of Statistics 17, 337–361.

Transforms for High-Rate Distributed Source Coding