Inference in Approximately Sparse Correlated Random Effects Probit Models Jeffrey M. Wooldridge∗

Ying Zhu†

Updated on July 3, 2017 (Forthcoming in The Journal of Business and Economic Statistics)

Abstract We propose a simple procedure based on an existing “debiased” l1 −regularized method for inference of the average partial effects (APEs) in approximately sparse probit and fractional probit models with panel data, where the number of time periods is fixed and small relative to the number of cross-sectional observations. Our method is computationally simple and does not suffer from the incidental parameters problems that come from attempting to estimate as a parameter the unobserved heterogeneity for each cross-sectional unit. Further, it is robust to arbitrary serial dependence in underlying idiosyncratic errors. Our theoretical results illustrate that inference concerning APEs is more challenging than inference about fixed and low dimensional parameters, as the former concerns deriving the asymptotic normality for sample averages of linear functions of a potentially large set of components in our estimator when a series approximation for the conditional mean of the unobserved heterogeneity is considered. Insights on the applicability and implications of other existing Lasso based inference procedures for our problem are provided. We apply the debiasing method to estimate the effects of spending on test pass rates. Our results show that spending has a positive and statistically significant average partial effect; moreover, the effect is comparable to found using standard parametric methods.

JEL Classification: C14, C23, C25, C55 Keywords: Nonlinear panel data models; Correlated random effects probit; Partial effects; Highdimensional statistics and inference; l1 −Regularized Quasi-Maximum Likelihood Estimation

1

Introduction

Probably the most commonly used nonlinear panel data models are those for binary responses Yit ∈ {0, 1}, with the most common specification being Yit = 1 {Wit β0 + αi + υit > 0} , t = 1, ..., T, i = 1, ..., n,

(1)

where Wit is a vector of observable covariates, αi is an unobserved random variable – the unobserved heterogeneity – and β0 is a vector of unknown parameters. The {υit : t = 1, ..., T} are the idiosyncratic errors for population unit i. More recently, many empirical researchers are interested ∗ †

[email protected]; Michigan State University, Department of Economics [email protected]; Michigan State University, Department of Economics

1

in estimation models for a fractional response variable Yit ∈ [0, 1], where Yit may be continuous or take on certain values with positive probability. In this case, we are interested in a conditional mean: E(Yit |Wit , αi ) = Φ (Wit β0 + αi ) , t = 1, ..., T, i = 1, ..., n. (2) Under the standard assumptions for the unobserved effects probit model, the conditional mean in (2) also holds in the binary response model (1), and so we can treat both binary and fractional responses in the same framework. We are interested in the case where the number of time periods, T, is small relative to the number of cross-sectional observations, n. Therefore, our asymptotic analysis will be for fixed T with n increasing to infinity. Our approach for modeling the unobserved heterogeneity αi in (2) is based on the concept of correlated random effects (CRE) – as exposited, for example, in Wooldridge (2010, Chapter 13). This approach views the unobserved effect, αi , as a random variable that may be correlated with the history Wi := {Wit : t = 1, ..., T}. Typically, αi is modeled as αi = Vi γ0 + ηi , where Vi is a set of functions of Wi and ηi is the heterogeneity left unexplained. In this paper, we consider the case ηi |Wi ∼ N (0, ση2 ), which makes the conditional mean of the probit form after averaging out ηi . Compared with the traditional random effects framework, which assumes independence between αi and Wi , the CRE approach is clearly more general. The CRE approach does not suffer from the incidental parameters problems that come from attempting to estimate as a parameter the unobserved heterogeneity for each cross-sectional unit. Plus, it can be applied to models with substantial heterogeneity, whereas so called “fixed effects” approaches are limited by the number of time periods. Most importantly, the “fixed effects” approach, with large n and small T, does not allow us to estimate functions (such as average partial effects) that involve the conditional mean of the heterogeneity. Compared to procedures that require obtaining the joint distribution of the underlying data for each cross-sectional observation over time, the CRE approach in conjunction with a pooled estimation is much simpler computationally. Also, as discussed in Wooldridge (2010), the CRE approach identifies average partial effects (APEs) without any restrictions on the serial dependence in the data. The data can be strongly dependent over time provided the cross-sectional dimension is reasonably large. The typical approach to CRE models is to specify a simple relationship between αi and timeconstant functions of {Wit }. A leading case is to use a linear function of the time averages, ¯ i = T−1 PT Wit , which was proposed by Mundlak (1978) in the linear model. Chamberlain W t=1 (1982) suggested a linear model for the entire history, Wi . But these are just simple possibilities, and there are countless others. For example, we could compute variances and covariances of the elements in Wit . We could break the time interval into, say, G groups, and use statistics computed within the G groups. For flexibility in the functional form, we might want to include polynomials in any set of original functions. One possible approach to allow for a flexible conditional mean functional form is to use series estimation once a given set of functions of the covariates has been specified. The drawback to this approach is that one must specify which functions of the covariates are to be added as the sample size grows, and no economic theory can provide guidance. In this paper we show how to apply high-dimensional selection methods in a CRE probit model, where we allow substantial flexibility in how the heterogeneity relates to the history of the covariates. However, it is worth emphasizing that our inference procedure and theory does not require perfect selection of our selector. We must emphasize that this paper only aims to relax the parametric restriction on the conditional mean of the unobserved heterogeneity and does not attempt to relax the probit functional form, as the former is likely the most serious restriction. Simulation results in Li and Zheng (2008) suggest that, for obtaining partial effects, the estimates are not overly sensitive to the normality

2

assumption. However, the findings of Arellano and Bonhomme (2009) and Rabe-Hesketh and Skrondal (2013) suggest that misspecification of the conditional mean can cause substantial bias, even in the partial effects. Relaxing the shape of the distribution is clearly worthwhile and has been done in the work of Altonji and Matzkin (2005). Nevertheless, Altonji and Matzkin assume, at a minimum, that the distribution of the heterogeneity given the history of covariates is exchangeable. Time averages and variances over time satisfy exchangeability, but many functions – such as individualspecific trends – do not. Moreover, the Altonji-Matzkin estimation approach is complicated and not especially appealing to empirical researchers. In comparison, our approach is computationally attractive and does not require an exchangeability assumption. Our main contribution is to propose valid inference for the average partial effect (APE, or average marginal effect) with respect to any policy variable. If the policy variable of interest indicates the treatment status, the corresponding APE is known as the average treatment effect in literature. When a nonlinear probability model is used to analyze the importance of policy interventions, the APEs are more sensible measures for the magnitudes of effects than the individual parameters themselves (which do not convey much information other than the signs they have). Unlike in a linear model where the APEs coincide with the coefficients, inference of the APEs in a nonlinear model like probit requires establishing distributional results that involve the estimator for the unknown conditional mean function of the unobserved heterogeneity. This problem can be reduced to deriving the asymptotic normality for sample averages of linear functions of a potentially large set of components in our estimator for primary parameters (associated with policy-related variables) as well as nuisance parameters (associated with approximating terms) when a series approximation for the conditional mean of the unobserved heterogeneity is considered. This fact renders some of the existing procedures inapplicable, particularly those relying on partialling out the effects of high-dimensional controls to obtain inference results on low-dimensional parameters (e.g., Belloni, Chernozhukov, and Hansen, 2014). Instead, we adopt a different approach based on constructing a “debiased” version of the l1 −regularized pooled quasi-maximum likelihood estimator. Our procedure is motivated by Javanmard and Montanari (2014), who focus on the linear regression model Yi = Xi θ∗ + ui and inference of a single parameter with cross-sectional data. Given an initial Lasso estimate θˆ of θ∗ , Javanmard and Montanari (2014) adds a correction term to θˆ to remove the bias introduced by the regularization. A key step in their procedure lies h 

in searching for an approximation M of the inverse of the population Hessian E XiT Xi P n T X − I establishing a finite-sample bound on M X i i=1 i n



i−1

and

where |·| ∞ is the elementwise l∞ −norm

and I is the identity matrix. This approximation step becomes more involved in the analysis of a probit-like model E (Yi |Xi ) = Φ (Xi θ∗ ) as the inverse of the population Hessian depends on the p−dimensional coefficient vector θ∗ (where p is allowed to exceed n) and consequently its approximation, M , will inevitably rely on ˆ denote the sample Hessian of our interest evaluˆ θ) the initial Lasso estimate θˆ of θ∗ . Letting H( ˆ ated at θ, we use a discretization argument and a metric entropy result to derive a finite-sample ˆ ˆ bound on M H(θ) − I . In contrast to Javanmard and Montanari (2014) where the asymptotic ∞ normality is derived by conditioning on the covariates X, our results incorporate the additional randomness in both X and θˆ as our M also depends on θˆ (clearly in the case of linear models as in Javanmard and Montanari, M would only depend on X). It turns out doing so requires imposing ˆ J(θ) ˆ . a growth condition on the number of non-zero components in θ, ˆ It is possible to trade off the condition on J(θ) with a sparsity assumption on the inverse of the population Hessian by applying a different debiasing method proposed by van de Geer, Bühlmann, Ritov, and Dezeure, (2014) as well as Zhang and Zhang (2014). We choose not to adopt their 3

method as the sparsity assumption on the inverse of the population Hessian does not seem natural to the applications underlying this paper. However, like the analysis in our paper, some of the results in van de Geer, et al. (2014) do not require conditioning on the Xs and can be extended to certain nonlinear probability models such as the fractional probit. This motivates us to compare the conditions in our results with those in Van de Geer, et al. To do so, we abstract our analysis from the panel data setting and make the comparisons specific to problems concerning cross-sectional data and inference about a single parameter. We apply one of the debiasing methods to estimate the effects of spending on math pass rates for fourth graders in Michigan between 1995 and 2001. Our proposed model includes the full set of Chamberlain’s regressors and the interactions between all variables and year dummies. This specification is much more general and flexible than the fractional probit model proposed in Papke and Wooldridge (2008) which only includes the classical Mundlak time averages to control for the correlated random effects. Our results show that spending has a positive and statistically significant average partial effect on pass rates. The finding on spending is consistent with the story in Papke and Wooldridge (2008) which uses the pooled quasi-maximum likelihood procedure. In terms of the magnitudes, the estimates for the effect of spending based on the debiasing method are also comparable to those in Papke and Wooldridge (2008). In terms of organization, we first introduce a set of notations that are commonly used in this paper. We then present in Section 2 formal statements of the correlated random effects probit and fractional probit models. The inference procedure is introduced in Section 3. We present the theoretical guarantees of the inference procedure in Section 4 where we also compare our results with those in Javanmard and Montanari (2014) and van de Geer, et al. (2014). In Section 5 we apply the procedure from Section 3 to estimate the effects of spending on math pass rates. Section 6 provides directions for future research and concludes the paper. The proofs are collected in Sections A and B. Notation. The lq −norm of a p−dimensional vector ∆ is denoted by |∆| q , 1 ≤ q ≤ ∞ where P

1/q

p q |∆| q := when 1 ≤ q < ∞ and |∆| q := maxj=1,...,p |∆j | when q = ∞. For a j=1 |∆j | p×p matrix H ∈ R , write |H| ∞ := maxi,j |Hij | to be the elementwise l∞ −norm of H. For a square matrix H, denote its minimum eigenvalue by λmin (H). For a vector ∆ ∈ Rp , let J(∆) = {j ∈ {1, ..., p} | ∆j 6= 0} be its support, i.e., the set of indices corresponding to its non-zero components ∆j . The cardinality of a set J ⊆ {1, ..., p} is denoted by |J| and if J = ∅, |J| = 0. For a set of indices A ∈ {1, ..., p} with cardinality |A|, denote ∆A or ∆A the sub-columns or sub-rows of a vector ∆ formed by the indices in A. Also, let Hj· denote the jth row of H and H·j the jth column P of H. Moreover, define |∆|0 = pj=1 I {∆j 6= 0}. For functions f (n) and g(n), write f (n) % g(n) to mean that f (n) ≥ cg(n) for a universal constant c ∈ (0, ∞) and similarly, f (n) - g(n) to mean 0 0 that f (n) ≤ c g(n) for a universal constant c ∈ (0, ∞), and f (n)  g(n) when f (n) % g(n) and f (n) - g(n) hold simultaneously. Also denote max{a, b} by a ∨ b and min{a, b} by a ∧ b. The standard normal c.d.f. and p.d.f. are denoted by Φ(·) and φ(·), respectively. Also, as a general rule for this paper, all the c constants denote universal positive constants that are independent of n. The specific values of these constants may change from place to place.

2

Correlated random effects probit

It is helpful to begin with a description of the CRE probit model. For simplicity, we assume a balanced panel data set in this paper. In general, elements of the vector Wit in (1) can be both time constant and time varying, but for now it is notationally convenient to consider only time-varying covariates. In the following, we provide a list of assumptions that are used for the CRE probit model. 4

Assumption 2.1 : (i) The draws (Yi , Wi )ni=1 are independently and identically distributed. (ii) αi = Vi γ0 + ηi , where Vi is a set of functions of Wi (including a constant term) and ηi |Wi ∼ N (0, ση2 ). (iii) υit |Wit , αi ∼ N (0, 1) for all t = 1, ..., T. (iv) E (Yit |Wi , αi ) = E (Yit |Wit , αi ). Missing from Assumption 2.1 is a clear statement of identification conditions. Those will become clear in Section 4 after pooled estimation of a probit model in high-dimensional settings has been treated. Given (1) and Assumption 2.1, we can write Yit = 1 {Xit θ0 + uit > 0} ,

(3) h

where Xit = [Wit , Vi ] and uit = ηi +υit , and the p−dimensional coefficient vector is θ0 = β0T , γ0T Under the normality and strict exogeneity assumptions,

iT

.

P (Yit = 1|Wi ) = P (ηi + υit > −Wit β0 − Vi γ0 |Wi ) = Φ (Wit β ∗ + Vi γ ∗ ) = Φ (Xit θ∗ ) , where θ∗ = √ θ0

1+ση2

(4)

is a set of scaled parameters. Note that θ∗ is the vector we can hope to estimate;

we cannot identify θ0 and ση2 separately. However, as discussed in Wooldridge (2010, Chapter 15) in the standard setting with a small number of regressors, θ∗ is the quantity that indexes the APEs. For applications with fractional response variables Yit ∈ [0, 1], we only use the conditional mean assumption E(Yit |Wi ) = Φ (Xit θ∗ ) , t = 1, ..., T. (5) Therefore, our analysis applies to the so-called fractional probit model with unobserved heterogeneity proposed by Papke and Wooldridge (2008). That paper uses a standard CRE approach to allow the heterogeneity to be correlated with time-varying covariates and results in an equation like (5). Thus, the current paper extends the fractional probit model to high dimensional settings which allows p, the dimension of θ∗ , to exceed the sample size n. In regard to applications concerning panel data with fixed number of periods considered in this paper, the high dimensionality of θ∗ arises from γ ∗ . Allowing the dimension of γ ∗ to grow with and possibly exceed n provides us the flexibility to include a large number of approximating terms in Vi . On the other hand, we point out that the method and theory provided in this paper can be extended to cases where a large number of time-varying functions of Wit s are included.

3 3.1

Inference procedure Pooled quasi-maximum likelihood estimation

A convenient and computationally simple method for estimating the parameters and APEs in CRE probit models – whether applied to binary or fractional outcomes – is pooled quasi-maximum likelihood estimation (QMLE). (The “quasi” is in the case where Yit is not binary.) In fact, pooled QMLE can be used in a variety of panel data settings, even if we have no interest in an underlying CRE. In particular, we can start with the conditional expectation that has a probit form as in (5). √ With fixed T, we can obtain a consistent and n−asymptotically normal estimator of θ∗ in the classical (finite p) settings, along with various partial effects, by solving the problem n X T 1X maxp {(1 − Yit ) log [1 − Φ (Xit θ)] + Yit log [Φ (Xit θ)]} , θ∈R n i=1 t=1

5

(6)

where T t=1 {(1 − Yit ) log [1 − Φ (Xit θ)] + Yit log [Φ (Xit θ)]} is the pooled (or partial) quasi-log likelihood function for cross-sectional unit i. Wooldridge (2010, Section 13.8) discusses estimation and inference using pooled MLEs and QMLEs generally. Papke and Wooldridge (2008) discuss why solving this problem produces a consistent estimator under (5) only (and mild regularity conditions). Because the estimator is obtained by treating the data as if it is one long cross section, computation is typically straightforward. Of course, in general the time series dependence in the data must be accounting for when estimating asymptotic variances, which are then used in test statistics and confidence intervals. A general “sandwich” estimator is available, and this estimator accounts for both serial dependence of unknown form and the fact that the probit log likelihood is only a quasi-log likelihood in the fractional response case. See Wooldridge (2010, Section 13.8) for a general discussion. A joint MLE, which requires obtaining the joint distribution of (Yi1 , ..., YiT ) conditional on (Xi1 , ..., XiT ), can be computationally much more difficult, and we may not even have enough assumptions to obtain the joint distribution. In any case, the joint MLE will generally require more assumptions to consistently estimate the parameters (and the benefit from the additional assumptions and computational burden is more asymptotic efficiency). A joint MLE for fractional responses seems especially challenging in cases of practical interest. Therefore, we focus on the pooled (Q)MLE. As we argue in the next subsection, pooled QMLE can be modified to allow cases with very high dimensional parameters provided we add a penalty. P

3.2

L1-regularized pooled QMLE and inference

Let θˆ be a first-step estimator of the p−dimensional coefficient vector θ∗ = √ θ0

1+ση2

. In this paper,

we choose the first-step estimator to be the l1 −regularized conditional quasi-maximum likelihood estimator (

θˆ ∈ arg minp θ∈R

T n X −1 X {(1 − Yit ) log [1 − Φ (Xit θ)] + Yit log [Φ (Xit θ)]} + λn |θ|1 , n i=1 t=1

)

(7)

where λn ≥ 0 is a regularization parameter. Let us consider a finite set A1 ⊆ {1, ..., p} of indices associated with θ∗ , where the cardinality of A1 is denoted by |A1 |. For example, the set A1 may be a singleton corresponding to some treatment ˆ ∪ A1 and denote the sample “Hessian” associated with A by “effect”. Let A = J(θ) n X T n X T X X ˆ = 1 ˆ T AX A ˆA = 1 H HitA (θ) f (Xit θ)X it it n i=1 t=1 n i=1 t=1 φ(u) where f (u) = u Φ(u) +



 φ(u) 2 Φ(u) .

ˆ sA i (θ) =

(8)

Let the score

T X t=1

  ˆ T A Yit − Φ(Xit θ) ˆ T φ(Xit θ)X X it ˆ   sA it (θ) = − ˆ ˆ

(9)

Φ(Xit θ) 1 − Φ(Xit θ)

t=1

for each i. Let M A be a solution that satisfies the following feasibility problem: A ˆA M H − I A

6



≤ µn

(10)

where µn is a non-negative tuning parameter and I A is the |A| × |A| identity matrix. There are many computationally efficient algorithms for solving programs like this, for example, the interior point method (see e.g., Bertsimas and Tsitsiklis, 1997; Boyd and Vandenberghe, 2004). Motivated by the idea of “debiasing” in Javanmard and Montanari (2014) used for inference in sparse linear regression models, we define the second-step estimator θ˜A as follows: n X T MA X ˆ θ˜A = θˆA + sA (θ). n i=1 t=1 it

(11)

An estimator of the APE jth covariate (j ∈ {1, ..., p}) can be  with  respect tothe (continuous)  c 1 Pn ˜ A A ˆ ∪ A1 and θ˜A is obtained from obtained by n i=1 θj φ Xit θ˜ where θ˜ = θ˜ , 0 with A = J(θ) (11). If the jth variable indicating h covariate   is a binaryi   treatment status, then the estimator  be(1) ˜ (0) ˜ (1) ˜ (0) ˜ 1 Pn comes n i=1 Φ Xit θ − Φ Xit θ where Φ Xit θ is evaluated at Xitj = 1 and Φ Xit θ is evaluated at Xitj = 0. Some intuitions ∗ ) = PT sA (θ ∗ ) and H A (θ) = PT H A (θ), defined Let us introduce the short notations sA t=1 it t=1 it i i (θ P PT ˆ in (8), respectively; in particular, we let ˆ in (9) and T H A (θ) (θ) in a similar way as t=1 sA t=1 it it P A ∗ ¯ Suppose EH ∗A and EH ˆ and H ˆ A have ¯ A = PT H A (θ). ˆ A PT H A (θ), Hi∗A = T t=1 it t=1 it t=1 Hit (θ ), Hi = i i i full rank. We first consider a simple estimator based on (11) in the classical finite p settings where −1  ˆA ˆ A is invertible, then we can set M A = H and (11) is we simply choose A = {1, ..., p}. If H simply a Newton-Raphson iteration in the so-called “One-Step Theorem” (Newey and McFadden, 1994), which is used to gain asymptotic efficiency. More generally, we can search for an M A such that (10) is satisfied for some “small” positive µn  −1 ˆA . To see how θ˜A allows us to obtain the asymptotic normality for a finite set and M A ≈ EH ∗ ˆ A1 ⊆ {1, ..., p} of elements in θ˜A , note that a mean-value expansion of sA it (θ) around θ and some simple algebraic manipulations of (11) yield that

θ˜A − θ∗A " # n n n   MA X MA X MA X A ∗ A A ¯ ¯ AAc θ∗Ac = si (θ ) + I − Hi H θˆA − θ∗A − n i=1 n i=1 n i=1 i =



EHi∗A



+

n −1 1 X ∗ sA i (θ )

n

i=1



ˆ iA M − EH A

(12)

" # n n n −1  1 X   MA X MA X c A ∗ A A ¯ ¯ iAAc θ∗A(13) si (θ ) + I − Hi θˆA − θ∗A − H n n n i=1

+ h



ˆA EH i

¯ AAc where H i

i j·

−1

=

i=1

n −1  1 X ∗A ∗ sA − EHi i (θ ),



n

hP

i=1

(14)

i=1

i

T ¯(j) T A Ac t=1 f (Xit θ )Xit Xit j·

and θ¯(j) (j ∈ A) is some intermediate value between

θˆ and θ∗ (possibly differing across the rows of the Hessian matrix). For any finite set A1 of indices in A, if we can show that as n → ∞: (a) There exists a solution √ A M to (10) such that the A1 −subrows of (13) multiplied by n are of smaller order than those √ √ of (12) multiplied by n, and (b) the A1 −subrows of (14)multiplied by n are of smaller order  √ √ than those of (12) multiplied by n, then, n θ˜A1 − θ∗A1 has the same asymptotic distribution 7

√ as the A1 −subrows of (12) multiplied by n. Note that even if θˆA1 = 0, the elements in θ˜A1 are non-zero by (11). Consequently, our procedure allows us to obtain the asymptotic distribution of θ˜A1 and conduct inference on functions of θ∗A1 . Establishing the asymptotic normality for the estimator of the APEs is more delicate as it concerns deriving limiting distributions for sample averages of linear functions of a potentially large set of estimates for primary parameters (associated with policy-related variables) as well as nuisance parameters (associated with approximating terms).

4

Theoretical results

In this section, we establish the asymptotic normality of a finite set of components in the secondstep estimator obtained by (11), as well as the estimator for the APE with respect to the jth covariate (j ∈ {1, ..., p}). For notational simplicity, in the theoretical results presented below, we assume the regime of interest is p ≥ n or p  n. The modification to allow p  n is trivial. Before stating the main theorems, we first present the asymptotic bounds for |θˆ − θ∗ |2 and |θˆ − θ∗ |1 and illustrate the role of these bounds in obtaining our inference results. To focus on the main point of this paper, we assume that |θ∗ |1 ≤ 1 in the analysis to avoid unnecessary complications. For a 1

random variable V , we denote its sub-Exponential norm by |V |Ψ := supr≥1 r−1 (E |V |r ) r and its 1

1

sub-Gaussian norm by |V |Ψ1 := supr≥1 r− 2 (E |V |r ) r . For the following lemma, we also define σ ˜=  1 q q P n o ∗ T k log p log p 2 ∗ ¯ maxj=1,...,p t=1 |Xitj | , Sτ := j ∈ {1, ..., p} : θj > τ , and Bn = + |θS c |1 n n τ ψ1 where k = Sτ .

Lemma 4.1 : Suppose Assumption 2.1 holds and the random matrix Xt hconsistsi of bounded components for all t = 1, ..., T. Assume λmin (ΣXt ) ≥ κL % 1 where ΣXt = E XitT Xit . If θ∗ is exactly sparse with at most k non-zero coefficients such that k = O



n log p



, we let τ = 0; otherwise we let

q

˜ τ = κλLn . Suppose |θ∗ |1 logn p = O(1) when τ 6= 0. If θˆ solves program (7) with λn = c0 σ some universal constant c0 > 0, then, ˆ θ − θ ∗ 2 ˆ ∗ θ − θ 1





¯n , = Op B √  ¯n + |θS∗ c |1 . = Op kB τ

q

log p n

for

(15) (16)

By the definition of a sub-Gaussian random variable (e.g., Vershynin, 2012), boundedness of Xt guarantees that σ ˜ - 1. Recalling the second term in (13), bound (16) plays an important role in √ ensuring that the A1 −subrows of this term multiplied by n is negligible asymptotically relative h i−1 P √ n 1 A ∗ to the A1 −subrows of the leading term EH ∗A n. To see i=1 si (θ ) in (12) multiplied by n this, note that " # n MA X A A A ∗A ¯ i (θˆ − θ ) H I − n i=1

4.1



n MA X A A ¯ i ≤ I − H n i=1

ˆA θ − θ ∗A . ∞

1

Asymptotic normality

We now proceed with the results on the asymptotic normality. The following definitions and assumptions are needed for the next two theorems. 8

Definitions. Let the set Sr1 , r2 := {θ ∈ Rp | |θ − θ∗ |1 ≤ r1 , |θ − θ∗ |2 ≤ r2 } , √





¯n , the upper bound in (15), ¯n + |θ∗ c |1 , the upper bound in (16), and r2 = c1 B kB Sτ where c1 and c2 are some sufficiently large universal positive constants. Also, for a given set of 0 indices A ⊆ {1, ..., p} and j, j ∈ A, define where r1 = c2

A T1,tjj 0 (θ)

=

2 n h i−1 1X A TA A EH (θ) Xit Xit , n i=1 jj 0

=

T n h i−1 h i−1 X 1X 0 A XitT A XitA EH (θ) E f (Xit θ)XitT A XitA EH A (θ) n i=1 t=1

"

"

A T2,tjj 0 (θ)





φ(u) where EH A (θ) := E HiA (θ) , f (u) = u Φ(u) +

#



 φ(u) 2 Φ(u)

#2

, jj 0

0

and f (u) denotes the first derivative of f

0

evaluated at u, and [H]jj 0 denotes the jj th entry of a matrix H. Assumption 4.2 (Local Identification): Given the set of indices A ⊆ {1, ..., p}, for some   A parameter κ > 0, λmin EH (θ) ≥ κ for all θ ∈ Sr1 , r2 . Remark. In terms of the set of indices A ⊆ {1, ..., p}, Assumption 4.2 requires that the population "Hessian" associated with A and evaluated at any θ belonging to a local neighborhood Sr1 , r2 of θ∗ have full rank. By the definition of a sub-Exponential random variable (e.g., Vershynin, 2012), boundedness of Xt together with Assumption 4.2 ensures that there exist parameters σ1A , σ2A , σ3A , σ4A > 0 such that for all θ ∈ Sr1 , r2 , σ1A ≥ σ2A ≥ σ3A ≥

T X h i−1 T A A EH A (θ) X X max it it 0 , j,j 0 ∈A t=1 jj Ψ T h i−1 X XitT A , max EH A (θ) j· j∈A t=1 Ψ1  2 h i−1 max XitT A Xit , EH A (θ) 0 0 jj

j ∈ A, j ∈ {1, ..., p} t = 1, ..., T

σ4A ≥

(17)

Ψ

" T # !2 h i−1 h i−1 X 0 EH A (θ) . max E f (Xit θ)XitT A XitA EH A (θ) XitT A Xit 0 0 j ∈ A, j ∈ {1, ..., p} t=1 jj Ψ

t = 1, ..., T Assumption 4.3 : Recalling the set Sτ defined for Lemma 4.1, 

Rp \{0} : |∆Sτc |1 ≤ 3|∆Sτ |1 + 4|θS∗ c |1 τ σXt .



∆T ΣXt ∆ |∆|22

≤ κU < ∞ for all ∆ ∈

and t = 1, ..., T; for any fixed unit vector ∆, |Xit ∆|Ψ ≤

Our first inference result concerns the asymptotic normality of a finite set of components in 9

the second-steprestimator θ˜A obtained by (11). For notational E1 = r hconvenience, we denote h i i √ √ A A σ3A ∨ maxjj 0 E supθ∈Sr1 , r2 T1,tjj σ4A ∨ maxjj 0 E supθ∈Sr1 , r2 T2,tjj 0 (θ) , E2 = 0 (θ) , and C1∗ =

T  X

n

o

n

max κU , σXt , E12 + max κU , σXt , E22

o

.

t=1

ˆ ∪ A1 for a finite set A1 of our interests, suppose: Theorem 4.1 . Given the set of indices A = J(θ)   √ p ∗ | ∨ θ ∗ - log (i) Assumptions 4.1-4.3 and the conditions in Lemma 4.1 hold; (ii) |θA c c 1 1 +ς Sτ 2 n 1 √

for some constant ς > 0, and we choose µn = cσ1A |A| = op

q

n (k∨1) log p

q



log p n

log p nς

3 2

= o(1); (iii) C1∗ (k∨1)√nlog p = o(1) and σ1A

q

log p n

¯n ; (iv) -B

in (10) for some universal constant c > 0; (v) the cardinality of A,

. Then, for the second-step estimator θ˜A in (11), we have    √  A1 d n θ˜ − θ∗A1 → N 0, ΣA1

(18)

where θ˜A1 denotes the A1 −subcomponent of θ˜A and ΣA1 denotes the |A1 | × |A1 | sub-matrix of A

Σ := var

h

EH

∗A

T i−1 X

! ∗ sA it (θ )

.

t=1

Note that the partial debiasing step (11) (a Newton-Raphson like1 iteration) yields the same asymptotic distribution as a post-Lasso approach where inference is based upon the estimates √ log p ∗ ˆ returned by (6) with the regressors in A = J(θ) ∪ A1 . The condition |θAc |1 - 1 +ς in Theorem n2 A Pn ¯ AAc θ∗Ac in (13) multiplied by √n is H 4.1 ensures that the A1 −subrows of the last term − Mn i=1 i h

i−1

n 1 A ∗ negligible asymptotically relative to the A1 −subrows of the leading term EH ∗A i=1 si (θ ) n √ √ p in (12) multiplied by n. The condition θS∗ c - log is needed in the proof that determines the 1 τ 1 n 2 +ς scale of µn (see Lemma A1). We now turn to our main result which establishes the asymptotic normality of the estimator for the APE with respect toh the covariate. A similar result can be established for the  (continuous)   jth i (1) ˜ (0) ˜ 1 Pn APE estimator n i=1 Φ Xit θ − Φ Xit θ if the jth covariate is a binary variable indicating     (0) (1) treatment status, where Φ X θ˜ is evaluated at Xitj = 1 and Φ X θ˜ is evaluated at Xitj = 0.

P

it

it

For Theorem 4.2, we denote  h

ξiA :=  EHi∗A a :=



T i−1 X j·

h

s∗A it ,

EHi∗A

T i−1 X

t=1

!T T

s∗A it

 ,

t=1 0

h

E [φ(Xit θ∗ )] , E θj∗ φ (Xit θ∗ )XitA

iT

.

ˆ ˜ ˜A Ac ˜A Theorem 4.2 . For a finite set A1 of our interests, let A = J(   θ) ∪√A1 and θ = (θ , 0 ) with θ p ∗ | ∨ θ ∗ obtained from debiasing θˆA in (11). Suppose |θA - log c 1 1 +ς for some constant ς > 0. Sτc 2 n 1 n ˆA = 1 The “like” here means we are taking a Newton-Raphson iteration based on H n i=1 P P n T T ˆ it instead of the |A| × |A| submatrix of n1 f (Xit θ)X Xit associated with the set A. i=1 t=1 1

P

10

PT t=1

TA A ˆ it f (Xit θ)X Xit

Also assume |θ∗ |21

q

log p n

= o(1), σ1A

q

log p n



¯n , |A| = op - B

n (k∨1) log p

1  4

, and the following

conditions hold, |a|21 |A|2 |a|21 |A|2 log n  log = Op (1), Pn var( i=1 aT ξiA ) var aT ξiA ! " #− 1 √ 3 n 2 X log p C1∗ (k ∨ 1) 2 log p T A ∨ = op (1), n var( a ξi ) 1 n n 2 +ς i=1 !

(19)

(20)

where C1∗ is defined in Theorem 4.1. For the estimator of the APE with respect to the (continuous) jth covariate (j ∈ {1, ..., p}) and t ∈ {1, ..., T}, under Assumptions (i) and (iv) of Theorem 4.1, we have n h   i X 1 d ˜j φ Xit θ˜ − θ∗ φ (Xit θ∗ ) → q θ N (0, 1) . (21) j Pn var( i=1 aT ξiA ) i=1



ˆ Condition (19) imposes restrictions on the growth rate of J(θ) (i.e., the number of non-zero ˆ since A = J(θ) ˆ ∪ A1 and A1 is a finite set. Let us consider a special case where coefficients in θ)   P T A A var(a ξi )  l var al ξil  |A| and |a|21  |A|2 . Then the pre-multiplier p P1n T A  var(

1 ˆ |∨1) n(|J(θ)

p

i=1

a ξi )

required for asymptotic normality in Theorem 4.2 is of order no greater than √

√1 n

3 2



C (k∨1) log p ∨ p1 = op (1). Moreover, (19) implies that ˆ |∨1) ˆ |∨1) nς (|J(θ) n(|J(θ)   1   ˆ 3 ˆ 4 n . J(θ) log J(θ) + log log n = Op (n), which is satisfied under the condition |A| = op (k∨1) log p

p log p

and condition (20) implies that

For Theorem 4.2, note that the restriction, |A| = op one, |A| = op

q

n (k∨1) log p



n (k∨1) log p

1  4

, is stronger than the



, in Theorem 4.1. This is not surprising as the APEs involve linear

combinations of the coefficient vector θ∗ . If the parameter k (defined prior to Lemma 4.1) is finite √ ˆ and J(θ) = Op (k ∨ 1) with high probability2 , then n−asymptotic normality is achieved by our APE estimator.

4.2

Estimators of the asymptotic variances h

ˆA A “sandwich” estimator for ΣA1 in (18) is Σ ˆA = from (10) and B ˆj = Λ t

1 n

Pn

i=1



a ˆT ξˆiA

2

1 n

Pn

i=1

PT

t=1

i A1

ˆ A0 (θ) ˆ T. sA (θ)s t0 =1 it it

PT

i A1

, where M A is obtained

An estimator of E



aT ξiA

2 

in (21) is

, where 

ξˆiA := Mj·A

T X

ˆ sA it (θ),

t=1

a ˆ :=

h

ˆ A M AT = M AB

MA

T X

!T T

ˆ sA it (θ)

 ,

t=1

n n X 1X 0 ˆ 1 ˆ A φ(Xit θ), θˆj φ (Xit θ)X it n i=1 n i=1



!T

.

ˆ = Op (k ∨ 1) can be ensured under a “sparse eigenvalue condition” (Bickel, et. al, 2009; Belloni Indeed, J(θ) and Chernozhukov, 2013). 2

11

(k∨1) log p = o(1), n   1 −ε  4 n Op (where (k∨1) log p 2 

Provided that ˆ J(θ) =

ˆ is a sum of approximately J(θ)

h

ˆA the consistency of Σ

i A1

ˆ j ) requires (and consequently Λ t

ˆ A = M AB ˆ A M AT ε ∈ (0, 14 ]) as each component in the matrix Σ terms with each term converging to its population counter-

q





ˆ part at the rate (k∨1)n log p . Note that the restrictions on J(θ) required for consistency of the covariance estimator and asymptotic normality of our APE estimator in Theorem 4.2 h coincide. i ∗AT = 0 If we assume that the score (evaluated at θ∗ ) is serially uncorrelated, i.e., E s∗A s it it0 0

for all t 6= t , then E  hP T



A

hP T

i

A ∗ t=1 Hit (θ ) = −E

hP T

t=1

PT

sA (θ∗ )sA (θ∗ )T t =1 it it0 0

i−1

i

and consequently ΣA =

A

E H (θ ) which can be estimated consistently by M without any further restriction on t=1 it ˆ J(θ) (i.e., Assumption (v) in Theorem 4.1 would suffice); on the other hand, to obtain a consistent   1   2 2  ˆ 4 n T ξˆA T A ˆ j = 1 Pn as Λ a ˆ is estimator for E a ξi , we still require J(θ) = op t i=1 i n (k∨1) log p 2  ˆ a sum of approximately J(θ) terms with each term converging to its population counterpart q

at the rate (k∨1)n log p . As a side comment, if the unknown function of interest belongs to the αth order Sobolev class and a classical series estimation procedure is adopted, it appears that the condition α > 2 is needed. To see this, recall the optimal rate of convergence for a series type estimator for these classes of functions is

  1 n

α 2α+1

and the corresponding number of basis terms included to achieve such a rate

1

is n 2α+1 . In order for the covariance estimators to be consistent, we need which holds when α > 2.

4.3

  1 n

α 2α+1

2

n 2α+1 = o(1),

Comparison with the linear case in Javanmard and Montanari (2014)

In what follows, we abstract our analysis from the panel data setting and make comparisons with Javanmard and Montanari (2014) by looking at problems concerning cross-sectional data and inference about a single parameter. For the case of a sparse linear regression Yi = Xi θ∗ + ui with cross-sectional data, Javanmard and Montanari let A = {1, ..., p} in (10) and debias all components of θˆ in (11), which leads to the following expansion n n   MX MX T ∗ ˜ θ−θ = Xi (Yi − Xi θ ) + I − XiT Xi θˆ − θ∗ . n i=1 n i=1

"

#



Consequently, as long as the asymptotic normality is derived by conditioning on X (which is the case for Javanmard and Montanari, 2014), Assumptions (ii) and (v) in Theorem 4.1 can be dropped. In the current setting, it makes little sense condition on the sequence of covariates, as we are explicitly allowing correlation between the heterogeneity and covariates. Therefore, we incorporate the additional randomness in both X and θˆ (note that our M would also depend on θˆ for probitlike models and clearly for linear models considered in Javanmard and Montanari, M would only depend on X). In our context, qwe have toimpose a restriction on the number of non-zero coefficients ˆ n in the Lasso – J(θ) = op , Assumption (v) in Theorem 4.1. Such a restriction is (k∨1) log p

essentially necessary for our analysis by noting that each row in (13)-(14) involves multiplications ˆ ˆ of vectors whose dimensions depend on J(θ) . It is possible to trade off the condition J(θ) = 12

op

q

n (k∨1) log p



with another assumption by applying a different method for approximating the

inverse of the Hessian, as we discuss in the subsequent section.

4.4

Comparison with the debiasing procedure via nodewise Lasso

van de Geer, Bühlmann, Ritov, and Dezeure (2014) as well as Zhang and Zhang (2014) propose an alternative procedure for approximating the inverse of the Hessian by applying the Lasso p times for each regression problem Xj versus X−j (the design submatrix without the jth column). We decided to adopt the method by Javanmard and Montanari (2014) as it appears more natural for panel data applications concerning probit functions. Later in the course of this research, we realize that Van de Geer, et al. (2014) also contain inference results regarding a single parameter in Generalized Linear Models with cross-sectional data; moreover, like the analysis in our paper, some of their results do not require conditioning on the Xs (while the conditioning argument is used in Javanmard and Montanari 2014). Therefore, comparing the conditions in our results (Theorem 4.1) with those in Van de Geer, et al. (2014) would be both interesting and worthwhile. To do so, we also abstract our analysis from the panel data setting and make the comparisons specific to problems concerning cross-sectional data and inference about a single parameter. First, our results do not require the inverse of the population Hessian matrix to be sparse, which is imposed in van de Geer, et al. (2014). On the other hand, we do have to restrict the growth rate ˆ on J(θ) . To be precise, van de Geer, et al. assumes that the off diagonal entries in the inverse √ 

of the Hessian are s−sparse where |s| = o lognp ; note that this assumption can be related to an s−sparse node regression of Xj versus X−j for every j = 1, ..., p. Under such a condition, they ∗ show that Mj· − Hj· = op (1) where Mj· is the jth row of their inverse Hessian approximation (see 1 Corollary 3.1 in Van de Geer, et al.). To arrive at a similar result without imposing the sparsity q  −1 n ∗ ˆ condition on the off diagonal entries of (H ) , we instead assume J(θ) = op (i.e., (k∨1) log p

Assumption (v) in Theorem 4.1); note that this assumption can be related to a “sparse eigenvalue condition” in literature (e.g., Belloni and Chernozhukov, 2013). We are not confident whether the sparsity structure for the inverse of the Hessian can be a reasonable assumption for economic applications underlying this paper. However, there are theoretical benefits for imposing such a condition and using the nodewise Lasso to obtain the inverse Hessian approximation: it is useful for obtaining a limit theory that has certain uniform properties (e.g., a local uniformity at least); in addition, besides Assumption (v), Assumption (ii) in Theorem 4.1 can also be dropped. We must emphasize that even with a sparsity condition on the inverse of the Hessian, applying the debiasing procedure via nodewise Lasso would not help rid of Assumption (ii) in Theorem 4.2 for the results on the APE estimators (except for the case where the dimension of θ∗ increases with n at a sufficiently small rate). This observation highlights the theoretical difficulty in the inference of APEs (as opposed to a single or low dimensional parameters), which essentially arise from the fact that establishing the asymptotic normality for the APE estimators concerns deriving limiting distributions for sample averages of linear functions of a potentially large set of estimates for primary parameters as well as nuisance parameters when a series approximation for the conditional mean of the unobserved heterogeneity is considered. √  Nevertheless, the additional s−sparsity condition (with |s| = o lognp ) on the inverse of the ˆ Hessian and the debiasing procedure via nodewise Lasso can help weaken the restriction on J(θ)

13





ˆ = op in Theorem 4.2. In particular, we can replace the condition J(θ) ˆ J(θ) = op

5

q

n (k∨1) log p



n (k∨1) log p

1  4

with



.

Application to test pass rates

We apply the debiasing method to estimate the effects of spending on math test pass rates for fourth graders in Michigan after funding for schools was changed from a local, property-tax based system to a statewide system supported primarily through a higher sales tax (and lottery profits) in 1994. We use the district-level data from Papke and Wooldridge (2008), which includes 501 school districts for the years 1995 through 2001. A detailed description of this data set is provided in Papke and Wooldridge (2008). The response variable, math4, is the fraction of fourth graders passing the Michigan Education Assessment Program (MEAP) fourth-grade math test in the district. The explanatory variables in our model include the same spending measure used in Papke and Wooldridge (in logarithmic form, log(avgrexp)), district enrollment (log(enroll)), the fraction of students eligible for the free and reduced-price lunch programs (lunch, a proxy for poverty levels), year dummies from 1996 to 2001 (year), and the interactions between the year dummies and the three time-varying variables. These explanatory variables correspond to the time-varying regressors, W , in (4). The policy variable is log(avgrexp); the other variables are included as controls. To control for the random effects that are correlated with log(avgrexp), log(enroll), and lunch, we include their time averages and the Chamberlain’s regressors from 1996 to 2001. These explanatory variables correspond to the time-constant controls, V , in (4). To summarize, our specification for Xit θ∗ in (5) takes on the following form Xit θ∗ = α + β1 log(avgrexpit ) + β2 log(enrollit ) + β3 lunchit + δt +π1t log(avgrexpit ) + π2t log(enrollit ) + π3t lunchit +γ1 log(avgrexpi ) + γ2 log(enrolli ) + γ3 lunchi +γ4 log(avgrexpi,96 ) + ... + γ9 log(avgrexpi,01 ) +γ10 log(enrolli,96 ) + ... + γ15 log(enrolli,01 ) +γ16 lunchi,96 + ... + γ21 lunchi,01 ,

(22)

which amounts to a total of p = 49 regressors including the intercept, while n = 501 is the number of cross sectional observations used to estimate (22). To implement the debiasing method via (10)-(11), we consider ˆ π ˆ ) ∪ A1 A = J(βˆ2 , βˆ3 , δ, ˆ 1, π ˆ 2, π ˆ 3, γ n

(23)

o

where A1 is the set of indices corresponding to α ˆ , βˆ1 . For comparison purposes, we also consider ˆ π ˆ ) ∪ B1 B = J(δ, ˆ 1, π ˆ 2, π ˆ 3, γ n

o

(24)

where B1 is the set of indices corresponding to α ˆ , βˆ1 , βˆ2 , βˆ3 . In applications like the current one, which is essentially a policy analysis, one can make a case for forcing the policy variable or variables into the parameter vector (whether they are selected by the Lasso or not) at the debiasing step as in (23) or (24). In (24), we also force log(enroll) and lunch into the parameter vector at the

14

debiasing step. In any case, our interest is squarely on the average partial effect of log(avgrexp) and not on the effects of the other qvariables. We consider several λn ≥ 0.55 logn p in (7). For λn in this range and the choice of A in (23) or B 

−1

ˆS in (24), it suffices to set µn = 0 in (10) and consequently M S = H (S = A or S = B) in (11) S ˆ is the sample Hessian defined in (8). In addition to the debiasing method, we consider where H another high-dimensional procedure, the so-called Post Lasso, which follows the same first-step estimation (7) and then solves (6) with only the regressors in A or B. To conduct inference, we simply evoke the asymptotic normality theory from the classical finite p settings by applying a Taylor series expansion to the first order condition of the loss function in (6). q Table 1 reports the first-step estimates obtained from (7) under three choices of λnj = cj logn p where c1 = 0.55, c2 = 0.6, and c3 = 0.65. Variables that are not reported in Table 1 simply have coefficients estimated to be 0 by the Lasso. Tables 2-3 report the second-step parameter estimate [ (β˜1 ) and the APE estimate (AP E) of log(avgrexp), along with the standard errors (se), given by the debiasing method and the Post Lasso method, under λnj , j = 1, 2, 3, respectively. In particular, Table 2 is based on A in (23) and Table 3 is based on B in (24). To compare the high-dimensional methods with the classical inference procedure, we apply (6) to estimate the following two specifications where Xit θ∗ in (5) takes on the forms Xit θ∗ = α + β1 log(avgrexpit ) + β2 log(enrollit ) + β3 lunchit + δt , Xit θ



(25)

= α + β1 log(avgrexpit ) + β2 log(enrollit ) + β3 lunchit + δt +γ1 log(avgrexpi ) + γ2 log(enrolli ) + γ3 lunchi .

(26)

The results via the classical methods are reported in Table 4 where Columns “Classical” are for (25) and Columns “Classical Mundlak” are for (26). The APE estimates and their standard errors are [ computed for each t = 95, ..., 01; the reported AP E and standard errors in Tables 2-4 are simply the time averages. The results regarding spending in Tables 2-3 agree with the finding in Papke and Wooldridge (2008) which applies the pooled QMLE to estimate (26) (“Classical Mundlak” in Table 4); namely, spending has a positive and statistically significant average partial effect on math pass rates. A ten percent increase in average spending increases the pass rate by about three percentage points, on average. In terms of the magnitudes, overall, the estimates for the effect of spending based on the high-dimensional methods are more comparable to those based on “Classical Mundlak” than to those based on “Classical”. Under λn2 and λn3 , the debiasing method and the “Classical Mundlak” method yield very similar estimates. As the penalty level increases, the first-step parameter estimate associated with spending in Table 1, the corresponding debiased estimate and the APE estimate in Tables 2-3 decrease; in contrast, the Post Lasso estimates increase as the penalty level increases. Under λn3 , the debiasing method and the Post Lasso method yield almost identical estimates.

log(avgrexp)

year97

year00

lunch99

lunch01

year96 · lunch

year97 · lunch

λn1

0.152

−0.013

0.001

−0.238

−0.355

−0.081

−0.236

λn2

0.105

0

0

−0.211

−0.341

−0.029

−0.213

λn3

0.056

0

0

−0.185

−0.328

0

−0.159

Table 1: First-step parameter estimates from the Lasso

15

Debiasing Method \ \ se(β˜1 ) AP E se(AP E)

β˜1

Post Lasso \ AP E

β˜1

se(β˜1 )

\ se(AP E)

λn1

1.046

0.313

0.384

0.068

0.634

0.083

0.215

0.028

λn2

0.855

0.138

0.296

0.042

0.683

0.084

0.232

0.028

λn3

0.752

0.101

0.256

0.037

0.778

0.086

0.265

0.029

Table 2: Second-step estimates and standard errors for spending, with A

Debiasing Method \ \ AP E se(β˜1 ) se(AP E)

β˜1

Post Lasso ˜ \ AP E se(β1 )

β˜1

\ se(AP E)

λn1

1.079

0.317

0.396

0.069

0.653

0.087

0.221

0.029

λn2

0.880

0.137

0.304

0.042

0.703

0.089

0.238

0.030

λn3

0.773

0.104

0.263

0.039

0.797

0.091

0.271

0.031

Table 3: Second-step estimates and standard errors for spending, with B

β˜1 0.881

Classical Mundlak \ \ E) se(β˜1 ) se(AP AP E 0.229

0.297

Classical

0.077

β˜1

se(β˜1 )

\ AP E

\ E) se(AP

0.333

0.094

0.112

0.032

Table 4: Estimates and standard errors for spending

6

Conclusion and future directions

We have proposed a simple method for inference of the average partial effects (APEs) in probit and fractional probit models with panel data where the number of time periods is fixed and small relative to the number of cross-sectional observations. In particular, our procedure allows us to model the conditional mean of the unobserved effect with a much larger number of approximating terms relative to the classical approaches for a probit model. It also extends the pooled quasimaximum likelihood estimator of the fractional probit model in Papke and Wooldridge (2008) to high dimensional settings. We apply the debiasing method to estimate the effects of spending on test pass rates. Our results show that spending has a positive and statistically significant average partial effect, with magnitudes similar to those found by Papke and Wooldridge (2008). It is natural to think the approach here can be extended to other nonlinear models. Conceptually there is no issue, as pooled quasi-maximum likelihood is applicable to many kinds of response variables, including count variables and nonnegative continuous variables. It may be possible to verify the regularity conditions when Yit is unbounded, so that pooled Poisson estimation of CRE models can be used. One can also argue that allowing for extensions of the probit response function, such as a heteroskedastic probit model, would be worth pursuing. Our approach is easily extended to unbalanced panels, provided we assume that selection is appropriately ignorable. In the context of fully parametric CRE models, Wooldridge (2016) shows how to modify pooled QMLEs to allow selection to be correlated with observed covariates and unobserved heterogeneity, but not idiosyncratic shocks. Therefore, in the context of equation (4), selection can be correlated with {Wit : t = 1, ..., T} and ηi , but not with {υit : t = 1, ..., T}. This is very standard in analyzing unbalanced panels, and is implicitly assumed when fixed effects estimation is used on unbalanced panels. Again, conceptually the extension is straightforward, as one should allow for a heteroskedastic-probit function rather than just a standard probit response function. The reason to allow a heteroskedasticity function is that the variance of the heterogeneity would change at least as a function of the number of time periods, and possibly the other covariates 16

as well. Now, we would take Vi to be functions of {(Sit , Sit Wit ) : t = 1, ..., T} , where Sit is a binary selection indicator that determines whether we observe a complete case for unit i in time period t. Functions include the time averages of covariates of the complete cases and also functions of the selection indicators themselves, such as the number of time periods observed for unit i, say Ti . Once the Vi have been chosen, the objective function in equation (7) is simply multiplied by the selection indicator, Sit , and the probit response function is replaced with probit with heteroskedasticity. Estimation is somewhat more challenging but quite feasible. Without the penalty it would be pooled heteroskedastic probit estimation where the variance function depends on a relatively small number of elements of Vi – see Wooldridge (2016). The challenge is deriving the asymptotic properties of the estimator with additional nonlinearity in the estimation, under high dimensionality and sparsity.

A

Main proofs

Proof of Lemma 4.1. We can write (7) as (

Ln (θ) =

T n X −1 X {(1 − Yit ) log [1 − Φ (Xit θ)] + Yit log [Φ (Xit θ)]} + λn |θ|1 , n i=1 t=1

)

Define the following quantity * ∗





δLn (θ , ∆) := Ln (θ + ∆) − Ln (θ ) −

T n X 1X sit (θ∗ ), ∆ . n i=1 t=1

+

To prove Lemma 4.1, we verify conditions (G1) and (G2) in Negahban, et. al (2010), as well as ˆ satisfies the restricted strong convexity (RSC) show the following two steps: Step 1. δLn (θ∗ , ∆) ˆ = θˆ − θ∗ ; Step 2. condition defined in Negahban, et. al (2010), where ∆ T n X 1 X ∗ sit (θ ) n i=1 t=1

s



σ ˜2



P with high probability, where σ ˜ = maxj=1,...,p T t=1 |Xitj |

log p , n

ψ1

. Then we can apply Theorem 1 in

Negahban, et. al (2010) to obtain v   u √  u λn |θS∗ c |1   k 0 λn 0t τ1  ˆ +c , θ − θ ∗ ≤ max δ, c   2 κL κL   P

(27)



T ∗ ¯n and λn ≥ 2 1 n where δ = cB i=1 t=1 sit (θ ) ∞ . n Condition (G2) holds trivially since we are applying the l1 −regularization. To verify condition 00 (G1), note that Ln (θ) is

P

n X T 1X {1(Yit = 1) [1 − var (uit | uit ≤ Xit θ)] + 1(Yit = 0) [1 − var (uit | uit ≥ −Xit θ)]} XitT Xit . n i=1 t=1

17

Since the unconditional variance is normalized to 1 and truncation always reduces variances (Greene, 2003), the scalar term in the expression above is strictly positive. Step 1: For the case where θ∗ is exactly sparse with at most k non-zero coefficients such that k - logn p and τ = 0, under the assumption λmin (ΣXt ) ≥ κL > 0, Proposition 2 along with Lemma ˆ 1 in Negahban, et. al (2010) yields the RSC of δLn (θ∗ , ∆). To show the approximately sparse case where we set τ = κλLn for a positive universal constant c, we apply the deviation inequality in Lemma 3 from Negahban, et. al (2010) and show that 

ˆ − Ln (θ∗ ) + λn |(θ∗ + ∆ ˆ S , θ∗ c + ∆ ˆ S c )|1 − |θ∗ |1 − |θ∗ c |1 0 ≥ Ln (θ + ∆) Sτ Sτ Sτ τ 1 Sτ 1 τ1 1 1 1   λn ˆ S c |1 − 4|θ∗ c |1 . ˆ S |1 + |∆ ≥ −3|∆ Sτ τ1 τ1 2 1 ∗

Consequently,

√ ˆ 1 ≤ 4|∆ ˆ S |1 + 4|θ∗ c |1 ≤ 4 k|∆| ˆ 2 + 4|θ∗ c |1 , |∆| Sτ Sτ τ



(28)



where k = Sτ . For the case where τ 6= 0, we upper bound the cardinality of k in terms of the threshold τ and the l1 −ball with “radius” of R = |θ∗ |1 . Note that we have R=

p X X ∗ θj∗ ≥ kτ θj ≥ j=1

(29)

j∈Sτ

and therefore k ≤ τ −1 R. Then, combining the bound in Proposition 2 in Negahban, et. al (2010) with q √ |∆|1 ≤ 4|∆Sτ |1 + 4|θS∗ c |1 ≤ 4 k|∆|2 + 4|θS∗ c |1 ≤ 4 τ −1 R|∆|2 + 4|θS∗ c |1 , τ τ τ yields δLn (θ∗ , ∆) ≥ |∆|22

 

s

c1 κL − b0





log p  Rτ −1 − b1 |θS∗ c |1 τ n 

s

log p |∆|2 n

q

1

for some constants b0 , b1 > 0. With the choice of δ ∗  (κL ) 2 |θS∗ c |1 logn p , if τ have ( ) |∆|22 0 00 δLn (θ∗ , ∆) ≥ c κL |∆|22 − = c κL |∆|22 2

q

Rτ −1 logn p - κL , we

for any ∆ such that |∆|2 ≥ δ ∗ (if |∆|2 < δ ∗ , bound (27) holds trivially). For the case where τ = 0, we have   s   k log p 00 δLn (θ∗ , ∆) ≥ |∆|22 c1 κL − b0 ≥ c κL |∆|22   n as long as b0

q

k log p n

- κL .

P P ∗ Step 2: In the following, we show that n1 ni=1 T t=1 sit (θ )

Note that



T X t=1

sitj (θ∗ ) = −



q

σ ˜ 2 logn p with high probability.

T X φ(Xit θ∗ )Xitj (Yit − Φ(Xit θ∗ )) t=1

Φ(Xit θ∗ ) (1 − Φ(Xit θ∗ ))

.

Recall that the random matrix Xt consists of bounded random variables for all t = 1, ..., T and φ(Xit θ∗ ) ∗ |θ |1 ≤ 1. Then, Φ(Xit θ∗ )(1−Φ(Xit θ∗ )) ≤ c¯ and by the definition of the sub-Gaussian norm of a random 18

1

1

T ∗ variable V , |V |Ψ1 := supr≥1 r− 2 (E |V |r ) r , we can show that the random variable t=1 sitj (θ ) is a P T zero-mean sub-Gaussian with parameter at most 2¯ c maxj=1,...,p t=1 |Xitj | . Applying a standard

P

ψ1

sub-Gaussian tail bound yields n X T 1 X ∗ sit (θ ) n i=1 t=1

s

≤ c1 σ ˜ ∞

log p n

 

with probability at least 1 − O p1 , where c1 > 0 is a universal constant. Combining Step 1 and Step 2 yields   ∗ 1  s s  2 |θ ˜ c |1 σ  c σ o n S k log p log p   τ 1˜ ˆ ∗ ¯n  + c1 := max δ, c1 B θ − θ ≤ max δ,   2 n κL n  κL   

with probability at least 1 − O p1 , for some universal constant c1 > 0. Setting δ in Assumption ¯n where c1 > c > 0 yields (15) and applying (28) yields (16). 2 4.1 to δ = cB Remark. In the following, we provide an upper bound on proof later. The second bound in Lemma B2 implies ˆ 2 Xt ∆ 2

n

0

ˆ |2 |Xt ∆ 2 n

, which will be used in the

3κU ˆ 2 0 log p ˆ 2 ∆ 2 + α ∆ 1 2 n   3κU 00 0 log p 0 0 log p −1 ˆ 2 ≤ −c α τ R ∆ |θS∗ c |21 2 + c α τ 2 n n 0 (3 + ς )κU ˆ 2 ≤ ∆ 2 2



(30)

0

for a ς > 0, where α is a constant depending on κL and κU , and the last inequality follows from the conditions logn p τ −1 R - 3κ2U and logn p |θS∗ c |21 = o(1). τ Proof of Theorem 4.1. By the expansion in (12)-(14), it suffices to show that, for any finite set A1 of parameters in A, as n → ∞: (a) there exists a solution M A to (10) such that the √ √ A1 −subrows of (13) multiplied by n are of smaller order than those of (12) multiplied by n, √ and (b) the A1 −subrows of (14)  multiplied by n are of smaller order than those of (12) multiplied √ √  ˜A1 ∗A by n, then, n θ − θ 1 has the same asymptotic distribution as the A1 −subrows of (12) √ multiplied by n. By Lemma A2, we have √

" A

n I −M

A1

n

n X T X

#

¯ itA H

(θˆA − θ∗A ) = op (1).

i=1 t=1

Furthermore, note that  −1 ∗A A ˆA EH EH − I

≤ ∞

n n    −1 1 X −1 1 X  A A A ∗A A A ˆ − I + EH ˆ ˆ ˆ H Hi − H EH i n i=1 i n i=1 ∞ s ∞ ! n  −1 1 X ∗A (k ∨ 1) log p  ˆA + EH EH ∗A − Hi , = Op  n n i=1

19



where a similar argument that we use to show (44) is applied to bound the second and third terms  above, and  the bound on the first term follows from Lemma A1. Consequently if |A| = op

q

n (k∨1) log p

h

A = EH ∗A , we apply argument similar to those in Lemma A3 with ξij

i−1 P

T A ∗ t=1 sit (θ )



and obtain " h

ˆA

EH

=

i−1

h

− EH

∗A

# A1 n X T i−1  1 X A ∗ √ sit (θ )

n

i=1 t=1

 A1    EH ∗A −1 X  n X T −1  ∗  ˆA √ EH ∗A − I A = op (1). sA  EH it (θ )

n





i=1 t=1





ˆ A − I A Using the fact established above and that M A EH "

M " 

=

M



q

log p n



by (45), we obtain

# A1 n X T i−1  1 X A ∗ √ sit (θ ) − EH h

A

A

= Op

ˆA

n





ˆ A − IA EH

 h

i=1 t=1

ˆA EH

i−1

h

− EH

where the last line follows since |A| = op

q

∗A

n log p

i−1

h

+ EH

∗A

#A1 T n X i−1  1 X A ∗ √ sit (θ ) = op (1)

n

i=1 t=1



.

AP n ¯ AAc = Op (1) by Lemma A4, under For the third term in (13), with the fact that Mn i=1 Hi ∞ √ A Pn log p ∗ | ¯ AAc θ∗Ac = √ the condition |θA in Theorem 4.1, we conclude that the term M c 1 1 +ς i=1 Hi n n2

op (1). Now, applying Lemma A3 and the Cramer-Wold Theorem yields the claim in Theorem 4.1. ˆ ∪ A1 and θ˜A obProof of Theorem 4.2. By the construction θ˜ = (θ˜A , 0Ac ) with A = J(θ) tained from debiasing θˆA in (11), we have n X

"

var(

#− 1 n 2 Xh

aT ξiA )

i=1 n X

"

=

var(

n X

=

var(



i

(31)

i=1

#− 1 ( n 2 Xh







i

θ˜j φ Xit θ˜ − θj∗ φ Xit θ˜

aT ξiA )

+

#− 1 2

aT ξiA )

n h X





i

)

θj∗ φ Xit θ˜ − θj∗ φ (Xit θ∗ )

i=1

i=1

i=1

"



θ˜j φ Xit θ˜ − θj∗ φ (Xit θ∗ )

n

0

h

(θ˜j − θj∗ )E [φ(Xit θ∗ )] + (θ˜A − θ∗A )T E θj∗ φ (Xit θ∗ )XitAT

io

+ REM

i=1

Note that (12)-(14) and (31) imply n X

"

var(

#− 1 n 2 Xh

aT ξiA )

i=1 n X

"

=

var(





i

θ˜j φ Xit θ˜ − θj∗ φ (Xit θ∗ )

i=1

#− 1 ( 2



aT ξiA )

h

E [φ(Xit θ )] EH

∗A

n i−1 X

i=1

"

+ var(

n X

#− 1 ( 2

aT ξiA )

E

h

0 θj∗ φ (Xit θ∗ )XitA

i=1



) ∗ sA i (θ )

ih

EH

∗A

n i−1 X i=1

20

(32)

i=1

) ∗ sA i (θ )

+ REM

(33)

where REM = REM1 + REM2 "

+

var(

n X

#− 1 " 2

aT ξiA )

i=1

"

+

var(

n X

#− 1 " 2

aT ξiA )

i=1

"

− n var(

n X

n n i−1 X h 1X ∗ ˜ − E [φ(Xit θ∗ )] EH ∗A sA φ(Xit θ) i (θ ) j· n i=1 i=1

#

n n h i h i−1 X 1X 0 0 0 ∗ θj∗ φ (Xit θ )XitA − E θj∗ φ (Xit θ∗ )XitA EH ∗A sA i (θ ) n i=1 i=1

#

#− 1

2

aT ξiA )

θ

∗Ac T

"

i=1

n 1X 0 0 c θ∗ φ (Xit θ )XitA T n i=1 j

#

and "

REM1 =

var(

n X

#− 1 " 2

aT ξiA )

i=1

"

+ n var(

n X

n 1X ˜ φ(Xit θ) n i=1

#− 1 " 2

aT ξiA )

i=1

"

+ n var(

n X

#− 1 " 2

aT ξiA )

i=1 n X

"

+

var(

#− 1 " 2

aT ξiA )

i=1

n X

"

REM2 =

var(

#− 1 " 2

aT ξiA )

i=1

"

+ n var(

n X

+ n var(

n X

#− 1 " 2

aT ξiA ) #− 1 " 2

aT ξiA )

i=1

"

+

n X

var(

#− 1 " 2

aT ξiA )

i=1

ˆA M − EH

n −1  X j· i=1

n 1X ˜ φ(Xit θ) n i=1

#"

n 1X ˜ φ(Xit θ) n i=1

#"

n 1X ˜ φ(Xit θ) n i=1



A

n MA X ¯A H I − n i=1 i

#

# 

n MA X ¯ iAAc θ∗Ac H n i=1

ˆA EH

−1



− EH

∗A

∗ sA i (θ )



A

θˆA − θ∗A

#



ˆA M A − EH



j· n −1  X

n −1  X

∗ sA i (θ ),

∗ sA i (θ )

i=1

n 1X 0 0 θj∗ φ (Xit θ )XitA n i=1

#"

n 1X 0 0 θj∗ φ (Xit θ )XitA n i=1

#"

n 1X 0 0 θ∗ φ (Xit θ )XitA n i=1 j



#

j· i=1

n 1X 0 0 θj∗ φ (Xit θ )XitA n i=1

i=1

"

#

n   MA X ¯ iA θˆA − θ∗A I − H n i=1

#

A

# 

n MA X ¯ AAc θ∗Ac H n i=1 i

ˆA EH

−1



− EH ∗A

#

n −1  X

∗ sA i (θ ),

i=1

0 ˜ By Lemmas A2 and A5 and the argument for some intermediate value θ between θ∗ and θ. in the proof for Theorem 4.1, under the conditions in Theorem 4.2, we can show that the term REM = op (1) and therefore REM is negligible asymptotically relative to the terms in (32) and ˆ ∪ A1 and (33). We now apply Lemma A3 with A = A˜ = J(θ)

 h 

ξiA :=  E Hi∗A a :=



T i−1 X j·

s∗A it ,

h 

E Hi∗A

T i−1 X

t=1

h

t=1 0

E [φ(Xit θ∗ )] , E θj∗ φ (Xit θ∗ )XitA 21

iT

.

!T T

s∗A it

 ,

which yields

n X

1 q

2



E1 =

T A i=1 a ξi ) i=1

var(



Lemma A1. Suppose θS∗ c τ 1 √

Pn



log p 1

n 2 +ς

. For a set of indices A ⊆ {1, ..., p}, define

v " u u σ3A ∨ max tE sup

A T1,tjj 0 (θ)

v " u u tE σ4A ∨ max sup 0

A T2,tjj 0 (θ)

jj 0



E2 =

θ∈Sr1 , r2

jj

C1∗

d

aT ξiA → N (0, 1) .

θ∈Sr1 , r2

T h X



#

,

#

n

,

o

n

(|θ∗ |1 ∨ 1) max κU , σXt , E12 + max κU , σXt , E22

=

oi

,

t=1 A = T1,tjj 0 (θ)

2 n h i−1 1X XitT A XitA , EH A (θ) n i=1 jj 0

A = T2,tjj 0 (θ)

T n h i−1 h i−1 X 1X 0 XitT A XitA f (Xit θ)XitT A XitA EH A (θ) E EH A (θ) n i=1 t=1

where

#

"

"

#2

, jj 0

 P  i−1 P A h n n PT 1 1 A A ∗ A ˆ ˆ and E1 (c) be := E n i=1 Hi (θ) . Let µ := I − EH t=1 Hit i=1 n q 2∞ q

EH A (θ)

log p n

the event that µ∗ ≤ µn = cσ1A P [E1 (c)] ≥ 1 − c1 exp(−c2 log p).

¯2 for a constant c > 0. Then, if C1∗ B n

σ1A log p , n

we have

Proof. Define the set Sr1 , r2 := {θ ∈ Rp | |θ − θ∗ |1 ≤ r1 , |θ − θ∗ |2 ≤ r2 } , √





¯n + |θ∗ c |1 , the upper bound in (16), and r2 = c1 B ¯n , the upper bound in (15). kB where r1 = c2 Sτ Also define the function I(Yit = 1) [1 − var (uit | uit ≤ u)] + I(Yit = 0) [1 − var (uit | uit ≥ −u)] = 1 − var (uit | uit ≤ u) = u

φ(u) + Φ(u)



φ(u) Φ(u)

2

= f (u),

where the second and third lines follow from the properties of a standard normal distribution. To show Lemma A1, we show that A Σ − I A



:=

sup θ∈Sr1 , r2

h i−1 A A EH A (θ) H (θ) − I

s

≤ cσ1A ∞

log p . n

0

We first show that ΣA is a sub-Exponential variable for j, j = 1, ..., p. In particular, for θ ∈ Sr1 , r2 , jj 0 we write  " " ## T X t=1

f (Xit θ)  E

T X

−1

f (Xit θ)XitT A XitA

t=1

XitT A XitA  jj 0

22

θA = Uijj 0.

Note that  " "  r  r1 T ##−1 T X X   f (Xit θ)XitT A XitA XitT A XitA   = sup r−1 E  f (Xit θ)  E r≥1 t=1 t=1 jj 0  r  r1 T " " T ##−1 X   X f (Xit θ)XitT A XitA XitT A XitA  ≤ sup r−1 E  E r≥1 0 t=1 t=1 jj

θA Uijj 0

ψ

≤ σ1A , where the second inequality follows from 0 < f (u) < 1 for all u ∈ R for probit together and the last inequality follows from (17). Applying Lemma B1 yields ! io h n A A A A P max Σjj 0 − Ijj 0 − E Σjj 0 − Ijj 0 ≥ u ≤ c1 exp −c2 n j, j 0

u2 u ∧ 2 σ1A σ1A

!

!

+ 2 log p .

(34)





A := ΣA − I A and using a classical symmetrization argument, we obtain Denote Tjj jj 0 0 jj 0

   " " T ##−1 T n 1 X X X   ≤ 2EX,  sup f (Xit θ)XitT A XitA XitT A XitA   i  f (XitT θ) E θ∈Sr1 , r2 n i=1 t=1 t=1 jj 0 # " n 1 X A := 2EX, sup i gjj 0 (Xi ; θ) θ∈Sr , r n 

A ETjj 0

1

i=1

2

where {i }ni=1 is an i.i.d. sequence of Rademacher variables, independent of {Xi }ni=1 . For any 2 2 δ ∈ (0, 1], δ2 ≤ δ. We construct a δ2 −covering set of Sr1 , r2 in the l2 −norm with covering number 

denoted by Nn



δ2 l 2 ; Sr1 , r2 := N . For any θ ∈ Sr1 , r2 , we can find a θ in the covering set such 0 φ(u) δ2 2 . Let h(u) := Φ(u) . Note that i ∈ {−1, 1} for all i = 1, ..., n, h (u) ∈ (0, 1) and

that θ − θl ≤ 2 f (u) ∈ (0, 1) for all u ∈ R. Without much loss of generality, we consider the h case i where Xt consists of bounded random variables supported on [−1, 1]. Therefore, for θ˘ ∈ θ, θl , since |Xit |∞ ≤ 1, √ ˘ - (|θ∗ | ∨ 1). We have ¯n + |θ∗ c |1 - (|θ∗ | ∨ 1), and h(X T θ) Xit θ˘ ≤ |Xit |∞ θ˘ - |θ∗ |1 if k B it Sτ 1 1 1 A A l gjj 0 (Xi ; θ) − g 0 (Xi ; θ ) jj

=

T X



0 ˘ Γ A 0 (θ) ˘ + h (Xit θ) 1,jj it



Xit (θ − θl ) Xit θ˘



t=1

+

T X

T X





˘ Γ A 0 (θ) ˘ Xit (θ − θl ) h(Xit θ) 1,jj it

t=1

˘ 0 (Xit θ) ˘ Γ A 0 (θ) ˘ + Xit (θ − θl ) 2h(Xit θ)h 1,jj it 



t=1

T X





˘ Γ A 0 (θ) ˘ Xit (θ − θl ) f (Xit θ) 2,jj it

t=1

for some intermediate value θ˘ between θ and θl , where A ˘ = Γ1,jj 0 (θ) it

h "

A ˘ Γ2,jj 0 (θ) it

=

˘ EH A (θ)

h

i−1

˘ EH (θ) A

i−1

XitT A XitA

 jj

0

,

n X T h i−1 1X 0 ˘ T A X A EH A (θ) ˘ f (Xit θ)X XitT A XitA E it it n i=1 t=1

#

"

23

#

. jj 0

We define A T1,tjj = 0 (θ)

2 n h i−1 1X EH A (θ) XitT A XitA , n i=1 jj 0

A T2,tjj 0 (θ)

T n h i−1 h i−1 X 1X 0 XitT A XitA EH A (θ) E f (Xit θ)XitT A XitA EH A (θ) n i=1 t=1

"

"

=

#

#2

. jj 0

A A ˘ ˘ Under Assumption 4.2 and the boundedness of Xt , We now upper bound T1,tjj (θ). 0 (θ) and T 2,tjj 0 A A ˘ ˘ is sub-Exponential, it can be easily verified that each of the summands in T1,tjj (θ) 0 (θ) and T 2,tjj 0 respectively. Therefore, we have

    A A ˘ ˘ P T1,tjj ≤ 0 (θ) − E T 0 (θ) ≥ u 1,tjj

u2 u c5 exp −c6 n ∧ 2 σ3A σ3A

!!

    A A ˘ ˘ ≤ P T2,tjj 0 (θ) − E T 0 (θ) ≥ u 2,tjj

u2 u c7 exp −c8 n 2 ∧ σ σ4A 4A

!!

,

(35)

.

(36)

We also have n 1 X i gjj 0 (Xi ; θ) ≤ n

n n 1 X 1 X   l l i gjj 0 (Xi ; θ ) + i gjj 0 (Xi ; θ) − gjj 0 (Xi ; θ ) n n

i=1

i=1 T X

0

≤ c

t=1

i=1

r r r   ∗ l A A A ˘ ˘ ˘ |θ | X (θ − θ ) T ( θ) + T ( θ) + T ( θ) (37) t 1 1,tjj 0 1,tjj 0 2,tjj 0 n

n 1 X l + max i gjj 0 (Xi ; θ ) . l=1,...,N n

(38)

i=1

We first upper bound the expectation of the term in (37). Let us fix some τ ≥ max κU , σXt , E12 , q 2 A τ1 ≥ (κU ∨ σXt ), and τ2 ≥ E1 . Let ∆ be a unit vector. Since |Xt ∆|n and T1,tjj 0 (θ) (for all 



θ ∈ Sr1 , r2 ) are both nonnegative, we have "

E max |Xt ∆|n 0 j,j

≤ τ+

Z ∞ τ

≤ τ+

r

P max |Xt ∆|n 0

XZ ∞ τ

XZ ∞ τ1

j,j 0 0

A ˘ T1,tjj 0 (θ)

j,j

j,j 0

≤ τ+

#

r

2

≤ τ + c1 p

r



P |Xt ∆|n 

! A ˘ T1,tjj 0 (θ)

A ˘ T1,tjj 0 (θ)



P |Xt ∆|2n > v dv +

> v dv 

> v dv

XZ ∞ j,j 0

(Z ∞ τ1

τ2



v − (κU ∨ σXt ) exp −c4 n dv + σXt 



A ˘ P T1,tjj 0 (θ) > v dv



Z ∞ τ2

v − E12 exp −c6 n σ3A

!

)

dv .

The last inequality follows from (35) and u2 u P |Xt ∆|2n > (κU ∨ σXt ) + u ≤ c3 exp −c4 n ∧ 2 σXt σXt 



24

!!

.

(39)

for any unit vector ∆, where we use the boundedness of Xt and (30). We apply a change of variable u = v − (κU ∨ σXt ) and u = v − E12 , provided v ≥ τ1 ≥ (κU ∨ σXt ) and v ≥ τ2 ≥ E12 , respectively. Integrating yields the following bound # q 2 n o 0 δ T 2 l A (θ) ≤ c T max κ , σ , E E max sup X (θ − θ ) 0 U X t 3 1 +o t 1,tjj n 2 j,j 0 θ∈Sr1 , r2 "

δ2 2

!

(40)

2

δ 2 −covering set of Sr1 , r2 in the l2 −norm. To upper bound r A ˘ the term Xt (θ − θl ) T2,tjj 0 (θ), we follow the same argument above with (36) in the place of

where we use the fact that θl is in the n

(35) and obtain "

E max sup 0 j,j

θ∈Sr1 , r2

# q 2 n o 00 δ l A T2,tjj max κU , σXt , E22 + o Xt (θ − θ ) 0 (θ) ≤ c3

2

n

δ2 2

!

.

(41)

We now upper bound the unconditional expectation of the term in (38), the “Rademacher comP plexity” associated with

A . Tjj 0

Conditioning on X, it can be easily verified that 1 n



Pn

θl )

A (X ; gjj 0 i

n  g A (Xi ; θl ) i=1 i jj 0

n

2

= O(1) for all l = 1, ..., N . Applying the is sub-Gaussian with parameter i=1 Dudley entropy integral bound as well as the upper bound on log Nn (u; Sr1 , r2 ) in Lemma B3, we 0 have, for any j, j ∈ A, #  s n 1 X √ log p l A ¯n + |θ∗ c |1 E max i gjj kB . 0 (Xi ; θ ) S τ l=1,...,N n n "

(42)

i=1





Under the condition θS∗ c τ 1 q log p n

u = cσ1A



log p 1

n 2 +ς

¯ 2 , and (34) with , combining bound (42), (40)-(41) with δ 2  B n

yields

s A ∗ ¯2 A Σ max 0 − I 0 ≤ c0 C1 Bn + cσ1A jj jj j, j 0

log p n

¯n2 - σ1A with probability at least 1 − c1 exp(−c2 log p). Under the condition C1∗ B in Lemma A1 follows.  q

Lemma A2. Suppose |A| = op

n log p



and σ1A

q

log p n

¯n . (a) If C ∗ -B 1



(b) If C1∗ p

Pnn

var(

i=1

aT ξiA )

" A

n I −M

¯n B





Pn

var(

n

n X

#

¯A H i

(θˆA − θ∗A ) = op (1).

i=1

¯n + |θ∗ c |1 kB Sτ

n q

A1



" A

I −M

T A i=1 a ξi )

= o(1), then we have A1

n

where a and ξiA are specified in Theorem 4.2.

25

n X i=1

log p n ,

the claim

  √ ¯ √ ¯ nBn k Bn + |θS∗ c |1 = τ



o(1), then we have √

q

#

¯ iA (θˆA − θ∗A ) = op (1), H

Proof. We show part (a) below and part (b) follows the same argument where h

Pn

replaced by n var(

≤ ≤

i=1 a

T ξA) i

i− 1 2



n−scaling is

. Note that we have

" # n √ X A A1 A A ∗A ˆ ¯ Hi (θ − θ ) n I −M n i=1 ∞ n n   X X √ A 1 1 ˆ A θˆA − θ∗A + M A √ ¯A − H ˆ A (θˆA − θ∗A ) n I − M A H H i i 1 n i=1 i n i=1 ∞ ∞ n n  i h  X X −1 1 √ A 1 ˆA ˆ A θˆA − θ∗A + EH ¯A − H ˆ A θˆA − θ∗A √ n I − M A H H i i i 1 1 n i=1 n i=1 ∞ ∞  n  i−1  1 X h  A A ˆ ¯A − H ˆ A θˆA − θ∗A . √ + M − EH H i i 1 n i=1

By Lemma A1 and setting µn = cσ1A

q



log p n

in (10), we are guaranteed that

n A 1 X ˆA A Hi − I M n i=1

s

≤ cσ1A ∞

log p n

(43)

for any solution M A satisfying (10). Therefore, by (16) of Lemma 4.1, √  n p √ A 1 X ˆ iA − I A θˆA − θ∗A ≤ c0 σ1A log p ¯n + |θS∗ c |1 . n M H kB τ 1 n i=1





 [EHˆ A ]−1 Pn  A A ¯ ˆ It suffices to show that n i=1 Hi − Hi

to show Lemma A1 can be applied to bound





¯n . In particular, the argument used = Op B ∞

( ) n   −1 1 X  h i A A A A A ˆ ¯i − H ˆi − E H ¯i − H ˆi H EH n i=1

. jj

0

As a result, we have h i−1 ˆA n  EH  X ¯A − H ˆ A ≤ cC ∗ B ¯ H i i 2 n n i=1 ∞     P T 1 ∗ ∗

(44)

(|θ |1 ∨ 1) max κU , σXt , E12 . Applying the    √ ¯ √ ¯ ˆA ∗A ∗ ∗ bound on θ − θ from Lemma 4.1, under the condition C2 nBn k Bn + |θS c |1 = o(1), τ 1 we have n  h i  ˆ A −1 1 X ¯ A ˆ A θˆA − θ∗A = op (1). √ Hi − H EH i 1 n i=1

with probability at least 1−O

p

, where C2 =



t=1



By bound (43) and the triangle inequality, we have n  h i−1 1 X ˆ iA ˆ A − I A EH ˆA H M A EH n i=1

26

s

= Op  ∞



log p  ; n

since |A| = op

q

n log p



, we must have s A ˆA M EH − I A



= Op 



log p  n

(45)

and therefore, by (44),  n  i−1  1 X h  A A A A ˆ ¯ −H ˆ H M − EH i i n i=1



n   i−1 1 X h  A A ˆA A A A ˆ ¯ −H ˆ EH = M EH − I H i i n i=1 ∞  

¯n . = op C2∗ B As a result,  n  i−1  1 X h  A A A A ˆ ¯ −H ˆ √ H M − EH i i n i=1

ˆA θ − θ ∗A = op (1). 1



2 Lemma A3. Let Assumption 2.1 hold. Suppose the random matrix Xt consists of bounded random variables for all t = 1, ..., T. Given a set of indices A ⊆ {1, ..., p} with cardinality |A| ˜ A) and a nonrandom matrix Σ of dimension |A| × |A|, for a subset A˜ ⊆ A, let ξiA = (ξij ˜ with j∈A P ˜ A| T ∗A A | s . For any a ∈ R , if ξ := Σ ij



t=1 it

|a|21 |A|2 P ˜ var( ni=1 aT ξiA )





|a|21 |A|2 log n     log ˜ var aT ξiA

then for every ε > 0, the Lindeberg condition limn→∞ 1A˜ n%n   ˜ ˜ 1 Pn T A A 0 holds almost surely (where %n := n i=1 var a ξi ),

Pn

= Op (1), 

i=1 E

and √ 1

(46)

 ˜ 2 aT ξiA I

  q T A˜ ˜ A = a ξi > ε n%n

˜ d T A i=1 a ξi →

Pn

˜ nρA n

N (0, 1).

˜

Proof. To show the claim in Lemma A3, note that the summands aT ξiA have zero mean and ∗ it θ ) are independent. Let Yit − Φ(Xit θ∗ ) := Y¯it and Φ(Xit θφ(X ∗ )(1−Φ(X θ ∗ )) := bit . By boundedness of Xit it and |θ∗ | ≤ 1, b ≤ bit ≤ ¯b for all i and t. Also, max ˜ Σj· X T A = O (|A|). Therefore, t j∈A, t=1,...,T 1 P T A˜ T ∗ ∗ b |A| t=1 Y¯it for some constant c > 0. As a result, a ξi ≤ c |a|1 ¯

  q T A˜ ˜ A E a ξi > ε n%n ( T )!    ˜ 1 X ˜ 2 −1 −1 T A −1 A 2 ¯ ¯ E a ξi I |A| Yit > c1 εb |a|1 n%n 





 ˜ 2 aT ξiA I

t=1



≤ c2 E 

T X

!2 ( T )    ˜ 1 X 2 |A|−1  |a|21 |A|2 ¯b2 Y¯it I b−1 |a|−1 n%A Y¯it > c1 ε¯ n 1

t=1



t=1

v  u !4 v ! u u T T  1   u X X u  2 ˜ −1 −1 b−1 |a|1 n%A c2 tE Y¯it tP |A| |a|21 |A|2 ¯b2 Y¯it > c1 ε¯ n t=1

t=1

r







˜ −2 −2 ≤ c0 |a|21 |A|2 ¯b2 exp −c3 ε2¯b−2 |a|−2 n%A T + log T n |A| 1

27



where we have applied a Cauchy-Schwarz inequality. Consequently, condition (46) implies that P ˜ d the Lindeberg condition holds. An application of the Lindeberg CLT yields √ 1 A˜ ni=1 aT ξiA → nρn

N (0, 1) . 2 Lemma A4. Suppose Assumption 4.2 and the conditions in Lemma 4.1 hold. Assume |A| = q  P h i 1 n PT PT n A (θ)s ˆ A0 (θ) ˆ T − E 1 Pn PT PT0 sA (θ∗ )sA (θ∗ ) = op . Then s 0 i=1 t=1 i=1 t=1 it log p n n t =1 it t =1 it it ∞

AP n ¯ AAc op (1) and Mn i=1 Hi



= Op (1).

 P  n 1 1 Pn A ˆ A ˆ T A ∗ A ∗ T = i=1 sit (θ)sit0 (θ) − n i=1 sit (θ )sit0 (θ ) n ∞ q h P i Pn log p n 1 A A A T A T = Op ( i=1 sit (θ)sit0 (θ) − E n i=1 sit (θ)sit0 (θ) n ). For the

Proof. For the first claim, it suffices to show E

¯n ) and supθ∈S O(B 1 r1 , r2 n second claim, we use the following decomposition MA n

n X

h

¯ AAc H i

ˆA EH

=

i−1

n

i=1

n  X



h

¯ AAc − H ∗AA H i i

+ M

+

i=1

h



 c

A

h

i

ˆ A − IA EH

ˆA  EH n

i−1

n X

ˆA EH

i−1

n

n X

Hi∗AA

c

i=1

¯ AAc . H i

i=1

The remaining proofs follow closely the argument for showing Lemma A1 and Lemma A2, and we omit the details here due to the repetitiveness. 2 Lemma A5. Suppose Assumption 4.2 and the conditions in Lemma 4.1 hold. If |A| = op

q

n (k∨1) log p

h i P 0 0 0 0 0 we have n1 ni=1 θj∗ φ (Xit θ ) − θj∗ φ (Xit θ∗ ) = op (1), where θ is some intermediate value θ between

˜ θ∗ and θ. 0

Proof. For some intermediate value ˚ θ between θ∗ and θ , we have n h n  0  i 1X 1X 00 0 0 0 θj∗ φ (Xit˚ θ)Xit θ − θ∗ θj∗ φ (Xit θ ) − θj∗ φ (Xit θ∗ ) = n i=1 n i=1

1 ≤ n 1 + n 1 + n 1 + n 1 + n

n X

n  −1 1 X ∗A A ∗ si (θ ) EHi n i=1 i=1 ∞ 1 n n  −1 1 X X 00 ∗ A A ˆ A A A A ∗ ˚ ˆ θj φ (Xit θ)Xit M EH − I EHi si (θ ) ∞ n i=1 i=1 1 1 n n A X X M 00 ¯ iA θˆA − θ∗A θj∗ φ (Xit˚ θ)XitA I A − H 1 n i=1 i=1 ∞ 1 n n n MA X 1 X X 00 c 00 c c c ¯ iAA θ∗A + 2 θj∗ φ (Xit˚ θ)XitA H θj∗ φ (Xit˚ θ)XitA θ∗A n n 1 1 i=1 i=1 i=1 ∞ ∞ 1  n n  −1 −1 1 X X ∗ 00 A A ∗A A ∗A ˚ ˆ θj φ (Xit θ)Xit EH EH − I EHi sA (θ∗ ) . i n ∞ i=1



00 θj∗ φ (Xit˚ θ)XitA

i=1

1

1

To bound the terms above, we  can apply the facts established in previous proofs together with the q n condition |A| = op (k∨1) log p . 2 28



,

B

Additional lemmas

Lemma B1: Let {Xi }ni=1 be independent sub-Exponential random variables with parameter at most σ. Then, we have # ( ! )! " n n X X ε2 ε Xi ≥ nε ≤ 2 exp −c0 n min , . P Xi − E σ2 σ i=1

i=1

Rn×p1

2 ). For any fixed (unit) vector Let X ∈ be a sub-Gaussian matrix with parameters (ΣX , σX ∆ ∈ Rp1 , we have

# ( )! " |X∆| 2 ε |X∆| 22 ε2 2 P . − E[ ] ≥ ε ≤ 2 exp −c1 n min 4 , σ2 n n σX X

Moreover, if Y ∈ Rn×p2 is a sub-Gaussian matrix with parameters (ΣY , σY2 ), then " Y T X P − E(YiT Xi ) n

#

(

≥ ε ≤ 6 exp −c2 n min ∞

ε2 ε , 2 2 σX σY σX σY

)

!

+ log p1 + log p2 ,

where Xi and Yi are the ith rows of X and Y , respectively. Remark. This lemma is based on Lemma 5.14 and Corollary 5.17 in Vershynin (2012), as well as Lemma 14 in Loh and Wainwright (2012). 2 ). We have Lemma B2: Let X ∈ Rn×p1 be a sub-Gaussian matrix with parameters (ΣX , σX

|X∆| 22 α 0 log p1 ≥ |∆| 22 − α |∆| 21 , for all ∆ ∈ Rp1 n 2 n |X∆| 22 3α ¯ 0 log p1 ≤ |∆| 22 + α |∆| 21 , for all ∆ ∈ Rp1 n 2 n 0 ¯ , and α only depend on ΣX and σX . with probability at least 1 − c1 exp(−c2 n), where α, α Remark. This lemma is Lemma 13 in Loh and Wainwright (2012).  Definition (Covering numbers). For a metric spacen consisting oof a set X and a metric ρ : X ×X → R+ : An t−covering of X with respect to ρ is a set β 1 , ..., β N ⊂ X such that for all β ∈ X , there exists some i ∈ {1, ..., N } with ρ(β, β i ) ≤ t. The t−covering number N (t; X , ρ) is the cardinality of the smallest t−covering. Lemma B3: For q ∈ (0, 1], let  

 d  X q q 0 0 0 θ ∈ B d (Rq ) := θ ∈ Rd : θ = θj ≤ Rq   q j=1

and N2 (t; B d (Rq )) be the t−covering number of the set B d (Rq ) in the l2 −norm. Then there is a universal constant c such that 2q

1 1 2−q log d for all t ∈ (0, Rqq ). log N2 (t; B (Rq )) ≤ cRq t Remark. These bounds are obtained by inverting known results on (dyadic) entropy numbers (e.g., Schütt, 1984; Guedon and E. Litvak, 2000; Kühn, 2001) of lq −“balls” as in the proof for Lemma 2 from Raskutii, Wainwright, and Yu (2011). 

d

2 2−q

 

29

References Altonji, J. G. and R. L. Matzkin (2005). “Cross Section and Panel Data Estimators for Nonseparable Models with Endogenous Regressors.” Econometrica, 73, 1053-1102. Arellano, M. and S. Bonhomme (2009). “Robust Priors in Nonlinear Panel Data Models,” Econometrica 77, 489-536. Belloni, A. and V. Chernozhukov (2013). “Least Squares after Model Selection in High-Dimensional Sparse Models.” Bernoulli, 19, 521-547. Belloni, A., V. Chernozhukov, and C. Hansen (2014). “Inference on Treatment Effects after Selection amongst High-Dimensional Controls.” Review of Economic Studies, 81, 608-650. Bertsimas, D. and J. Tsitsiklis (1997). Introduction to Linear Optimization, Athena Scientific. Bickel, P., J. Y. Ritov, and A. B. Tsybakov (2009). “Simultaneous Analysis of Lasso and Dantzig Selector”. The Annals of Statistics, 37, 1705-1732. Blundell, R. and J. L. Powell (2003). “Endogeneity in Nonparametric and Semiparametric Regression Models.” in Advances in Economics and Econonometrics: Theory and Applications, Eighth World Congress, M. Dewatripont, L. P. Hansen and S. J. Turnovsky, eds. Cambridge: Cambridge University Press, 2, 312-357. Blundell, R. and J. L. Powell (2004). “Endogeneity in Semiparametric Binary Response Models.” Review of Economic Studies, 71, 655-679. Boyd, S., and L. Vandenberghe (2004). Convex Optimization. Cambridge University Press, Cambridge. Bühlmann, P. and S. A. van de Geer (2011). Statistics for High-Dimensional Data. Springer, NewYork. Chamberlain, G. (1982). “Multivariate Regression Models for Panel Data.” Journal of Econometrics, 1, 5-46. Chamberlain, G. (1984). “Panel Data.” in Handbook of Econometrics, Z. Griliches and M. D. Intriligator. Eds. 2, 1248-1318. Amsterdam: North Holland. Chernozhukov, V., I. Fernández-Val, J. Hahn, and W. K. Newey (2009). “Identification and Estimation of Marginal Effects in Nonlinear Panel Models.” Mimeo, Boston University Department of Economics. Dudley R. M. (1967). “The Sizes of Compact Subsets of Hilbert Spaces and Continuity of Gaussian Processes.” J. Functional Analysis, 1:290-330, 1967. Fan, J., J. Lv, and L. Qi (2011). “Sparse High Dimensional Models in Economics.” Annual Review of Economics, 3, 291-317. Fan, Y. and R. Li (2012). “Variable Selection in Linear Mixed Effects Models.” Annals of Statistics, 40, 2043-2068. Fernández-Val, I. (2008). “Fixed Effects Estimation of Structural Parameters and Marginal Effects in Panel Probit Models.” Mimeo, Boston University, Department of Economics. Greene, W. H. (2003). Econometric Analysis. Pearson Education India. Guedon, O., and A. E. Litvak (2000). “Euclidean Projections of p−Convex Body.” In Geometric Aspects of Functional Analysis, 95-108. Springer-Verlag. Hoderlein. S., E. Mammen, and K. Yu (2011). “Non-Parametric Models in Binary Choice Fixed Effects Panel Data.” The Econometrics Journal 14, 351-367. Javanmard, A. and A. Montanari (2014). “Confidence Intervals and Hypothesis Testing for HighDimensional Regression” Journal of Machine Learning Research, 15, 2869-2909. Koltchinskii V (2006). “Local Rademacher Complexities and Oracle Inequalities in Risk Minimization.” The Annals of Statistics, 34, 2593-2656. Kühn, T. (2001). “A Lower Estimate for Entropy Numbers.” Journal of Approximation Theory, 30

110, 120-124. Li, T. and X. Zheng (2008). “Semiparametric Bayesian Inference for Dynamic Tobit Panel Data Models with Unobserved Heterogeneity,” Journal of Applied Econometrics 23, 699-728. Loh, P., and M. Wainwright (2012). “High-Dimensional Regression with Noisy and Missing data: Provable Guarantees with Non-convexity.” The Annals of Statistics, 40, 1637-1664. Loh, P., and M. Wainwright (2013). “Regularized M-estimators with Nonconvexity: Statistical and Algorithmic Theory for Local Optima.” NIPS, NV. Newey, W. and D. McFadden (1994). “Large Sample Estimation and Hypothesis Testing.” In Handbook of Econometrics. D. McFadden and R. Engle eds. 4, 2113-2245. Elsevier. Mundlak, Y. (1978). “On the Pooling of Time Series and Cross Section Data” Econometrica, 46, 69-85. Negahban, S., P. Ravikumar, M. J. Wainwright, and B. Yu (2012). “A Unified Framework for HighDimensional Analysis of M-Estimators with Decomposable Regularizers.” Statistical Science, 27, 538-557. 2010 version: arXiv:1010.2731v1. Papke, L. E. and J. M. Wooldridge (2008). “Panel Data Methods for Fractional Response Variables with an Application to Test Pass Rates.” Journal of Econometrics, 145, 121-133. Rabe-Hesketh, S. and A. Skrondal (2013). “Avoiding Biased Versions of Wooldridge’s Simple Solution to the Initial Conditions Problem.” Economics Letters, 120, 346-349. Raskutti, G., M. J. Wainwright, and B. Yu (2011). “Minimax Rates of Estimation for High-dimensional Linear Regression over lq -Balls.” IEEE Trans. Information Theory, 57, 6976-6994. Schütt, C. (1984). “Entropy Numbers of Diagonal Operators between Symmetric Banach Spaces.” Journal of Approximation Theory, 40, 121-128. Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society, Series B, 58, 267-288. van de Geer, S., P. Bühlmann, Y. Ritov, and R. Dezeure (2014). “On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models.” The Annals of Statistics, 42, 11661202. Vershynin, R. (2012). “Introduction to the Non-Asymptotic Analysis of Random Matrices”, in Eldar, Y. and G. Kutyniok, Eds, Compressed Sensing: Theory and Applications, 210-268, Cambridge. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge. Wooldridge, J. M. (2016). “Correlated Random Effects Models with Unbalanced Panels.” Working paper, Michigan State University, Department of Economics. Zhang, C.-H. and S. S. Zhang (2014). “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models.” Journal of the Royal Statistical Society, Series B, 76, 217242.

31

Inference in Approximately Sparse Correlated Random ...

Jul 3, 2017 - tional mean of the unobserved heterogeneity and does not attempt to relax the probit functional form, as the former is likely the most serious ...

619KB Sizes 3 Downloads 160 Views

Recommend Documents

Information cascades on degree-correlated random networks
Aug 25, 2009 - This occurs because the extreme disassortativity forces many high-degree vertices to connect to k=1 vertices, ex- cluding them from v. In contrast, in strongly disassortative networks with very low z, average degree vertices often con-

Information cascades on degree-correlated random networks
Aug 25, 2009 - We investigate by numerical simulation a threshold model of social contagion on .... For each combination of z and r, ten network instances.

AN AUTOREGRESSIVE PROCESS WITH CORRELATED RANDOM ...
Notations. In the whole paper, Ip is the identity matrix of order p, [v]i refers ... correlation between two consecutive values of the random coefficient. In a time.

Random Set Inference Introduction
measures that dominate the set function P set-wise, forming thus the core of the ...... to any countable subset Sc of S to yield the existence of π∗ and π∗ ∈ Π ...

A Flexible Correlated Random Effects Approach to ...
Sep 1, 2016 - the fixed effects ordered logit model. Journal of the Royal Statistical Society: Series A. (Statistics in Society) 178(3), 685–703. Bester, C. A. and C. Hansen (2009, April). Identification of Marginal Effects in a Nonparamet- ric Cor

Sparse Bayesian Inference of White Matter Fiber Orientations from ...
taxonomy and comparison. NeuroImage 59 (2012) ... in diffusion mri acquisition and processing in the human connectome project. Neu- roimage 80 (2013) ...

Topology Discovery of Sparse Random Graphs With Few ... - UCI
This is the strongest of all queries. These queries can be implemented by using Tracer- oute on Internet. In [9], the combinatorial-optimization problem of selecting the smallest subset of nodes for such queries to estimate the network topology is fo

Sparse Bayesian Inference of White Matter Fiber Orientations from ...
proposed dictionary representation and sparsity priors consider the de- pendence between fiber orientations and the spatial redundancy in data representation. Our method exploits the sparsity of fiber orientations, therefore facilitating inference fr

Random Sparse Representation for Thermal to Visible ...
except the elements associated with the ith class, which are equal to elements of xi. ..... mal/visible face database.,” Oct. 2014, [Online]. Available: http://www.

Topology Discovery of Sparse Random Graphs ... - Research at Google
instance, the area of mapping the internet topology is very rich and extensive ... oute on Internet. In [9], the ...... Pattern Recognition Letters, 1(4):245–253, 1983.

Memory in Inference
the continuity of the inference, e.g. when I look out of the window at a bird while thinking through a problem, but this should not blind us to the existence of clear cases of both continuous and interrupted inferences. Once an inference has been int

Universality in strongly correlated quantum systems
(physics technical term). • Efimov effect is not ... or occurring everywhere. (Merriam-Webster Online) ..... wherein the color and flavor degrees of freedom become.

Unified Inference in Extended Syllogism - Semantic Scholar
duction/abduction/induction triad is defined formally in terms of the position of the ... the terminology introduced by Flach and Kakas, this volume), cor- respond to ...

Randomization Inference in the Regression ...
Download Date | 2/19/15 10:37 PM .... implying that the scores can be considered “as good as randomly assigned” in this .... Any test statistic may be used, including difference-in-means, the ...... software rdrobust developed by Calonico et al.

Causal inference in motor adaptation
Kording KP, Beierholm U, Ma WJ, Quartz S, Tenenbaum JB, Shams L (in press) Causal inference in Cue combination. PLOSOne. Robinson FR, Noto CT, Bevans SE (2003) Effect of visual error size on saccade adaptation in monkey. J. Neurophysiol 90:1235-1244.

Unified Inference in Extended Syllogism - Semantic Scholar
... formally in terms of the position of the shared term: c© 1998 Kluwer Academic Publishers. ...... Prior Analytics. Hackett Publishing Company, Indianapolis, Indi-.

Inference in Incomplete Models
Program for Economic Research at Columbia University and from the Conseil Général des Mines is grate- ... Correspondence addresses: Department of Economics, Harvard Uni- versity ..... Models with top-censoring or positive censor-.

On the Power of Correlated Randomness in Secure ...
{stm,orlandi}@cs.au.dk. Abstract. ... Supported by the European Research Council as part of the ERC project CaC ... Supported by ERC grant 259426. Research ...

Weak pairwise correlations imply strongly correlated network states in ...
between pairs of neurons coexist with strongly collective behaviour in the ... These maximum entropy models are equivalent to Ising models, and predict that.

Optimal paths in complex networks with correlated ...
Nov 6, 2006 - worldwide airport network and b the E. Coli metabolic network. Here wij xij kikj ... weighted. For example, the links between computers in the.