Nonparametric Estimation of Triangular Simultaneous ...

Viewer
Transcript

Nonparametric Estimation of Triangular Simultaneous Equations Models under Weak Identification Sukjin Han⇤ Department of Economics University of Texas at Austin [email protected] October 6, 2015

Abstract This paper analyzes the problem of weak instruments on identification, estimation, and inference in a simple nonparametric model of a triangular system. The paper derives a necessary and sufficient rank condition for identification, based on which weak identification is established. Then, nonparametric weak instruments are defined as a sequence of reduced-form functions where the associated rank shrinks to zero. The problem of weak instruments is characterized as concurvity and to be similar to the ill-posed inverse problem, which motivates the introduction of a regularization scheme. The paper proposes a penalized series estimation method to alleviate the e↵ects of weak instruments and shows that it achieves desirable asymptotic properties. The findings of this paper provide useful implications for empirical work. To illustrate them, Monte Carlo results are presented and an empirical example is given in which the e↵ect of class size on test scores is estimated nonparametrically. Keywords: Triangular models, nonparametric identification, weak identification, weak instruments, series estimation, inverse problem, regularization, concurvity. JEL Classification Numbers: C13, C14, C36. ⇤ I am very grateful to my advisors, Donald Andrews and Edward Vytlacil, and committee members, Xiaohong Chen and Yuichi Kitamura for their inspiration, guidance and support. I am deeply indebted to Donald Andrews for his thoughtful advice throughout the project. The earlier version of this paper has benefited from discussions with Joseph Altonji, Ivan Canay, Philip Haile, Keisuke Hirano, Han Hong, Joel Horowitz, Seokbae Simon Lee, Oliver Linton, Whitney Newey, Byoung Park, Peter Phillips, Andres Santos, and Alex Torgovitsky. I gratefully acknowledge financial support from a Carl Arvid Anderson Prize from the Cowles Foundation. I also thank the seminar participants at Yale, UT Austin, Chicago Booth, Notre Dame, SUNY Albany, Sogang, SKKU, and Yonsei, as well as the participants at NASM and Cowles Summer Conference.

1

1

Introduction

Instrumental variables (IVs) are widely used in empirical research to identify and estimate models with endogenous explanatory variables. In linear simultaneous equations models, it is well known that standard asymptotic approximations break down when instruments are weak in the sense that (partial) correlation between the instruments and endogenous variables is weak. The consequences of and solutions for weak instruments in linear settings have been extensively studied in the literature over the past two decades; see, e.g., Bound et al. (1995), Staiger and Stock (1997), Dufour (1997), Kleibergen (2002, 2005), Moreira (2003), Stock and Yogo (2005), and Andrews and Stock (2007), among others. Weak instruments in nonlinear parametric models have recently been studied in the literature in the context of weak identification by, e.g., Stock and Wright (2000), Kleibergen (2005), Andrews and Cheng (2012), Andrews and Mikusheva (2015b,a), Andrews and Guggenberger (2015), and Han and McCloskey (2015). One might expect that nonparametric models with endogenous explanatory variables will generally require stronger identification power than parametric models as there is an infinite number of unknown parameters to identify, and hence, stronger instruments may be required.1 Despite the problem’s importance and the growing popularity of nonparametric models, weak instruments in nonparametric settings have not received much attention.2 Furthermore, surprisingly little attention has been paid to the consequences of weak instruments in empirical research using nonparametric models; see below for references. Part of the neglect is due to the existing complications embedded in nonparametric models. In a simple nonparametric framework, this paper analyzes the problem of weak instruments on identification, estimation, and inference, and proposes an estimation strategy to mitigate the e↵ect. Identification results are obtained so that the concept of weak identification can subsequently be introduced via localization. The problem of weak instruments is characterized as concurvity and is shown to be similar to the ill-posed inverse problem. An estimation method is proposed through regularization and the resulting estimators are shown to have desirable asymptotic properties even when instruments are possibly weak. As a nonparametric framework, we consider a triangular simultaneous equations model. The specification of weak instruments is intuitive in the triangular model because it has an explicit reduced-form relationship. Additionally, clear interpretation of the e↵ect of weak instruments can be made through a specific structure produced by the control function approach. To make our analysis succinct, we specify additive errors in the model. This particular model is considered in Newey et al. (1999) (NPV) and Pinkse (2000) in a situation without weak instruments. Although relatively 1

This conjecture is shown to be true in the setting considered in this paper; see Theorem 5.1 and Corollary 5.2. Chesher (2003, 2007) mentions the issue of weak instruments in applying his key identification condition in the empirical example of Angrist and Keueger (1991). Blundell et al. (2007) determine whether weak instruments are present in the Engel curve dataset of their empirical section. They do this by applying the Stock and Yogo (2005) test developed in linear models to their reduced form, which is linearized by sieve approximation. Darolles et al. (2011) briefly discuss weak instruments that are indirectly characterized within their source condition. 2

2

recent developments in nonparametric triangular models contribute to models with nonseparable errors (e.g., Imbens and Newey (2009), Kasy (2014)), such flexibility complicates the exposition of the main results of this paper. For instance, the control function employed in Imbens and Newey (2009) requires large variation in instruments, and hence discussing weak instruments (i.e., weak association between endogenous variables and instruments or little variation in instruments) in such a context requires more care. Also, having a form analogous to its popular parametric counterpart, the model with additive errors is broadly used in applied research such as Blundell and Duncan (1998), Yatchew and No (2001), Lyssiotou et al. (2004), Dustmann and Meghir (2005), Skinner et al. (2005), Blundell et al. (2008), Del Bono and Weber (2008), Frazer (2008), Mazzocco (2012), Coe et al. (2012), Breza (2012), Henderson et al. (2013), Chay and Munshi (2014), and Koster et al. (2014). One of the contributions of this paper is that it derives novel identification results in nonparametric triangular models that complement the existing results in the literature. With a mild support condition, we show that a particular rank condition is necessary and sufficient for the identification of the structural relationship. This rank condition is substantially weaker than what is established in NPV. The rank condition covers economically relevant situations such as outcomes resulting from corner solutions or kink points in certain economic models. More importantly, deriving such a rank condition is the key to establishing the notion of weak identification. Since the condition is minimal, a “slight violation” of it has a binding e↵ect on identification, hence resulting in weak identification. To characterize weak identification, we consider a drifting sequence of reduced-form functions that converges to a non-identification region, namely, a space of reduced-form functions that violate the rank condition for identification. Under this localization, the signal diminishes relative to the noise in the system, and hence, the model is weakly identified. A particular rate is designated relative to the sample size, which e↵ectively measures the strength of the instruments, so that it appears in asymptotic results for the estimator of the structural function. The concept of nonparametric weak instruments generalizes the concept of weak instruments in linear models such as in Staiger and Stock (1997). In the nonparametric control function framework, the problem of weak instruments becomes a nonparametric analogue of a multicollinearity problem known as concurvity (Hastie and Tibshirani (1986)). Once the endogeneity is controlled by a control function, the model can be rewritten as an additive nonparametric regression, where the endogenous variables and reduced-form errors comprise two regressors, and weak instruments result in the variation of the former regressor being mainly driven by the variation of the latter. This problem of concurvity is related to the illposed inverse problem inherent in other nonparametric models with endogeneity or, in general, to settings where smoothing operators are involved; see Carrasco et al. (2007) for a survey of inverse problems. Although the sources of ill-posedness in the two problems are di↵erent, there is sufficient similarity that the regularization methods used in the literature to solve the ill-posed inverse problem can be introduced to our problem. Due to the problems’ distinct features, however, 3

among the regularization methods, only penalization (i.e., Tikhonov-type regularization) alleviates the e↵ect of weak instruments. This paper proposes a penalized series estimator for the structural function and establishes its asymptotic properties. We develop a modified version of the standard L2 penalization to control the penalty bias. Our results on the rate of convergence of the estimator suggest that, without penalization, weak instruments characterized as concurvity slow down the overall convergence rate, exacerbating bias and variance “symmetrically.” We then show that a faster convergence rate is achieved with penalization, while the penalty bias is dominated by the standard approximation bias. In showing the gain from penalization, this paper derives the decay rates of coefficients of series expansions, which are related to source conditions (Engl et al. (1996)) in the literature on ill-posed inverse problems. The corresponding rate, which is assumed to be part of a source condition, is rather an abstract smoothness condition and is agnostic about dimensionality; see, e.g., Hall and Horowitz (2005). In contrast to the literature, we derive the decay rates from a conventional smoothness condition as di↵erentiability and also incorporate dimensionality. Along with the convergence rate results, we derive consistency and asymptotic normality with mildly weak instruments. The problem of concurvity in additive nonparametric models is also recognized in the literature where di↵erent estimation methods are proposed to address the problem—e.g., the backfitting methods (Linton (1997), Nielsen and Sperlich (2005)) and the integration method (Jiang et al. (2010)). See Sperlich et al. (1999) for further discussions of those methods in the context of correlated designs (i.e., covariates). In the present paper, where an additive model results from a triangular model accompanied with the control function approach, the problem of concurvity is addressed in a more direct manner via penalization. In addition, although the main conclusions of this paper do not depend on the choice of nonparametric estimation method, using series estimation in our penalization procedure is also justified in the context of design density. In situations where the joint density of x and v becomes singular, such as in our case with weak instruments, it is known that series and local linear estimators are less sensitive than conventional kernel estimators; see, e.g., Hengartner and Linton (1996) and Imbens and Newey (2009) for related discussions. Another possible nonparametric framework in which to examine the problem of weak instruments is a nonparametric IV (NPIV) model (Newey and Powell (2003), Hall and Horowitz (2005) and Blundell et al. (2007), among others). It is well known that, even with strong instruments, the ill-posed inverse problem arises in this type of model due to the typical involvement of an integral equation. This ill-posed inverse problem can be characterized as the eigenvalues of the integral operator converging to zero (Horowitz (2011)). Unlike in a triangular model, the absence of an explicit reduced-form relationship forces weak instruments in this setting also to be characterized as the eigenvalues converging to zero, albeit in a di↵erent way than that for the ill-posed inverse problem; see Section B.5 in the Appendix for details of this discussion. Therefore, in this model,

4

the performance of the estimator can be severely deteriorated as the problem is “doubly ill-posed.”3 Further, it may also be hard to separate the e↵ects of the two in asymptotic theory. As a related recent work, Freyberger (2015) provides a framework by which the completeness condition can be tested in a NPIV model. Instead of using a drifting sequence of distributions, he indirectly defines weak instruments as a failure of a restricted version of the completeness condition. While he applies his framework to test weak instruments, our focus is on estimation and inference of the function of interest in a di↵erent nonparametric model with a more explicit definition of weak instruments. The findings of this paper provide useful implications for empirical work. First, when estimating a nonparametric structural function, the results of IV estimation and subsequent inference can be misleading even when the instruments are strong in terms of conventional criteria for linear models.4 Second, the symmetric e↵ect of weak instruments on bias and variance implies that the bias–variance trade-o↵ is the same across di↵erent strengths of instruments, and hence, weak instruments cannot be alleviated by exploiting the trade-o↵. Third, penalization on the other hand can alleviate weak instruments by significantly reducing variance and sometimes bias as well. Fourth, there is a tradeo↵ between the smoothness of the structural function (or the dimensionality of its argument) and the requirement of strong instruments. Fifth, if a triangular model along with its assumptions is considered to be reasonable, it makes the data to be informative about the relationship of interest more than a NPIV model does, which is an attractive feature especially in the presence of weak instruments. Sixth, although a linear first-stage reduced form is commonly used in applied research (e.g., in NPV, Blundell and Duncan (1998), Blundell et al. (1998), Dustmann and Meghir (2005), Coe et al. (2012), and Henderson et al. (2013)), the strength of instruments can be improved by having a nonparametric reduced form so that the nonlinear relationship between the endogenous variable and instruments can be fully exploited. The last point is related to the identification results of this paper. In Section 8, we apply the findings of this paper to an empirical example, where we nonparametrically estimate the e↵ect of class size on students’ test scores. In this section, we also compare the estimates with those of Horowitz (2011)’s NPIV model. The rest of the paper is organized as follows. Section 2 introduces the model and obtains new identification results. Section 3 discusses weak identification and Section 4 relates the weak instrument problem to the ill-posed inverse problem and defines our penalized series estimator. Sections 5 and 6 establish the rate of convergence and consistency of the penalized series estimator and the asymptotic normality of some functionals of it. Section 7 presents the Monte Carlo simulation results. Section 8 discusses the empirical application. Finally, Section 9 concludes. 3

In Section 8, we illustrate this point in an empirical application by comparing estimates calculated from the triangular and NPIV models. 4 For instance, in Coe et al. (2012), the first-stage F -statistic value that is reported is (sometimes barely) in favor of strong instruments, but the judgement is based on the criterion for linear models. The majority of empirical works referenced above do not report first-stage results.

5

2

Identification

We consider a nonparametric triangular simultaneous equations model y = g0 (x, z1 ) + ",

x = ⇧0 (z) + v,

E["|v, z] = E["|v] a.s.,

(2.1a)

E[v|z] = 0 a.s.,

(2.1b)

where g0 (·, ·) is an unknown structural function of interest, ⇧0 (·) is an unknown reduced-form

function, x is a dx -vector of endogenous variables, z = (z1 , z2 ) is a (dz1 + dz2 )-vector of exogenous variables, and z2 is a vector of excluded instruments. The stochastic assumptions 2.1b are more general than the assumption of full independence between (", v) and z and E[v] = 0. Following the control function approach, E[y|x, z] = g0 (x, z1 ) + E["|⇧0 (z) + v, z] = g0 (x, z1 ) + E["|v] = g0 (x, z1 ) + where

0 (v)

0 (v),

(2.2)

= E["|v] and the second equality is from the first part of (2.1b). In e↵ect, we capture

endogeneity (E["|x, z] 6= 0) by an unknown function

0 (v),

which serves as a control function.

Another intuition for this approach is that once v is controlled for or conditioned on, the only variation of x comes from the exogenous variation of z. Based on equation (2.2) we establish identification, weak identification, and estimation results. First, we obtain identification results that complement the results of NPV. For useful comparisons, we first restate the identification condition of NPV which is written in terms of ⇧(·). Given (2.2), the identification of g0 (x, z1 ) is achieved if one can separately vary (x, z1 ) and v in g(x, z1 ) + (v). Since x = ⇧0 (z) + v, a suitable condition on ⇧0 (·) will guarantee this via the separate variation of z and v. In light of this intuition, NPV propose the following identification condition. Proposition 2.1 (Theorem 2.3 in NPV) If g(x, z1 ), (v), and ⇧(z) are di↵erentiable, the boundary of the support of (z, v) has probability zero, and 

Pr rank

✓

@⇧0 (z) @z20

◆

= dx = 1,

(2.3)

then g0 (x, z1 ) is identified up to an additive constant. The identification condition can be seen as a nonparametric generalization of the rank condition. One can readily show that the order condition (dz2

dx ) is incorporated in this rank condition.

Note that this condition is only a sufficient condition, which suggests that the model can possibly be identified with a relaxed rank condition. This observation motivates our identification analysis. We find a necessary and sufficient rank condition for identification by introducing a mild support condition. The identification analysis of this section is also important for our later purpose of 6

defining the notion of weak identification. Henceforth, in order to keep our presentation succinct, we focus on the case where the included exogenous variable z1 is dropped from model (2.1) and z = z2 . With z1 included, all the results of this paper readily follow similar lines; e.g., the identification analysis follows conditional on z1 . We first state and discuss the assumptions that we impose. Assumption ID1 The functions g(x),

(v), and ⇧(z) are continuously di↵erentiable in their

arguments. This condition is also assumed in Proposition 2.1 above. Before we state a key additional assumption for identification, we first define the supports that are associated with x and z. Let X ⇢ Rdx and Z ⇢ Rdz be the marginal supports of x and z, respectively. Also, let Xz be the conditional support of x given z 2 Z. We partition Z into two regions where the rank condition is satisfied, i.e., where z is relevant, and otherwise.

Definition 2.1 (Relevant set) Let Z r be the subset of Z defined by r

r

Z = Z (⇧0 (·)) =

⇢

z 2 Z : rank

✓

@⇧0 (z) @z 0

◆

= dx .

Let Z 0 = Z\Z r be the complement of the relevant set. Let X r be the subset of X defined by

X r = {x 2 Xz : z 2 Z r }. Given the definitions, we introduce an additional support condition.

Assumption ID2 The supports X and X r di↵er only on a set of probability zero, i.e., Pr [x 2 X \X r ] = 0.

Intuitively, when z is in the relevant set, x = ⇧(z)+v varies as z varies, and therefore, the support of x corresponding to the relevant set is large. Assumption ID2 assures that the corresponding support is large enough to almost surely cover the entire support of x. ID2 is not as strong as it may appear to be. Below, we show this by providing mild sufficient conditions for ID2. If we identify g0 (x) for any x 2 X r , then we achieve identification of g0 (x) by Assumption ID2.5

Now, in order to identify g0 (x) for x 2 X r , we need a rank condition, which will be minimal. The following is the identification result:

Theorem 2.2 In model (2.1), suppose Assumptions ID1 and ID2 hold. Then, g0 (x) is identified on X up to an additive constant if and only if 

Pr rank 5

✓

@⇧0 (z) @z 0

◆

= dx > 0.

(2.4)

The support on which an unknown function is identified is usually left implicit in the literature. To make it more explicit, g0 (x) is identified if g0 (x) is identified on the support of x almost surely.

7

This and all subsequent proofs can be found in the Appendix. The rank condition (2.4) is necessary and sufficient. By Definition 2.1, it can alternatively be written as Pr [z 2 Z r ] > 0. The condition is substantially weaker than (2.3) in Proposition 2.1, which is Pr [z 2 Z r ] = 1 (with z = z2 ). That is, Theorem 2.2 extends the result of NPV in the sense that when Z r = Z, ID2 is trivially satisfied with X = X r . Theorem 2.2 shows that it is enough

for identification of g0 (x) to have any fixed positive probability with which the rank condition is satisfied.6 This condition can be seen as the local rank condition as in Chesher (2003), and we achieve global identification with a local rank condition. Although this gain comes from having the additional support condition, this support condition is shown below to be mild, and the trade-o↵ is appealing given the later purpose of building a weak identification notion. Even without Assumption ID2, maintaining the assumptions of Theorem 2.2, we still achieve identification of g0 (x), but on the set {x 2 X r }.

Lastly, in order to identify the level of g0 (x), we need to introduce some normalization as in NPV. Either E["] = 0 or 0 (¯ v ) = ¯ suffices to pin down g0 (x). With the latter normalization, it follows that g0 (x) = E[y|x, v = v¯] ¯ , which we apply in estimation as it is convenient to implement. The following is a set of sufficient conditions for Assumption ID2. Let Vz be the conditional

support of v given z 2 Z.

Assumption ID20 Either (a) or (b) holds. (a) (i) x is univariate and x and v are continuously distributed, (ii) Z is a cartesian product of connected intervals, and (iii) Vz = Vz˜ for all z, z˜ 2 Z 0 ; (b) Vz = Rdx for all z 2 Z.

Lemma 2.1 Under Assumption ID1, Assumption ID20 implies Assumption ID2. In Assumption ID20 , the continuity of the r.v. is closely related to the support condition in Proposition 2.1 that the boundary of support of (z, v) has probability zero. For example, when z or v is discrete their condition does not hold. Assumption ID20 (a)(i) assumes that the endogenous variable is univariate, which is most empirically relevant in nonparametric models. An additional condition is required with multivariate x, which is omitted in this paper. Even under ID20 (a)(i), however, the exogenous covariate z1 in g(x, z1 ), which is omitted in the discussion, can still be a vector. ID20 (a)(ii) and (iii) are rather mild. ID20 (a)(ii) assumes that z has a connected support, which in turn requires that the excluded instruments vary smoothly. The assumptions on the continuity of the r.v. and the connectedness of Z are also useful in deriving the asymptotic theory

of the series estimator considered in this paper; see Assumption B below. ID20 (a)(iii) means that the conditional support of v given z is invariant when z is in Z 0 . This support invariance condition is the key to obtaining a rank condition that is considerably weaker than that of NPV. Our support

invariance condition is di↵erent from the support invariance condition introduced in Imbens and 6

A similar condition appears in the identification analysis of Hoderlein (2009), where endogenous semiparametric binary choice models are considered in the presence of heteroskedasticity.

8

Figure 1: Identification under Assumption ID20 (a), univariate z and no z1 . Newey (2009). Using the notations of the present paper, Imbens and Newey (2009) require that the support of v conditional on x equals the marginal support of v, which inevitably requires a large support of z. On the other hand, ID20 (a)(iii) requires that the support of v conditional on z is invariant (for z 2 Z 0 ), and therefore imposes no restriction on the support of z. Also, the conditional support does not have to equal the marginal support of v here. ID20 (a)(iii), along with the control

function assumptions (2.1b), is a weaker orthogonality condition for z than the full independence condition z ? v.7 Note that Vz = {x

⇧0 (z) : x 2 Xz }. Therefore, ID20 (a)(iii) equivalently means

that Xz is invariant for z such that E[x|z] = const. This condition can be checked from the data.

Given ID20 (b) that v has a full conditional support, ID2 is trivially satisfied and no additional

restriction is imposed on the joint support of z and v. ID20 (b) also does not require univariate x or the connectedness of Z. This assumption on Vz is satisfied with, for example, a normally distributed error term (conditional on regressors).

Figure 1 illustrates the intuition of the identification proof under ID20 (a) in a simple case where z is univariate. With ID20 (b), the analysis is even more straightforward; see the proof of Lemma 2.1 in the Appendix. In the figure, the local rank condition (2.4) ensures global identification of g0 (x). The intuition is as follows. First, by @E[y|v, z]/@z = (@g0 (x)/@x) · (@⇧0 (z)/@z) and the rank

condition, g0 (x) is locally identified on x corresponding to a point of z in the relevant set Z r . As such a point of z varies within Z r , the x corresponding to it also varies enough to cover almost the

entire support of x. At the same time, for any x corresponding to an irrelevant z (i.e., z outside of Z r ), one can always find z inside of Z r that gives the same value of such an x. The probability

Pr [z 2 Z r ] being small but bounded away from zero only a↵ects the efficiency of estimators in the

estimation stage. This issue is related to the weak identification concept discussed later; see Section 3. With ID20 (a)(iii), the heteroskedasticity of v which is allowed by having E["|v, z] = E["|v] may or may not be restricted. 7

9

Note that the strength of identification of g0 (x) is di↵erent for di↵erent subsets of X . For

instance, identification must be strong in a subset of X corresponding to a subset of Z where

⇧0 (·) is steep. In addition, g0 (x) is over-identified on a subset of X that corresponds to multiple

subsets of Z where ⇧0 (·) has a nonzero slope, since each association of x and z contributes to identification. This discussion implies that the shape of ⇧0 (·) provides useful information on the strength of identification in di↵erent parts of the domain of g0 (x). Lastly, it is worth mentioning that the separable structure of the reduced form along with ID20 (a)(iii) allows “global extrapolation” in a manner that is analogous to that in a linear model. With a linear model for the reduced form, the local rank condition (2.4) is a global rank condition. That is, the linearity of the function contributes to globally extrapolating the reduced-form relationship. Likewise, the identification results of this paper imply that although the reduced-form function is nonparametric, the way that the additive error interacts with the invariant support enables the global extrapolation of the relationship. The identification results of this section apply to economically relevant situations. Let x be an economic agent’s optimal decision induced by an economic model and z be a set of exogenous components in the model that a↵ects the decision x. One is interested in a nonlinear e↵ect of the optimal choice on a certain outcome y in the model. We present two situations where the resulting Pr [z 2 Z r ] is strictly less than unity in this economic problem: (a) x is realized as a corner solution beyond a certain range of z. In a returns-to-schooling example, x can be the schooling decision of a potential worker, z the tuition cost or distance to school, and y the future earnings. When the tuition cost is too high or the distance to school is too far beyond a certain threshold, such an instrument may no longer a↵ect the decision to go to school. (b) The budget set has kink points. In a labor supply curve example, x is the before-tax income, which is determined by the labor supply decision, z the worker’s characteristics that shift her utility function, and y the wage. If an income tax schedule has kink points, then the x realized at such points will possibly be invariant at the shift of the utility. The identification results of this paper imply that even in these situations, the returns to schooling or the labor supply curve can be fully identified nonparametrically as long as Pr [z 2 Z r ] > 0.

3

Weak Identification

The structure of the joint distribution of x and z that contributes to the identification of g(·) is discussed in the previous section. Specifically, the rank condition (2.4) imposes a minimal restriction for identification on the shape of the conditional mean function E[x|z] = ⇧(z). This necessity result suggests that “slight violation” of (2.4) will result in weak identification of g(·). Note that this approach will not be successful with (2.3) of NPV, since violating the condition, i.e., Pr [rank (@⇧0 (z)/@z 0 ) = dx ] < 1, can still result in strong identification. In this section, we formally construct the notion of weak identification via localization. We 10

define nonparametric weak instruments as a drifting sequence of reduced-form functions that are localized around a function with no identification power. Such a sequence of models or drifting data-generating process (Davidson and MacKinnon (1993)) is introduced to define weak instruments relative to the sample size n. As a result, the strength of instruments is represented in terms of the rate of localization, and hence, it can eventually be reflected in the local asymptotics of the estimator of g0 (·). Let C(Z) be the class of conditional mean functions ⇧(·) on Z that are bounded, Lipschitz and

continuously di↵erentiable. Note that the derivative of ⇧(·) is bounded by the Lipschitz condition.

Define a non-identification region C0 (Z) as a class of functions that satisfy the lack-of-identification condition, namely, the contraposition of the necessity part of the rank condition8 : C0 (Z) = {⇧(·) 2 C(Z) : Pr [rank (@⇧(z)/@z 0 ) < dx ] = 1}. Define an identification region as C1 (Z) = C(Z)\C0 (Z).

We consider a sequence of triangular models y = g0 (x) + " and x = ⇧n (z) + v with corresponding stochastic assumptions, where ⇧n (·) is a drifting sequence of functions in C1 (Z) that drifts in a ¯ certain sense to a function ⇧(·) in C0 (Z). Although g(x) is identified with ⇧n (·) 2 C1 (Z) for ¯ any fixed n by Theorem 2.2, g(x) is only weakly identified as ⇧n (·) drifts toward ⇧(·). Namely,

the noise (i.e., v) contributes more than the signal (i.e., ⇧n (z)) to the total variation of x 2 {⇧n (z) + v : z 2 Z, v 2 V} as n ! 1. In order to facilitate a meaningful asymptotic theory in

which the e↵ect of weak instruments is reflected, we further proceed by considering a specific sequence of ⇧n (·). Assumption L (Localization) For some > 0, the true reduced-form function ⇧n (·) satisfies ˜ 2 C1 (Z) that does not depend on n and for z 2 Z the following. For some ⇧(·) @⇧n (z) =n @z 0

˜ @ ⇧(z) + op (n @z 0

).

˜ · ⇧(z) + c + op (n

)

·

Assumption L is equivalent to ⇧n (z) = n

(3.1)

for some constant vector c. This specification of a uniform convergent sequence over Z can be justified by our identification analysis: If the convergence occurs only in some part of Z, its limit

would not result in the lack of identification by (2.4). The “local nesting” device in (3.1) is also used in Stock and Wright (2000) and Jun and Pinkse (2012) among others. In contrast to these papers, the value of

measures the strength of identification here and is not specified to be 1/2.9 Unlike

8

The lack of identification condition is satisfied either when the order condition fails (dz < dx ), or when z are jointly irrelevant for one or more of x, almost everywhere in their support. ˜ ⇧(·) 9 It would be interesting to have di↵erent rates across columns or rows of @@z 0 . One can also consider di↵erent 2 rates for di↵erent elements of the matrix. The analyses in these cases can analogously be done by slight modifications of the arguments.

11

a linear reduced form, to characterize weak instruments in a more general nonparametric reduced form, we need to control the complete behavior of the reduced-form function, and the derivation of local asymptotic theory seems to be more demanding. Nevertheless, the particular sequence considered in Assumption L makes the weak instrument asymptotic theory straightforward while embracing the most interesting local alternatives against non-identification.

4

Estimation

Once the endogeneity is controlled by the control function in (2.2), the problem becomes one of estimating the additive nonparametric regression function E[y|x, z] = g0 (x) +

0 (v).

In a weak

instrument environment, however, we face a nonstandard problem called concurvity: x = ⇧n (z) + v ! v a.s. as n ! 1 under the weak instrument specification (3.1) of Assumption L with c = 0 as P normalization. With a series representation g0 (x) + 0 (v) = 1 j=1 { 1j pj (x) + 2j pj (v)}, where the pj (·)’s are the approximating functions, it becomes a familiar problem of multicollinearity as pj (x) ! pj (v) a.s. for all j. More precisely, pj (x) pj (v) = Oa.s. (n ) by mean value expansion pj (v) = pj (x ˜ ˜ n ⇧(z)) = pj (x) n ⇧(z)@p x)/@x with an intermediate value x ˜.10 Alternatively, by plugging j (˜ o P1 n ˜ pj (v) back into the series, we obtain E[y|x, z] = j=1 ( 1j + 2j )pj (x) n ⇧(z)@p (˜ x )/@x , 2j j ˜ and n ⇧(z)@p x)/@x can be seen as a regressor whose variation shrinks as n ! 1. j (˜

This feature is reminiscent of the ill-posed inverse problem in the literature, although the sources

of ill-posedness in the two problems are di↵erent. The ill-posed inverse problem is a function space inversion problem that commonly occurs, e.g., in estimating a standard NPIV model of P y = g0 (x) + " with endogenous x and E["|z] = 0. Again by writing g0 (x) = 1 j=1 1j pj (x) it follows P1 that E[y|z] = j=1 1j E [pj (x)|z]. Analogous to the weak instruments problem, the variation of the ⇥ ⇤ regressor E [pj (x)|z] shrinks since E E[pj (x)|z]2 ! 0 as j ! 1 (Kress (1999, p. 235)). Blundell and Powell (2003, p. 321) also acknowledge that the ill-posed inverse problem is a functional analogue to the multicollinearity problem in a classical linear regression model. In order to tackle the ill-posed inverse problem, two approaches are taken in the NPIV literature: the truncation method and the penalization method. Using the foregoing example, the truncation method is to regularize the problem by truncating the infinite sum, that is, by considering j  Jn

for some Jn < 1 for all n. In this way, the variation of the regressor E [pj (x)|z] is prevented from shrinking. The penalization method is used to directly control the behavior of coefficients

1j ’s

for

all j < 1 by penalizing them while maintaining the original infinite-dimensional approximation. In

Chen and Pouzo (2012) closely related concepts are used with di↵erent terminologies: minimizing a criterion over finite sieve space and minimizing a criterion over infinite sieve space with a Tikhonovtype penalty. They introduce the penalized sieve minimum distance estimator, which essentially incorporates both the truncation and penalization methods. See also, e.g., Newey and Powell 10

This expansion is useful later in deriving the local asymptotic results in Sections 5 and 6.

12

(2003), Ai and Chen (2003), Blundell et al. (2007), Hall and Horowitz (2005), Horowitz and Lee (2007) for regularization methods used in the NPIV framework. The regularization approach for multicollinearity in linear regression models is a biased estimation method called ridge regression (Hoerl and Kennard (1970)), which can be seen as a penalization method. Given the connections between the weak instruments, concurvity, and ill-posed inverse problems, the regularization methods used in the realm of research concerning inverse problems (Carrasco et al. (2007)) are suitable for use with weak instruments. In this paper, we introduce the penalization scheme. The nature of our problem is such that the truncation method does not work properly. Unlike in the ill-posed inverse problem, the estimators of

1j

and

2j

above can still be unstable

even after truncating the series since we still have pj (x) ! pj (v) a.s. for j  J < 1 as n ! 1. On the other hand, the penalization directly controls the behavior of the

1j ’s

and

2j ’s,

and hence, it

successfully regularizes the weak instrument problem. We propose a penalized series estimation procedure for h0 (w) = g0 (x) +

0 (v)

where w = (x, v).

We choose to use series estimation rather than other nonparametric methods as it is more suitable in our particular framework. Because x ! v a.s. the joint density of w becomes concentrated

along a lower dimensional manifold as n tends to infinity. As mentioned in the introduction, series estimators are less sensitive to this problem of singular density. See Assumption B below for related discussions. Furthermore with series estimation, it is easy to impose the additivity of h0 (·) and to characterize the problem of weak instruments as a multicollinearity problem. Since the reduced-form error v is unobserved, the procedure takes two steps. In the first stage, we estimate the reduced form ⇧n (·) using a standard series estimation method and obtain the residual vˆ. In the second stage, we estimate the structural function h0 (·) using a penalized series estimation method with w ˆ = (x, vˆ) as the regressors. The theory to follows uses orthogonal polynomials as approximating functions in series estimation. The orthogonal polynomials considered in this paper are the Chebyshev and Legendre polynomials. The results of this paper also follow analogously using Fourier series. For detailed descriptions of power series in general, see Newey (1997). Let {(yi , xi , zi )}ni=1 be the data with n observations, and let rL (zi ) = (r1 (zi ), ..., rL (zi ))0 be a vector of

approximating functions of order L for the first stage. Define a matrix R = (rL (z1 ), ..., rL (zn ))0 . rL (z

L 0 0 ˆ i ) gives ⇧(·) = r (·) ˆ where ˆ = (R R)

n⇥L 1 R0 (x , ..., x )0 , 1 n

Then, regressing xi on and we obtain ˆ vˆi = xi ⇧(zi ). Define a vector of approximating functions of order K for the second stage as pK (w) = (p1 (w), ..., pK (w))0 . To reflect the additive structure of h0 (·), there are no interaction terms between the approximating functions for g0 (·) and those for

0 (·)

in this vector; see the Appendix

for an explicit expression. This structure is the key to the dimension reduction featured in the asymptotic results. Denote a matrix of approximating functions as Pˆ = (pK (w ˆ1 ), ..., pK (w ˆn ))0 n⇥K

where w ˆi = (xi , vˆi ). Note that L = L(n) and K = K(n) grow with n, which implies that the size of the matrix Pˆ as well as that of R also grows with n.

13

We define a penalized series estimator : ˆ ⌧ (w) = pK (w)0 ˆ⌧ , h

(4.1)

where the “interim” estimator ˆ⌧ optimizes a penalizing criterion function ˆ⌧ = arg min

˜2RK

⇣

y

⌘0 ⇣ Pˆ ˜ y

⌘ Pˆ ˜ /n + ⌧n ˜0 DK ⇤ ˜,

(4.2)

where y = (y1 , ..., yn )0 , DK ⇤ = Diag{0, ..., 0, 1, ..., 1} is a diagonal matrix that satisfies ˜0 DK ⇤ ˜ = PK ˜2 0 the penalization parameter. In (4.2), the standard L2 penalty is tailored j=K ⇤ +1 j , and ⌧n to meet our purpose. We introduce such a modification and allow K ⇤ = K ⇤ (n) to grow with

n to gain sufficient control over the bias created by the penalization in this nonparametric weak instrument setting; see Assumption D below for related discussions. Note that the penalty term ⌧n 0 DK ⇤ penalizes the coefficients on the higher-order series, which e↵ectively imposes smoothness restrictions on h0 (·). For the same purpose of controlling the bias, ⌧n is assumed to converge to zero. The optimization problem (4.2) yields a closed form solution: ˆ⌧ = (Pˆ 0 Pˆ + n⌧n DK ⇤ )

1

Pˆ 0 y.

The multicollinearity feature discussed above is manifested here by the fact that the matrix Pˆ 0 Pˆ is nearly singular under Assumption L, since the two columns of Pˆ become nearly identical. The term n⌧n DK ⇤ mitigates such singularity, without which the performance of the estimator of h(·) would deteriorate severely.11 The relative e↵ects of weak instruments and penalization determine ˆ ⌧ (·). Given h ˆ ⌧ (·), with the normalization that (¯ the asymptotic performance of h v ) = ¯ , we have ˆ ⌧ (x, v¯) gˆ⌧ (x) = h

5

¯.

Consistency and Rate of Convergence

First, we state the regularity conditions and key preliminary results under which we find the rate of convergence of the penalized series estimator introduced in the previous section. Let X = (x, z). Assumption A {(yi , xi , zi ) : i = 1, 2, ...} are i.i.d. and var(x|z) and var(y|X) are bounded func11 One can easily see that a similar feature can be found in a linear setting analogous to the present triangular model on applying a control function approach there. Note that the control function estimator is equivalent to the usual two-stage least squares (TSLS) estimator in such a linear model. Therefore, the problem of weak instruments with the TSLS estimator analyzed in Staiger and Stock (1997) can be seen as a multicollinearity problem with the “control function estimator.” In linear settings, however, the introduction of a regularization method discussed in the present paper is not appealing, since it creates the well-known biased estimator of ridge regression. In contrast, we do not directly interpret ˆ⌧ in the current nonparametric setting, since it is only an interim estimator calculated to ˆ ⌧ (·). More importantly, the overall bias of h ˆ ⌧ (·) is unlikely to be worsened in the sense that the additional obtain h bias introduced by penalization can be dominated by the existing series estimation bias. This is, in fact, proved to be true in the convergence rate by letting K ⇤ grow with n as K does.

14

tions of z and X, respectively. Assumption B (z, v) is continuously distributed with density that is bounded away from zero on Z ⇥ V, and Z ⇥ V is a cartesian product of compact, connected intervals. Assumption B is useful to bound below and above the eigenvalues of the “transformed” second moment matrix of approximating functions. This condition is worthy of discussion in the context of identification and weak identification. Let fu and fw denote the density functions of u = (z, v) and w = (x, v), respectively. An identification condition like Assumption ID20 in Section 2 is embodied in Assumption B. To see this, note that fu being bounded away from zero means that there is no functional relationship between z and v, which in turn implies Assumption ID20 (a)(iii).12 On the other hand, an assumption written in terms of fw like Assumption 2 in NPV (p. 574) cannot be imposed here. Observe that w = (⇧n (z) + v, v) depends on the behavior of ⇧n (·), and hence, fw is not bounded away from zero uniformly over n under Assumption L, but instead approaches a singular density. Technically, making use of the mean value expansion and a transformation matrix (see the Appendix), an assumption is made in terms of fu , which is not a↵ected by weak instruments, and the e↵ect of weak instruments can be handled separately in the asymptotics proof. Note that the assumption for the Cartesian products of supports, namely Z ⇥V and its compactness can be replaced by introducing a trimming function as in NPV, that ensures bounded rectangular

supports. Assumption B can be weakened to hold only for some component of the distribution of z. Some components of z can be allowed to be discrete as long as they have finite supports. Next, Assumption C is a smoothness assumption on the structural and reduced-form functions. P For a generic dimension d and a d-vector µ of nonnegative integers, let |µ| = dl=1 µl . Define the µ

derivative @ µ g(x) = @ |µ| g(x)/@xµ1 1 @xµ2 2 · · · @xdxdx of order |µ|. Let !(·) be a positive continuous ´ 0 weight function on X and let kgkX ,! = X |@ (1,...,1) g(x)|!(x)dx be a weighted seminorm of g(·); for an univariate example of this seminorm, see (4.5) of Trefethen (2008). Let W be the support of w = (x, v).

Assumption C g0 (x) and s on W, and

k@ µ gkX ,!

+

0 (v) are µ k@ kV,! is

Lipschitz and absolutely continuously di↵erentiable of order bounded by a fixed constant for |µ| = s. ⇧n (z) is bounded,

Lipschitz, and absolutely continuously di↵erentiable of order s⇡ on Z, and for the m-th element ⇧n,m (z), k@ µ ⇧n,m kZ,! is bounded by a fixed constant for |µ| = s⇡ , and for all n and m  dx .

As shown in the following lemma, this assumption on di↵erentiability and bounded variation ensures that the coefficients on the series expansions have particular decay rates and that the approximation errors shrink at particular rates as the number of approximating functions increases. Recall that pj (w) and rj (z) are (the tensor products of) orthogonal polynomials on W and Z,

respectively. We present results for the Chebyshev polynomials here, and results for the Legendre polynomials can be found in the Appendix. 12

The definition of a functional relationship can be found, e.g., in NPV (p. 568).

15

Lemma 5.1 Under Assumptions B and C,

P1

j pj (w)

j=1

of h0 (w) and

P1

j=1 nj rnj (z)

all n are uniformly convergent, and for a generic positive constant c and for all n, (i) and

nl

 cl

s⇡ /dz 1 ;

(ii) supw2W h0 (w)

PK˜

j=1

j pj (w)

˜ and L. ˜ for all positive integers K

˜  cK

s/dx

j

of ⇧n (z) for  cj

PL˜

and supz2Z ⇧n (z)

j=1 nj rj (z)

˜  cL

s/dx 1

s⇡ /dz

Lemma 5.1(i) is closely related to source conditions used in the NPIV literature, where the decay rate of coefficients is assumed to be part of them. For example, Hall and Horowitz (2005) assume that the rate is O(j

) for some constant

> 1/2. Darolles et al. (2011) incorporate a similar assumption

within a source condition. See also Chen and Pouzo (2012). Such an assumption is an abstract assumption on smoothness and is agnostic about dimensionality. This paper does not assume but explicitly derives the decay rate from a conventional smoothness condition of Assumption C. In this way, Lemma 5.1 connects the abstract smoothness condition (i.e., a source condition) with a more standard smoothness concept (i.e., di↵erentiability). Note that the result is also informative about dimensionality. This result is novel in the literature and is useful later in showing that the rate of the penalty bias is dominated by the rate of the approximation bias. The approximation error bound of (ii) is implied from (i).13 In Lemma 5.1(i) and hence in (ii), the dimension of w is reduced to the dimension of x as the additive structure of h0 (w) is exploited. Relevant discussion can be found in Assumption A.2 and p. 472 of Andrews and Whang (1990) and in Theorem 3.2 of Powell (1981, p. 26). The next regularity condition restricts the rate of growth of the numbers K and L of the approximating functions and K ⇤ in the penalty term. Let an ⇣ bn denote that an /bn is bounded below and above by constants that are independent of n. p Assumption D n2 K 7/2 ( L/n+L s⇡ /dz ) ! 0, L3 /n ! 0, and n for any n and K ⇤ ⇣ K.

K 6 ! 0. Also, K > 2(K ⇤ +1)

The conditions on K and L are more restrictive than the corresponding assumption for the power series in NPV (Assumption 4, p. 575) where weak instruments are not considered. The last inequality requires that K be sufficiently larger than K ⇤ so that the penalty term ⌧n 0 DK ⇤

plays

a sufficient role. Technically, this condition allows the derivation of the asymptotic results to be similar to a simpler case of a fixed K ⇤ . Now, we provide results for the rate of convergence in probability of the penalized series estimator ˆ h⌧ (w) in terms of L2 and uniform distance. Let Fw (w) = F (w) be the distribution function of w. Theorem 5.1 Suppose Assumptions A–D and L are satisfied. Let Rn = min{n , ⌧n ⇢ˆ h

ˆ ⌧ (w) h

h0 (w)

i2

1/2

dF (w)

⇣ p = Op Rn ( K/n + K

s/dx

+

p L/n + L

1/2

}. Then

s⇡ /dz

⌘ ) .

13 No result analogous to (i) is derived in Lorentz (1986) when showing (ii), but a di↵erent approach is used under an assumption similar to Assumption C of the present paper. See Theorem 8 of Lorentz (1986, p. 90).

16

Also,

⇣ p h0 (w) = Op Rn · K( K/n + K

ˆ ⌧ (w) sup h

w2W

s/dx

+

p L/n + L

s⇡ /dz

⌘ ) .

Suppose there is no penalization (⌧n = 0). Then with Rn = n , Theorem 5.1 provides the rates ˆ of convergence of the unpenalized series estimator h(·). For example, with k·kL2 denoting the L2 norm above,

ˆ h

h0

L2

⇣ p = Op n ( K/n + K

s/dx

+

p L/n + L

s⇡ /dz

⌘ ) .

(5.1)

Compared to the strong instrument case of NPV (Lemma 4.1, p. 575), the rate deteriorates by the p leading n rate, the weak instrument rate. Note that the terms K/n and K s/dx correspond to p the variance and bias of the second stage estimator, respectively, and L/n and L s⇡ /dz are those of the first stage estimator. The latter rates appear here due to the fact that the residuals vˆi are generated regressors obtained from the first-stage nonparametric estimation. The way that n enters into the rate implies that the e↵ect of weak instruments (hence concurvity) not only exacerbates the variance but also the bias. This is di↵erent from a linear case where multicollinearity only results in imprecise estimates but does not introduce bias. Moreover, the symmetric e↵ect of weak instruments on bias and variance implies that the problem of weak instruments cannot be resolved by the choice of the number of terms in the series estimator. This is also related to the discussion in Section 4 that the truncation method does not work as a regularization method for weak instruments. More importantly, in the case where penalization is in operation (⌧n > 0), the way that Rn enters into the convergence rates implies that penalization can reduce both bias and variance by the same mechanism working in an opposite direction to the e↵ect of weak instruments. Explicitly, when the e↵ect of penalization dominates the e↵ect of weak instruments, i.e., when Rn = min{n , ⌧n ⌧n

1/2

1/2

, penalization plays a role and yields the rate ˆ⌧ h

h0

L2

⇣ p = Op ⌧n 1/2 ( K/n + K

s/dx

+

p L/n + L

Here, the overall rate is improved since the multiplying rate ⌧n

1/2

s⇡ /dz

⌘ ) .

}=

(5.2)

is of smaller order than the

multiplying rate n of the previous case. Interestingly, the additional bias term introduced by penalization, namely, the penalty bias, is not present in the rate as it is shown to be dominated by the approximation bias term K

s/dx .

This result is because of Lemma 5.1; see the Appendix.

When the e↵ect of weak instruments dominates the e↵ect of penalization, the rate becomes (5.1). Next, we find the optimal L2 convergence rate. For a more concrete comparison between the rates n and ⌧n

1/2

, let ⌧n = n

2

⌧

for some

⌧

> 0. For example, the larger

⌧

is, the faster the

penalization parameter converges to zero, and hence, the smaller the e↵ect of penalization is. Corollary 5.2 (Consistency) Suppose the Assumptions of Theorem 5.1 are satisfied. Let K = ˆ ⌧ h0 O(n1/(1+2s/dx ) ) and L = O(n1/(1+2s⇡ /dz ) ). Then h = Op (n q ) = op (1), where q = 2 L

17

min

n

s⇡ s dx +2s , dz +2s⇡

o

min{ ,

⌧ }.

Given the choice of K and L in the corollary, Assumption D implies that 1 4(1+2s⇡ /dz ) .

<

1 4

7 4(1+2s/dx )

It can be readily shown that q > 0 from this bound on . This bound, however, requires

that the value of

needs to be small to ensure consistency. For example, when the nonparametric

functions are smooth (s = s⇡ = 1), it should be that

< 1/4. With weak instruments, optimal

rates in the sense of Stone (1982) are not attainable. The uniform convergence rate does not attain Stone’s (1982) bound even without the weak instrument factor (Newey (1997, p. 151)) and hence is not discussed here. Corollary 5.2 has several implications. Let dx = dz = d and s = s⇡ for simplicity so that q=

s d+2s

min{ ,

⌧ }.

Consider a weak instruments-dominating case of

<

⌧.

When the structural

function is less smooth or has a high dimensional argument (i.e., small s or large d, and hence, small

s d+2s ),

instruments should not be too weak (i.e., small ) to achieve the same optimal rate

(i.e., holding q fixed). This implies a trade-o↵ between the smoothness of the structural function (or the dimensionality) and the required strength of instruments. This, in turn, implies that the weak instrument problem can be mitigated with some smoothness restrictions, which is in fact one of our justifications for introducing the penalization method. Once the penalization e↵ect is dominating (

⌧

< ), q increases and a faster rate is achieved. When implementing the penalized series estimator in practice, there remains the issue of choosing

tuning parameters, namely, the penalization parameter ⌧n and the orders K and L of the series. In the simulations, we present results with a few chosen values of ⌧ , K, and L. A data-driven procedure such as the cross-validation method can also be used (Arlot and Celisse (2010)). This method is used for choosing ⌧ in the empirical section below. There is, however, no optimal method in the literature for choosing the tuning parameters (Blundell et al. (2007, pp. 1636–1637)). It would be interesting to further investigate the sensitivity of the penalized estimator to the choice of ⌧ . Denote ˆ ⌧ (·) = pK (·)0 ˆ⌧ = pK (·)0 (Q ˆ = Pˆ 0 Pˆ /n and let DK ⇤ = I for simplicity. Recall that h ˆ + ⌧ I) 1 Pˆ 0 y/n Q ˆ ⌧ (·) to ⌧ , we consider the sharp bound on is a function of ⌧ . To determine the sensitivity of h n o 1 1 ˆ ˆ . As a measure of sensitivity, we calculate the absolute max ((Q + ⌧ I) ), which is min (Q) + ⌧ value of the first derivative of the bound: @[

ˆ + ⌧] @⌧

min (Q)

1

=

n

ˆ +⌧

min (Q)

o

2

.

(5.3)

ˆ That is, as instruments become weaker Note that the sensitivity is a decreasing function of min (Q). ˆ ⌧ (·) becomes ˆ becomes smaller because of increased singularity), the performance of h (i.e., min (Q) more sensitive to a change in ⌧ . This has certain implications in practice. Theorem 5.1 leads to the following theorem, which focuses on the rate of convergence of the structural estimator gˆ⌧ (·) after subtracting the constant term which is not identified.

18

Theorem 5.3 Suppose Assumptions A–D and L are satisfied. Let Rn = min{n , ⌧n ˆ (x) = gˆ⌧ (x) g0 (x), (ˆ 

ˆ (x)

ˆ

ˆ (x)dFw

ˆ ⌧ (x, v¯) Also, if gˆ⌧ (x) = h sup |ˆ g⌧ (x)

x2X

2

dFw

¯ and ¯ =

)1/2

⇣ p = Op Rn ( K/n + K

v ), 0 (¯

s/dx

+

p L/n + L

1/2

s⇡ /dz

}. For

⌘ ) .

then

⇣ p g0 (x)| = Op Rn K( K/n + K

s dx

+

p L/n + L

s⇡ dz

⌘ ) .

The optimal rate results for gˆ⌧ (·) and the related analyses can be followed analogously, and we omit them here. The convergence rate is net of the constant term. We can further assume E["] = 0 to identify the constant. Before closing this section, we discuss one of the practical implications of the identification and asymptotic results of this paper. In applied research that uses nonparametric triangular models, a linear specification of the reduced form is largely prevalent; see, for example, NPV, Blundell and Duncan (1998), Blundell et al. (1998), Yatchew and No (2001), Lyssiotou et al. (2004), Dustmann and Meghir (2005), and Del Bono and Weber (2008). While cases are rare where a linear reducedform relationship is justified by economic theory, linear specification is introduced to avoid the curse of dimensionality with many covariates at hand, or for an ad hoc reason that it is easy to implement, or simply because the nonparametric structural equation is of primary interest. The convergence rates in Theorem 5.1 and 5.3 are simplified with a linear reduced form, since the p p nonparametric rate of the first stage ( L/n + L s⇡ /dz ) disappears as the parametric rate (1/ n) p is dominated by the second-stage nonparametric rate ( K/n + K s/dx ). When the reduced form is linearly specified, however, any true nonlinear relationship is “flattened out,” and the situation

is more likely to have the problem of weak instruments, let alone the problem of misspecification. On the other hand, one can achieve a significant gain in the performance of the estimator by nonparametrically estimating the relationship between x and z. According to the rank condition (2.4), i.e., Pr [@⇧0 (z)/@z 6= 0] > 0 with univariate x and z, a small region where x and z are relevant contributes to the identification of g(·). This implies that identification power can be enhanced by

exploiting the entire nonlinear relationship between x and z. This phenomenon may be interpreted in terms of the “optimal instruments” in the GMM settings of Amemiya (1977); see also Newey (1990) and Jun and Pinkse (2012). Note that the nonparametric first stage estimation is not likely to worsen the overall convergence rate of the estimator, since the nonparametric rate from the second stage is already present.

19

6

Asymptotic Distributions

In this section, we establish the asymptotic normality of the linear functionals of the penalized ˆ ⌧ (·). We consider linear functionals of h0 (·) that include h0 (·) at a certain value series estimator h ´ (i.e., h0 (w)) ¯ and the weighted average derivative of h0 (·) (i.e., #(w) [@h0 (w)/@x] dw). The linear ˆ ⌧ ) of ✓0 = a(h) is the functionals of h = h0 (·) are denoted as a(h). Then, the estimator ✓ˆ⌧ = a(h natural “plug-in” estimator. Let A = (a(p1K ), a(p2K ), ..., a(pKK )), where pjK (·) is an element of pK (·). Then, ˆ ⌧ ) = a(pK (x)0 ˆ⌧ ) = A ˆ⌧ , ✓ˆ⌧ = a(h which implies that ✓ˆ⌧ is a linear combination of the two-step penalized least squares estimator ˆ⌧ . ˆ ⌧ ) can be naturally defined: Then the following variance estimator of a(h ⇣ ⌘ ˆ 1 ⌃ ˆ⌧ + H ˆ⌧ Q ˆ 1⌃ ˆ 1Q ˆ 1H ˆ0 Q ˆ 1 A0 , Vˆ⌧ = AQ ⌧ ⌧ ⌧ 1 1

ˆ⌧ = ⌃

n X

K

K

p (w ˆi )p (w ˆi ) [yi

i=1

ˆ⌧ = H

n X

0

pK ( w ˆi )

i=1

⇢h

ˆ ⌧ (w h ˆi )]2 /n,

ˆ1 = ⌃

n X

vˆi2 rL (zi )rL (zi )0 /n,

i=1

i0

ˆ ⌧ (w ˆ i ))/@⇡ rL (zi )0 /n, @h ˆi )/@w @w(Xi , ⇧(z

ˆ 1 = R0 R/n, Q

where X is a vector of variables that includes x and z and w(X, ⇡) is a vector of functions of X and ⇡ where ⇡ is a possible value of ⇧(z). ˆ ⌧ ). Let The following are additional regularity conditions for the asymptotic normality of a(h ⌘=y

h.

Assumption E

2 (X)

h i ⇥ ⇤ = var(y|X) is bounded away from zero, E ⌘ 4 |X is bounded, and E kvk4 |X

is bounded. Also, h0 (w) is twice continuously di↵erentiable in v with bounded first and second derivatives. This assumption strengthens the boundedness of conditional second moments in Assumption A. For the next assumption, let |h|r = max|µ|r supw2W |@ µ h(w)|. Also let p˜K (w) be a generic vector of approximating functions.

Assumption F a(h) is a scalar, |a(h)|  |h|r for some 0, and ithere exists h r 2 K0 K K ! 1, a(˜ p p˜ (w)0 K ! 0. K ) is bounded away from zero while E

K

such that as

This assumption includes the case of h at a certain value. The next condition restricts the rate of growth of Kand L and the rate of convergence of ⌧n . p 1 ! 0, nL s⇡ /dz ! 0, and n + 2 ⌧n ! 0; also, n3( ⇥ ⇤ 1 1 ) 4 (K 3 + L3 ) ! 0, and n 2 K 3 L3/2 + K 4 L1/2 (K + L)1/2 ! 0.

Assumption G n4(

p

nK

s/dx

20

1 ) 6

K 4 L1/2 ! 0,

p nK s/dx ! 0 and nL s⇡ /dz p ! 0, imply overfitting in that the bias (K s/dx ) shrinks faster than 1/ n, the usual rate of standard Assumption G introduces “overfitting.” The first two conditions,

p

deviation of the estimator. The same feature is found in the corresponding assumption in NPV (Assumption 8, p. 582), while the overall conditions on K and L in Assumption G are stronger than that in NPV. At the same time, ⌧n needs to converge fast enough to satisfy n

+ 12

⌧n ! 0, which

is stronger than what is required in the previous section. This condition also implies overfitting in that it requires less penalization, hence implicitly imposing less smoothness. The value of

that

satisfies G needs to be small, i.e., the instruments must be mildly weak, to obtain the asymptotic normality of ✓ˆ⌧ . Theorem 6.1 If Assumptions A–G and L are satisfied, then ✓ˆ⌧ = ✓0 + Op (n p

nVˆ⌧

1/2

(✓ˆ⌧

1 2

K 1+2r ) and

✓0 ) !d N (0, 1).

The proofs of this theorem and the following corollary can be found in Appendix B.4. In addition to asymptotic normality, the results also provide the bound on the convergence rate of ✓ˆ⌧ . The bound on the rate achieved here is slower than that of NPV. The fact that the rate is slower (i.e., the multiplier of (✓ˆ⌧ ✓0 ) is of smaller order) in the case of weak instruments can be seen as the e↵ective sample size being small. p Under the following assumption, n-consistency is achieved . Let p˜⇤K (z, v) be a “transformation” of p˜K (w) purged of the weak instruments e↵ect (see the Appendix). h i Assumption H There exists ⌫(w) and ↵K such that E k⌫(w)k2 < 1, a(h) = E [⌫(w)h0 (w)], h i 2 a(pj ) = E [⌫(w)pj (w)], and E ⌫(⇧n (z) + v, v) p˜⇤K (z, v)0 ↵K ! 0 as K ! 1.

This assumption includes the case of the weighted average derivative of h, in which case ⌫(w) =

fw (w)

1 @#(w)/@w.

Let ⇢(z) = E [⌫(w)@h0 (w)/@v 0 |z]. Under this assumption, the asymptotic

variance of ✓ˆ in this case can be expressed as

⇥ ⇤ ⇥ ⇤ V¯ = E ⌫(w)⌫(w)0 var(y|X) + E ⇢(z)var(x|z)⇢(z)0 .

Corollary 6.2 If Assumptions A–E, G, H and L are satisfied, then p

n(✓ˆ⌧

✓0 ) !d N (0, V¯ ),

Vˆ⌧ !p V¯ .

In this case, instruments can be regarded as nearly strong in that

p

n-consistency is achieved as

if instruments are strong, as in NPV. There still remain issues when the results of Theorem 6.1 and Corollary 6.2 are used for inference, e.g., for constructing pointwise asymptotic confidence intervals. As long as nuisance parameters are present, such an inferential procedure may depend on the strength of instruments or on the 21

choice of the penalization parameter. Developing a robust procedure against weak instruments in nonparametric models is beyond the scope of our paper, and we leave it to future research. Lastly, one closely related work to the asymptotic results of this paper is Jiang et al. (2010), where a model for DNA microarray data is transformed into an additive nonparametric model with highly correlated regressors. A more obvious example in their paper is a time series model P such as yt = m0 + dj=1 mj (xt j ) where the lagged variables of xt are highly correlated. The authors establish pointwise asymptotic normality for local linear and integral estimators. Their findings show the way that the correlated regressors a↵ect bias and variance. Once our problem is transformed to an additive nonparametric regression with concurvity, we can also develop weak instrument asymptotic theory for local linear and integral estimators of the functionals of h0 (·). Their results can be applied after taking into account the generated regressors (ˆ vi ) that are involved in our problem.

7

Monte Carlo Simulations

In this section, we document the problems of weak instruments in nonparametric estimation and investigate the finite sample performance of the penalized estimator. We are particularly interested in the finite sample gain in terms of the bias, variance, and mean squared errors (MSE) of the penalized series estimators defined in Section 4 (“penalized IV (PIV) estimators”) relative to those of the unpenalized series estimators (“IV estimators”) for a wide range of strength of instruments. We also compare the IV estimators with series estimators that ignore endogeneity (“least squares (LS) estimators”). LS estimators are calculated by nonparametrically estimating the outcome equation without considering the reduced-form equation. We consider the following data generating process: y=

✓

x

µx x

◆

+ ",

x = ⇡1 + z⇡ + v,

2 2 0 where " y, x, and # z are univariate, z ⇠ N (µz , z ) with µz = 0 and z = 1, and (", v) ⇠ N (0, ⌃) with 1 ⇢ ⌃= . Note that |⇢| measures the degree of endogeneity, and we consider ⇢ 2 {0.2, 0.5, 0.95}. ⇢ 1 The sample {zi , "i , vi } is i.i.d. with size n = 1000. The number of simulation repetitions is s 2

{500, 1000}.

We consider di↵erent strengths of the instrument by considering di↵erent values of ⇡. Let the

intercept ⇡1 = µx that

2 x

=

⇡ 2 z2

⇡µz with µx = 2 so that E[x] = µx does not depend on the choice of ⇡. Note

+ 1 still depends on ⇡, which is reasonable since the signal contributed to the total

variation of x is a function of ⇡. More specifically, to measure the strength of the instrument, we

22

define the concentration parameter (Stock and Yogo (2005)): 2

µ =

⇡2

Pn

2 i=1 zi . 2 v

Note that since the dimension of z is one, the concentration parameter value and the first-stage F statistic are similar to each other. For example, in Staiger and Stock (1997), for F = 30.53 (strong instrument), a 97.5% confidence interval for µ2 is [17.3, 45.8], and for F = 4.747 (weak instrument), a confidence interval for µ2 is [2.26, 5.64]. The candidate values of µ2 are {4, 8, 16, 32, 64, 128, 256}, which range from a weak to a strong instrument in the conventional sense.14 As for the penalization parameter ⌧ , we consider candidate values of {0.001, 0.005, 0.01, 0.05., 0.1}. We choose K ⇤ 2 {1, 3, 5}.

The approximating functions used for g0 (x) and

0 (v)

are polynomials with di↵erent choices of

(K1 , K2 ), where K1 is the number of terms for g0 (·), K2 for and 0 (1)

0 (·)

0 (·),

and K = K1 + K2 . Since g0 (·)

are separately identified only up to an additive constant, we introduce the normalization

= ⇢, where ⇢ is chosen because of the joint normality of (", v). Then, g0 (x) = h0 (x, 1)

⇢,

where h(x, v) = g(x) + (v). In the first part of the simulation, we calculate gˆ⌧ (·) and gˆ0 (·), the penalized and unpenalized IV estimates, respectively, and compare their performances. For di↵erent strengths of the instrument, we compute estimates with di↵erent values of the penalization parameter. We choose K1 = K2 = 6 and ⇢ = 0.5.15 As one might expect, the choice of orders of the series is not significant as long as we are only interested in comparing gˆ⌧ (·) and gˆ(·). Figures 2–5 present some representative results. Results with di↵erent values of µ2 and ⌧ are similar and hence are omitted to save space. In Figure 2, we plot the mean of gˆ⌧ (·) and gˆ(·) with concentration parameter µ2 = 16 and penalization parameter ⌧ = 0.001. The plots for the unpenalized estimate indicate that with the given strength of the instrument, the variance is very large, which implies that functions with any trends can fit within the 0.025–0.975 quantile ranges; it indicates that the bias is also large. In Figure 2, the graph for the penalized estimate shows that the penalization significantly reduces the variance so that the quantile range implies the upward trend of the true g0 (·). Note that the bias of gˆ⌧ (·) is no larger than that of gˆ(·). Although µ2 = 16 is considered to be strong according to the conventional criterion, this range of the concentration parameter value can be seen as the case where the instrument is “nonparametrically” weak in the sense that the penalization induces a significant di↵erence between gˆ⌧ (·) and gˆ(·). Figure 3 is drawn with µ2 = 256, while all else remains the same. In this case, the penalization induces no significant di↵erence between gˆ⌧ (·) and gˆ(·). This can be seen as the case where the 14

The simulation results seem to be unstable when µ2 = 4 (presumably because instruments in this range are severely weak in nonparametric settings), and hence need caution when interpreting them. 15 Because of the bivariate normal assumption for (", v)0 , we implicitly impose linearity in the function E["|v] = (v). Although K2 being smaller than K1 would better reflect the fact that 0 (·) is smoother than g0 (·), we assume that we are agnostic about such knowledge.

23

instrument is “nonparametrically” strong. It is noteworthy that the bias of the penalized estimate is no larger than the unpenalized one even in this case. Figures 4–5 present similar plots but with penalization parameter ⌧ = 0.005. Figure 4 shows that with a larger value of ⌧ than the previous case, the variance is significantly reduced, while the biases of the two estimates are comparable to each other. The change in the patterns of the graphs from Figure 4 to 5 is similar to those in the previous case. Furthermore, the comparison between Figures 2–3 and Figures 4–5 shows that the results are more sensitive to the change of ⌧ in the weak instrument case than in the strong instrument case. This provides evidence for the theoretical discussion on sensitivity; see (5.3) in Section 5. The fact that the penalized and unpenalized estimates di↵er significantly when the instrument is weak has a practical implication: Practitioners can be informed about whether the instrument they are using is worryingly weak by comparing penalized series estimates with unpenalized estimates. A similar approach can be found in the linear weak instruments literature; for example, the biased TSLS estimates and the approximately median-unbiased LIML estimates of Staiger and Stock (1997) can be compared to detect weak instruments. Tables 1 reports the integrated squared bias, integrated variance, and integrated MSE of the penalized and unpenalized IV estimators and LS estimators of g0 (·). The LS estimates are calculated by series estimation of the outcome equation (with order K1 ), ignoring the endogeneity. We also calculate the relative integrated MSE for comparisons. We use K1 = K2 = 6 and ⇢ = 0.5 as before. Results with di↵erent choices of orders K1 and K2 between 3 and 10 and a di↵erent degree of endogeneity ⇢ in {0.2, 0.95} show similar patterns. Note that the usual bias and variance trade-o↵s are present as the order of the series changes.

As the instrument becomes weaker, the bias and variance of the unpenalized IV (⌧ = 0) increase with a greater proportion in variance. The integrated MSE ratios between the IV and LS estimators (M SEIV /M SELS ) indicate the relative performance of the IV estimator compared to the LS estimator. A ratio larger than unity implies that IV performs worse than LS. In the table, the IV estimator does poorly in terms of MSE even when µ2 = 16, which is in the range of conventionally strong instruments; therefore, this can be considered as the case where the instrument is nonparametrically weak. The rest of the results in Table 1 are for the penalized IV (PIV) estimator gˆ⌧ (·). Overall the variance is reduced significantly compared to that of IV without sacrificing much bias. In the case of ⌧ = 0.001, the variance is reduced for the entire range of instrument strength, and the bias is no larger and is reduced when the instrument is weak. This provides evidence for the theoretical discussion in Section 5 that the penalty bias can be dominated by the existing series estimation bias. This feature diminishes as the increased value of ⌧ introduces more bias. The integrated MSE ratios between PIV and IV (M SEP IV /M SEIV ) in Table 1 suggest that PIV outperforms IV in terms of MSE for all the values of ⌧ considered here. For example, when µ2 = 8, the MSE of PIV with ⌧ = 0.001 is only about 4.57% of that of IV, while the bias of PIV is only about 20% of that 24

of IV. Obviously, these results imply that PIV performs substantially better than LS unlike the previous case of IV. Table 2 presents similar results for K ⇤ = 3. Interestingly, while the variance is uniformly increased, increasing K ⇤ does not seem to control the penalty bias in a monotonic fashion. Compared to the results of Table 1, there are cases where the bias is reduced, but these cases are for some strength of instruments and are mostly observed in the case where ⌧ = 0.005 or 0.01. The simulation results can be summarized as follows. Even with a strong instrument in a conventional sense, unpenalized IV estimators do poorly in terms of mean squared errors compared to LS estimators. Variance seems to be a bigger problem, but bias is also worrisome. Penalization alleviates much of the variance problem induced by the weak instrument, and it also works well in terms of bias for relatively weak instruments and for some values of the penalization parameter.

8

Application: E↵ect of Class Size

To illustrate our approach and apply the theoretical findings, we nonparametrically estimate the e↵ect of class size on students’ test scores. Estimating the e↵ect of class size has been an interesting topic in the schooling literature, since among school inputs that a↵ect students’ performance, class size is thought to be easier to manipulate. Angrist and Lavy (1999) analyze the e↵ect of class size on students’ reading and math scores in Israeli primary schools. With linear models, they find that the estimated e↵ect is negative in most of the specifications they consider. This specific empirical application is chosen for the following reasons: (i) Although Angrist and Lavy (1999) use an instrument that is considered to be strong for their parametric model, it may not be sufficiently strong when applied in a nonparametric specification of the relationship; see below for details. (ii) The instrument is continuous in this example and presents a nonlinear relationship with the endogenous variable; see Figure 1 in Angrist and Lavy (1999). (iii) We also compare estimates calculated from our triangular model and the NPIV model in Horowitz (2011), where the same example is considered. In this section, we investigate whether the results of Angrist and Lavy (1999) are driven by their parametric assumptions. It is also more reasonable to allow a nonlinear e↵ect of class size, since it is unlikely that the marginal e↵ect is constant across class-size levels. We nonparametrically extend their linear model: For school s and class c, scoresc = g(classizesc , disadvsc ) + ↵s + "sc , where scoresc is the average test score within class, classizesc the class size, disadvsc the fraction of disadvantaged students, and ↵s an unobserved school-specific e↵ect. Note that this model allows for di↵erent patterns for di↵erent subgroups of school/class characteristics (here, disadvsc ). Class size is endogenous because it results from choices made by parents, schooling providers

25

or legislatures, and hence is correlated with other determinants of student achievement. Angrist and Lavy (1999) use Maimonides’ rule on maximum class size in Israeli schools to construct an IV. According to the rule, class size increases one-for-one with enrollment until 40 students are enrolled, but when 41 students are enrolled, the class size is dropped to an average of 20.5 students. Similarly, classes are split when enrollment reaches 80, 120, 160, and so on, so that each class does not exceed 40. This rule can be expressed by a nonlinear function of enrollment, which produces the IV (denoted as fsc following their notation): fsc = es /{int((es

1)/40) + 1}, where es is the beginning-of-the-

year enrollment count. This rule generates discontinuity in the enrollment/class-size relationship, which serves as exogenous variation. Note that with the sample around the discontinuity points, IV exogeneity is more credible in addressing the endogeneity issue. The dataset we use is the 1991 Israel Central Bureau of Statistics survey of Israeli public schools from Angrist and Lavy (1999). We only consider fourth graders. The sample size is n = 2019 for the full sample and 650 for the discontinuity sample. Given a linear reduced form, first stage tests have F = 191.66 with the discontinuity sample (±7 students around the discontinuity points) and F = 2150.4 with the full sample. Lessons from the theoretical analyses of the present paper suggest that an instrument that is strong in a conventional sense (F = 191.66) can still be weak in nonparametric estimation of the class-size e↵ect, and a nonparametric reduced form can enhance identification power. We consider the following nonparametric reduced form classizesc = ⇧(fsc , disadvsc ) + vsc . The sample is clustered, an aspect which is reflected in ↵s of the outcome equation. Hence, we use the block bootstrap when computing standard errors and take schools as bootstrap sampling units to preserve within-cluster (school) correlation. This produces cluster-robust standard errors. We use b = 500 bootstrap repetitions. With the same example and dataset (only the full sample), Horowitz (2011, Section 5.2) uses the model and assumptions of the NPIV approach to nonparametrically estimate the e↵ect of class size. To solve the ill-posed inverse problem, he conducts regularization by replacing the operator with a finite-dimensional approximation. First, we compare the NPIV estimate of Horowitz (2011) with the IV estimate obtained by the control function approach of this paper. Figure 8 (Figure 3 in Horowitz (2011)) is the NPIV estimate of the function of class size (g(·, ·)) for disadv = 1.5(%) with the full

sample. The solid line is the estimate of g and the dots show the cluster-robust 95% confidence band. As noted in his paper (p. 374), “the result suggests that the data and the instrumental variable assumption, by themselves, are uninformative about the form of any dependence of test scores on class size.” Figure 9 is the (unpenalized) IV estimate calculated with the full sample using the triangular model (2.1) and the control function approach. Although not entirely flexible, the nonparametric reduced form above is justified for use in the comparison with the NPIV estimate, since the NPIV

26

approach does not specify any reduced-form relationship. The sample, the orders of the series, and the value of disadv are identical to those for the NPIV estimate. The dashed lines in the figure indicate the cluster-robust 95% confidence band. The result clearly presents a nonlinear shape of the e↵ect of class size and suggests that the marginal e↵ect diminishes as class size increases. The overall trend seems to be negative, which is consistent with the results of Angrist and Lavy (1999). It is important to note that the control function and NPIV approaches maintain di↵erent sets of assumptions. For example in terms of orthogonality conditions for IV, assumptions (2.1b) are not stronger or weaker than E["|z] = 0, the orthogonality condition introduced in the NPIV model; only if v ? z is assumed, then E["|v, z] = E["|v] with E["] = 0 implies E["|z] = 0. Therefore,

this comparison does not imply that one estimate performs better than the other. It does, however, imply that if the triangular model and control function assumptions are considered to be reasonable, they make the data to be informative about the relationship of interest. One of the control function assumptions, E ["sc |vsc , fsc ] = E ["sc |vsc ], is satisfied if ("sc , vsc ) are jointly independent of fsc . Here, "sc captures the factors that determine the average test scores of a class other than the class size

and the fraction of disadvantaged students, and vsc captures other factors that are correlated with enrollment. Moreover, since the NPIV approach su↵ers from the ill-posed inverse problem even without the problem of weak instruments, the control function approach may be a more appealing framework than the NPIV approach in the possible presence of weak instruments. In Section B.5 in the Appendix, we further discuss weak instruments in NPIV models. We proceed by calculating the penalized IV estimates from the proposed estimation method of this paper. For all cases below, we find estimates for disadv = 1.5(%) as before. To better justify the usage of our method in this part, we use the discontinuity sample where the instrument is possibly weak in this nonparametric setting. For the penalization parameter ⌧ , we use cross-validation (CV) to choose a value from among {0.005, 0.01, 0.015, 0.02, 0.05}. Specifically, we use the 10-fold CV; see Arlot and Celisse (2010) for details. The following results of cross-validation (Table 3) suggest that ⌧ = 0.015 is the MSE-minimizing value. We penalize

1

=(

2,

3 , ...,

K)

by choosing K ⇤ = 1.

Figure 10 depicts the penalized and unpenalized IV estimates. There is a certain di↵erence in the estimates, but the amount is small. It is possible that either the instrument is not very weak in this example or that cross-validation chooses a smaller value of ⌧ . Similar to before, the results suggest a nonlinear e↵ect of class size with the overall negative trend.

9

Conclusions

This paper analyzes identification, estimation, and inference in a nonparametric triangular model in the presence of weak instruments and proposes an estimation strategy to mitigate the e↵ect. The findings and implications of this paper are not restricted to the present additively separable triangular models. The results can be adapted to the nonparametric limited dependent variable framework of Das et al. (2003) and Blundell and Powell (2004). Weak instruments can also be 27

studied in other nonparametric models with endogeneity, such as the IV quantile regression model of Chernozhukov and Hansen (2005) and Lee (2007). The results of this paper are directly applicable in several semiparametric specifications of the model of this paper. With a large number of covariates, one can consider a semiparametric outcome equation or reduced form that is additively nonparametric in some components and parametric in others. One can also consider a single-index model for one equation or the other. As more structure is imposed on the model, the identification condition of Section 2 and the regularity condition of Sections 5 and 6 can be weakened. When the reduced form is of a single-index structure ⇧(z 0 ), the strength of instruments is determined by the combination of @⇧(·)/@(z 0 ) and . Subsequent research can consider two specification tests: a test for the relevance of the instruments and a test for endogeneity. These tests can be conducted by adapting the existing literature on specification tests where the test statistics can be constructed using the series estimators of this paper; see, e.g., Hong and White (1995). Testing whether instruments are relevant ˆ can be conducted with the nonparametric reduced-form estimate ⇧(·). A possible null hypothesis is H0 : Pr {⇧(z) = const.} = 1, which is motivated by our rank condition for identification. Testing whether the model su↵ers from endogeneity can be conducted with the control function estimate ˆ (v) = h(w) ˆ gˆ(x). A possible null hypothesis is H0 : Pr { (v) = const.} = 1. In this case, one needs to take into account the generated regressors vˆ when using existing results on the specification test. Constructing a test for instrument weakness would be more demanding than the above-mentioned tests. Developing inference procedures that are robust to identification of arbitrary strength is also an important research question.

A

Appendix I

This section (Appendix I) provides the proofs of the theorems and lemmas in Sections 2 and 5. The next section (Appendix II) contains some of the proofs of lemmas introduced in Appendix I and the proofs of the theorem and corollary in Section 6. The last subsection of Appendix II discusses weak instruments in a NPIV model. Throughout the Appendices, we suppress the subscript “0” for the true functions and, when no confusion arises, the subscript “n” of the true reduced-form function in Assumption L and of the coefficients on its expansion for notational simplicity.

A.1

Proofs in Identification Analysis (Section 2)

In this subsection, we prove Lemma 2.1 and Theorem 2.2. In order to prove the sufficiency of Assumption ID20 for ID2 (Lemma 2.1), we first introduce a preliminary lemma. For nonempty sets A and B, define the following set A + B = {a + b : (a, b) 2 A ⇥ B} .

28

(A.1)

Then, the following rules that are useful in proving the lemma. For nonempty sets A, B, and C, A + B = B + A (commutative)

(Rule 1)

A + (B [ C) = (A + B) [ (A + C) (distributive 1)

(Rule 2)

A + (B \ C) = (A + B) \ (A + C) (distributive 2)

(Rule 3)

c

c

(A + B) ⇢ A + B ,

(Rule 4)

where the last rule is less obvious than the others but can be shown to hold. The distributive rules of Rule 2 and 3 do not hold when the operators are switched; e.g., A [ (B + C) 6= (A [ B) + (A [ C). Recall that Z r = {z 2 Z : rank (@⇧(z)/@z 0 ) = dx } and Z 0 = Z\Z r . Let measure on

Rd x ,

Leb

denote a Lebesgue

and @V and int(V) denote the boundary and interior of V, respectively.

Lemma A.1 Suppose Assumptions ID1 and ID20 (a)(i) and (ii) hold. Suppose Z r 6= Then, (a) ⇧(z) + v : z 2 Z 0 , v 2 int(V) ⇢ X r , and (b)

Leb (⇧(Z

0 ))

and Z 0 6= .

= 0 and @V is countable.

We prove the main lemma first. Proof of Lemma 2.1: When Z r =

or Z r = Z we trivially have X r = X . Suppose Z r 6=

and Z 0 6= . First, under Assumption ID20 (b) that V = Rdx , we have the conclusion by

n o n o X r = ⇧(z) + v : z 2 Z r , v 2 Rdx = Rdx = ⇧(z) + v : x 2 Z, v 2 Rdx = X .

Now suppose Assumption ID20 (a) holds. By Assumption ID20 (a)(iii), for z 2 Z 0 , the joint support of (z, v) is Z 0 ⇥ V. Hence

⇧(z) + v : z 2 Z 0 , v 2 int(V) = ⇧(z) + v : (z, v) 2 Z 0 ⇥ int(V) = ⇧(Z 0 ) + int(V). c

But by Lemma A.1(a), ⇧(Z 0 ) + int(V) ⇢ X r or contrapositively, X \X r ⇢ ⇧(Z 0 ) + int(V) . Also, by (Rule 4), ⇧(Z 0 ) + int(V)

c

⇢ ⇧(Z 0 ) + @V. Therefore,

X \X r ⇢ ⇧(Z 0 ) + @V.

(A.2)

Let @V = {⌫1 , ⌫2 , ..., ⌫k , ...} = [1 k=1 {⌫k } by Lemma A.1(b). Then, Leb

⇧(Z 0 ) + @V = 

Leb 1 X k=1

⇧(Z 0 ) + ([1 k=1 {⌫k }) = Leb (⇧(Z

0

) + {⌫k }) =

1 X

Leb

0 [1 k=1 (⇧(Z ) + {⌫k })

Leb (⇧(Z

0

)) = 0,

k=1

where the second equality is from (Rule 2) and the third equality by the property of Lebesgue measure. The last equality is by Lemma A.1(b) that Leb (⇧(Z 0 )) = 0. Since x is continuously ⇥ ⇤ distributed, by (A.2), Pr [x 2 X \X r ]  Pr x 2 ⇧(Z 0 ) + @V = 0. ⇤ 29

In the following proofs, we explicitly distinguish the r.v.’s with their realization. Let ⇠, ⇣, and ⌫ denote the realizations of x, z, and v, respectively. We now prove Lemma A.1. The result of the following lemma is needed: Lemma A.2 Under the assumptions of Lemma A.1, Z 0 is closed in Z. 0 ¯ Proof of Lemma A.2: Consider any [1 n=1 {⇣n } ⇢ Z ⇢ Z such that lim ⇣n = ⇣ 2 Z. Then for

each n, @⇧(⇣n )/@⇣ 0 = (0, 0, ..., 0) = 0 by the definition of Z 0 , and

0 ¯ @⇧(⇣)/@⇣ = @⇧( lim ⇣n )/@⇣ 0 = lim @⇧(⇣n )/@⇣ 0 = 0, n!1

n!1

where the second equality is by continuity of @⇧(·)/@⇣ 0 . Therefore, ⇣¯ 2 Z 0 , and hence Z 0 is closed. ⇤

Proof of Lemma A.1(a): First, we claim that for any ⇡ 2 ⇧(Z 0 ) there exists [1 n=1 {⇡n } ⇢

⇧(Z r ) such that limn!1 ⇡n = ⇡.

By Proposition 4.21(a) of Lee (2011, p. 92), for any space S, the path components of S form a

partition of S. Note that a path component of S is a maximal nonempty path connected subset of

S. Then for Z 0 ⇢ Rdz , we have Z 0 = [◆2I Z◆0 where partitions Z◆0 are path components. Note that, since Z 0 is path connected, for any ⇣ and ⇣˜ in Z 0 , there exists a path in Z 0 , namely, a piecewise ◆

◆

continuously di↵erentiable function

◆

˜ Note that : [0, 1] ! Z◆0 such that (0) = ⇣ and (1) = ⇣.

{ (t) : t 2 [0, 1]} ⇢ Z◆0 . Consider a composite function ⇧

: [0, 1] ! ⇧(Z◆0 ) ⇢ Rdx . Then, ⇧( (·))

is di↵erentiable, and by the mean value theorem, there exists t⇤ 2 [0, 1] such that ⇧( (1))

⇧( (0)) =

@⇧( (t⇤ )) (1 @t

0) =

@⇧( (t⇤ )) @ (t⇤ ) . @⇣ 0 @t

Note that @⇧( (t⇤ ))/@⇣ 0 = 0dx ⇥dx since (t⇤ ) 2 Z◆0 ⇢ Z 0 and dx = 1. This implies that ⇧( (1)) = ˜ Therefore for any ⇣ 2 Z 0 , ⇧( (0)), or ⇧(⇣) = ⇧(⇣). ◆

⇧(⇣) = c◆

(A.3)

for some constant c◆ . Also, since Z 0 is closed by Lemma A.2, Z◆0 is closed for each ◆. That is, Z 0 is partitioned to a

closed disjoint union of Z◆0 ’s. But Assumption ID20 (a)(ii) says Z is a connected set in Euclidean

space (i.e., Rdz ). Therefore, for each ◆ 2 I, Z◆0 must contain accumulation points of Z r (Taylor (1965, p. 76)). Now, for any ⇡ = ⇧(⇣) 2 ⇧(Z 0 ), it satisfies that ⇣ 2 Z◆0 for some ◆ 2 I. Let ⇣c 2 Z◆0

r be an accumulation point of Z r , that is, there exists [1 n=1 {⇣n } ⇢ Z such that limn!1 ⇣n = ⇣c .

Then, it follows that

⇡ = ⇧(⇣) = c◆ = ⇧(⇣c ) = ⇧( lim ⇣n ) = lim ⇧(⇣n ), n!1

30

n!1

where the second and third equalities are from (A.3) and the fourth by continuity of ⇧(·). Let ⇡n = ⇧(⇣n ), then ⇡n 2 ⇧(Z r ) for every n

1. Therefore, we can conclude that for any ⇡ 2 ⇧(Z 0 ),

r there exists [1 n=1 {⇡n } ⇢ ⇧(Z ) such that limn!1 ⇡n = ⇡.

Next, we prove that if ⇠ 2 ⇧(z) + v : z 2 Z 0 , v 2 int(V) then ⇠ 2 X r . Suppose ⇠ 2 {⇧(z) + v :

z 2 Z 0 , v 2 int(V) , i.e., ⇠ = ⇡ + ⌫ for ⇡ 2 ⇧(Z 0 ) and ⌫ 2 int(V). Then, by the result above, there

r exists [1 n=1 {⇡n } ⇢ ⇧(Z ) such that limn!1 ⇡n = ⇡. Define a sequence ⌫n = ⇠

Notice that ⌫n is not necessarily in V. But, ⌫n = (⇡ + ⌫)

⇡n = ⌫ + (⇡

⇡n for n

1.

⇡n ), hence limn!1 ⌫n = ⌫.

Since ⌫ 2 int(V), there exists an open neighborhood B" (⌫) of ⌫ for some " such that B" (⌫) ⇢ int(V). Also, by the fact that limn ⌫n = ⌫, there exists N" such that for all n by conveniently taking n = N" , ⇠ satisfies that

N" , ⌫n 2 B" (⌫). Therefore,

⇠ = ⇡N" + ⌫N" , where ⇡N" 2 ⇧(Z r ) and ⌫N" 2 B" (⌫) 2 int(V) ⇢ V. That is, ⇠ 2 X r . ⇤ Proof of Lemma A.1(b): Recall dx = 1. Note that V ⇢ R can be expressed by a union of

disjoint intervals. Since we are able to choose a rational number in each interval, the union is a countable union. But note that each interval has at most two end points which are the boundary of it. Therefore @V is countable. To prove that

Leb (⇧(Z

0 ))

= 0, we use the following proposition:

Proposition A.1 (Lusin-Saks, Corollary 6.1.3 in Garg (1998)) Let X be a normed vector space. Let f : X ! R and E ⇢ X. If at each point of E at least one unilateral derivative of f is zero, then

Leb (f (E))

= 0.

Note that Z 0 is the support where @⇧(z)/@zk = 0 for k  dz . Therefore, its bilateral (diP z rectional) derivative D↵ ⇧(z) in the direction ↵ = (↵1 , ↵2 , ..., ↵dz )0 satisfies D↵ ⇧(z) = dk=1 ↵k · @⇧(z)/@zk = 0. Since the bilateral derivative is zero, each unilateral derivative is also zero; see, e.g., Giorgi et al. (2004, p. 94) for the definitions of various derivatives. Then, by Proposition A.1, Leb (⇧(Z

0 ))

= 0. ⇤

Proof of Theorem 2.2: Consider equation (2.2) with z = z2 , E[y|x, z] = E[y|v, z] = g(⇧(z) + v) + (v),

(2.2)

and note that the conditional expectations and ⇧(·) are consistently estimable, and v can also be estimated. By di↵erentiating both sides of (2.2) with respect to z, we have @E[y|v, z] @g(x) @⇧(z) = · . 0 @z @x0 @z 0

(A.4)

Now, suppose Pr [z 2 Z r ] > 0. For any fixed value z¯ such that z¯ 2 Z r , we have rank (@⇧(¯ z )/@z 0 ) =

dx by definition, hence the system of equations (A.4) has a unique solution @g(x)/@x0 for x in the 31

conditional support Xz¯. That is, @g(x)/@x0 is locally identified for x 2 Xz¯. Now, since the above argument is true for any z 2 Z r , we have that @g(x)/@x0 is identified on x 2 X r . Now by Assump-

tion ID2, the di↵erence between X r and X has probability zero, thus @g(x)/@x0 is identified. Once @g(x)/@x0 is identified, we can identify @ (v)/@v 0 by di↵erentiating (2.2) with respect to v: @E[y|v, z] @g(x) @ (v) = + . 0 @v @x0 @v Next, we prove the necessity part of the theorem. Suppose Pr[z 2 Z r ] = 0. This implies

Pr[z 2 Z 0 ] = 1, but since Z 0 is closed Z 0 = Z. Therefore, for any z 2 Z, the system of equations (A.4) either has multiple solutions or no solution. Therefore g(⇧(z) + v) is not identified. ⇤

A.2

Key Theoretical Step and Technical Assumptions (Section 5)

For asymptotic theory, we require a key preliminary step to separate out the weak instrument factor from the second moment matrices of interest. In order to do so, we first introduce a transformation matrix and apply the mean value expansion to those matrices. Define a vector of approximating functions of orders K = K1 + K2 + 1 for the second stage, 

0 . . p (w) = (1, p1 (x), ..., pK1 (x), p1 (v), ..., pK2 (v)) = 1 .. pK1 (x)0 .. pK2 (v)0 , 0

K

where pK1 (x) and pK2 (v) are vectors of approximating functions for g0 (·) and and K2 , respectively. Note that this rewrites

pK (w)

= (p1 (w), ..., pK

(w))0

0 (·)

of orders K1

of the main body for

expositional convenience in illustrating the step of this subsection. When not stated explicitly, pK (w) should be read as its generic expression (such as in the proof of Lemma A.4). Since g0 (·) and

0 (·)

can only be separately identified up to a constant, when estimating h0 (·), we include only

one constant function. Define a K ⇥ K sample second moment matrix ˆ0 ˆ ˆ=PP = Q n ˆ + ⌧n D K ⇤ ) Then, ˆ⌧ = (Q

Pn

i=1 p

K (w ˆ

i )p

K (w ˆ i )0

n

.

(A.5)

1P ˆ 0 y/n.

For the rest of this subsection, we consider univariate x for simplicity. This corresponds to 0

Assumption ID2 (a). The analysis can also be generalized to the case of a vector x by using multivariate mean value expansion; see Appendix B.3 for discussion of this generalization. Note ˜ after applying a normalization c = 0 that z is still a vector. Under Assumption L, ⇧n (·) = n ⇧(·) and suppressing op (n

) for simplicity in (3.1). Omitting op (n

) does not a↵ect the asymptotic

results developed in the paper. Assume that the approximating function pj (xi ) is twice di↵erentiable for all j, and for r 2 {1, 2} define its r-th derivative as @ r pj (x) = dr pj (x)/dxr .

32

By mean value expanding each element of pK1 (xi ) around vi , we have, for j  K1 , pj (xi ) = pj (n

˜ i ) + vi ) = pj (vi ) + n ⇧(z

˜ i )@pj (˜ ⇧(z vi ),

(A.6)

where v˜i is a value between xi and vi . Define @ r pK1 (x) = [@ r p1 (x), @ r p3 (x), ..., @ r pK1 (x)]0 . Then, by (A.6) the vector of regressors pK (wi ) for estimating h(·) can be written as  .. K1 . 0 .. K2 0 p (wi ) = 1 . p (xi ) . p (vi ) = 1 .. pK1 (vi )0 + n K

0



Let  = K1 = K2 = (K

. ˜ i )@pK1 (˜ ⇧(z vi )0 .. pK2 (vi )0 .

(A.7)

1)/2. Again, K1 , K2 , L, K, and  all depends on n. Note that

K ⇣ K1 ⇣ K2 ⇣ . This setting can be justified by the assumption that the functions g0 (·) and 0 (·)

have the same smoothness, which is in fact imposed in Assumption C. The general case of

K1 6= K2 and multivariate x is discussed in Appendix B.3. Now we choose a transformation matrix Tn to be

2

1

6 Tn = 4 0⇥1 0⇥1

01⇥ n I n I

01⇥

3

7 0⇥ 5 . I

After multiplying Tn on both sides of (A.7), the weak instrument factor is separated from pK (wi )0 : with ui = (zi , vi ), K

where

p⇤K (ui )0



0



p (wi ) Tn = 1  = 1

..  . p (vi )0 + n

. ˜ i )@p (˜ ⇧(z vi )0 .. p (vi )0 · Tn

.. ˜ . . ⇧(zi )@p (˜ vi )0 .. p (vi )0 = p⇤K (ui )0 + mK0 i ,

 .. ˜ ..  . ˜  0 0 K0  v )0 = 1 . ⇧(zi )@p (vi ) . p (vi ) and mi = 0 .. ⇧(z i ) (@p (˜ i

(A.8)

. @p (vi )0 ) .. (0⇥1 )0 .

To illustrate the role of this linear transformation, rewrite the original vector of regressors in (A.7) as pK (wi )0 = pK (wi )0 Tn Tn 1 = p⇤K (ui ) + mK i

0

Tn 1 .

Ignoring the remainder vector mK i which is shown to be asymptotically negligible below, the original vector pK (wi ) is separated into two components p⇤K (ui ) and Tn 1 . Note that p⇤K (ui ) is not a↵ected by the weak instruments and can be seen as a new set of regressors.16 ˆ (below) and to the population second moment matrix We apply this result to Q ⇥ ⇤ Q = E pK (wi )pK (wi )0 .

(A.9)

For justification that p⇤K (ui ) can be regarded as regressors, see Assumption B in Section 5 and Assumption B† below. 16

33

By equations (A.9) and (A.8), it follows ⇥ ⇤ Tn0 QTn = E Tn0 pK (wi )pK (wi )0 Tn ⇥ ⇤ ⇥ ⇤ ⇥ ⇤ ⇤K K0 = Q⇤ + E mK (ui )0 + E p⇤K (ui )mK0 + E mK i p i i mi

(A.10)

⇥ ⇤ where the newly defined Q⇤ = E p⇤K (ui )p⇤K (ui )0 is the population second moment matrix with the new regressors. Also, define

0 ˜ = P P, Q n

where P = (pK (w1 ), ..., pK (wn ))0 which is similar to Pˆ but with unobservable v instead of residual n⇥K

vˆ. In terms of the notations, it is useful to note that “⇤” denotes matrix P , Q, or vector pK (·) that are free from the weak instrument e↵ect. Also, “⇠” is for P or Q with vi and “ˆ” is for ones with vˆi . ˜ 2 C1 (Z) can have nonempty Z0 as a subset of its domain, we define Furthermore, since ⇧(·) ⇥ ⇤ Qr⇤ = E p⇤K (ui )p⇤K (ui )0 |zi 2 Z r ,

⇥ ⇤ Q0⇤ = E p⇤K (ui )p⇤K (ui )0 |zi 2 Z 0 .

⇥ ⇤ Also define the second moment matrix for the first-stage estimation as Q1 = E rL (zi )rL (zi )0 .

Assumptions B, C, D, and L of the main body serve as sufficient conditions for high-level

assumptions that are directly used in the proofs for the convergence rate. In this section, we state Assumptions B† , C† , and D† , and Appendix B.1 proves that they are implied by Assumptions B, C, D, and L. For a symmetric matrix B, let

min (B)

and

max (B)

denote the minimum and maximum

eigenvalues of B, respectively, and det(B) the determinant of B. Assumption B† (i)

min (Q

r⇤ )

is bounded away from zero for all K(n) and

min (Q1 )

is bounded

away from zero for all L(n); (ii)

max (Q

⇤)

and

max (Q)

are bounded by a fixed constant, for all K(n), and

max (Q1 )

bounded

by a fixed constant, for all L(n); (iii) The diagonal elements qjj of Q are bounded away from zero for j  K ⇤ . Assumption C† For all n, (i) j  Cj s/dx 1 and ˜ and L, ˜ supw2W h0 (w) PK˜ ˜ K j=1 j pj (w)  C K ˜ s⇡ /dz . CL

nl  Cl s/dx and

s⇡ /dz 1 ;

supz2Z

(ii) for all positive integers PL˜ ⇧n (z) j=1 nj rj (z) 

For the next assumption let ⇣rv () and ⇠rv (L) satisfy max sup k@ µ p (v)k  ⇣rv (),

max sup @ µ rL (z)  ⇠r (L),

|µ|r v2V

|µ|r z2Z

34

which impose nonstochastic uniform bounds on the vectors of approximating functions. Let p L/n + L s⇡ /dz . Assumption D† (i) n2 1/2 ⇣1v ()

⇡

=

⇣2v ()1/2 ! 0; (iii) ⇠0 (L)2 L/n ! 0; also, (iv)

! 0; (ii) n

K > 2(K ⇤ + 1) for any n and K ⇤ ⇣ K.

A.3

⇡

Key Lemmas for Asymptotic Theory (Section 5)

Lemmas A.3 and A.4 of this subsection describes the subsequent step and obtain the order of magnitudes of eigenvalues of the second moment matrices in term of the weak instrument and penalization rates. In proving these lemmas, we frequently apply two useful mathematical lemmas (Lemmas B.1 and B.2) that are stated and proved in Appendix B.2. For any matrix A, let the p matrix norm be the Euclidean norm kAk = tr(A0 A). The following lemmas are the main results of this subsection.

Lemma A.3 Suppose Assumptions ID, A, B† , D† , and L are satisfied. Then, (a) ˆ 1 ) = Op (n2 ). O(n2 ) and (b) max (Q

max (Q

1)

=

We only present the proof of (a) here. The proof of (b), which uses a similar strategy but involves more works, can be found in Appendix B.2. Lemma A.4 Suppose Assumptions ID, A, B† , D† , and L are satisfied. Then, ˆ

max (Q⌧

1

⇣ n ) = Op min

ˆ

max (Q

1

), ⌧n 1

o⌘

.

(A.11)

In all proofs, let C denote a generic positive constant that may be di↵erent in di↵erent use. TR, CS, MK are triangular inequality, Cauchy-Schwartz inequality and Markov inequality, respectively. Preliminary derivations for the proofs of Lemmas A.3 and A.4 and Theorem 5.1: Before proving the lemmas and theorems below, it is useful to list the implications of Assumption D† (i) that are used in the proofs. Define ⇡

Note that

p = L/n + L ˜ Q

= ⇣1v ()

s⇡ /dz

,

p /n ! 0 by

ˆ Q

=

⇣1v ()2

2 ⇡

n2 1/2 ⇣1v ()

+

1/2 ⇣1v ()

⇡

⇡,

˜ Q

q = ⇣1v ()2 /n.

!0

of Assumption D† (i). Also n2 n2

˜ Q

n o 2 v 2 2 1/2 v = n ⇣ () +  ⇣ () ⇡ ! 0, ˆ 1 ⇡ 1 Q q p = n2 ⇣1v ()2 /n  Cn2 1/2 ⇣1v ()/ n ! 0 35

(A.12)

and n2 ⇣1v ()2 by n 1/2 ⇣1v ()

⇡

! 0 and n2 ⇣1v ()

⇡

2 ⇡ /n

!0

(A.13)

! 0 which are implied by (A.12). ⇤

Proof of Lemma A.3(a): Let p⇤i = p⇤K (ui ) and mi = mK i for brevity. Recall (A.10) that ⇤ 0 0 Tn0 QTn = Q⇤ + E [mi p⇤0 i ] + E [pi mi ] + E [mi mi ]. Then,

Tn0 QTn

⇣ ⌘1/2 ⇣ ⌘1/2 Q⇤  2E kmi k kp⇤i k + E kmi k2  2 E kmi k2 E kp⇤i k2 + E kmi k2

.. ˜  v )0 by CS. But mi = mK i i = [0 . ⇧(zi ) @p (˜

. @p (vi )0 .. (0⇥1 )0 ]0 where v˜ is the intermediate value

between x and v. Then, by mean value expanding @p (˜ vi ) around vi and |˜ vi have

˜ i )@ 2 p (¯ kmi k2 = ⇧(z vi ) (˜ vi 2

=n

2

vi )

2

4

2

vi |, we

vi |2

˜ i ) ⇣2v ()2 |xi  ⇧(z

˜ i ) ⇣2v ()2  Cn ⇧(z

vi |  |xi

⇣2v ()2 ,

(A.14)

˜ i ) < 1. where v¯ is the intermediate value between v and v˜, and by Assumption L that supz ⇧(z Therefore

E kmi k2  Cn

2

⇣2v ()2 .

(A.15)

Then, by Assumption B† (ii), ⇥ ⇤ ⇤ ⇤ E p⇤0 i pi = tr(Q )  tr(I2 )

max (Q

⇤

)  C · .

(A.16)

2

(A.17)

Therefore, by combining (A.15) and (A.16) it follows Tn0 QTn

Q⇤  O(1/2 n

2

⇣2v ()) + O(n

⇣2v ()2 ) = o(1)

by Assumption D† (ii), which shows that all the remainder terms are negligible. Now, by Lemma B.1, we have 0 min (Tn QTn )

min (Q

Combine the results (A.17) and (A.18) to have with simpler notations p1 = Pr [z 2

Zr]

and p0 =

by a variant of Lemma B.1 (with the fact that it follows that

min

(Q⇤ )

p1 ·

Since p1 > 0, it holds that

(Qr⇤ )

min ⇤ min (Q )

)  Tn0 QTn

0 min (Tn QTn ) = ⇥ ⇤ Pr z 2 Z 0 , we 1(

B) =

(Q0⇤ )

+ p0 ·

p1 ·

⇤

min

min (Qr⇤ )

36

Q⇤ min (Q

have

k (B)

= p1 ·

(A.18) ⇤)

Q⇤

+ o(1). But note that,

= p1 Qr⇤ + p0 Q0⇤ . Then,

for any symmetric matrix B),

min (Q

r⇤ ),

because

min (Q

0⇤ )

= 0.

c > 0 for all K(n) by Assumption B† (i).

Therefore, 0 min (Tn QTn )

Let T0n =

n

0

n

˜

n

Then, by solving

"

1

0

n max (I ) = 1, it follows

⌦ I ,

max (Tn )

Note that

0 max (Tn Tn )

so that

Tn =

(A.19) "

1

01⇥2

02⇥1

T0n

#

.

= 0, we have ˜ = n or 1 for eigenvalues of T0n , and since

˜

1

#

c > 0.

=

max (T0n )

= O(n ).

(A.20)

= O(n2 ) by Lemma B.2. Since (A.19) implies

0 1 max ((Tn QTn ) )

= O(1), it

follows max (Q

1

)=

0 1 0 max (Tn (Tn QTn ) Tn )

 O(1)

0 max (Tn Tn )

= O(n2 )

by applying Lemma B.2 again. ⇤ Proof of Lemma A.4: In proving this lemma, we use generic K and K ⇤ for expositional convenience. To deal with the singularity in ⌧n DK ⇤ , define a matrix K⇤

the following: For j  ˆ and for j, k > K ⇤ , in Q; ˆ

max (Q⌧

Note that

min ( K ⇤

and k 6= j, jk

1

jk

= qˆjk ,

kj

= qˆkj , and

jj

K⇤

whose elements

jk

satisfy

= qˆjj /2 where qˆjk is an element

= 0. Write

1 1 = ˆ ˆ min (Q + ⌧n DK ⇤ ) min (Q K⇤ + 1  . ˆ min (Q K ⇤ ) + min ( K ⇤ + ⌧n DK ⇤ )

)=

K⇤

+ ⌧n D K ⇤ )

1 † min (⌧n K ⇤ + DK ⇤ ). By Assumption D (iv), we have that is either 1 or ⌧n 1 qˆjj /2 (j  K ⇤ ) for sufficiently large n. This is because Q ⇤ QK 1 ˆ /2 · IK = K jj j) j ), and in calculating the dej=1 (⌧n q j=K ⇤ +1 (1

+ ⌧n D K ⇤ ) = ⌧n

1 min (⌧n K ⇤ + DK ⇤ ) 1 det ⌧n K ⇤ + DK ⇤

terminant, note that o↵-diagonal, reverse diagonal, and reverse o↵-diagonal multiplications are all

zero when K > 2(K ⇤ + 1) for any n. More precisely, all o↵-diagonal multiplications are zero for

K > 2K ⇤ + 1; all reverse-diagonal multiplications are zero for K > 2K ⇤ + 1; all reverse o↵-diagonal ˆK⇤ multiplications are zero for K > 2K ⇤ + 2. Therefore, min ( K ⇤ + ⌧n DK ⇤ ) = C⌧n . Next, let Q ˆ be the lower block of the block diagonal matrix Q

K⇤ .

By the fact that eigenvalues of a block

diagonal matrix are those of each blocks, we have that max

⇣

ˆ (Q

K⇤ )

1

⌘

n = max 2/ˆ q11 , 2/ˆ q22 , ..., 2/ˆ qK ⇤ K ⇤ ,

ˆ 1 max (QK ⇤ )

o

.

ˆ K ⇤ ) !p 0 By Assumption B† (iii) that qjj for j  K ⇤ are bounded away from zero and since min (Q 1 ˆ K ⇤ ) < qˆjj /2 or max (Q ˆ ⇤ ) > 2/ˆ as n ! 1, for sufficiently large n, we have that min (Q qjj for all K 37

j  K ⇤ (n). Consequently, ˆ

max (Q⌧

where

ˆ 1 max (QK ⇤ )

1

o ˜ n1 , C⌧ n o ⇣ n o⌘ ˆ 1⇤ ), C⌧ ˜ n 1 = Op min n2 , ⌧n 1 ,  min max (Q K

)  min

n

max

⇣

ˆ (Q

K⇤ )

1

⌘

= Op (n2 ) can be similarly shown as the proof of

ˆ

max (Q

1)

= Op (n2 ) by

accordingly redefining Tn . ⇤

A.4

Decaying Rates of Coefficients and Approximation Errors (Section 5)

Proofs of Lemma 5.1: Let khk1 = supw |h(w)| for simplicity. We provide a proof for the expansion of h(w), and a similar proof can be followed for ⇧n (z). Let d be a generic dimension of w Q for now. Let pj (w) = dl=1 p l (j) (wl ) be the tensor product of orthogonal polynomials, where l (j) Q are the order of each univariate polynomial, and let !(w) = dl=1 !l (wl ) be the associated weight. P Here let (j) = ( 1 (j), 2 (j), ..., dw (j)) be the multi-index with norm | (j)| = dl=1 l (j), which is ¯ monotonically increasing in j and { (j)}1 j=1 are distinct such vectors. Also, l (j) ⇣ (j) for all l for ´ P some increasing sequence ¯ (j). Let j = W h(w)pj (w)!(w)dw and denote hK (w) = K j=1 j pj (w). By Assumption C that h(·) is Lipschitz, hK (·) is uniformly convergent to h(·) (Trefethen (2008, p. 75)). We derive the decay rate of the coefficients on the Chebyshev polynomials. By Assumption Q B, let W = dl=1 Wl where Wl is the marginal support of wl and, without loss of generality, let q Wl = [ 1, 1] for all l. Also let !l (wl ) = 1/

1

wl2 for all l. Then khkW,! becomes the Chebyshev-

weighted seminorm denoted as khkT . First, for a given l 2 {1, ..., d}, we pay our attention to h(w) as a univariate function of wl on Wl in ˆ ˆ = h(wl , w l )p j W

where w w

l

l

2W

l

l

Wl

l (j)

(wl )!l (wl )dwl

Y

p

m (j)

(wm )!m (wm )dw l ,

m6=l

denotes the remaining vector. Let µ be such that |µ| =

Pd l

µl = s. For a given

2 W l , by Assumption C, h(wl , w l ) as a function of wl is µl -times (and hence (µl

absolutely continuously di↵erentiable. Suppose also that

k@ µl h(·, w

l )kT

1)-times)

is bounded for |µl | = µl .

Then, one can apply the univariate results of Trefethen (2008, Theorem 4.2) to yield ˆ

h(wl , w l )p

l (j)

(wl )!l (wl )dwl = O

l (j)

Likewise, under Assumption C that h(w) is s-times (and hence (µl each l) absolutely continuously di↵erentiable and

k@ µ hkT

38

µl 1

.

1)-times with respect to wl for

is bounded for |µ| = s, we can apply the

integration by part in his proof further with respect to w j

=

ˆ

h(w)pj (w)!(w)dw = O

1 (j)

µ1 1

l

as well, which yields that 2 (j)

µ2 1

W

To illustrate this, suppose d = 2 and s1 = s2 = 0 for simplicity. Let T Chebyshev polynomial of order

j

l (j).

··· l (j)

d (j)

µd 1

.

(wl ) denote the univariate

Then,

✓ ◆2 ˆ 1 ˆ 1 T (j) (w1 ) T 2 (j) (w2 ) 2 p = h(w1 , w2 ) p1 dw1 dw2 ⇡ 1 w12 1 w22 1 1 ✓ ◆2 ˆ ⇡ ˆ ⇡ 2 = h(cos ✓1 , cos ✓2 ) cos( 1 (j)✓1 ) cos( 2 (j)✓2 )d✓1 d✓2 ⇡ 0 0 ✓ ◆2 ˆ ⇡ˆ ⇡ 2 1 = h12 (cos ✓1 , cos ✓2 ) sin( 1 (j)✓1 ) sin ✓1 sin( ⇡ 1 (j) 2 (j) 0 0

2 (j)✓2 ) sin ✓2 d✓1 d✓2 ,

where the last equality is by iteratively applying the integration by part with respect to each argument and where the boundary terms vanish because of the sine functions. Applying 2 sin ✓0 sin ✓00 = cos(✓0

✓00 )

cos(✓0 + ✓00 ), we have

✓ ◆2 2 | j|  ⇡ cos(( ⇥

ˆ ⇡ˆ ⇡ 1 cos(( 1 (j) h12 (cos ✓1 , cos ✓2 ) 2 1 (j) 2 (j) 0 0 1)✓2 ) cos(( 2 (j) + 1)✓1 ) 2 (j) d✓1 d✓2 2 2 1

✓ ◆2 kh(w)kT 2   ⇡ 1 (j) 2 (j)

1)✓1 )

cos((

1 (j)

2

+ 1)✓1 ) 1

C , 1 (j) 2 (j)

where the second last inequality is because the k·k1 terms are bounded by 1 and the last inequality

is by Assumption C for s = s1 +s2 = 0. A similar argument follows with general d and s by iterating the procedure. Note that j = or j 1/d ⇣ ¯ (j),

Qd

l=1 ( l (j)

+ 1), which manifests the curse of dimensionality. Since j ⇣ ( ¯ (j))d

j

⇣ =O j

(s+d)/d

⌘

⇣ =O j

s/d 1

⌘

.

Additionally, because of the additivity of h(·), we have reduced dimensionality such that d = dx Qx and j/2 = dl=1 ( l (j) + 1). Therefore, j = O j s/dx 1 . PK Now we calculate the approximation error. Since hK hK˜ 1 =  j pj (w) ˜ j=K+1 1

39

C

PK

˜ j=K+1

| j | by Assumption B† (ii) that kpj k1  C, for all K, h

hK˜

1

 kh

hK k1 + hK

Then, based on the results above that

h

hK˜

1

j

 lim kh 1 X

=O j

1

 kh

s/dx 1

K!1

j

s/dx 1

˜ j=K+1

 C˜

ˆ

hK k1 + C

˜ j=K+1

and limK!1 kh

hK k1 + C lim

K!1

= C˜

hK˜

K X

1

˜ K

K X

˜ j=K+1

x

| j| .

hK k1 = 0,

| j|

s/dx 1

˜ dx = C˜˜ K

s/dx

.

⇤ Lemma 5.1 delivers the case of the Chebyshev polynomials. For the Legendre polynomials, similar results follow by extending Theorem 2.1 of Wang and Xiang (2012) as above under the same ´ assumptions of Lemma 5.1. For example, j = W h(w)pj (w)!(w)dw = O(j s/d 1/2 ) where pj (w) is the tensor product of the Legendre polynomials. We omit the full results.

A.5

Proofs of Rate of Convergence (Section 5)

Given the results of the lemmas above, we derive the rates of convergence. We first derive the rate ˆ defined as of convergence of the unpenalized series estimator h(·) ˆ h(w) = pK (w)0 ˆ, where ˆ = (Pˆ 0 Pˆ )

1

Pˆ 0 y.

ˆ ⌧ (·) defined in Section 4. Then, we prove the main theorem with the penalized estimator h Lemma A.5 Suppose Assumptions A–D, and L are satisfied. Then, ˆ h

h0

L2

⇣ p = Op n ( K/n + K

Proof of Lemma A.5: Let

be such that

40

s/dx

=(

+

p L/n + L

1 , ...,

K ).

s⇡ /dz

⌘ ) .

Then, by TR of L2 norm (first

inequality), ˆ h

h0

L2

=  =

⇢ˆ h ⇢ˆ h ⇢⇣

ˆ

C ˆ

ˆ h(w)

h0 (w)

p (w) ( ˆ 0

K

⌘0

i2 )

1/2

dF (w)

i2

1/2

dF (w)

+

⇣ EpK (w)pK (w)0 ˆ

⌘

s/dx

+ O(K

)

⇢ˆ

⇥

1/2

K

p (w)

0

+ O(K

by Assumption B† (ii) and using Lemma B.2 (last eq.). As ˆ

h0 (w) s/dx

⇤2

1/2

dF (w)

)

= (Pˆ 0 Pˆ )

1P ˆ 0 (y

Pˆ ), it follows

that ˆ

2

= (ˆ

)0 ( ˆ

)

= (y

Pˆ )0 Pˆ (Pˆ 0 Pˆ )

1

(Pˆ 0 Pˆ )

1

= (y

ˆ Pˆ )0 Pˆ Q

1

1

Pˆ 0 (y

Pˆ )/n2

= (y

ˆ Pˆ )0 Pˆ Q

1/2

 Op (n2 )(y

ˆ Q

ˆ Q

1

ˆ Q

1/2

⇣ ⌘ Pˆ )0 Pˆ Pˆ 0 Pˆ

Pˆ 0 (y

Pˆ 0 (y 1

Pˆ 0 (y

Pˆ ) Pˆ )/n2 Pˆ )/n

by Lemma B.2 and Lemma A.3(b) (last ineq.). ˜ = (h(w Let h = (h(w1 ), ..., h(wn ))0 and h ˆ1 ), ..., h(w ˆn ))0 . Also let ⌘i = yi )0 .

h0 (wi ) and ⌘ =

)0 ,

then E [yi |W ] = h0 (wi ) which implies E [⌘i |W ] = 0. Also ⇥ ⇤ similar to the proof of Lemma A1 in NPV (p. 594), by Assumption A, we have E ⌘i2 |W being (⌘1 , ..., ⌘n

Let W = (w1 , .., wn

bounded and E [⌘i ⌘j |W ] = 0 for i 6= j, where the expectation is taken for y. Then, given that ˜ + (h ˜ Pˆ ), we have, by TR, y Pˆ = (y h) + (h h) ˆ

ˆ 1/2 Pˆ 0 (y Pˆ )/n = Op (n ) Q n ˆ 1/2 Pˆ 0 ⌘/n + Q ˆ  Op (n ) Q

1/2

˜ ˆ h)/n + Q

Pˆ 0 (h

1/2

˜ Pˆ 0 (h

Pˆ )/n

o

.

(A.21)

For the first term of equation (A.21), consider E

h

(P Tn

P ⇤ )0 ⌘/n

2

i h W = E M 0 ⌘/n = Op (n

2

2

i 1 X W C 2 kmi k2 n

1 v ⇣2 ()2 )

i

= op (1)

by (A.14) and op (1) is implied by Assumption D† (ii). Therefore, by MK, (P Tn

P ⇤ )0 ⌘/n = op (1). 41

(A.22)

Also, E

"

⇣

Pˆ Tn

P Tn

⌘0

2

⌘/n

W

#

C

1 X (ˆ pi n2

2

pi ) 0 T n

i

1  O(n2 )Op (⇣1v ()2 n

2 ⇡)

C

1 X n2

max (Tn )

2

i

2

= Op (n ⇣1v ()2

kˆ pi

p i k2

2 ⇡ /n)

(A.23)

by (A.20) and (B.6), and hence ⇣

Pˆ Tn

P Tn

⌘0

⌘/n = op (1)

(A.24)

by Assumption D† (i) (or (A.13)) and MK. Also E P ⇤0 ⌘/n

2

h

= E E[ P ⇤0 ⌘/n

2

i

|W ] = E

"

X i

⇤ 2 2 p⇤0 i pi E[⌘i |W ]/n

#

1 X ⇥ ⇤0 ⇤ ⇤ C 2 E pi pi = Ctr(Q⇤ )/n = O(/n) n i

by Assumptions A (first ineq.) and equation (A.16) (last eq.). By MK, this implies P ⇤0 ⌘/n  Op ( Hence by TR with (A.22), (A.24), and (A.25), Tn0 Pˆ 0 ⌘/n 

⇣

Pˆ Tn

P Tn

⌘0

⌘/n + (P Tn

p /n).

(A.25)

P ⇤ )0 ⌘/n + P ⇤0 ⌘/n  Op (

p /n).

Therefore, the first term of (A.21) becomes ˆ Q

1/2

ˆ0

P ⌘/n

2

=

⌘ 0 Pˆ Tn n

!

ˆ n) 1 (Tn0 QT

Tn0 Pˆ 0 ⌘ n

!

 Op (1) Tn0 Pˆ 0 ⌘/n

2

= Op (/n)

(A.26)

by Lemma B.2 and (B.10). ⇣ ⌘ 1 Because I Pˆ Pˆ 0 Pˆ Pˆ 0 is a projection matrix, hence is p.s.d, the second term of (A.21) becomes

ˆ Q

1/2

Pˆ 0 (h

˜ h)/n

2

⇣ ⌘ 1 ˜ 0 Pˆ Pˆ 0 Pˆ ˜ = (h h) Pˆ 0 (h h)/n  (h X X = (h(wi ) h(w ˆi ))2 /n = ( (vi ) i

C

X i

˜ 0 (h h)

˜ h)/n

(ˆ vi ))2 /n

i

|vi

2

vˆi | /n = Op (

42

2 ⇡)

(A.27)

by (B.5) (last eq.) and Assumption C (Lipschitz continuity of (v)) (last ineq.). This term is as a result of the generated regressors vˆ from the first-stage estimation, and hence follows the rate of 2 ). ⇡

the first-stage series estimation ( ˆ Q

1/2

˜ Pˆ 0 (h

2

Pˆ )/n

Similarly, the last term is ˜ = (h

⇣ ⌘ Pˆ )0 Pˆ Pˆ 0 Pˆ

1

˜ Pˆ )0 (h ˜ Pˆ )/n  (h X = h(w ˆ i ) pK ( w ˆ i )0

˜ Pˆ 0 (h

2

Pˆ )/n

/n = Op (K

2s/dx

)

(A.28)

i

by Assumption C† . Therefore, by combining (A.26), (A.27), and (A.28) ˆ Consequently, since  ⇣ K, ˆ h

h0

L2

h p  Op (n ) Op ( /n) + Op (

h p  Op (n ) Op ( K/n) + O(K

⇡ ) + O(K

s/dx

) + Op (

and we have the conclusion of the lemma. ⇤

s/dx

i ) .

i ) + O(K ⇡

s/dx

)

ˆ ⌧ = (Pˆ 0 Pˆ + n⌧n DK ⇤ )/n = Q ˆ + ⌧n DK ⇤ . Define Pˆ# = Proof of Theorem 5.1: Recall Q Pˆ + n⌧n Pˆ (Pˆ 0 Pˆ ) 1 DK ⇤ . Note that the penalty bias emerges as Pˆ# 6= Pˆ . Consider ˆ⌧

2

= ( ˆ⌧

)0 ( ˆ⌧

)

= (y

Pˆ# )0 Pˆ (Pˆ 0 Pˆ + n⌧n DK ⇤ )

= (y

ˆ 1/2 Q ˆ 1Q ˆ 1/2 Pˆ 0 (y Pˆ# )0 Pˆ Q ⌧ ⌧ ⌧



ˆ

max (Q⌧

1

ˆ 1/2 Pˆ 0 (y ) Q ⌧

1

(Pˆ 0 Pˆ + n⌧n DK ⇤ )

Pˆ# )/n

1

Pˆ 0 (y

Pˆ# )

Pˆ# )/n2 2

.

Then, first note that by Lemma A.4, ˆ

max (Q⌧

Hence

ˆ⌧

 Op (min{n , ⌧n ˆ 1/2 Pˆ 0 (y Q ⌧

1/2

1

n o ) = Op (min n2 , ⌧n 1 ).

ˆ ⌧ 1/2 Pˆ 0 (y }) Q

(A.29)

Pˆ# )/n . But, by TR,

Pˆ# )/n

ˆ ⌧ 1/2 Pˆ 0 (y  Q

ˆ ⌧ 1/2 Pˆ 0 (h h)/n + Q

Pˆ# )/n

ˆ 1/2 Pˆ 0 (y = Q ⌧

ˆ 1/2 Pˆ 0 (h h)/n + Q ⌧

Pˆ

ˆ ⌧ 1/2 Pˆ 0 (y  Q

ˆ ⌧ 1/2 Pˆ 0 (h h)/n + Q

ˆ ⌧ 1/2 DK ⇤ Pˆ )/n + ⌧n Q

43

n⌧n Pˆ (Pˆ 0 Pˆ )

1

DK ⇤ )/n .

ˆ ⌧ 1 Pˆ c  c0 Pˆ 0 Q ˆ The first and second terms, note that c0 Pˆ 0 Q

ˆ 1 Q ˆ ⌧ 1) for any vector c, since (Q 1/2 ˆ ⌧ Pˆ 0 (y h)/n = is p.s.d. Therefore, by (A.26), (A.27), and (A.28) in Lemma A.5, we have Q p ˆ ⌧ 1/2 Pˆ 0 (h Pˆ )/n = Op (K s/dx + ⇡ ). The third term (squared) is Op ( K/n) and Q 2

ˆ ⌧ 1/2 DK ⇤ ⌧n Q

ˆ ⌧ 1 DK ⇤  ⌧n2 = ⌧n2 0 DK ⇤ Q

1P ˆc

2 ˆ 1 0 max (Q⌧ ) DK ⇤  ⌧n

ˆ 1 max (Q⌧ )

K X

2 j.

j=K ⇤ +1

Then, by Assumption C† K X

2 j

j=K ⇤ +1

ˆ ⌧ 1/2 DK ⇤ Therefore, ⌧n Q



1 X

2s/dx 2

Cj

j=K ⇤ +1

= Op (⌧n K ⇤

C

ˆ1

2s/dx 2

x

h0

s/dx 1/2 min{n

L2

⇣ p = Op Rn ( K/n + K

s/dx

2s/dx 1

.

K⇤

, ⌧n

1/2

proof of Lemma A.5, ˆ⌧ h

˜ ⇤ dx = CK

+ ⌧n K ⇤

}). Consequently, analogous to the

s/dx 1/2

Rn +

p L/n + L

s⇡ /dz

⌘ ) .

Note that here, the condition that the penalty bias is dominated by the approximation bias in the penalization dominating case is the following: (K ⇤ /K) is guaranteed by Assumption D that

K⇤

conclusion of the second part follows from ˆ h(w)

h0 (w)

1

 pK (w)0

h0 (w)

1

s/dx

⌧n K ⇤

1 1/2

! c for c 2 [0, 1), which

⇣ K. This proves the first part of the theorem. The + pK (w)0 ( ˆ⌧

)

1

 O(K

s/dx

) + ⇣0v (K) ˆ⌧

.

⇤ Proof of Theorem 5.3: The proof follows directly from the proofs of Theorems 4.2 and 4.3 of NPV (p. 602). As for notations, we use v instead of u of NPV and the other notations are identical. ⇤

B

Appendix II

This section (Appendix II) contains the remaining proofs of the lemmas introduced in Appendix I and the proofs of the theorem and corollary in Section 6. It also briefly discusses the generalization of the analyses in Subsections A.2 and A.3. Lastly, it discusses weak instruments in a NPIV model.

B.1

Proofs of Sufficiency of Assumptions B, C, D, and L

h i ˜ = 1 and we prove after Proof that B and L imply B† : For simplicity, assume Pr z 2 Z r (⇧) ˜ is piecewise one-to-one. Here, we prove replacing Qr⇤ with Q⇤ in Assumption B† (i). Note that ⇧(·) 44

˜ ˜ the caseh where ⇧(·) i is one-to-one. The general cases where ⇧(·) is piecewise one-to-one or where ˜ < 1 can be followed by conditioning on z in appropriate subset of Z. 0 < Pr z 2 Z r (⇧) ˜ Consider the change of into u ˜ = (˜ z , v˜) where z˜ = ⇧(z) and v˜ = v. Then,  variables of u = (z, v) 0 . . it follows that p⇤K (ui ) = 1 .. z˜i p (vi )0 .. p (vi )0 = pK (˜ ui ) where pK (˜ ui ) is one particular form of a vector of approximating functions as specified in NPV (pp. 572–573). Moreover, the joint density of u ˜ is ˜ fu˜ (˜ z , v˜) = fu (⇧ ˜ Since @ ⇧

1 (˜ z )/@ z˜

1

˜ @⇧

(˜ z ), v˜) ·

1 (˜ z)

@ z˜

0

0

1

˜ = fu ( ⇧

1

(˜ z ), v˜) ·

˜ 1 (˜ @⇧ z) . @ z˜

˜ 2 C1 (Z) (bounded derivative) and fu is bounded away from zero and the 6= 0 by ⇧

support of u is compact by Assumption B, the support of u ˜ is also compact and fu˜ is also bounded K u )pK (˜ ui ) 0 ) min (Ep (˜ i

away from zero. Then, by the proof of Theorem 4 in Newey (1997, p. 167), is bounded away from zero. Therefore

min (Q

⇤)

is bounded away from zero for all .

As for Q1 that does not depend on the e↵ect of weak instruments, the density of z being bounded away from zero implies that The maximum eigenvalues

min (Q1 ) is bounded of Q⇤ , Q and Q1 are

away from zero for all L by Newey (1997, p. 167).

bounded by fixed constants by the fact that the ⇥ ⇤ polynomials or splines are defined on bounded sets. Lastly, note that qjj is either E pj (x)2 or ⇥ ⇤ ⇥ ⇤ E pj (v)2 . But by Assumption B, the density of v is bounded below, and hence E pj (v)2 is ⇥ ⇤ bounded below. Similarly, E pj (x)2 is also bounded below since eventually x converges to v by Assumption L. ⇤

Proof that B and C imply C† : The results follow from Lemma 5.1. ⇤ Proof that D implies D† : It follows from Newey (1997, p. 157) that in the polynomial case ⇣rv (K) = K 1+2r .

(B.1)

The same results holds for ⇠r (L). ⇤

B.2

Proofs of Technical Lemmas and Lemma A.3(b)

The following are mathematical lemmas that are useful in proving Lemmas A.3 and A.4 in the p Appendix. Again, for any matrix A, let the matrix norm be the Euclidean norm kAk = tr(A0 A). Lemma B.1 For symmetric k ⇥ k matrices A and B, let values such that

1

2

···

| j (A) |

k.

j (A)

min (B)|

 kA

j (B)

denote their jth eigen-

Then, the following inequality holds: For 1  j  k, j (B)|

|

1 (A

B)|  kA

By having i = 1 and n, note that Lemma B.1 implies |

min (A)

and

Bk .

max (A)

max (B)|

 kA

Bk, respectively, which will be useful in several proofs below. 45

Bk and

Lemma B.2 If K(n) ⇥ K(n) symmetric random sequence of matrices An satisfies

max (An )

=

Op (n ), then kBn An k  kBn k Op (n ) for a given sequence of matrices Bn . Another useful corollary of this lemma is that, for

max (An )

= Op (n ) and sequence of vectors

bn and cn , we have b0n An cn  Op (n )b0n cn . Proof of Lemma B.1: We provide slightly more general results. Firstly, by Weyl (1912), for symmetric k ⇥ k matrices C and D i+j 1 (C

where i + j

+ D) 

i (C)

+

j (D),

(B.2)

1  k. As for the second inequality, we prove | j (D)|  kDk ,

(B.3)

for 1  j  k. Note that, for any k ⇥ 1 vector a such that kak = 1, (a0 Da)2 = tr(a0 Daa0 Da) = tr(DDaa0 )  tr(DD)tr(aa0 ) = tr(DD). Since

j (D)

= a0 Da for some a with kak = 1, taking square root on both sides of the inequality

gives the desired result. Now, in inequalities (B.2) and (B.3), take j = 1, C = B, and D = A

B

and we have the conclusions. ⇤ Proof of Lemma B.2: Let An have eigenvalue decomposition An = U DU kBn An k2 = tr Bn An An Bn0 = tr Bn U D2 U  tr Bn U U

1

Bn0 ·

max (An )

2

1

1.

Then,

Bn0

= kBn k2 Op (n )2 .

⇤ p Proof of Lemma A.3(b): Recall ⇡ = L/n + L s⇡ /dz , Qˆ = ⇣1v ()2 2⇡ + 1/2 ⇣1v () ⇡ , and p v {⇣1 ()2 + ⇣0v ()2 } K/n. Let pi = pK (wi ) for brevity, whose element is denoted as pk (wi ) ˜ = Q

for k = 1, ..., K. Consider ˜ E Q

Q

2

8 !2 9 < X X  P p p0 = ⇥ ⇤ i i i =E E pi p0i : ; n jk j k (✓ P ◆2 ) XX p (w )p (w ) i j i k i = E E [pk (wi )pj (wi )] n j k XX 1 ⇥ ⇤ 1 ⇥ ⇤  E pk (wi )2 pj (wi )2 = E p0i pi p0i pi n n j

k

46

by Assumptions A (last ineq.). We bound the forth moment with the bounds of a second moment and the following bound. With w = (x, v), max sup @ µ pK (w)

2

|µ|r w2W

 sup k@ r p (x)k2 + sup k@ r p (v)k2  ⇣rv ()2 + ⇣rv ()2 = 2⇣rv ()2 . x2X

v2V

By Assumption B† (ii), the second moment E [p0i pi ] = tr(Q)  tr(I2 ) ˜ E Q ˜ Therefore, by MK, Q

2

Q

˜ Tn0 (Q

Q

2



 C · . Hence

1 ⇥ 0 0 ⇤ E pi pi pi pi  O(⇣1v ()2 /n). n

= Op (⇣1v ()2 /n) = Op (

Q)Tn 

max (Q)

max (Tn )

˜ Q

2

2 ). ˜ Q

Now, by Lemma B.2,

Q  O(n2 )Op (

˜) Q

= op (1)

(B.4)

by Assumption D† (i) and  (A.20). . . Let pˆi = pK (w ˆi ) = 1 .. p (xi ) .. p (ˆ vi ) and by mean value expansion 

. . pˆi = 1 .. p (xi ) .. p (vi ) + @p (¯ vi ) (ˆ vi

vi ) ,

where v¯ is the intermediate value. Since 1X |ˆ vi n i

v|2 =

1X ⇧n (zi ) n

ˆ i) ⇧(z

2

= Op (

2 ⇡ ),

(B.5)

i

we have kˆ pi

pi k2 = kp (xi )

p (xi )k2 + k@p (¯ vi ) (ˆ vi vi )k2 1X  C⇣1v ()2 |ˆ vi vi |2  Op (⇣1v ()2 2⇡ ). n

(B.6)

i

Also, by MK, h i E kp k2 i Pr kpi k2 > "  = C · tr(Q)  C · tr(I2 ) "

47

max (Q)

= O(),

(B.7)

hence

Pn

2 i=1 kpi k /n

ˆ Q

= Op (). Then, by TR, CS and by combining (B.6) and (B.7) n

n

X ˜ = 1 Q pˆi pˆ0i n

pi p0i

i=1

n 1X  kˆ pi n i=1 n

1X  kˆ pi n i=1

= Op (⇣1v ()2

1X  (ˆ pi n

pi )0 + pi (ˆ pi

pi ) (ˆ pi

pi )0 + (ˆ pi

pi )p0i

i=1

n 1X 2 pi k + 2 kpi k kˆ pi pi k n i=1 !1/2 n n 1X 1X 2 2 pi k + 2 kpi k kˆ pi n n i=1

2 ⇡)

i=1

+ Op (1/2 )Op (⇣1v ()

⇡)

= Op (

2

pi k

!1/2

ˆ ). Q

Thus, by Lemma B.2, ˆ Tn0 (Q

˜ n  O(n2 )Op ( Q)T

ˆ) Q

= op (1)

(B.8)

by Assumption D† (i) and (A.20). Therefore, by combining (B.4) and (B.8) and by TR, ˆ n Tn0 QT

ˆ n Tn0 QTn  Tn0 QT

˜ n + Tn0 QT ˜ n Tn0 QT

ˆ n Now by TR, (B.9) and (A.17) give Tn0 QT

ˆ n Q⇤  Tn0 QT

Tn0 QTn = op (1).

(B.9)

Tn0 QTn + kTn0 QTn

Q⇤ k =

⇤ 0 ˆ ˆ n) op (1). Also, by Lemma B.1, we have min (Tn0 QT Q⇤ . Combine the min (Q )  Tn QTn ˆ n ) = min (Q⇤ ) + op (1). But, by Assumption B† (i), we have results to have min (Tn0 QT 0 ˆ min (Tn QTn )

c > 0,

(B.10)

w.p ! 1 as n ! 1. Then, similarly to the proof of Lemma A.3(a), we have ˆ

max (Q

1

)=

0 ˆ 1 0 max (Tn (Tn QTn ) Tn )

 Op (1)

0 max (Tn Tn )

= Op (n2 ).

⇤

B.3

Generalization

The general case of K1 6= K2 for Subsections A.2 and A.3 can be incorporated by the following argument. Recall that K = K1 + K2 + 1 and let  = min {K1 , K2 }. Note that we have K ⇣ . Suppose K1  K2 without loss of generality, then we can rewrite pK (w) as pK (w) = (1, p1 (x), ..., pK1 (x), p1 (v), ..., pK2 (v))0 = (1, p1 (x), ..., pK1 (x), p1 (v), ..., pK1 (v), pK1 +1 (v), ..., pK2 (v))0   0 . . . . = 1 .. pK1 (x)0 .. pK1 (v)0 .. (pK1 +1 (v), ..., pK2 (v))0 = p2K1 +1 (w)0 .. dK2 48

K1

0

(v)0 ,

where dK2

K1 (v)

= (pK1 +1 (v), ..., pK2 (v))0 . Let p˜i = p2+1 (wi ) and di = dK2 ⇥

K

0

K

⇤

Q = E p (w)p (w) =

"

E[˜ pi p˜0i ] E[˜ pi d0i ] E[di p˜0i ] E[di d0i ]

K1 (v

i ).

Then,

#

Note that the partition E [di d0i ] is the usual second moment matrix free from the weak instruments. Therefore, the singularity of Q is determined by the remaining three. Note that p˜i is subject to the weak instruments as before and after mean value expanding it, the usual transformation matrix can be applied. Then, by applying the inverse of partitioned matrix, the maximum eigenvalue of Q

1

can eventually be expressed in term of the n rate of Assumption L. A similar argument holds for K1

K2 . We can also have a more general case of multivariate x for the results of Subsections A.2 and

A.3 including the proof of Lemma A.4. Define @ µ pj (x) = where |µ| =

Pdx

t=1 µt ,

@ |µ| pj (x) µ @xµ1 1 @xµ2 2 · · · @xdxdx

and @ µ pK (x) = [@ µ p1 (x), ..., @ µ pK (x)]0 . Then, multivariate mean value ex-

pansion can be applied using the derivative @ µ pj (x). One can also have a simpler proof by considering mean value expansion of just one element of x and apply a similar proof as in the univariate case. That is, pj (v) = pj (x

⇧n (z)) = pj (x)

⇧m,n (z)@pj (˜ x)/@xm where x ˜ is an intermediate value and ⇧m,n (z) is m-th element of ⇧n (z).

B.4

Proof of Asymptotic Normality (Section 6)

Assumption G in Section 6 implies the following technical assumption. Note that ⇣r (K)  ⇣rv (K) by definition.

p Assumption G† The following converge to zero as n ! 1: nK p p p v 1/2 v 1/2 3 v 1/2 ⇣0 (K)L/ n, n L L⇣1 () + K ⇠(L) / n, n K⇣1 (K)L / n, p 4 v 2 2 n ⇣0 (K) K + ⇠(L) L /n, and n ⇣0v (K)L1/2 ⇣1v (K)(K + L)1/2 / n.

s/dx ,

p

nL

s⇡ /dz ,

p

n⌧n n ,

Proof that Assumption G implies Assumption G† : By (B.1), ⇣rv (K) = K 1+2r and simi-

49

larly, ⇠(L) = L. Therefore, p ⇣0v (K)L/ n = n n o p n L1/2 L⇣1v () + K 1/2 ⇠(L) / n = n

1/2

=n

p n3 K⇣1v (K)L1/2 / n = n3(

KL

n o L1/2 LK 3 + K 1/2 L n o 1 3/2 2L K 3 + K 1/2  Cn 1 2

1 ) 6

1 2

K 3 L3/2

K 4 L1/2

1

n4

⇣0v (K)2 K + ⇠(L)2 L /n = n4( 4 ) K 3 + L3 p 1 n ⇣0v (K)L1/2 ⇣1v (K)(K + L)1/2 / n  n 2 K 4 L1/2 (K + L)1/2 . Then, the results of Assumption G† follow. ⇤ Preliminary derivations for the proof of Theorem 6.1 : Define Q H

First, n3 K 1/2 we have

⇡

p + ⌧n K, p = L1/2 ⇣1v () ⇡ + K 1/2 ⇠(L)/ n, =

ˆ Q

+

˜, Q

=

Q⌧

Q

h

Therefore, Assumption n L1/2

H

n3 ⇣1v ()

⇡

n p o = n L1/2 L1/2 ⇣1v () ⇡ + K 1/2 ⇠(L)/ n n p p o = O(n L1/2 L1/2 ⇣1v ()L1/2 / n + K 1/2 ⇠(L)/ n ) = o(1) p = Op (n3 K⇣1v (K)L1/2 / n) = op (1)

by G† . These results imply n ⇣1v (K)2 n3 1/2

ˆ Q

n3 1/2

˜ Q

2 ⇡

= O(n ⇣1v (K)2 L/n) = o(1) and also

n o = n3 1/2 ⇣1v ()2 2⇡ + 1/2 ⇣1v () ⇡ ! 0 q p = n3 1/2 {⇣1v ()2 + ⇣0v ()2 } K/n  Cn3 ⇣1v ()/ n ! 0

p p ! 0 and n3 ⇣1v () ⇡ ! 0. Also, since n⌧n n ! 0 and n K/ n ! p p = n⌧n n · n K/ n ! 0, consequently, n2 1/2 Q⌧ ! 0. Also G† assumes

since n3 1/2 ⇣1v ()2 0 imply n2 K⌧n

⇡ ).

p p nK s/dx ! 0 and nL s⇡ /dz ! 0, Q ! 0. Also note that, by p p 1/2 pnL s⇡ /dz ) = O(L1/2 /pn) and L/n). h = Rn ( K/n + p v p † 2 v G implies n⇣0 (K) ⇡ = O(⇣0 (K)L/ n) = o(1). Also

! 0 implies n2 p = (L1/2 / n)(1 + L Q

p = ⇠(L)L1/2 / n p = Rn ( K/n + K s/dx + Q1

2 ⇡

O n4 1 ⇣0v (K)2 K + ⇠(L)2 L = o(1). And L1/2 p n⌧n n = o(1) implies Rn = n , ⇣0v (K)L1/2 ⇣1v (K) by G† , which in turn gives ⇣0v (K)

h

h

Q1

! 0 is also implied by G† . Also, since

p = n ⇣0v (K)L1/2 ⇣1v (K)(K + L)1/2 / n ! 0

! 0. Then, we also have n2 50

Q

+ ⇣0v (K)2 K/n ! 0. Lastly,

p

p n⌧n n ! 0 implies n2 ⌧n ! 0 since n / n ! 0, and hence Rn = n . ⇤ Proof of Theorem 6.1: The proofs of Theorem 6.1 and Corollary 6.2 are a modification of

the proof of Theorem 5.1 in NPV (with their trimming function being an identity function) taking into account that instruments are weak. We use the components established in the proof of the convergence rate, which are distinct from NPV. The rest of the notations are the same as those of NPV. Let “MVE” abbreviate mean value expansion. Let X is a vector of variables that includes x and z, and w(X, ⇡) a vector of functions of X and ⇡. Note that w(·, ·) is a vector of transformation functions of regressors and, trivially, w(X, ⇡) = (x, x

⇧n (z)). Recall

⇣ ⌘ ˆ 1 ⌃ ˆ +H ˆQ ˆ 1⌃ ˆ 1Q ˆ 1H ˆ0 Q ˆ 1 A0 , Vˆ⌧ = AQ ⌧ ⌧ 1 1 ⇢ h i X 0 ˆ ⌧ (w ˆ i ))/@⇡ r0 /n, ˆ⌧ = H pˆi @ h ˆi )/@w @w(Xi , ⇧(z i

h ˆ = Pˆ 0 Pˆ /n, Q ˆ⌧ = Q ˆ + ⌧n I, Q = E [pi p0 ], Q ˆ 1 = R0 R/n, ⌃ ˆ ⌧ = P pˆi pˆ0 yi and Q i i P 2 0 L ˆ ⌃1 = vˆ ri r /n where ri = r (zi ). Then, define i

ˆ ⌧ (w h ˆi )

i

i2

/n, and

V = AQ 1 ⌃ + HQ1 1 ⌃1 Q1 1 H 0 Q 1 A0 , ⇥ ⇤ ⇥ ⇤ ⌃ = E pi p0i var(yi |Xi ) , H = E pi [@h(wi )/@w]0 @w(Xi , ⇧n (zi ))/@⇡ ri0 ,

where V does not depend on ⌧ . Note that H is a channel through which the first-stage estimation error kicks into the variance of the estimator of h(·) in the outcome equation. We first prove

p

nV

For notational simplicity, let F = V

1/2

1/2 .

(✓ˆ⌧

✓0 ) !d N (0, 1).

˜ = (h(w Let h = (h(w1 ), ..., h(wn ))0 and h ˆ1 ), ..., h(w ˆn ))0 .

h0 (wi ) and ⌘ = (⌘1 , ..., ⌘n )0 . Let ⇧n = (⇧n (z1 ), ..., ⇧n (zn ))0 , vi = xi ⇧n (zi ), ⇥ ⇤ p and U = (v1 , ..., vn )0 . Once we prove that F AQ 1 = O(n ), nF a(pK0 ) a(h0 ) = op (1), p ˜ Pˆ# ) = op (1), F AQ ˜ pn = F AQ 1 ⇥ HR0 U/pn + op (1), ˆ ⌧ 1 Pˆ 0 (h h)/ nF A(Pˆ 0 Pˆ + n⌧n I) 1 Pˆ 0 (h Also let ⌘i = yi

51

ˆ ⌧ 1 Pˆ 0 ⌘/pn = F AQ and F AQ p

nV

1/2

⇣

✓ˆ⌧

Then, for any vector

1P ˆ 0 ⌘/pn

+ op (1) below, then we will have,

⌘ p ⇣ ⌘ ˆ ⌧ ) a(h0 ) ✓0 = nF a(h ⇣ ⌘ p = nF a(pK0 ˆ⌧ ) a(pK0 ) + a(pK0 ) a(h0 ) p p = nF A ˆ⌧ nF A + op (1) p p = nF A(Pˆ 0 Pˆ + n⌧n I) 1 Pˆ 0 (h + ⌘) nF A(Pˆ 0 Pˆ + n⌧n I) p ˜ Pˆ# ˜) + op (1) + nF A(Pˆ 0 Pˆ + n⌧n I) 1 Pˆ 0 (h p p ˜ ˆ ⌧ 1 Pˆ 0 ⌘/ n F AQ ˆ ⌧ 1 Pˆ 0 (h h)/ = F AQ n + op (1) p p = F AQ 1 (P 0 ⌘/ n + HR0 U/ n) + op (1). 0 F AQ 1 [p

with k k = 1, let Zin =

i ⌘i

1

˜ Pˆ 0 h

(B.11)

p + Hri ui ] / n. Note Zin is i.i.d. for

each n. Also EZin = 0, var(Zin ) = 1/n (recall ⌃ = E[pi p0i var(yi |Xi )]). Furthermore, F AQ 1H

O(n ) and F AQ

 C F AQ

1

= O(n ) by CI

1

=

HH 0 p.s.d, so that, for any " > 0,

⇥ ⇤ ⇥ ⇤ 2 nE 1 {|Zin | > "} Zin = n"2 E 1 {|Zin /"| > 1} (Zin /")2 ⇥ ⇤  n"2 E (Zin /")4 h i n"2 4  2 4 k k4 F AQ 1 {kpi k2 E kpi k2 E[⌘i4 |Xi ] n " h i + kri k2 E kri k2 E[u4i |zi ] } n o  CO(n4 ) ⇣0v (K)2 E kpi k2 + ⇠(L)2 E kri k2 /n  CO(n4 ) ⇣0v (K)2 tr(Q) + ⇠(L)2 tr(Q1 ) /n  O(n4 by G† . Then,

p

nF (✓ˆ⌧

1

⇣0v (K)2 K + ⇠(L)2 L ) = o(1)

✓0 ) !d N (0, 1) by Lindbergh-Feller theorem and (B.11).

Now, we proceed with detailed proofs. For simplicity as before, the remainder of the proof will be given for the scalar ⇧(z) case. Under Assumption F and by CS, |a(hK )| = |A kAk k

CI =

Kk

= C kAk EhK

min (Q

1 )I

 Q

1/2 (w)2

1.

And also since

Assumption E, we have ⌃ V

AQ

so kAk ! 1. Since 2 (X)

1)

min (Q

K|



is bounded away from zero,

= var(y|X) is bounded away from zero by

CQ. Hence 1

⌃Q

1

A0

CAQ

1

QQ

Therefore, it follows that F is bounded.

52

1

A0

CAQ

1

A0

C˜ kAk2 ,

(B.12)

Also, by the previous proofs on the convergence rate, ˆ⌧ Q

ˆ Q  Q

ˆ Q + ⌧n I  Q

˜ + Q ˜ Q + k⌧n Ik Q p  Op ( Qˆ ) + Op ( Q˜ ) + O(⌧n K) = Op ( Q⌧ )

ˆ 1 I = Op (⇠(L)L1/2 /pn) = Op ( Q ). Furthermore similarly to and, by letting Q1 = I, Q 1 P 0 ¯ ¯ the proofs above, with H = pˆi d(Xi )ri /n, H H = Op ( H ) = op (1) as in NPV, where ⇥ 1/2 ⇤ 2 1/2 ⇠(L)/pn. Now, by (B.12) ⇣1 (K) + ⇣0 (K)⇠(L) ⇡ +K H = L F AQ

2

1/2

1

= tr(F AQ

A0 F )  tr(CF V F ) = C.

By Assumption G† , Rn = n and hence by (A.29) and Lemmas A.3 and A.4, and

max (Q

1)

ˆ

max (Q⌧

1)

= Op (n2 )

= O(n2 ). And Then,

F AQ

1

= F AQ

1/2

Q

1/2



ˆ 1  BQ ˆ Note that for any matrix B, B Q ⌧ ˆ ⌧ 1  F A0 Q ˆ F A0 Q

1

max (Q

1 1/2

)

ˆ , since (Q

1

 F AQ

1

Q)

1/2

 CO(n ).

ˆ 1 ) is p.s.d. Hence Q ⌧

1

+ F AQ

ˆ  CO(n ) + CO(n )Op (n2 ) Q = O(n ) + Op (n3

F AQ

1

Q

⇣

⌘ ˆ Q Q

ˆ Q

1

= O(n ) + op (1) = Op (n )

by Assumption G† . Also ˆ 1/2 F AQ ⌧

2

ˆ  F A0 Q

1/2

 C + F A0 Q

2

1/2

 F AQ

1

ˆ Q

2

+ tr(F AQ

ˆ F A0 Q

Q

1

1

⇣

ˆ Q

 O(n )Op (

⌘ ˆ Q Q

1

Q )O(n

A0 F ) ) = op (1).

Firstly, by Assumptions C† and G† , p

⇥ nF a(pK0 )

a(h0 )

⇤

=

p

⇥ nF a(pK0

p  C nK

s/dx

h0 )

⇤

= op (1).

53



p

n |F | sup pK (w)0 w

h0 (w)

Secondly, ˜ ˆ ⌧ 1 Pˆ 0 (h F AQ

p p p ˆ ⌧ 1 Pˆ 0 / n Pˆ# ˜)/ n  F AQ n sup pK (w)0 ˜ W

h0 (w)

p ˆ 1 / n + n⌧n F AQ ⌧ p p ˆ ⌧ 1/2 ˆ⌧ 1 k k  F AQ nO(K s/dx ) + n⌧n F AQ p p  op (1)O( nK s/dx ) + Op ( n⌧n n ) = op (1)

by G† . Thirdly, let

= ( 1 , ...,

L)

0

and di = d(Xi ). By a second order MVE of each h(w ˆi ) around

wi ˆ ⌧ 1 Pˆ 0 (h F AQ

h X p ˜ ˆ⌧ 1 ˆ i) h)/ n = F AQ pˆi di ⇧(z i

i p ⇧n (zi ) / n + ⇢ˆ

p ˆ⌧ H ¯Q ˆ 1 R0 U/ n + F AQ ˆ ⌧ 1H ¯Q ˆ 1 R0 (⇧n = F AQ 1 1 X ⇥ 0 ⇤ p 1 ˆ⌧ + F AQ pˆi di ri ⇧n (zi ) / n + ⇢ˆ. 1

p R0 )/ n

i

2 p ˆ ⌧ 1/2 ⇣ v (K) P ⇧(z ˆ i ) ⇧n (zi ) /n = op (1)Op (pn⇣ v (K) 2⇡ ) = op (1). But kˆ ⇢k  C n F AQ 0 0 i 1 ¯0 ¯ ˆ Also, by di being bounded and nH Q1 H being equal to the matrix sum of squares from the ¯Q ˆ 1H ¯ 0  P pˆi pˆ0 d2 /n  C Q ˆ  CQ ˆ ⌧ . Therefore, multivariate regression of pˆi di on ri , H i i 1 i

ˆ ⌧ 1H ¯Q ˆ 1 R0 (⇧n F AQ 1

p p p ˆ ⌧ 1H ¯Q ˆ 1 R0 / n R0 )/ n  F AQ n sup ⇧n (z) 1 n

rL (z)0

Z

⇣

⌘o1/2 p ˆ ⌧ 1H ¯Q ˆ 1Q ˆ 1Q ˆ 1H ¯ 0Q ˆ ⌧ 1 A0 F 0  tr F AQ O( nL s⇡ /dz ) 1 1 n ⇣ ⌘o1/2 p ˆ 1Q ˆ⌧ Q ˆ 1 A0 F 0  C tr F AQ O( nL s⇡ /dz ) ⌧ ⌧ p p ˆ ⌧ 1/2 O( nL s⇡ /dz ) = op (1)Op ( nL s⇡ /dz ) = op (1)  C F AQ

by G† . Similarly, ˆ 1 F AQ ⌧

X i

pˆi di [ri0

p p ˆ 1/2 O( nL ⇧n (zi )]/ n  C F AQ ⌧

s⇡ /dz

p ) = op (1)Op ( nL

s⇡ /dz

) = op (1).

ˆ ⌧ 1H ¯Q ˆ 1 R0 U/pn. Note that E kR0 U/pnk2 = tr(⌃1 )  Ctr(IL )  Next, we consider the term F AQ 1 p L by E[u2 |z] bounded, so by MR, kR0 U/ nk = Op (L1/2 ). Also, we have ˆ 1H ¯Q ˆ 1  Op (1) F AQ ˆ 1/2 = op (1). F AQ ⌧ ⌧ 1

54

Therefore ˆ ⌧ 1 H( ¯ Q ˆ 1 F AQ 1

p ˆ ⌧ 1H ¯Q ˆ 1 I)R0 U/ n  F AQ 1 = op (1)Op (

ˆ1 Q

Q1 )Op (L

p R0 U/ n

I

1/2

) = op (1)

by G† . Similarly, ˆ 1 (H ¯ F AQ ⌧

p ˆ 1 H)R0 U/ n  F AQ ⌧

¯ H

= Op (n )Op (

p R0 U/ n

H

H )Op (L

1/2

) = op (1)

by G† . Note that HH 0 is the population matrix mean-square of the regression of pi di on ri so p 2 p that HH 0  C, it follows that E kHR0 U/ nk = tr(H⌃1 H 0 )  CK and therefore, kHR0 U/ nk = Op (K 1/2 ). And Then, ˆ⌧ 1 F A(Q

Q

1

p )HR0 U/ n 

ˆ

max (Q⌧

1

) F AQ

= O(n2 )O(n )Op ( ˆ 1 Pˆ 0 (h Combining the results above and by TR, F AQ ⌧

1

Q

Q⌧ )Op (K

ˆ⌧ Q 1/2

˜ pn = F AQ h)/

p HR0 U/ n

) = op (1). 1 HR0 U/pn

+ op (1).

Lastly, similar to (A.23) in the convergence rate part, p Pˆ )0 ⌘/ n = Op (n ⇣1v (K)2

ˆ ⌧ 1/2 (P Q

2 ⇡)

= op (1)

by G† (and by (A.6) of NPV), which implies p ˆ ⌧ 1/2 P )0 ⌘/ n  F AQ

ˆ ⌧ 1 (Pˆ F AQ

ˆ ⌧ 1/2 (P Q

p Pˆ )0 ⌘/ n = op (1)op (1) = op (1).

Also, by E[⌘|X] = 0, 

p 2 ˆ 1 Q 1 )P 0 ⌘/ n |Xn F A(Q ⌧ ⇣ hX i ⌘ ˆ 1 Q 1 )A0 F ˆ ⌧ 1 Q 1)  tr F A(Q pi p0i var(yi |Xi )/n (Q ⌧ ⇣ ⌘ ⇣ ˆ 1 Q 1 )Q ˆ ⌧ (Q ˆ 1 Q 1 )A0 F = Ctr F AQ 1 (Q ˆ⌧  Ctr F A(Q ⌧ ⌧

E

 Op (n2 ) F AQ

1 2

ˆ⌧ Q

Q

2

 Op (n2

Q⌧ )

2

ˆ 1 (Q ˆ⌧ Q)Q ⌧

= op (1)

by G† . Combining all of the previous results and by TR, p

h ˆ⌧ ) nF a(h

i a(h0 ) = F AQ

1

p p (P 0 ⌘/ n + HR0 U/ n) + op (1).

55

Q)Q

1

A0 F

⌘

⌃ + HQ1 1 ⌃1 Q1 1 H 0 Q

1

Next, recall V = AQ

1 A0 .

H⌃1 H 0 is p.s.d (that is

Note CI

H⌃1 H 0  CI using Q1 = I) and since var(y|X) is bounded by Assumption E, ⌃  CQ. Therefore, V = AQ

1

 AQ

1

⌃ + HQ1 1 ⌃1 Q1 1 H 0 Q

A0 = AQ

A0 = C AQ

1

(CQ + CI) Q

1

2

1/2

⌃ + H⌃1 H 0 Q

1

+ C AQ

1 2

1

A0

. 1,

Note that it is reasonable that the first-stage estimation error is not cancelled out with Q the first-stage regressors does not su↵er multicollinearity. Recall And kAk2  a(pK0 A0 )  ApK 2

C kAk = O(⇣r O(n2

⇣r

(K)2 ). ✓ˆ⌧

(K)2 ).

r

a(pK0

) = A so

 ⇣r (K) kAk so kAk  ⇣r (K). Hence AQ

Likewise, AQ

1 2

= O(n ⇣r

(K)2 ).

Thus, V = O(⇣r

a(pK0 A0 )

1/2 2

= AA0 .

= AQ

(K)2 )+O(n2

⇣r

since

1 A0

(K)2 )



=

Therefore,

p p ✓0 = Op (V 1/2 / n) = Op (n ⇣r (K)/ n) = Op (n

1 2

⇣r (K))  Op (n

1 2

⇣rv (K)).

This result for scalar a(h) covers the case of Assumption F. Now we can prove

p

nVˆ⌧

1/2

by showing F Vˆ⌧ F 1 !p 0. Then, V ✓0 )/(V 1 Vˆ⌧ )1/2 !d N (0, 1).

(✓ˆ⌧

1V ˆ⌧

✓0 ) !d N (0, 1)

p ˆ nV ⌧

!p 1, so that

1/2

(✓ˆ⌧

✓0 ) =

p

nV

1/2 (✓ ˆ

The rest part of the proof can directly be followed by the relevant part of the proof of NPV

(pp. 600–601), except that in our case Q 6= I because of weak instrument. Therefore the following replace the corresponding parts in the proof: For any matrix B, we have kB⌃k  C kBQk by ⌃  CQ. Therefore,

ˆ ⌧ 1 ⌃Q ˆ⌧ 1 F A(Q ˆ 1  F A(Q ⌧

1

Q

ˆ 1 (Q  F AQ ⌧ ˆ 1  F AQ ⌧

2

ˆ 1  C F AQ ⌧  Op (n2 )Op (

1

Q

⌃Q

1

(Q

1

ˆ 1 ⌃(Q ⌧

ˆ 1 A0 F 0 + F AQ ⌃Q ⌧

ˆ ⌧ )Q Q

(Q

Q⌧ )

)A0 F 0

ˆ 1 A0 F 0 + F AQ )⌃Q ⌧

ˆ ⌧ )Q Q

2

1

1

⌃ + F AQ

ˆ ⌧ )Q Q

1

1

1

+ Op (n2 )Op (

Q⌧ )

⌃Q

⌃Q

Q + C F AQ

Q

1

1

1

1

)A0 F 0 ˆ ⌧ )Q ˆ 1 A0 F 0 Q ⌧

(Q

ˆ⌧ ) Q

(Q

QQ

1

(Q

ˆ 1 F AQ ⌧ ˆ⌧ ) Q

ˆ 1 F AQ ⌧

= op (1)

by Assumption G† . Also note that in our proof, Q⌧ is introduced by penalization but the treatment ˆ 1C Q ˆ 1B  BQ ˆ 1C Q ˆ 1 B for any matrix is the same as above. Specifically, one can apply B Q ⌧

⌧

B and C of corresponding orders. Also, recall ⇣r (K)  ⇣rv (K) and paper. That is, by

⇣0v (K)

h,

n2

Q

+

⇣0v (K)2 K/n

56

, and

Q are redefined in this h and v 1/2 v ⇣0 (K)L ⇣1 (K) h converging to zero,

we can prove the following: ˆi ˜ Q ˆ ⌧ 1 A0 F 0  Ctr(D) ˆ max h ⌃)

ˆ ⌧ 1 (⌃ ˆ F AQ ˆ 1 (⌃ ˜ F AQ ⌧

hi  Op (1)Op (⇣0v (K)

in

2

ˆ 1 A0 F 0  F AQ ˆ 1 ⌃)Q ⌧ ⌧

ˆ H

¯ C H =

n X i=1

˜ ⌃

2

⌃  Op (n2 )Op (

2

kˆ pi k kri k /n

!1/2

Op (⇣0v (K)L1/2 )Op (⇣1v (K)

n X

dˆi

Q

= op (1),

+ ⇣0v (K)2 K/n) = op (1),

2

di /n

i=1

h)

h)

!1/2

= op (1).

The rest of the proof thus follows. ⇤ Proof of Corollary 6.2: Now instead, suppose Assumption H is satisfied. For the functional c0 a(g),

this assumption is satisfied with ⌫(w) replaced by c0 ⌫(w). Therefore, it suffices to prove the ⇥ ⇤ result for scalar ⌫(w). Then, A = a(pK ) = E ⌫(wi )pK (wi )0 . Let ⌫K (w) = AQ 1 pK (w), which 1⇥K

is (transpose of) mean square projection of ⌫(·) on approximating functions pK (w). Note that Q 1 ⇥ ⇤ is singular with weak instruments. Therefore, define A⇤ = a(p⇤K ) = E ⌫(wi )p⇤K (ui )0 . Also, let ⇤ (u) = A⇤ Q⇤ ⌫K

functions

1 p⇤K (u),

p⇤K (u).

which is (transpose of) mean square projection of ⌫(·) on approximating ⇤ k2  E ⌫ ⌫K

Then, E k⌫

⌫K = E[⌫(wi )pK (wi )0 ]Q

⇤ E k⌫K

2

! 0 by the assumption. But,

p (w) = E[⌫(wi )pK (wi )0 ]Tn (Tn0 QTn )

1 K

= E[⌫(wi )p⇤K (ui )0 ](Tn0 QTn ) Let R⇤ = Tn0 QTn

0 p⇤K ↵K

1 ⇤K

p

1

Tn0 pK (w)

(u).

⇤ 0 0 Q⇤ = E [mi p⇤0 i ] + E [pi mi ] + E [mi mi ]. Then, we have

⌫K k2 = E A⇤ [Q⇤

(Tn0 QTn )

1

1

]p⇤K (ui )

 C kA⇤ k2 kR⇤ k2 E p⇤K (ui )

2

2

= E A⇤ Q⇤ 1 R⇤ (Tn0 QTn ) 1 p⇤K (ui ) ⇣ ⌘2  C˜ kA⇤ k2 2E kmi k kp⇤i k + E kmi k2 .

2

But by CS and Assumptions B and H, a(p⇤j ) = E[⌫(wi )p⇤j (ui )]  E[v(wi )2 ]E[pj (u⇤i )2 ] < 1, and therefore kA⇤ k2 = a(p⇤K0 ) and by Assumption D (n n

2

⇣2v ()2 K 1/2 = n ⇣

2

K6

2

= O(K). Then, by using previous results of (A.15) and (A.16),

! 0) which implies that n

⇣2v ()1/2 K 1/2 = n

⌫K k2 ! 0. Given

⇤ ⇣2v (K)2 K 1/2 converge to zero, we conclude that E k⌫K

E k⌫

⌫K k

2

⌘1/2

⇣

 E k⌫

we have the desired result that E k⌫

⇤ 2 ⌫K k

⌘1/2

⇣

+ E

⇤ k⌫K

⌫K k

2

⌘1/2

⇣2v (K)K and

,

⌫K k2 ! 0. Let d(X) = [@h(w)/@w]0 @w(X, ⇧0 (z))/@⇡,

57

⇥ ⇤ ⇥ ⇤ bKL (z) = E d(Xi )⌫K (wi )rL (zi )0 rL (z) and bL (z) = E d(Xi )⌫(wi )rL (zi )0 rL (z). Then, h E kbKL (zi )

i h bL (zi )k2  E d(Xi )2 k⌫K (wi )

i h ⌫(wi )k2  CE k⌫K (wi )

h as K ! 1. Furthermore by Assumption E, E kbL (zi )

i ⌫(wi )k2 ! 0

i ⇢(zi )k2 ! 0 as L ! 1, where ⇢(z) is

a matrix of projections of elements of ⌫(w)d(X) on L which is the set of limit points of rL (z)0

L.

Therefore (as in (A.10) of NPV), by Assumption E ⇥ V = E ⌫K (wi )⌫K (wi )0

2

⇤ ⇥ ⇤ (Xi ) + E bKL (zi )var(xi |zi )bKL (zi )0 ! V¯ ,

⇥ ⇤ where V¯ = E ⌫(wi )⌫(wi )0 2 (Xi ) + E [⇢(zi )var(xi |zi )⇢(zi )0 ]. This shows F is bounded. p Then, given nF (✓ˆ⌧ ✓0 ) !d N (0, 1) from the proof of Theorem 6.1, the conclusion follows p from F 1 ! V¯ 1/2 so that F 1 nF (✓ˆ⌧ ✓0 ) !d N (0, V¯ ). ⇤

B.5

Weak Instruments in Nonparametric IV Models

The purpose of this subsection is to illustrate the subtlety of di↵erence between weak instruments and the intrinsic ill-posed inverse problem in a NPIV model. The characterization of weak instruments in this section parallels the main approach of this paper, in that we consider a sequence of distributions in defining weak instruments. This exercise is related to Darolles et al. (2011), where weak instruments are characterized within their source condition. Alternatively, in the same NPIV model, Freyberger (2015) defines a notion of weak instruments by associating it with bias of the estimator resulting from a failure of (a restricted version of) the completeness condition. Consider Y = g(X) + U,

E[U |Z] = 0,

or equivalently, E[Y

g(X)|Z] = 0,

(B.13)

where X is the endogenous variable, and Z is the instrument, g is the parameter of interest. All variables are scalar and the support of (X, Z) is [ 1/2, 1/2]2 . In terms of notations, we explicitly distinguish a r.v. with its realized value in this subsection. Here, the problem of weak instruments can be characterized as the moment condition not being sensitive to g(·), which in fact is also a feature of the ill-posed inverse problem. As is shown below, however, there is a subtle di↵erence between the two. The moment condition (B.13) can be written as (T g)(w) = r(w)

(B.14)

with a linear operator T and appropriate r(·) defined below, following Horowitz (2011). For the

58

sake of our discussion, assume that X is generated according to X = Z + V, where V has a symmetric distribution and Z ? V . It is important to note that this relationship is unknown to the econometrician; otherwise, the model becomes a triangular model. From this specification, we write the operator T and its eigenvalues i.e.,

as functions of . When X ? Z,

j ’s

= 0, we have the lack of identification and T is singular. We find that in this case, some

of the eigenvalues are zero. The idea here is to separate singular. Then, one can show the convergence of groups when

! 0, while both satisfy

j

j ’s

j ’s

that are zero and nonzero when T is

exhibits di↵erent patterns across these two

! 0 when j ! 1.

Specifically, define T after symmetrizing (B.13): For h 2 L2 [ 1/2, 1/2], (T h)(x) ⌘ where ⌧ (x, w) ⌘

ˆ

1/2

h(w)⌧ (x, w)dw,

(B.15)

1/2

´ 1/2

assuming fXZ (·, ·) is bounded. Then (B.15) can be ´ 1/2 written as (B.14) by assuming g 2 L2 [ 1/2, 1/2], where r(x) ⌘ 1/2 E[Y |Z = z]fXZ (x, z)dz. Since 1/2 fXZ (x, z)fXZ (w, z)dz,

fXZ is bounded, we know that r 2 L2 [ 1/2, 1/2] . Also, T has eigenvalues and eigenfunctions { j

j

: j = 1, 2, ...} such that

T and k j k = 1, h j ,

ki

=

j

j j,

= 0 for j 6= k. Then an appropriate condition for identification (such as

completeness) will guarantee that T is nonsingular. That is, inverse, so that g = T

1 r.

j

> 0 (since T is p.s.d) and T has an

Also, since T is a compact operator, for j

j

j+1

8j,

! 0.

To characterize weak instruments, we want to consider those

j ’s

that are zero when

the lack of identification case). Note that FX|Z (x|z) = Pr[Z + V  x|Z = z] = Pr[V  x by Z?V , and hence fX|Z (x|z) = fV (x ⌧ (x, w) =

ˆ

z ) or fXZ (x, z) = fV (x fV (x

z )fV (w

59

z ]

z )fZ (z). Then

z )fZ (z)2 dz,

= 0 (i.e.,

where we suppress the range [ 1/2, 1/2] of the integral. Consider the case where ⌧ (x, w) =

ˆ

fV (x)fV (w)fZ (z)2 dz = kfZ k2 fV (x)fV (w).

Consider the following eigenvalue equation (i.e., T ˆ

of fV (·) around

) (x),

= kfZ k2 kfV k2 . Note that

= kfZ k2 kfV k2 is the only

= 0, and therefore, T is singular. Note

holds with the sequence { , 0, 0, 0, ...}. Now consider the case

=

kfZ k2 fV (x)fV (w) (w)dw =

which implies (x) = fV (x)/ kfV k and nonzero eigenvalue of T when

6= 0. Recall ⌧ (x, w) ⌘

= 0 yields

fV (x This approximate is justified by

= 0: We have

´

fV (x

z)fV (w

j

! 0 (as j ! 1) trivially

z)fZ (z)2 dz. The expansion

zfV0 (x) + o( ).

z) = fV (x)

being close to zero with weak instruments. Henceforth, suppress

V in fV . Then the eigenvalue equation, after some algebra, becomes ⇥

V0 hf, i

⇤ V1 hf 0 , i f (x)

⇥

2

V1 hf, i

2

where V0 = kfZ k2 , V1 = kzfZ k2 , and V2 = z 2 fZ

⇤ V2 hf 0 , i f 0 (x) + o(

2

)=

(x),

(B.16)

are finite. This implies that the eigenfunction

is of the form

(x) = af (x) + bf 0 (x) + o( for some constants a and b. Suppress o(

2)

2

)

(B.17)

for simplicity. Substituting (B.17) into (B.16), we have

h i h i 2 a = a V0 kf k2 V1 hf 0 , f i + b V0 hf, f 0 i V1 f 0 , h i h 2 2 b = a V1 kf k2 V2 hf 0 , f i b V1 hf, f 0 i V2 f 0 and the solutions for this quadratic equation of

2

i

,

are

2

2 = V0 kf k2 2 V1 hf 0 , f i + 2 V2 f 0 r⇣ ⌘2 ± V0 kf k2 2 V1 hf 0 , f i + 2 V2 kf 0 k2 where V0 V2

V12

0 and kf k2 kf 0 k2

hf, f 0 i2

4

2 (V

0 V2

V12 )(kf k2 kf 0 k2

hf, f 0 i2 ),

0 by the Cauchy-Schwarz inequality. Then, observe

60

that the eigenvalues have “two types” in terms of their limits when 1

= where

1

+ O( ) ±

p ( 2

1

+ O(

p

))2

= kfZ k2 kfV k2 is the nonzero eigenvalue when

=

(

1

! 0:

p + O( ) , p O( )

= 0. Motivated from this, we define the

following sets of eigenvalues:

L1 = {

j

2L:

j

L0 = {

j

2L:

j

p + O( )} , p = O( )} , =

1

where L is the set of all eigenvalues. In conclusion, the sequences

When

! 0 for a given j,

a given ,

j(

) ! 0 for

j

j(

! 0 and j ! 1 play di↵erent roles in the convergence of

)!

1

for

j

2 L1 and

j(

) ! 0 for

j(

j:

) 2 L0 ; when j ! 1 for

2 L1 [ L0 = L. One additional note: Although weak instruments and

the ill-posed inverse problem entail di↵erent patterns of convergence of the eigenvalues, the use of regularization methods may solve both problems simultaneously.

References Ai, C., Chen, X., 2003. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71 (6), 1795–1843. 4 Amemiya, T., 1977. The maximum likelihood and the nonlinear three-stage least squares estimator in the general nonlinear simultaneous equation model. Econometrica, 955–968. 5 Andrews, D. W., Guggenberger, P., 2015. Identification-and singularity-robust inference for moment condition models. 1 Andrews, D. W. K., Cheng, X., 2012. Estimation and inference with weak, semi-strong, and strong identification. Econometrica 80 (5), 2153–2211. 1 Andrews, D. W. K., Stock, J. H., 2007. Inference with weak instruments. In: Advances in Econometrics: Proceedings of the Ninth World Congress of the Econometric Society. 1 Andrews, D. W. K., Whang, Y.-J., 1990. Additive interactive regression models: circumvention of the curse of dimensionality. Econometric Theory 6 (04), 466–479. 5 Andrews, I., Mikusheva, A., 2015a. Conditional inference with a functional nuisance parameter, Unpublished Manuscript, Department of Economics, Massachusetts Institute of Technology. 1

61

Andrews, I., Mikusheva, A., 2015b. A geometric approach to nonlinear econometric models, Unpublished Manuscript, Department of Economics, Massachusetts Institute of Technology. 1 Angrist, J. D., Keueger, A. B., 1991. Does compulsory school attendance a↵ect schooling and earnings? The Quarterly Journal of Economics 106 (4), 979–1014. 2 Angrist, J. D., Lavy, V., 1999. Using Maimonides’ rule to estimate the e↵ect of class size on scholastic achievement. The Quarterly Journal of Economics 114 (2), 533–575. 8 Arlot, S., Celisse, A., 2010. A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40–79. 5, 8 Blundell, R., Browning, M., Crawford, I., 2008. Best nonparametric bounds on demand responses. Econometrica 76 (6), 1227–1262. 1 Blundell, R., Chen, X., Kristensen, D., 2007. Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75 (6), 1613–1669. 2, 1, 4, 5 Blundell, R., Duncan, A., 1998. Kernel regression in empirical microeconomics. Journal of Human Resources 33, 62–87. 1, 5 Blundell, R., Duncan, A., Pendakur, K., 1998. Semiparametric estimation and consumer demand. Journal of Applied Econometrics 13 (5), 435–461. 1, 5 Blundell, R., Powell, J. L., 2003. Endogeneity in nonparametric and semiparametric regression models. Econometric Society Monographs 36, 312–357. 4 Blundell, R. W., Powell, J. L., 2004. Endogeneity in semiparametric binary response models. The Review of Economic Studies 71 (3), 655–679. 9 Bound, J., Jaeger, D. A., Baker, R. M., 1995. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American statistical association 90 (430), 443–450. 1 Breza, E., 2012. Peer e↵ects and loan repayment: Evidence from the Krishna default crisis. Job Market Paper MIT. 1 Carrasco, M., Florens, J.-P., Renault, E., 2007. Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. Handbook of econometrics 6, 5633–5751. 1, 4 Chay, K., Munshi, K., 2014. Black networks after emancipation: Evidence from reconstruction and the great migration, Unpublished working paper. 1

62

Chen, X., Pouzo, D., 2012. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica 80 (1), 277–321. 4, 5 Chernozhukov, V., Hansen, C., 2005. An IV model of quantile treatment e↵ects. Econometrica 73 (1), 245–261. 9 Chesher, A., 2003. Identification in nonseparable models. Econometrica 71 (5), 1405–1441. 2, 2 Chesher, A., 2007. Instrumental values. Journal of Econometrics 139 (1), 15–34. 2 Coe, N. B., von Gaudecker, H.-M., Lindeboom, M., Maurer, J., 2012. The e↵ect of retirement on cognitive functioning. Health economics 21 (8), 913–927. 1, 4 Darolles, S., Fan, Y., Florens, J.-P., Renault, E., 2011. Nonparametric instrumental regression. Econometrica 79 (5), 1541–1565. 2, 5, B.5 Das, M., Newey, W. K., Vella, F., 2003. Nonparametric estimation of sample selection models. The Review of Economic Studies 70 (1), 33–58. 9 Davidson, R., MacKinnon, J. G., 1993. Estimation and inference in econometrics. OUP Catalogue. 3 Del Bono, E., Weber, A., 2008. Do wages compensate for anticipated working time restrictions? evidence from seasonal employment in austria. Journal of Labor Economics 26 (1), 181–221. 1, 5 Dufour, J.-M., 1997. Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica, 1365–1387. 1 Dustmann, C., Meghir, C., 2005. Wages, experience and seniority. The Review of Economic Studies 72 (1), 77–108. 1, 5 Engl, H. W., Hanke, M., Neubauer, A., 1996. Regularization of inverse problems. Vol. 375. Springer. 1 Frazer, G., 2008. Used-clothing donations and apparel production in Africa. The Economic Journal 118 (532), 1764–1784. 1 Freyberger, J., 2015. On completeness and consistency in nonparametric instrumental variable models. Working Paper. 1, B.5 Garg, K. M., 1998. Theory of di↵erentiation. Wiley. A.1 Giorgi, G., Guerraggio, A., Thierfelder, J., 2004. Mathematics of Optimization: Smooth and Nonsmooth Case. Elsevier. A.1

63

Hall, P., Horowitz, J. L., 2005. Nonparametric methods for inference in the presence of instrumental variables. The Annals of Statistics 33 (6), 2904–2929. 1, 4, 5 Han, S., McCloskey, A., 2015. Estimation and inference with a (nearly) singular Jacobian, Unpublished Manuscript, University of Texas at Austin and Brown University. 1 Hastie, T., Tibshirani, R., 1986. Generalized additive models. Statistical science, 297–310. 1 Henderson, D. J., Papageorgiou, C., Parmeter, C. F., 2013. Who benefits from financial development? New methods, new evidence. European Economic Review 63, 47–67. 1 Hengartner, N. W., Linton, O. B., 1996. Nonparametric regression estimation at design poles and zeros. Canadian journal of statistics 24 (4), 583–591. 1 Hoderlein, S., 2009. Endogenous semiparametric binary choice models with heteroscedasticity, cemmap working paper. 6 Hoerl, A. E., Kennard, R. W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 (1), 55–67. 4 Hong, Y., White, H., 1995. Consistent specification testing via nonparametric series regression. Econometrica, 1133–1159. 9 Horowitz, J. L., 2011. Applied nonparametric instrumental variables estimation. Econometrica 79 (2), 347–394. 1, 8, B.5 Horowitz, J. L., Lee, S., 2007. Nonparametric instrumental variables estimation of a quantile regression model. Econometrica 75 (4), 1191–1208. 4 Imbens, G. W., Newey, W. K., 2009. Identification and estimation of triangular simultaneous equations models without additivity. Econometrica 77 (5), 1481–1512. 1, 2 Jiang, J., Fan, Y., Fan, J., 2010. Estimation in additive models with highly or nonhighly correlated covariates. The Annals of Statistics 38 (3), 1403–1432. 1, 6 Jun, S. J., Pinkse, J., 2012. Testing under weak identification with conditional moment restrictions. Econometric Theory 28 (6), 1229. 3, 5 Kasy, M., 2014. Instrumental variables with unrestricted heterogeneity and continuous treatment. The Review of Economic Studies 81 (4), 1614–1636. 1 Kleibergen, F., 2002. Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica 70 (5), 1781–1803. 1

64

Kleibergen, F., 2005. Testing parameters in GMM without assuming that they are identified. Econometrica 73 (4), 1103–1123. 1 Koster, H. R., Ommeren, J., Rietveld, P., 2014. Agglomeration economies and productivity: a structural estimation approach using commercial rents. Economica 81 (321), 63–85. 1 Kress, R., 1999. Linear integral equations. Vol. 82. Springer. 4 Lee, J. M., 2011. Topological spaces. In: Introduction to Topological Manifolds. Springer, pp. 19–48. A.1 Lee, S., 2007. Endogeneity in quantile regression models: A control function approach. Journal of Econometrics 141 (2), 1131–1158. 9 Linton, O. B., 1997. Miscellanea: Efficient estimation of additive nonparametric regression models. Biometrika 84 (2), 469–473. 1 Lorentz, G. G., 1986. Approximation of functions. Chelsea Publishing Company, New York. 13 Lyssiotou, P., Pashardes, P., Stengos, T., 2004. Estimates of the black economy based on consumer demand approaches. The Economic Journal 114 (497), 622–640. 1, 5 Mazzocco, M., 2012. Testing efficient risk sharing with heterogeneous risk preferences. The American Economic Review 102 (1), 428–468. 1 Moreira, M. J., 2003. A conditional likelihood ratio test for structural models. Econometrica 71 (4), 1027–1048. 1 Newey, W. K., 1990. Efficient instrumental variables estimation of nonlinear models. Econometrica, 809–837. 5 Newey, W. K., 1997. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79 (1), 147–168. 4, 5, B.1 Newey, W. K., Powell, J. L., 2003. Instrumental variable estimation of nonparametric models. Econometrica 71 (5), 1565–1578. 1, 4 Newey, W. K., Powell, J. L., Vella, F., 1999. Nonparametric estimation of triangular simultaneous equations models. Econometrica 67 (3), 565–603. 1 Nielsen, J. P., Sperlich, S., 2005. Smooth backfitting in practice. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1), 43–61. 1 Pinkse, J., 2000. Nonparametric two-step regression estimation when regressors and error are dependent. Canadian Journal of Statistics 28 (2), 289–300. 1 65

Powell, M. J. D., 1981. Approximation theory and methods. Cambridge university press. 5 Skinner, J. S., Fisher, E. S., Wennberg, J., 2005. The efficiency of medicare. In: Analyses in the Economics of Aging. University of Chicago Press, pp. 129–160. 1 Sperlich, S., Linton, O. B., H¨ ardle, W., 1999. Integration and backfitting methods in additive models-finite sample properties and comparison. Test 8 (2), 419–458. 1 Staiger, D., Stock, J. H., 1997. Instrumental variables regression with weak instruments. Econometrica 65 (3), 557–586. 1, 11, 7 Stock, J. H., Wright, J. H., 2000. GMM with weak identification. Econometrica 68 (5), 1055–1096. 1, 3 Stock, J. H., Yogo, M., 2005. Testing for weak instruments in linear IV regression. Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, 80–108. 1, 2, 7 Stone, C. J., 1982. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 1040–1053. 5 Taylor, A. E., 1965. General theory of functions and integration. Dover Publications. A.1 Trefethen, L. N., 2008. Is Gauss quadrature better than Clenshaw-Curtis? SIAM (Society for Industrial and Applied Mathematics) Review 50 (1), 67–87. 5, A.4 Wang, H., Xiang, S., 2012. On the convergence rates of Legendre approximation. Mathematics of Computation 81 (278), 861–877. A.4 Weyl, H., 1912. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller di↵erentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen 71 (4), 441–479. B.2 Yatchew, A., No, J. A., 2001. Household gasoline demand in Canada. Econometrica 69 (6), 1697– 1709. 1, 5

66

µ2

K⇤ = 1 4

8

16

32

64

128

256

0.4236

0.0325

0.0043

0.0003

0.0002

0.0000

0.0000

V ar

943.0304

9.4828

0.3767

0.1145

0.0870

0.0311

0.0168

M SE

943.4540

9.5153

0.3810

0.1148

0.0872

0.0311

0.0169

M SEIV /M SELS

3711.5352

35.4269

1.5029

0.4751

0.3598

0.1491

0.0904

Bias2

0.0400

0.0064

0.0009

0.0004

0.0002

0.0000

0.0000

V ar

1.9542

0.4287

0.2029

0.0915

0.0773

0.0297

0.0163

M SE

1.9943

0.4351

0.2038

0.0918

0.0775

0.0298

0.0163

M SEP IV /M SEIV

0.0021

0.0457

0.5350

0.8001

0.8890

0.9564

0.9677

Bias2

0.1299

0.0759

0.0303

0.0113

0.0034

0.0016

0.0004

V ar

0.2470

0.1313

0.1258

0.0647

0.0369

0.0436

0.0153

M SE

0.3768

0.2072

0.1561

0.0759

0.0403

0.0452

0.0158

M SEP IV /M SEIV

0.0001

0.1417

0.4796

0.5039

0.7093

0.6305

0.9049

Bias2

0.1887

0.1269

0.0771

0.0314

0.0143

0.0038

0.0015

V ar

0.2063

0.0870

0.0700

0.0415

0.0346

0.0240

0.0164

M SE

0.3951

0.2139

0.1470

0.0729

0.0489

0.0278

0.0180

M SEP IV /M SEIV

0.0011

0.1069

0.5116

0.6535

0.7909

0.8231

0.9073

Bias2 ⌧ =0

⌧ = 0.001

⌧ = 0.005

⌧ = 0.01

Table 1: Integrated squared bias, integrated variance, and integrated MSE of the penalized and unpenalized IV estimators gˆ⌧ (·) and gˆ(·).

67

µ2

K⇤ = 3 4

8

16

32

64

128

256

Bias2

0.0634

0.0022

0.0054

0.0004

0.0002

0.0002

0.0000

V ar

21.6485

5.7572

0.2994

0.0975

0.0758

0.0343

0.0305

M SE

21.7118

5.7594

0.3048

0.0979

0.0760

0.0344

0.0306

M SEP IV /M SEIV

0.8103

0.9757

0.9097

0.9518

0.9446

0.9674

0.9870

Bias2

0.0015

0.1859

0.0020

0.0011

0.0002

0.0001

0.0001

V ar

55.4223

286.3584

0.2392

0.0997

0.0520

0.0375

0.0287

M SE

55.4238

286.5443

0.2412

0.1008

0.0522

0.0376

0.0288

M SEP IV /M SEIV

0.5546

0.8021

0.8498

0.8506

0.8761

0.7105

0.9570

Bias2

0.0139

0.2316

0.0037

0.0002

0.0001

0.0003

0.0000

V ar

82.1557

146.8736

0.2420

0.0991

0.0506

0.0500

0.0158

M SE

82.1696

147.1052

0.2457

0.0993

0.0507

0.0503

0.0158

M SEP IV /M SEIV

0.5646

1.0198

0.7376

0.8525

0.8328

0.8605

0.9538

⌧ = 0.001

⌧ = 0.005

⌧ = 0.01

Table 2: Integrated squared bias, integrated variance, and integrated MSE of the penalized and unpenalized IV estimators gˆ⌧ (·) and gˆ(·).

⌧

CV Value

0.015

37.267

0.05

37.246

0.1

37.286

0.15

37.330

0.2

37.373

Table 3: Cross-validation values for the choice of ⌧ .

68

Figure 2: Penalized versus unpenalized estimators (ˆ g⌧ (·) vs. gˆ(·)) with a weak instrument, ⌧ = 0.001. The (blue) dotted-dash line is the true g0 (·). The (black) solid line is the (simulated) mean of gˆ(·) with the dotted band representing the 0.025-0.975 quantile ranges. Note that the di↵erence between g0 (·) and the mean of gˆ(·) is the (simulated) bias. The (red) solid line is the mean of gˆ⌧ (·) with the dashed 0.025-0.975 quantile ranges.

Figure 3: Penalized versus unpenalized estimators (ˆ g⌧ (·) vs. gˆ(·)) with a strong instrument, ⌧ = 0.001.

69

Figure 4: Penalized versus unpenalized estimators (ˆ g⌧ (·) vs. gˆ(·)) with a weak instrument, ⌧ = 0.005.

Figure 5: Penalized versus unpenalized estimators (ˆ g⌧ (·) vs. gˆ(·)) with a strong instrument, ⌧ = 0.005.

70

Figure 6: NPIV estimates from Horowitz (2011), full sample (n = 2019), 95% confidence band

Figure 7: Unpenalized IV estimates with nonparametric first-stage equations, full sample (n = 2019), 95% confidence band

71

Figure 8: Penalized IV estimates with the discontinuity sample (n = 650, F = 191.66).

72