Estimation of Counterfactual Distributions with a Continuous Endogenous Treatment Santiago Pereda-Fernández∗ Banca d’Italia September 20, 2016

Abstract Policy makers are often interested in the distributional effects of a policy. In this paper I propose a method to estimate the counterfactual distribution of an outcome variable when the treatment is endogenous, continuous, and its effect is heterogeneous. The type of counterfactuals considered in this paper are those in which the change in treatment intensity can be correlated with the individual effects. I characterize the outcome and the treatment with a triangular system of equations in which the unobservables are related by a copula that captures the endogeneity of the treatment, which is nonparametrically identified by inverting the quantile processes that determine the outcome and the treatment. Both processes are estimated using existing quantile regression methods, and I propose a parametric and a nonparametric estimator of the copula. To illustrate these methods, I estimate several counterfactual distributions of the birth weight of children, had their mothers smoked less during pregnancy.

Keywords: Copula, counterfactual distribution, endogeneity, policy analysis, quantile regression, unconditional distributional effects JEL classification: C31, C36 ∗

Banca d’Italia, Via Nazionale 91, 00184 Roma, Italy. I would like to thank Manuel Arellano, Stéphane Bonhomme, Domenico Depalo, Bryan Graham, Giuseppe Ilardi, Michael Jansson, James Powell, Demian Pouzo, Enrique Sentana, and seminar participants at CEMFI and University of California, Berkeley for their helpful comments and discussion. All remaining errors are my own. I can be reached via email at [email protected]

1

1

Introduction

Estimation of the effect of a policy is usually one of the main objectives of experiments in economics. When the treatment is binary, the population is naturally split into treated and untreated individuals, and two of the main statistics of interest are the average treatment effect on the treated and on the untreated. However, these two provide an incomplete picture of the effects of a policy, since they do not capture the distributional effects of the policy.1 When the treatment is continuous, the population is not divided into treated and untreated, and in principle everyone could have a different treatment intensity. However, the policy maker could be interested in similar questions regarding the distributional effects of a policy on a selected subpopulation. The aim of this paper is to provide estimators of the counterfactual distribution of a continuous outcome variable when the treatment is continuous, endogenous, and has a heterogeneous effect. Unlike the exogenous case, individuals self select themselves into treatment intensity, and the policy maker may not be able to fully enforce the treatment for everyone. Hence, I focus on counterfactuals in which the change in treatment intensity is correlated with the individual effect. For example, setting a minimum amount of treatment for everyone would not affect the treatment intensity choices of individuals above that level, but it would increase the intensity of those below that level, who would be differently affected by the change in the treatment. Similarly, if the treatment intensity was halved for everyone, the decrease would be larger for those with a higher individual effect. The model is characterized by a triangular system of equations in which both the outcome and the treatment depend monotonically on a single unobservable variable each. These unobservables isolate the endogeneity of the treatment, which is modeled with a copula. Despite not being the main object of interest for the policy maker, this copula can nevertheless be quite informative about the potential effects of a particular policy. To see this, consider the extreme case in which there is perfect correlation: for a given set of covariates, individuals 1

A policy maker interested in inequality would also like to know what the impact of the policy would be on the variance, a particular quantile of the distribution or any measure of inequality such as the Gini index.

2

with the lowest treatment intensity are those whose treatment effect is the lowest, and increasing their treatment intensity would only have a modest effect. On the other hand, if there is little to no correlation, even individuals with a low treatment intensity could benefit a lot from an increase in the treatment. I discuss the identification and the estimation of the components of the distribution of an outcome variable Y : its conditional distribution, which depends on both equations of the triangular system, the distribution of the instrument, and the copula. Then, I use the estimators of these functionals to estimate the counterfactual distribution of the outcome. These estimators of the counterfactual distribution can in turn be used to estimate other functionals, such as the unconditional quantile treatment effect, or the Gini index. To estimate the two functions that characterize the triangular model, I use existing quantile methods, whereas for the copula I propose two estimators: one parametric, and another one nonparametric. Both of them require the inversion of the quantile processes that √ conform the triangular model, and they achieve n convergence rate, which is also achieved by the estimators of the unconditional distribution based on them. Then, I combine these estimators to obtain the estimator of the counterfactual distribution, showing its asymptotic properties. The identification of this type of triangular models has received a lot of attention in the literature, using either instrumental variable or control function approaches. Early works are Chesher (2003) or Imbens and Newey (2009), who study the nonparametric identification of nonseparable models using a control function approach. Other papers propose methods that could be referred to as semiparametric, which do not suffer from the curse of dimensionality, such as Jun (2009), or Lee (2007) who assumes the model to be separable. Alternatively, Ma and Koenker (2006) propose a parametric model of Chesher (2003). D’Haultfœuille and Février (2015) and Torgovitsky (2015) are recent papers that establish the identification of nonseparable triangular models when the support of the instrument is small, for which they use the monotonicity of the unobservables. On the other hand, Hoderlein and Mammen (2007) discuss the identification of such models without monotonicity, and Kasy (2014)

3

considers the case in which the heterogeneity can be multidimensional. This paper is also related to the literature of estimation of unconditional counterfactual distributions. Machado and Mata (2005) and Melly (2006) proposed estimators of such effects based on quantile regression when the treatment is exogenous, which Chernozhukov et al. (2013) generalized by proposing a method to estimate any functional of interest, given an initial estimator of the conditional quantile curve or the conditional distribution function. The estimator I propose extends these methods to the presence of endogeneity based on an instrumental variables approach, similarly to Pereda-Fernández (2010). On the other hand, Martinez-Sanchis et al. (2012) adapted Melly (2006) to the presence of endogeneity using a control function approach. Firpo et al. (2009) proposed a different method to estimate distributional effects under exogeneity, based on the reweighting of the influence function rather than on quantile regression methods as in this paper. Frölich and Melly (2013) proposed a nonparametric estimator of the unconditional quantile treatment effect for the subpopulation of compliers when the treatment is an endogenous binary variable. However, in this paper both the outcome and the treatment are continuous, which allows me to nonparametrically identify the copula that captures the endogeneity of the treatment. Examples of empirical works that fit into the framework presented in this paper include the impact of education on earnings (Card, 2001), and on adult mortality (Lleras-Muney, 2005), the effect of family income on scholastic achievement (Dahl and Lochner, 2012), the impact of class size on scholastic achievement (Angrist and Lavy, 1999), or on long-term outcomes (Fredriksson et al., 2013), the quality of institutions on income (Acemoglu et al., 2012), or the effect of smoking during pregnancy on child’s birthweight (Evans and Ringel, 1999). These studies could benefit from studying the distributional effects of an intervention that results in a different assignment of the amount of treatment for the whole population. I illustrate these methods using data on birth weight of children whose mothers smoked during pregnancy. Following Evans and Ringel (1999), I instrument smoke consumption during pregnancy with the state tax as a percentage of the final price. I estimate the distributional effect of smoking an extra daily cigarette, and then carry out two counterfactuals

4

in which I respectively reduce the amount of cigarettes smoked to one half of what was actually reported, and I limit the maximum daily amount of smoked cigarettes to ten. The results show that such reductions in smoke consumption during pregnancy would increase the average birth weight. However, this effect would be heterogeneous, and in particular it would substantially reduce the number of newborns with low, and very low birth weight, i.e. those who respectively weigh less than 2,500, and 1,500 grams at birth. The rest of the paper is organized as follows. In section 2 I discuss the identification of the functionals of interest and the counterfactual distributions. In section 3 I propose two estimation methods based on different assumptions of the copula. In section 4 I carry out a Monte Carlo experiment, and in section 5 I apply the methodology presented in this paper to the estimation of the effect of smoking during pregnancy on birth weight. Finally, section 6 concludes.

2

Identification

Let Y be the outcome variable of interest, X ≡ (X1 X20 )0 be the vector composed by the treatment, X1 , and a set of exogenous covariates, X2 , Z ≡ (Z1 X20 )0 be the vector composed by the instrumental variable, Z1 and the exogenous covariates, and U and V be uniformly distributed random variables that are not observed by the econometrician, to which I refer as the conditional ranks. These variables conform the following triangular model:

Y = g (X1 , X2 , U )

(1)

X1 = h (Z1 , X2 , V )

(2)

U, V |Z ∼ CU V |X2

(3)

where g (·, ·, ·) and h (·, ·, ·) are nonseparable and strictly increasing in their last argument, and CU V |X2 is the copula of (U, V ), conditional on the vector of exogenous covariates.2 The 2

By definition, a copula is the multivariate distribution of (U1 , ..., Ud ) such that their marginal distributions are uniformly distributed on the unit interval. Sklar (1959) showed that any multivariate

5

Skorohod representation allows us to isolate the endogeneity of the treatment, captured by the copula, from the structural equations of the outcome and the treatment.3 In this model, h (·, ·, ·) represents the conditional quantile function (CQF) of X1 , which satisfies P (X1 ≤ h (Z1 , X2 , τ ) |Z1 , X2 ) = τ , and g (·, ·, ·) is the structural quantile function (SQF) of Y , which satisfies P (Y ≤ g (X1 , X2 , τ ) |Z1 , X2 ) = τ , which is different from its CQF.4 The copula of the conditional ranks, while not being the objective function of the policy maker, can nevertheless be informative. For example, if the copula is highly correlated, an intervention that increases the treatment on those with a low level of treatment would have a modest effect. On the other hand, if the copula has little correlation, such intervention would have a small effect on some individuals, and a large effect on others.

2.1

Identification of the Structural Functions

Before focusing on the counterfactual distribution of Y , I consider its actual distribution and discuss the identification of the different components upon which it depends. Denote the conditional distribution of Y by FY |Z , and the conditional copula by CU |V X2 . The former can be expressed as a function of the latter: ˆ FY |Z (y|z) =

[0,1]2

1 (g (h (z1 , x2 , v) , x2 , u) ≤ y) dCU V |X2 (u, v|x2 )

ˆ =

[0,1]2 1

 1 u ≤ g −1 (h (z1 , x2 , v) , x2 , y) dCU |V X2 (u|v, x2 ) dv

ˆ

 CU |V X2 g −1 (h (z1 , x2 , v) , x2 , y) |v, x2 dv

=

(4)

0

where 1 (·) denotes the indicator function. Therefore, conditional on (V, Z), there is a bijection between Y and U , given by FY |ZV (y|z, v) = CU |V X (g −1 (h (z1 , x2 , v) , x2 , y) |v, x2 ). To better understand this relation, consider the exogenous case. If U and V were independent distribution of the continuously distributed variables X1 , ..., Xd there exists a unique cdf C, such that P (X1 ≤ x1 , ..., Xd ≤ xd ) = C (F1 (x1 ) , ..., Fd (xd )). 3 The Skorohod representation states that a random variable ϕi can be written in terms of its quantile function: ϕi = q (Ui ), where Ui ∼ U (0, 1). 4 See Chernozhukov and Hansen (2013) for a more detailed discussion on the difference between the SQF and the CQF of Y .

6

of each other, conditional on Z, the conditional copula would simplify to CU |V X2 (u|v, x2 ) = u. Moreover, by equation 2, when Z and V are known, so is X1 . Thus, it follows that FY |ZV (y|z, v) = g −1 (h (z1 , x2 , v) , x2 , y) = g −1 (x1 , x2 , y) ≡ u, whereas under endogeneity FY |ZV (y|z, v) = CU |V X2 (u|v, x2 ) 6= u. Hence, the identification of FY |Z requires the identification of three components: the SQF of Y , the CQF of X1 and the copula of (U, V ) conditional on X2 . Identification of g (·, ·, ·) has been an active area of research, with recent works by D’Haultfœuille and Février (2015) and Torgovitsky (2015) establishing its identification with a continuous treatment even when the support of the instrument Z1 are two points. Although the assumptions required for identification are different, both of them require strict monotonicity and continuity of both g (·, ·, ·) and h (·, ·, ·), strong exogeneity of the instrument, and a normalization of U . The identification of h (·, ·, ·), was established by Matzkin (2003), and it follows by the normalization that V is uniformly distributed, the strict monotonicity, and the continuity of h (·, ·, ·). As for the identification of the copula, it is obtained by inverting the SQF and the CQF, which is possible by the continuity and the monotonicity of both functions in their last argument: U = g −1 (X1 , X2 , Y ) V = h−1 (Z1 , X2 , X1 ) Hence, it follows that CU V |X2 (u, v|x2 ) = P (U ≤ u, V ≤ v|z).5 Finally, to obtain the unconditional distribution of Y , integrate FY |Z over the actual distribution of Z: ˆ FY (y) =

1

FY |Z (y|z) dFZ (z)

(5)

0 5

Arellano and Bonhomme (2015) also use of a copula in a quantile context to model sample selection. Because they model the relation of a continuous outcome variable and a binary selection variable, they rely on parametric assumptions to identify the copula. Moreover, despite the technical similarities, their interpretation is different.

7

2.2

Identification of the Counterfactual Distributions of Y

In this paper I focus on the type of counterfactuals in which the treatment is not randomly assigned.6 To better understand them, let us consider two illustrative examples. Imagine a policy maker interested in increasing worker’s income through an increase in the minimum level of compulsory schooling. In this counterfactual scenario, those who would attain a level of education above that minimum without the intervention would have the same level of education, but there would be an increase for those below it. Since there is self selection into schooling, the increase in income for those students below the threshold would be smaller than if those above it increased their education level. Alternatively, consider the case of a policy maker who wants to estimate the effect that reducing the number of cigarettes smoked during pregnancy to a half would have on the birth weight of newborns. Since smoking could be correlated with other unobserved bad habits, such a policy would have a different effect on women who smoked a different amount of cigarettes, which could increase the birth weight of those who would have had a low birth weight by less than the rest of the children. Such counterfactuals involve a structural change in the relation between the endogenous variable and the outcome. Hence, the treatment is no longer determined by equation 2, but by hcf (z1 , x2 , v). In the two previous examples, the new structural function would be given  by hcf (z1 , x2 , v) = max h (z1 , x2 , v) , xmin , where xmin would be the new minimum schooling age, and hcf = 21 h (z1 , x2 , v). Mathematically, the counterfactual distribution of the outcome variable is given by ˆ ˆ FYcf

(y) = Z

3

[0,1]2

  1 g hcf (z1 , x2 , v) , x2 , u ≤ y dCU V |X2 (u, v|x2 ) dFZ (z)

(6)

Estimation

To keep the notation compact, define SY (u|z, v) ≡ g (h (z1 , x2 , v) , x2 , u), SYcf (u|z, v) ≡   g hcf (z1 , x2 , v) , x2 , u , and u (y, z, v) ≡ CU−1|V X2 FY |ZV (y|z, v) |v, x2 . 6

See appendix C for the case in which the policy maker can randomly assign either the treatment, or just instrument and then individuals endogenously choose their treatment intensity.

8

3.1

Baseline Estimator

Let the following assumptions hold: Assumption 1. g (x1 , x2 , u) = x0 β (u) h (z1 , x2 , v) = z 0 γ (v)

where β (·) and γ (·) are continuous and g (·, ·, ·) and h (·, ·, ·) are strictly increasing in their last argument. Assumption 2. (yi , x1i , x02i , z1i )0 are iid for i = 1, ..., n, defined on the probability space (Ω, F, P) and take values in a compact set. Assumption 3. Y and X1 have conditional density that is bounded from above and away from zero, a.s. on compact sets Y and X1 , respectively. The copula CU V |X2 (u, v|x2 ) is bounded above and away from zero on [0, 1]2 , and it is uniformly continuous and differentiable with respect to its arguments a.e. Moreover, the marginals are uniformly distributed on the unit interval and therefore cU V |X2 (u, v|x2 ) = cU |V X2 (u|vx2 ) = cV |U X2 (v|ux2 ). Assumption 4. For all (τ, θ), β (τ )0 , γ (θ)0

0

∈ intB × G, where B × G is compact and

convex. Assumption 5. 

  τ − 1 Y < X β + Φ (τ, Z) ι Ψ (τ, Z)  Π (β, ι, γ, τ, θ) ≡ E  0 (θ − 1 (X1 < Z γ)) Z   (τ − 1 (Y < X 0 β)) Ψ (τ, Z)  Π (β, γ, τ, θ) ≡ E  0 (θ − 1 (X1 < Z γ)) Z 0

0

0 Φ (τ, Z)0 , X20 , Φ (τ, Z) is a vector a transformation of instruments,

where Ψ (τ, Z) ≡



Jacobian matrices

∂ Π (β, γ, τ, θ) ∂(β 0 ,γ 0 )

and

∂ Π (β, ι, γ, τ, θ) ∂ (β20 ,ι0 ,γ 0 )

9

are continuous and have

full rank, uniformly over B × I × G × T × C and the image of B × G under the mapping (β, γ) 7→ Π (β, γ, τ, θ) is simply-connected.7 p ˆ (τ, z) ∈ F and Φ ˆ (τ, z) → Assumption 6. wp → 1, the function Φ Φ (τ, z) uniformly

in (τ, z) over compact sets, where Φ (τ, z) ∈ F; the functions f (τ, z) ∈ F are uniformly  smooth functions in z with the uniform smoothness order ω > dim (x1 , z 0 )0 /2, and moreover kf (τ 0 , z) − f (τ, z)k < C |τ − τ 0 |a , C > 0, a > 0, for all (z, τ, τ 0 ). Assumption 7. The counterfactual function hcf (z1 , x2 , v) is Hadamard differentiable with respect to γ (v), and its Hadamard derivative its denoted by ∇γ hcf (z1 , x2 , v). Assumption 8. The copula CU V |X2 (U, V |x2 ; ξ) is known up to the vector of parameters ξ ∈ int (R), where R is bounded and of finite dimension. Moreover, its pdf, denoted by c (u, v|x2 ; ξ), is three times continuously differentiable with respect to its arguments on [0, 1]2 .8 Assumption 1 imposes linearity on the two quantile processes of the triangular system, which greatly simplifies the computation of estimator. This assumption is restrictive, and for instance it requires regressors to take either positive or negative values, but not both, as that would make the process non-monotonic.9 Moreover, the estimation of the second stage equation does not require linearity of the first stage equation (Chernozhukov and Hansen, 2005), but the linearity of both equations permits the estimator of the counterfactual √ distribution to be asymptotically linear, and thus attain the n convergence rate. The linearity of the first stage equation is relaxed in appendix E, where I discuss the properties of the estimator when the equation is unknown. Assumptions 2 to 6 are regularity conditions required for the asymptotic Gaussianity of the Instrumental Variables Quantile Regression (Chernozhukov and Hansen, 2005) and Quantile Regression (Koenker and Bassett, 1978) estimators.10 Assumption 7 imposes some smoothness on the counterfactual treatment, 7

ι denotes the parameter γ in Chernozhukov and Hansen (2006). Notice that this rules out the cases of perfect correlation, since in those, either P (U = u|V = v) = 1 (u = v) or P (U = u|V = v) = 1 (u = 1 − v), implying that the joint pdf takes a value of zero in a dense 2 subspace of [0, 1] . 9 See Koenker (2005) for further details. 10 Chernozhukov and Hansen (2005) does no longer constitute the state of the art in the identification 8

10

which is required to apply the functional delta method to the estimator of the counterfactual  distribution. For example, if hcf (z1 , x2 , v) = max h (z1 , x2 , v) , xmin , then ∇γ hcf (z1 , x2 , v) =  1 h (z1 , x2 , v) ≥ xmin z 0 . Finally, assumption 8 is a parametric assumption of the copula that allows us to obtain the joint asymptotic distribution of the SQF and the copula estimators. Let βˆ (·) and γˆ (·) be the IVQR and QR estimators of the parameters of the processes defined in assumption 1.11 The estimator of the counterfactual SQF, conditional on Z and V , is given by SˆYcf (u|z, v) ≡ xˆcf (v)0 βˆ (u) =



ˆ cf (z1 , x2 , v) x2 h

0

βˆ (u)

(7)

The estimation of the copula copula requires the inversion of the quantile processes to obtain the estimated values of the conditional ranks for each individual, given by vˆi =  ´1  0 ´1 0 ˆ (u) ≤ yi du. Then, the parameter vector of the β γ ˆ (v) ≤ x ) dv and u ˆ = 1 x 1 (z 1i i i i 0 0 copula, ξ, is estimated by maximizing the following pseudo-likelihood function: n

n

1X 1X log (c (uj , vj |x2 ; ξ)) + log ξˆ = arg max ξ n j=1 n j=1



c (ˆ uj , vˆj |x2 ; ξ) c (uj , vj |x2 ; ξ)

 (8)

The first term in equation 8 is the log likelihood function. However, because the actual values of the copula are not observed, the function that is maximized differs from the actual log likelihood function by the second term. To obtain the estimator of the copula, replace ξ ˆ by ξ:   CˆU V |X2 (u, v|x2 ) ≡ CU V |X2 u, v|x2 ; ξˆ

(9)

The asymptotic distribution of the parametric estimator of the copula, together with the of a triangular model. Torgovitsky (2015) and D’Haultfœuille and Février (2015) are the two most recent contributions to this literature and, as far as I know, no estimator of the structural quantile function is based on the identification results of these two papers. Proposing such an estimator that is also easily implementable in current applied research is beyond the scope of this paper. 11 For simplicity, in this paper I assume constant weights in the estimation. Generalizing the estimation to have non-constant weights is straightforward.

11

structural quantile function is given by: Proposition 1. Under assumptions 1 to 8, the joint asymptotic distribution of SˆYcf (u|z, v) and CˆU |V X2 (u|v, x2 ) is given by √

 n

SˆYcf (u|z, v) − SYcf (u|z, v) CˆU |V X2 (u|v, x2 ) − CU |V X2 (u|v, x2 )

  ⇒ GN (u, v, z)

where GN (u, v, z) ≡ N (u, v, z) GM (u, v) is a Gaussian process, GM (u, v) is the joint asymptotic distribution of the IVQR and QR estimators, and the estimator of the copula parameters, a Gaussian process with zero mean and covariance matrix ΣM (u, v, u˜, v˜), and   0 cf x (v) β1 (u) ∇γ h (z1 , x2 , v) 0 . The process GN (u, v, z) N (u, v, z) ≡  ∂ C 0 0 (u|v, x2 ; ξ) ∂ξ U |V X2 has zero mean and covariance function given by: ΣN (u, v, z, u˜, v˜, z˜) ≡ N (u, v, z)0 ΣM (u, v, u˜, v˜) N (˜ u, v˜, z˜)0

Proof. See appendix A.1 The estimator of FYcf (y) is the sample analog of equation 6:12 n

FˆYcf

1X (y) = n i=1 n

1X = n i=1

ˆ [0,1]2

ˆ

1

  1 SˆYcf (u|zi , v) ≤ y dCˆU V |X2 (u, v|x2i )

  CˆU |V X2 SˆYcf,−1 (y|zi , v) |v, x2i dv

(10)

0

The following theorem characterizes its asymptotic distribution: Theorem 1. Let assumptions 1 to 8 hold. The asymptotic distribution of FˆY is given by  √  cf cf ˆ n FY (y) − FY (y) ⇒ GO (y) With this estimator, it is straightforward to estimate any other function that depends on FYcf (y) by plugging in this estimator as in Chernozhukov et al. (2013). 12

12

´ ´1 where GO (y) ≡ Z 0 O (y, v, z) GN (u (y, z, v) , v, z) dvdFZ (z) is a Gaussian process, and h i O (y, v, z) = −fYcf|ZV (y|z, v) 1 . GO (y) has zero mean and covariance function given by: ˆ

ˆ

ΣO (y, y˜) = Z2

[0,1]

2

O (y, z, v) ΣN (u (y, z, v) , v, z, u (˜ y , z˜, v˜) , v˜, z˜) O (˜ y , z˜, v˜)0 dvd˜ v dFZ (z) dFZ (˜ z)

Proof. See appendix A.2 The previous result holds true even if the estimators of the SQF and the conditional copula differ from those presented in this paper, as long as they are jointly asymptotically Gaussian and converge at the parametric rate. Estimation of the asymptotic variance of FˆYcf is feasible, but computationally cumbersome: many of these variances need to be computed for a large number of values, making it particularly impractical. For example, ΣO (y, y˜) would need to be computed for every possible combination of (yi , yj ), for i, j = 1, ..., n, and ΣN (u, v, z, u˜, v˜, z˜) would require the computation of an even larger number of combinations of the arguments upon which it depends. Despite that, an estimator of the asymptotic variance is presented in appendix D.

3.2

Nonparametric Estimator of the Copula

Assumption 8 is restrictive, as it imposes a parametric distribution on the unobserved vector of conditional ranks. Since the copula can be nonparametrically identified, consider the following assumption, which does not require the copula to be parametric: Assumption 9. The copula CU V (u, v) is independent of X2 . In words, the copula is left unspecified, but it is constant for all values of X2 . This could be further relaxed, but the estimator of the copula would then converge at a rate slower √ than n. Then, the nonparametric estimator of the copula empirical copula based on the estimated values of the conditional ranks (ui , vi ): 1 CˇU V (u, v) ≡ Σni=1 1 (ˆ ui ≤ u) 1 (ˆ vi ≤ v) n 13

(11)

The estimator of the counterfactual distribution of Y , conditional on Z, is obtained by plugging-in the nonparametric estimator of the copula: ˆ FˇYcf|Z (y|z) ≡

1

0

    1 uj ) ≤ y vj )0 βˆ (ˆ 1 xˆcf (v)0 βˆ (u) ≤ y dCˇU V (u, v) = Σnj=1 1 xˆcf (ˆ n

(12)

This estimator does not require to calculate weights depending on the copula. Hence, the estimator of the unconditional counterfactual distribution, which is a double sum given   0 ˆ (ˆ v ) β (ˆ u ) ≤ y , is fast to compute. by FˇYcf (y) ≡ n12 Σni=1 Σnj=1 1 xˆcf j j i √ The uniform convergence of CˇU V (u, v) can be shown to be at a rate n, which in turn allows to show that FˇYcf|Z (y|z) is indeed uniformly consistent at that rate. However, it is not possible to obtain the asymptotic Gaussianity by the usual arguments: because of the nonlinearity of the indicator function, it is not possible to use the extended continuous mapping theorem, which is required because the conditional ranks are not observed. This issue could be overcome by using a smooth function that converges uniformly to the indicator function, but even in that case it would not be possible to establish the asymptotic normality as in theorem 1, since the estimator of the conditional copula would converge at rate slower √ than n.13 Nevertheless, FˇYcf|Z (y|z) is a uniformly consistent estimator, as shown by the following proposition: Proposition 2. Let assumptions 1 to 7, and 9 hold. Then, √ cf cf ˇ sup n FY |Z (y|z) − FY |Z (y|z) = Op (1) y,z

Proof. See appendix A.3

3.3

Discussion of Alternative Methods and their Validity

An alternative estimator of the counterfactual distribution of Y would be the one proposed by Chernozhukov et al. (2013) based on the IVQR estimator and the counterfactual distribution 13

See appendix B.8.

14

of the treatment. However, this alternative estimator is biased as long as the counterfactual treatment intensity is correlated with the treatment effect. Hence, this method would be appropriate if the policy maker could assign treatment intensity at random or if treatment intensity was the same for everyone, but not if individuals have the ability to choose the treatment intensity. Another possibility is to estimate the triangular equation model using a control function approach, and then estimate the counterfactual distribution of Y based on these estimates. For example, Lee (2007) proposes a control function quantile regression estimator for the following triangular model: Y = Xβ (τ ) + Z10 γ (τ ) + U X = µ (α) + Z 0 π (α) + V

The identification of this model is based on different conditions than those considered in this paper. In particular, Lee (2007) assumes that QU |XZ (τ |x, z) = QU |V (τ |v) ≡ λτ (v), so this model and the baseline model of this paper are not nested. In his model, the distribution of U is independent of both the endogenous treatment and the remaining covariates, once we control for V . Further, he imposes additivity of a function of V in the second stage equation, and therefore the joint distribution of U and V is not the copula given in equation 3. Martinez-Sanchis et al. (2012) propose an estimator of the unconditional distribution of Y based on Lee (2007).14 This estimator can consistently estimate the actual distribution of Y , and the counterfactual distribution when the distribution of Z is changed, but it cannot consistently estimate the counterfactual distributions considered in this paper: by definition, U and V are heteroskedastic in the covariates, and the structural change in the determination of the treatment implies a different conditional distribution of (U, V ) given Z, which is not captured by the estimated values of (U, V ). On the other hand, the copula approached proposed in this paper is invariant to such counterfactuals, and therefore appropriate to 14

Note that Martinez-Sanchis et al. (2012) do not show the asymptotic distribution of their estimator.

15

estimate the distribution of Y .

4

Monte Carlo

To evaluate the finite sample performance of the estimator, I carried out a simulation study with the following data generating process:

X1i = Z1i γ1 (vi ) + X2i γ2 (vi ) + γ3 (vi ) Yi = X1i β1 (ui ) + X2i β2 (ui ) + β3 (ui )  0 where the parameters are given by γ (θ) = 2 + θ, 1 + 1.5 log (1 + θ) , Ft−1 (θ) , and 5  0 β (τ ) = 1 + 2 tan (τ ) , 1 + 2 (τ − 0.5)3 , Φ−1 (τ ) , the instrument and the exogenous variables are drawn from Z1i ∼ U (1, 10), X2i ∼ U (10, 15), and the copula is drawn from (ui , vi ) ∼ Clayton (2). The sample size equals n = 2000, the number of repetitions is R = 1000 and the quantile grid for both the first and second stage equations estimation was made out of 99 evenly spaced quantiles. Figure 1 shows the scatter plot of the estimated values of the conditional ranks and their actual values, since they are required to estimate the copula. Both estimates are reasonably close to the 45 degree line, and therefore to their true values, but the estimates of vi are more accurate than those of ui . Figure 2 compares the performance of the different estimators of the actual distribution: the estimator with the parametric copula, the estimator with the nonparametric copula, the estimator proposed by Martinez-Sanchis et al. (2012) (MMK ), and the estimator proposed by Chernozhukov et al. (2013) (CFM ). As shown in table 1, the first three estimators approximate the actual distribution reasonably well, particularly the estimator with the parametric copula. On the other hand, the estimator proposed by Chernozhukov et al. (2013) is biased in a large part of the distribution. All four estimators have a similar precision, although the parametric estimator performs slightly worse than the other three. Notice that part of the difference between the estimates and the actual

16

distribution is due to the density of the grid used to approximate the unit interval. Increasing the number of quantiles used in the estimation to approximate the integrals results in a better approximation, particularly at the tails.15

1

0.8

0.8

0.6

0.6

vˆi

u ˆi

Figure 1: Estimated values of the conditional ranks 1

0.4

0.4

0.2

0.2

0

0

0.2

0.4

0.6

u

0.8

0

1

0

0.2

0.4

v

0.6

0.8

1

Notes: Each graph is the scatterplot of the true values of conditional ranks and their estimated estimated values, for one repetition of the Monte Carlo.

1

P arametric

0.5

0

1

Figure 2: Unconditional cdf estimators N onparametric MMK

0.5

50

100 150 200

0

50

1

1

0.5

0.5

100 150 200

0

50

100 150 200

0

CF M

50

100 150 200

Notes: In each of the four graphs, the solid line represents the actual distribution of Y , the dashed line represents the median (pointwise) across repetitions of the estimator, and the dotted line represent the 2.5 and 97.5 percentiles (pointwise) across repetitions.

Now consider a counterfactual in which the policy maker sets a compulsory minimum treatment: x1 = max {z 0 γ (v) , 40}. Figure 3 shows, for each of the four estimators, the difference between the counterfactual and the actual distributions, i.e the unconditional distributional effect of the policy. The estimators proposed in this paper do a good job at estimating the counterfactual distribution, with the true difference of the distributions being inside the 95% confidence bands. However, the converse is not true for the other two estimators: the instrument proposed by Martinez-Sanchis et al. (2012) is biased, particularly 15

Results available upon request.

17

Table 1: Fit of the actual distributions Parametric Nonparametric MMK   ´ Q Fˆ (y) − FY (y) dy 0.002 0.012 0.008 Y 0.5 Y ˆ supy FY (y) − FY (y) 0.005 0.020 0.011   ´ ˆ ∇0.975 0.023 0.020 0.019 0.025 Q FY (y) dy Y   ˆ 0.039 0.041 0.040 supy ∇0.975 0.025 Q FY (y)

CFM 0.029 0.060 0.020 0.048

Notes: The first row represents the integral of the difference between the median across repetitions of the estimated counterfactual cdf and the true cdf; the second row represents the maximum of this difference; the third and fourth rows represent the same differences between the 97.5 and 2.5 percentiles.

at the lower tail of the distribution, whereas the instrument proposed by Chernozhukov et al. (2013) is substantially biased everywhere, with the sign of the bias being different at different parts of the distribution. To understand this, notice that the latter is the only estimator that does not take into account the endogeneity of the treatment, substantially underestimating the actual distributional effect of the policy. This is confirmed by difference between the actual counterfactual distribution and the median estimate across repetitions in table 2. Regarding their accuracy, it is very similar for all four estimators. Figure 3: Difference between the actual and counterfactual unconditional cdf estimators N onparametric P arametric MMK CF M −0.05 −0.1 −0.15 −0.2 −0.25

−0.05 −0.1 −0.15 −0.2 −0.25 50

100 150 200

−0.05 −0.1 −0.15 −0.2 −0.25 50

100 150 200

−0.05 −0.1 −0.15 −0.2 −0.25 50

100 150 200

50

100 150 200

Notes: In each of the four graphs, the solid line represents the actual distribution of Y , the dashed line represents the median (pointwise) across repetitions of the estimator, and the dotted line represent the 2.5 and 97.5 percentiles (pointwise) across repetitions.

18

Table 2: Fit of the difference between the actual and counterfactual Parametric Nonparametric MMK ´ Fˆ (y) − FY (y) dy 0.001 0.001 0.010 Y Y ˆ supy FY (y) − FY (y) 0.003 0.003 0.060   ´ ˆ ∇0.975 0.022 0.020 0.019 0.025 Q FY (y) dy Y   ˆ 0.053 0.058 0.057 supy ∇0.975 0.025 Q FY (y)

distributions CFM 0.021 0.054 0.024 0.057

Notes: The first row represents the integral of the difference between the median across repetitions of the estimated counterfactual cdf and the true cdf; the second row represents the maximum of this difference; the third and fourth rows represent the same differences between the 97.5 and 2.5 percentiles.

5

Empirical Application

Consider the estimation of the counterfactual distribution of the birth weight of children whose mothers smoked during pregnancy, had they smoked less. It is particularly interesting to the policy maker to estimate the proportion of newborns with Low Birth Weight (LBW, 2,500 grams), and Very Low Birth Weight (VLBW, 1,500 grams), since newborns falling into these categories have a higher chance of developing several medical conditions later in life, including cognitive development and chronic diseases (Case and Paxson, 2009; Almond and Currie, 2011). The data I use combines the 1990 Natality Data from the National Vital Statistics System of the National Center for Health Statistics, which records every birth in the United States during 1990, and the Tax Burden of Tobacco, which includes the percentage of state taxes over the final price of cigarettes, which is the instrument I use.16 The covariates I include in the regression are a quadratic polynomial of the mother’s age, the number of years of education, the number of gestation weeks, and dummy variables for the race of the mother (1 if black), the marital status (1 if married), and the sex of the newborn (1 if female). I restrict the data to firstborn children of mothers aged 18-35 who smoked during pregnancy, and who were either of black or white race. Moreover, I exclude multiple births, since they have a higher chance of having a lower birth weight. This leaves us with 144,478 births, 16

For a more detailed description of the datasets, see Evans and Ringel (1999).

19

and table 3 presents some descriptive statistics of the variables used. Standard errors for the IVQR estimates were computed following Chernozhukov and Hansen (2005), and for the remaining estimates using the bootstrap with 200 repetitions. Table 3: Descriptive statistics Variable

Mean

Birth weight Mean number of daily cigarettes smoked State tax as a percentage of price Age Black Years of Education Married Gestation weeks Female baby N

S.D.

3188.82 531.93 11.87 7.60 0.19 0.08 23.59 4.48 0.09 0.28 11.99 1.72 0.56 0.50 39.49 2.42 0.49 0.50 144,478

Figure 4 shows the IVQR estimates, which display a fair amount of heterogeneity. In particular, increasing the average daily number of cigarettes smoked during pregnancy could result in a decrease of the weight of the newborn of up to 50 grams per cigarette for the lower quantiles, but as little as 10 grams for the upper quantiles. Mother’s age has an inverse u-shaped effect, and young or old mothers tend to have children that weight less. Also, children of black mothers tend to weight 200 to 300 grams less than their white counterparts. Education has a positive effect only for the quantiles around and above the median, whereas the number of gestation weeks has an unambiguous positive and constant effect. Married women tend to have heavier children, although this effect is more marked for the lower quantiles. Finally, female newborns are lighter than their male counterparts by about 100 to 150 grams. Regarding the estimates of the copula parameters, I allowed the copula to be different for white and black mothers, finding a positively correlated copula for both of them with a higher correlation for white mothers.17 The fact that the correlation is positive implies that, 17

I did a similar analysis for other variables, such as married versus unmarried mothers, mothers with college studies versus no college studies, young mothers versus old mothers, or male versus female babies, and found no substantial differences.

20

Figure 4: IVQR estimates 0

0.5 50

0 −0.5

−50

−1 0.2

0.4

0.6

0.8

0

0.2

0.4

0.6

0.8

−1.5

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

−100 40 −200

20

−300

0 0.2

0.4

0.6

0.8

50

0.2

0.4

0.6

0

0.8

−50

2000

−100

0

−150

−2000

100 80 60

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

Notes: From left to right, the IVQR estimates shown are: in the first row, cigarettes consumption, mother’s age, and mother’s age to the square; in the second row, race, years of education, and married; in the third row, gestation weeks, sex of the baby, and the constant term.

for both demographic groups, those mothers with a propensity to smoke more are also those with a propensity to give birth to heavier children. However, for black women this is not so marked. Table 4 shows the estimates of the parameters of the four parametric copulas considered in the estimation. Table 4: Copula parameter estimates and counterfactual changes in the distribution Gaussian Clayton Frank Gumbel All White Black

0.28 (0.04) 0.29 (0.04) 0.23 (0.03)

0.37 (0.06) 0.37 (0.06) 0.28 (0.05)

1.54 (0.22) 1.57 (0.23) 1.26 (0.15)

1.20 (0.04) 1.21 (0.04) 1.16 (0.03)

Notes: Bootstrapped standard errors in parentheses.

Before presenting the estimates of the counterfactual distributions, let us compare the fit of the estimator of the unconditional cdf of children’s birth weight when the copula is estimated both parametrically and nonparametrically. The parametric estimator uses a

21

Gumbel copula, since it had the best fit among the parametric copulas considered.18 The first column of figure 5 plots both estimators, as well as the 95% confidence bands which were computed using the bootstrap. Clearly, the fit of the estimator based on the nonparametric estimator of the copula is the best. The empirical cdf lies always inside the confidence bands, and it is very close to the estimator FˇY (y).

Parametric

1

Actual CDF

Figure 5: Unconditional cdf of Y Counterfactual 1 CDF 0

0

−0.05

−0.05

−0.1

−0.1

0.5

0

2000

3000

4000

2000

1

Nonparametric

Counterfactual 2 CDF

3000

4000

0

0

−0.05

−0.05

−0.1

−0.1

2000

3000

4000

2000

3000

4000

0.5

0

2000

3000

4000

2000

3000

4000

Notes: For the actual CDF, the solid blue line represents the estimator of the unconditional distribution, the red dotted-dashed line represents the empirical CDF, and the dotted green lines represent the 95% bootstrapped confidence bands. For the counterfactual CDFs, the solid blue line represents the difference between the estimated counterfactual distribution and the actual distribution, and the dotted green lines represent the 95% bootstrapped confidence bands.

Now consider the following two counterfactuals: 1. Each mother smokes half as much as they actually did. 2. Each mother smokes at most an average of ten cigarettes per day. Because the change in the treatment intensity is correlated with the individual effect, the methods presented in this paper can consistently estimate the counterfactual distributions. The change in the distribution with respect to the actual distribution is displayed in the second and third columns of figure 5. Both counterfactuals substantially increase the average 18

See appendix F for further details.

22

birth weight, although this change is far from homogeneous, and the first counterfactual has a larger impact than the second one. In particular, the largest change in the distribution takes place on newborns who otherwise would have weighted around 3,000 grams. For the subpopulations of interest, i.e. LBW and VLBW children, whose proportions equal 9.1% and 0.6% in the actual distribution, the counterfactuals would be effective in reducing their numbers. In the first counterfactual, these proportions are estimated to respectively decrease by 4.6 and 0.5 percentage points both when the copula is parametric, and when it is nonparametric. In the second counterfactual, the changes would be more modest, particularly when the copula used is nonparametric, in which case the decrease would be roughly one half of the previous one (2.5 and 0.2 percentage points, respectively), whereas with the parametric copula they would equal 3.3 and 0.5 percentage points. Hence, policies aiming at reducing smoke consumption during pregnancy could have an important impact in reducing the percentage of children with low birth weight.

6

Conclusions

In this paper I propose an estimator of counterfactual unconditional distribution functions in the presence of an endogenous continuous treatment with heterogeneous effects. This estimator is based on the estimators of the quantile processes that characterize a triangular system of equations, and the estimator of the distribution of the copula of the conditional ranks, which captures the endogeneity of the treatment. The latter is nonparametrically identified by inverting the quantile processes of the triangular system, and it can be estimated either parametrically, resulting in an estimator that is asymptotically Gaussian with the usual √ n convergence rate, or nonparametrically using the empirical cdf of the estimated values of the copula of the conditional ranks. The counterfactuals I consider involve changes in the intensity of the treatment that are correlated to the marginal effect on each individual, i.e that are not randomly assigned. I show the finite sample performance of these estimators in a Monte Carlo simulation,

23

comparing them to alternative estimators of the unconditional distribution of the outcome variable. As an empirical application I estimate the effect of birth smoking of newborns’ birth weight, and I carry out two counterfactuals in which I respectively reduce the number of smoked cigarettes during pregnancy to one half of the actual quantity, and I limit their consumption to a maximum of ten per day. Both counterfactuals would increase the birth weight of newborns, though this effect would be heterogeneous, and would substantially reduce the percentage of newborns with low birth weight.

References Acemoglu, D., S. Johnson, and J. A. Robinson (2012). The colonial origins of comparative development: An empirical investigation: Reply. The American Economic Review 102 (6), 3077–3110. Almond, D. and J. Currie (2011). Killing me softly: The fetal origins hypothesis. The Journal of Economic Perspectives 25 (3), 153–172. Andrews, D. W. (1994).

Empirical process methods in econometrics.

Handbook of

econometrics 4, 2247–2294. Angrist, J. D. and V. Lavy (1999). Using maimonides’ rule to estimate the effect of class size on scholastic achievement. The Quarterly Journal of Economics 114 (2), 533–575. Arellano, M. and S. Bonhomme (2015). Quantile selection models. Technical report, Mimeo. Card, D. (2001). Estimating the return to schooling: Progress on some persistent econometric problems. Econometrica 69 (5), 1127–1160. Case, A. and C. Paxson (2009). Early life health and cognitive function in old age. The American economic review 99 (2), 104. Chernozhukov, V., I. Fernández-Val, and B. Melly (2013). Inference on counterfactual distributions. Econometrica 81 (6), 2205–2268. 24

Chernozhukov, V. and C. Hansen (2005).

An iv model of quantile treatment effects.

Econometrica 73 (1), 245–261. Chernozhukov, V. and C. Hansen (2006). Instrumental quantile regression inference for structural and treatment effect models. Journal of Econometrics 132 (2), 491–525. Chernozhukov, V. and C. Hansen (2013). Quantile models with endogeneity. Annual Review of Economics 5, 57–81. Chesher, A. (2003). Identification in nonseparable models. Econometrica 71 (5), 1405–1441. Dahl, G. B. and L. Lochner (2012). The impact of family income on child achievement: Evidence from the earned income tax credit. The American Economic Review 102 (5), 1927–1956. D’Haultfœuille, X. and P. Février (2015). Identification of nonseparable triangular models with discrete instruments. Econometrica 83 (3), 1199–1210. Einmahl, U., D. M. Mason, et al. (2005). Uniform in bandwidth consistency of kernel-type function estimators. The Annals of Statistics 33 (3), 1380–1403. Evans, W. N. and J. S. Ringel (1999). Can higher cigarette taxes improve birth outcomes? Journal of public Economics 72 (1), 135–154. Firpo, S., N. M. Fortin, and T. Lemieux (2009).

Unconditional quantile regressions.

Econometrica 77 (3), 953–973. Fredriksson, P., B. Ockert, and H. Oosterbeek (2013). Long-term effects of class size. The Quarterly Journal of Economics 128 (1), 249–285. Frölich, M. and B. Melly (2013). Unconditional quantile treatment effects under endogeneity. Journal of Business & Economic Statistics 31 (3), 346–357. Hoderlein, S. and E. Mammen (2007). Identification of marginal effects in nonseparable models without monotonicity. Econometrica 75 (5), 1513–1518. 25

Ichimura, H. and W. K. Newey (2015). The influence function of semiparametric estimators. Technical report, Mimeo. Imbens, G. W. and W. K. Newey (2009).

Identification and estimation of triangular

simultaneous equations models without additivity. Econometrica 77 (5), 1481–1512. Jun, S. J. (2009). Local structural quantile effects in a model with a nonseparable control variable. Journal of Econometrics 151 (1), 82–97. Kasy, M. (2014). Instrumental variables with unrestricted heterogeneity and continuous treatment. The Review of Economic Studies 81 (4), 1614–1636. Koenker, R. (2005). Quantile regression, Volume 38 of Econometric Society Monograph Series. Cambridge University Press. Koenker, R. and G. Bassett (1978). Regression quantiles. Econometrica: journal of the Econometric Society 46, 33–50. Lee, S. (2007). Endogeneity in quantile regression models: A control function approach. Journal of Econometrics 141 (2), 1131–1158. Lleras-Muney, A. (2005). The relationship between education and adult mortality in the united states. The Review of Economic Studies 72 (1), 189–221. Ma, L. and R. Koenker (2006). Quantile regression methods for recursive structural equation models. Journal of Econometrics 134 (2), 471–506. Machado, J. A. and J. Mata (2005). Counterfactual decomposition of changes in wage distributions using quantile regression. Journal of applied Econometrics 20 (4), 445–465. Martinez-Sanchis, E., J. Mora, and I. Kandemir (2012). Counterfactual distributions of wages via quantile regression with endogeneity. Computational Statistics & Data Analysis 56 (11), 3212–3229.

26

Matzkin, R. L. (2003).

Nonparametric estimation of nonadditive random functions.

Econometrica 71 (5), 1339–1375. Melly, B. (2006). Estimation of counterfactual distributions using quantile regression. Review of Labor Economics 68 (4), 543–572. Newey, W. K. (1991). Uniform convergence in probability and stochastic equicontinuity. Econometrica: Journal of the Econometric Society 59 (4), 1161–1167. Newey, W. K. (1994).

The asymptotic variance of semiparametric estimators.

Econometrica 62 (6), 1349–1382. Pereda-Fernández, S. (2010). Quantile regression discontinuity: Estimating the effect of class size on scholastic achievement. Technical report, Master Thesis CEMFI No. 1002. Powell, J. L. (1986). Censored regression quantiles. Journal of econometrics 32 (1), 143–155. Sklar, M. (1959). Fonctions de répartition à n dimensions et leurs marges. Publications de l’Istitut de Statistique de l’Universitè de Paris 8, 229–231. Torgovitsky, A. (2015). Identification of nonseparable models using instruments with small support. Econometrica 83 (3), 1185–1197. van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press. van der Vaart, A. W. and J. A. Wellner (1996). Weak Convergence and Empirical Processes With Applications to Statistics. Springer.

27

Let W ≡ (Y, X1 , X2 , Z1 ). The following notation is used throughout the appendix:19 n

1X f→ 7 En [f (W )] ≡ f (Wi ) n i=1 n

1 X f (Wi ) − E (f (Wi )) f 7→ Gn [f (W )] ≡ √ n i=1     ˆ (τ )0 ι Ψ ˆ (τ ) ϕτ Y − X 0 β − Φ  fˆ (W, β, ι, γ, τ, θ) ≡  0 ϕθ (X1 − Z γ) Z    ϕτ Y − X 0 β − Φ (τ )0 ι Ψ (τ )  f (W, β, ι, γ, τ, θ) ≡  0 ϕθ (X1 − Z γ) Z     0 0 ˆ ˆ ρτ Y − X β − Φ (τ ) ι Ψ (τ )  gˆ (W, β, ι, γ, τ, θ) ≡  0 ρθ (X1 − Z γ) Z   0  0 ρτ Y − X β − Φ (τ ) ι Ψ (τ )  g (W, β, ι, γ, τ, θ) ≡  0 ρθ (X1 − Z γ) Z Qn (β, ι, γ, τ, θ) ≡ En [ˆ g (Y, W, β, ι, γ, τ, θ)] Q (β, ι, γ, τ, θ) ≡ E [g (Y, W, β, ι, γ, τ, θ)] ε = Y − X 0 β, ε (τ ) = Y − X 0 β (τ ), εˆ (τ ) = Y − X 0 βˆ (τ ), η = X1 − Z 0 γ, η (θ) = X1 −  0 0 ˆ (τ ) ≡ Φ ˆ (τ )0 , X2 , Φ (τ ) ≡ Φ (τ, Z), Z 0 γ (θ), ηˆ (θ) = X1 − Z 0 γˆ (θ), Ψ (τ ) ≡ Φ (τ )0 , X2 , Ψ ˆ (τ ) ≡ Φ ˆ (τ, Z), ϕτ (u) ≡ (1 (u < 0) − τ ), ρτ (u) ≡ (τ − 1 (u < 0)) u, and `j (u, v, ξ) ≡ Φ log (c (u, v|x2j ; ξ)). 19

Some of this notation is the standard in the literature of empirical processes. See van der Vaart (2000).

28

A

Mathematical Proofs

A.1

Proof of Proposition 1

Start by expanding

 √  ˆcf n SY (u|z, v) − SYcf (u|z, v) around (γ (v) , β (u))

  i  √ h 0 √  cf n SˆY (u|z, v) − SYcf (u|z, v) = n xcf (v)0 βˆ (u) − β (u) + xˆcf (v) − xcf (v) βˆ (u) The only component of the vector xˆcf (v) that differs from xcf (v) is the first one, and by   ∗ ˆ lemma 3, β (u) − β (u) = oP √1n . Consequently, √

0

n xˆ (v) − x (v) βˆ (u) = cf

cf

=

√ √



 cf cf ˆ nβ1 (u) h (z1 , x2 , v) − h (z1 , x2 , v) + o∗P (1) nβ1 (u) ∇γ hcf (z1 , x2 , v) (ˆ γ (v) − γ (v)) + o∗P (1)

where the last equality follows by assumption 7, the mean value theorem, and the uniform consistency of the QR estimator. For the estimator of the copula I have that  √     √  ˆ ˆ n CU |V X2 (u|v, x2 ) − CU |V X2 (u|v, x2 ) ≡ n CU |V X2 u|v, x2 ; ξ − CU |V X2 (u|v, x2 ; ξ)  √  ∂ ˆ CU |V X2 u|v, x2 ; ξ = n ξ−ξ ∂ξ  √  ∂ = CU |V X2 (u|v, x2 ; ξ) n ξˆ − ξ + o∗P (1) ∂ξ where the first equality follows by a mean value expansion around ξ, and the second by the consistency of ξˆ and the continuous mapping theorem. Together with the previous result, apply the functional delta method to lemma 6 and it follows that  √ n

SˆYcf (u|z, v) − SYcf (u|z, v)



 ⇒ GN (u, v, z) = N (u, v, z) GM (u, v) ˆ CU |V X2 (u|v, x2 ) − CU |V X2 (u|v, x2 ) 

29

A.2

Proof of Theorem 1

Begin by showing the asymptotic distribution of FˆYcf|Z (y|z), for which I add and subtract the   ´ unfeasible estimator: F˜Ycf|Z (y|z) ≡ [0,1]2 1 SˆYcf (u|z, v) ≤ y dCU V |X2 (u, v|x2 ).  √  cf n FˆY |Z (y|z) − FYcf|Z (y|z)  √   √  = n FˆYcf|Z (y|z) − F˜Ycf|Z (y|z) + n F˜Ycf|Z (y|z) − FYcf|Z (y|z) The first term can be expressed as  √  cf n FˆY |Z (y|z) − F˜Ycf|Z (y|z) ˆ 1h    i √ cf,−1 cf,−1 ˆ ˆ ˆ = n CU |V X2 SY (y|z, v) |v, x2 − CU |V X2 SY (y|z, v) |v, x2 dv 0 ˆ 1h    i √ = n CˆU |V X2 SYcf,−1 (y|z, v) |v, x2 − CU |V X2 SYcf,−1 (y|z, v) |v, x2 dv + o∗P (1) 0

where I have used the extended continuous mapping theorem, the uniform continuity of CU V |X2 (u, v|x2 ), and the uniform consistency of SˆYcf (y|z, v). As for the second term, by proposition 1, lemmas 1 and 2, the functional chain rule, the extended continuous mapping theorem and the functional delta method  √  cf n F˜Y |Z (y|z) − FYcf|Z (y|z) ˆ 1h    i √ CU |V X2 SˆYcf,−1 (y|z, v) |v, x2 − CU |V X2 SYcf,−1 (y|z, v) |v, x2 dv = n 0 ˆ 1 h i √ =− n fYcf|ZV (y|z, v) SˆYcf (u (y, z, v) |z, v) − SYcf (u (y, z, v) |z, v) dv + o∗P (1) 0

Therefore, ˆ  √  cf cf ˆ n FY |Z (y|z) − FY |Z (y|z) ⇒

1

0

30

N (y, z, v) GM (u (y, z, v) , z, v) dv

By assumption 2 and the functional delta method, ˆ ˆ  √  cf cf ˆ n FY (y) − FY (y) ⇒ Z

1

N (y, z, v) GM (u (y, z, v) , z, v) dvdFZ (z)

0



A.3

Proof of Proposition 2 √ sup n FˇYcf (y|z) − FYcf (y|z) y,z

√ cf √ cf cf cf ˇ ˜ ˜ ≤ sup n FY |Z (y|z) − FY |Z (y|z) + sup n FY |Z (y|z) − FY |Z (y|z) y,z y,z ˆ 1    √ cf ˆ ˇ 1 SY (u|z, v) ≤ y d CU V (u, v) − CU V (u, v) + OP∗ (1) = sup n y,z





ˆ

0

1

d CˇU V (u, v) − CU V (u, v) + OP∗ (1)

n 0

√ ≤ sup n CˇU V (u, v) − CU V (u, v) + OP∗ (1) = OP∗ (1) u,v

where the first inequality follows from the triangle inequality, the first equality from the definition of the estimators and the uniform consistency of F˜Ycf|Z (y|z) shown in theorem 1, the second inequality from the fact that the indicator function is no larger than one, the third inequality by taking the supremum of the difference, and the last equality by lemma 7. 

31

B

Auxiliary Lemmas Hadamard Derivative of FY |ZV (y|z, v) with Respect to SY (u|z, v)

B.1

Lemma 1. Define FY (y|z, v, ht ) ≡

´1 0

1 (SY (u|z, v) + tht (u|z, v) ≤ y) dCU |V X2 (u|v, x2 ) .

Under assumption 3, as t & 0,

Dht (y|z, v, ht ) =

FY |ZV (y|z, v, ht ) − FY |ZV (y|z, v) → Dh (y|z, v) t

   where Dh (y|z, v) ≡ −fY |ZV (y|z, v) h CU−1|V X2 FY |ZV (y|z, v) |v, x2 |z, v . The convergence holds uniformly in any compact subset of YZV for any ht : kht − hk∞ → 0, where YZV ≡ {(y, z, v) : y ∈ Yz , z ∈ Z, v ∈ [0, 1]} and ht ∈ `∞ (UZV) and h ∈ C (UZV).   Proof. ∀δ > 0∃ > 0 such that if u ∈ B CU−1|V X2 (FY (y|z, v) |v, x2 ) and t ≥ 0 small enough, 1 (SY (u|z, v) + tht (u|z, v) ≤ y)  h   i   −1 ≤ 1 SY (u|z, v) + t h CU |V X2 FY |ZV (y|z, v) |v, x2 |z, v − δ ≤ y   and if u ∈ / B CU−1|V X2 FY |ZV (y|z, v) |v, x2 1 (SY (u|z, v) + tht (u|z, v) ≤ y) = 1 (SY (u|z, v) ≤ y)

So for small enough t ≥ 0,



1 t

´

1 t

´1 0

[1 (SY (u|z, v) + tht (u|z, v) ≤ y) − 1 (SY (u|z, v) ≤ y)] CU |V X2 (u|v, x2 ) du

B (FY (y|z,v))

[1 (SY (u|z, v) + tht (u|z, v) ≤ y) − 1 (SY (u|z, v) ≤ y)] CU |V X2 (u|v, x2 ) du (13)

 Let y˜ = SY (u|z, v), so that u = CU−1|V X2 FY |ZV (˜ y |z, v, ) |v, x2 and J be the image of

32

 B FY |ZV (y|z, v) under u → SY (u|z, v). Then, equation 13 equals 1 t

ˆ fY |ZV (˜ y |z, v) d˜ y J∩[y,y−t(h(FY |ZV (y|z,v)|z,v )−δ )]

For fixed  and t & 0 h    i  J ∩ y, y − t h CU−1|V X2 FY |ZV (y|z, v) |v, x2 |z, v − δ h    i  = y, y − t h CU−1|V X2 FY |ZV (y|z, v) |v, x2 |z, v − δ

fY |ZV (˜ y |z, v) → fY |ZV (y|z, v) as FY |ZV (˜ y |z, v) → FY |ZV (y|z, v). Therefore, the right hand term in equation 13 is no greater than      −fY |ZV (y|z, v) h CU−1|V X2 FY |ZV (y|z, v) |v, x2 |z, v − δ + o (1)      Similarly, −fY |ZV (y|z, v) h CU−1|V X2 FY |ZV (y|z, v) |v, x2 |z, v + δ + o (1) bounds the equation 13 from below. Since δ can be arbitrarily small, the result follows. To show uniformity of this result, apply Lemma 5 in Chernozhukov et al. (2013). Let (y, z, v) ∈ K, where K is a compact subset of YZV. Take a sequence (yt , zt , vt ) in K that converges to (y, z, v) ∈ K, since the function    (y, z, v) 7→ −fY (y|z, v) h CU−1|V X2 FY |ZV (y|z, v) |v, x2 |z, v is uniformly continuous on K it follows that the preceding argument applies to this sequence. This result follows by the assumed continuity of h (u|z, v), FY |ZV (y|z, v) and fY |ZV (y|z, v) in all of its arguments, and the compactness of K.

33

B.2

Hadamard Derivative of FY |Z (y|z) with Respect to FY |ZV (y|z, v)

Lemma 2. Define FY |Z (y|z, ht ) ≡

Dht (y|z, ht ) = where Dh (y|z) ≡

´1 0

 ´1 FY |ZV (y|z, v) + tht (y|z, v) dv. As t & 0, 0

FY |Z (y|z, ht ) − FY |Z (y|z) → Dh (y|z) t

h (y|z, v) dv. The convergence holds uniformly in any compact subset

of YZ ≡ {(y, z) : y ∈ Yz , z ∈ Z, } for any ht : kht − hk∞ → 0, where ht ∈ `∞ (UZ) and h ∈ C (UZ). Proof. FY |Z (y|z, ht ) − FY |Z (y|z) t ˆ  1 1 FY |ZV (y|z, v) + tht (y|z, v) − FY |ZV (y|z, v) dv = t ˆ 1 ˆ 10 h (y|z, v) dv ht (y|z, v) dv → =

Dht (y|z, ht ) =

0

0

B.3

Asymptotic Distribution of the IVQR and QR Estimators

The proof of the following lemma is an extension of the proof in Chernozhukov and Hansen (2006) to account for the joint distribution of both estimators. Lemma 3. Let γˆ (v) and βˆ (u) denote the conditional QR and conditional IVQR estimators of quantiles v and u of equations 2 and 1, respectively. Under assumptions 1 to 6, their joint asymptotic distribution is given by: 

 β (u) √ −  ⇒ GJ (u, v) n  γˆ (v) γ (v) βˆ (u)





where GJ (u, v) is a zero-mean Gaussian process with covariance function ΣJ (u, v, u˜, v˜),

34

given by:  ΣJ (u, v, u˜, v˜) = 

Σ11 ˜) Σ21 ˜)0 J (u, u J (u, v Σ21 J

(˜ u, v)

Σ22 J

 

(v, v˜)

where   ˜) ≡ J (u)−1 (u ∨ u˜ − u˜ u) E Ψ (u, z) Ψ (˜ u, z)0 J (˜ u)−1 Σ11 J (u, u  −1 0 0 0 0 (˜ u , v) ≡ H (v) E (1 (y ≤ x β (˜ u )) 1 (x ≤ z γ (v)) − u ˜ v) zΨ (˜ u , z) Σ21 J (˜ u)−1 1 J Σ22 ˜) ≡ H (v)−1 (v ∨ v˜ − v˜ v ) E [zz 0 ] H (˜ v )−1 J (v, v H (v) ≡ E [fX1 (z 0 γ (v) |z) zz 0 ] J (u) ≡ E [fY (x0 β (u) |x, z1 ) Ψ (u, z) x0 ] Proof. Step 1 (Consistency) By assumption 3, Q (β, ι, γ, τ, θ) is continuous over B × I × G × p

T × C. Furthermore, by lemma 5, sup(β,ι,γ)∈B×I×G kQn (β, ι, γ, τ, θ) − Q (β, ι, γ, τ, θ)k → 0.

ˆ

p By lemma 4, have uniform convergence of sup(β1 ,τ,θ)∈B1 ×T ×C ϑ (β1 , τ, θ) − ϑ (β1 , τ, θ) →



p 0, which by lemma 4 implies that sup(β1 ,τ )∈B1 ×T kˆι (β1 , τ )kB1 (τ ) − kι (β1 , τ )kB1 (τ ) → 0.



p

p

By lemma 4, supτ ∈T βˆ1 (τ ) − β1 (τ ) → 0, and therefore supτ ∈T βˆ2 (τ ) − β2 (τ ) → 0,



 p

ˆ

p γ (θ) − γ (θ)k → 0. supτ ∈T ˆι β1 (τ ) τ − 0 → 0 and supθ∈C kˆ Step 2 (Asymptotics) Consider a collection of closed balls Bδn (β1 (τ )) centered at β1 (τ ) ∀τ , δn independent of τ and δn → 0 slowly enough. Let β1n (τ ) be any value inside Bδn (β (τ )). By Theorem 3.3 in Koenker and Bassett (1978),  O

1 √ n

 =



  ˆ ˆ nEf W, β1n (·) , ϑ (β1n (·) , ·, ··) , ·, ··

35

p

By lemma 5, the following expansion holds for any supτ ∈T kβ1n (τ ) − β1 (τ )k → 0  O

1 √ n



  √   = Gn fˆ W, β1n (·) , ϑˆ (β1n (·) , ·, ··) , ·, ·· + nEfˆ W, β1n (·) , ϑˆn (β1n (·) , ·, ··) , ·, ·· = Gn fˆ (W, β1 (·) , ϑ (β1 (·) , ·, ··) , ·, ··) + oP (1)   √ + nEfˆ W, β1n (·) , ϑˆn (β1n (·) , ·, ··) , ·, ·· in `∞ (T × C) = Gn fˆ (W, β1 (·) , ϑ (β1 (·) , ·, ··) , ·, ··) + oP (1)  √  + (Jϑ (·, ··) + oP (1)) n ϑˆ (β1n (·) , ·, ··) − ϑ (·, ··) √ + (Jβ1 (·) + oP (1)) n (β1n (·) − β1 (·)) in `∞ (T × C)

where   0  0 0 ϕ· Y − X1 β1 (·) − X2 β2 − Φ (·) ι Ψ (·) ∂   E Jϑ (·, ··) ≡ ∂ (β20 , ι0 , γ 0 ) 0 ϕ·· (X1 − Z γ) z ϑ=ϑ(·,··)   ∂ 0 0  ∂β1 E [ϕ· (Y − X1 β1 − X2 β2 (·)) Ψ (·)] β1 =β1 (·)  Jβ1 (·) ≡   0dim(X )×1 p

For any supτ ∈T kβ1n (τ ) − β1 (τ )k → 0  √  n ϑˆ (β1n (·) , ·, ··) − ϑ (·, ··) = −Jϑ−1 (·, ··) Gn f (Y, W, β1 (·) , ϑ (·, ··) , ·, ··) √ − Jϑ−1 (·, ··) Jβ1 (·) [1 + oP (1)] n (β1n (·) − β1 (·)) + oP (1) in `∞ (T × C). So I have √

n (ˆι (β1n (·) , ·) − 0) − J¯ι (·, ··) Gn f (Y, W, β1 (·) , ϑ (·, ··) , ·, ··) − J¯ι (·, ··) Jβ1 (·) [1 + oP (1)]

  in `∞ (T × C), where J¯β2 (·, ··)0 : J¯ι (·, ··)0 : J¯γ (·, ··)0 is the comfortable partition of Jϑ−1 (·, ··).

36

By step 1, wp → 1, βˆ1 (τ ) = arg

inf β1n (τ )∈Bn (β1 (τ ))

kˆι (β1n (τ ) , τ )kB1 (τ ) ∀τ ∈ T

By lemma 5, Gn f (Y, W, β1 (·) , ϑ (·, ··) , ·, ··) = Op (1), so it follows that √

√ n kˆι (β1n (·) , ·)kB1 (·) = Op (1) − J¯ι (·, ··) Jβ1 (·) [1 + oP (1)] n (β1n (·) − β1 (·)) B1 (·)

in `∞ (T × C). Hence, 

√  n βˆ1 (·) − β1 (·) = arg inf −J¯ι (·) Gn f (Y, W, β1 (·) , ϑ (·, ··) , ·, ··) − J¯ι (·, ··) J¯β1 (·) µ B1 (·) µ∈R

+ oP (1) in `∞ (T × C). So jointly in `∞ (T × C)  −1 √  n βˆ1 (·) − β1 (·) = − Jβ1 (·)0 J¯ι (·, ··)0 B1 (·) J¯ι (·, ··) Jβ1 (·)  · Jβ1 (·)0 J¯ι (·, ··)0 B1 (·) J¯ι (·, ··) Gn f (Y, W, β1 (·) , ϑ (·, ··) , ·, ··) + oP (1) = Op (1)

  √   n ϑˆ βˆ1 (·) , ·, ·· − ϑ (·, ··) h −1 = −Jϑ−1 (·, ··) I − Jβ1 (·) Jβ1 (·)0 J¯ι (·, ··)0 B1 (·) J¯ι (·, ··) Jβ1 (·)  · Jβ1 (·)0 J¯ι (·, ··)0 B1 (·) J¯ι (·, ··) Gn f (Y, W, β1 (·) , ϑ (·, ··) , ·, ··) + oP (1) = OP (1)

37

(14)

Due to invertibility of Jβ1 (τ ) J¯ (τ, θ),   √   n ˆι βˆ1 (·) , · − 0 h i  −1 = −J¯ι (·, ··) I − Jβ1 (·) Jβ1 (·)0 J¯ι (·, ··)0 J¯ι (·, ··) Gn f (W, β1 (·) , ϑ (·, ··) , ·, ··) + oP (1) = 0 × OP (1) + oP (1)

(15)

     1 ˆ ˆ ˆ √ , γˆ (··) , and in ` (T × C). Because β1n (·) , ϑ (β1n (·) , ·, ··) = β1 (·) , β2 (·) , 0 + oP n 



if I substitute it into the expansion, I have: 





βˆ (·) − β (·)



J (·) 0dim(X ) √  n  + oP (1) −Gn f (W, β1 (·) , ϑ (·, ··) , ·, ··) =  0dim(X ) H (··) γˆ (··) − γ (··) in `∞ (T × C). By lemma 5, Gn f (W, β1 (·) , ϑ (·, ··) , ·, ··) ⇒ GG (·, ··) in `∞ (T × C), a   Gaussian process with covariate function S (τ, θ, τ 0 , θ0 ) = E GG (τ, θ) GG (τ 0 , θ0 )0 , which yields √

B.4

 n

βˆ (·) − β (·)







−1

J (·) 0dim(X ) ⇒  G (·, ··) = GJ (τ, θ) in `∞ (T × C) −1 γˆ (··) − γ (··) 0dim(X ) H (·)

Argmax Process

Lemma 4. (Chernozhukov and Hansen, 2006) suppose that uniformly in π in a compact set Π and for a compact set K (i) Zn (π) is s.t. Qn (Zn (π) |π) ≥ supz∈K Qn (z|π) − n ,  & 0; Zn (π) ∈ K wp → 1, (ii) Z∞ (π) ≡ arg supz∈K Q∞ (z|π) is a uniquely defined continuous p

process in `∞ (Π), (iii) Qn (·|·) → Q∞ (·|·) in `∞ (K × Π), where Q∞ (·|·) is continuous. Then Zn (·) = Z∞ (·) + oP (1) in `∞ (Π) Proof. See Chernozhukov and Hansen (2006).

38

B.5

Stochastic Expansion

Lemma 5. Under assumptions 1 to 6, the following statements hold: 1. sup(β,ι,γ)∈B×I×G |En [ˆ g (W, β, ι, γ, τ, θ)] − E [g (W, β, ι, γ, τ, θ)]| = oP (1) 2. Gn f (W, β (·) , 0, γ (··) , ·, ··) ⇒ GG (·, ··) in `∞ (T , C), where GG is a Gaussian process with covariance function S ((τ, θ) , τ 0 , θ0 ) defined below in the proof.

 

ˆ Furthermore, for any sup(τ,θ)∈T ×C β (τ ) , ˆι (τ ) , γˆ (θ) − (β (τ ) , 0, γ (θ)) = oP (1),

 

sup(τ,θ)∈T ×C Gn fˆ W, βˆ (τ ) , ˆι (τ ) , γˆ (θ) , τ, θ − Gn f (W, β (τ ) , 0, γ (θ) , τ, θ) = oP (1) Proof. Let π = (β, ι, γ) and Π = B × I × G, where I is a closed ball around 0. Define the class of functions H as        ϕτ Y − X 0 β − Φ (Z)0 ι Ψ (Z)   H ≡ h = (Φ, Ψ, π, τ, θ) 7→ π ∈ Π, Φ, Ψ ∈ F   ϕθ (X1 − Z 0 γ) Z where F is the class of uniformly smooth functions in z with the uniform smoothness order ω <

dim(w) 2

and kf (τ 0 , z) − f (τ, z)k < C (τ − τ 0 )a , C > 0, a > 0 ∀ (z, τ, τ 0 ) ∀f ∈ F. H

is Donsker, and the bracketing number of F, by Corollary 2.7.4 in van der Vaart and Wellner (1996) satisfies  dim(z)    0 log N[·] (, F, L2 (P )) = O − ω = O −2−δ for some δ 0 < 0. Therefore, F is Donsker with a constant envelope. By Corollary 2.7.4 in van der Vaart and Wellner (1996), the bracketing number of   D1 ≡ (Φ, π) → X 0 β + Φ (X, Z)0 ι , π ∈ Π, Φ ∈ F

satisfies    dim(w)  00 = O −2−δ log N[·] (, X , L2 (P )) = O − ω 39

for some δ 00 < 0. Also, by Corollary 2.7.4 in van der Vaart and Wellner (1996), the bracketing number of D2 ≡ {(π) → (Z 0 γ) , π ∈ Π}

satisfies   dim(z)   000 log N[·] (, D2 , L2 (P )) = O − ω = O −2−δ for some δ 000 < 0 such that δ 000 < δ 00 . Since the indicator function is bounded and monotone, and the density functions fY |X1 Z (y) and fX1 |Z (x1 ) are bounded by assumption 3, then I have that the bracketing number of   E ≡ (Φ, π) → 1 Y < X 0 β + Φ (X, Z)0 ι + 1 (X1 < Z 0 γ) , π ∈ Π, Φ ∈ F

satisfies 

−2−δ 00

log N[·] (, E, L2 (P )) = O 



Since E has a constant envelope, it is Donsker. Let T ≡ {τ → τ } and C ≡ {θ 7→ θ}. Then I have that H ≡ T × F + C × F − E × F. Since H is Lipschitz over (T × C × F × E), it follows that it is Donsker by Theorem 2.10.6 in van der Vaart and Wellner (1996). Define   0  ϕτ ε − Φ (Z) ι Ψ (Z)  h ≡ (Φ, Ψ, π, τ, θ) 7→ Gn  ϕθ (η)Z)

40

h is Donsker in `∞ (H). Consider the process    ϕτ  − Φ (Z)0 ι Ψ (Z)  (τ, θ) 7→ Gn  ϕθ (η) Z By the uniform Hölder continuity of (τ, θ) 7→ τ, β (τ )0 , Φ (τ, Z)0 , Ψ (τ, Z)0 , θ, γ (θ)0

0

in

(τ, θ) with respect to the supremum norm, it is also Donsker in `∞ (H). Therefore, I have 

ϕ· (ε (·)) Ψ (·, Z)

Gn 

  ⇒ GG (·, ··)

ϕ·· (η (··)) Z with covariate function 0  τ, θ˜   S (τ, τ˜) S 0   S (τ, θ, τ 0 , θ0 ) = E GG (τ, θ) GG (τ 0 , θ0 ) ≡  S 21 (˜ τ , θ) S 22 θ, θ˜ 

11

21



where   S 11 (τ, τ˜) = (τ ∨ τ˜ − τ τ˜) E Ψ (τ, Z) Ψ (˜ τ , Z)0   S 21 (˜ τ , θ) = E (1 (y ≤ x0 β (˜ τ )) 1 (x1 ≤ z 0 γ (θ)) − τ˜θ) ZΨ (˜ τ , Z)0     S 22 θ, θ˜ = θ ∨ θ˜ − θθ˜ E [ZZ 0 ] p p p ˆ (·) → ˆ (·) → Since Ψ Ψ (·), and Φ Φ (·) uniformly over compact sets and π ˆ (τ, θ) → π (τ, θ) p

uniformly in (τ, θ). δn ≡ sup(τ,θ)∈T ×C ξ (h0 (τ, θ) , h (τ, θ))| → 0 by assumptions 5 and 6, for ˆ (τ, θ), where h0 (τ, θ) = h v u      2   u

0 ˜ ˜ u ρτ ε − Φ (Z)0 ι Ψ (Z) ρ ε ˜ − Φ (Z) ˜ ι Ψ (Z)

τ˜ u 0  −  ξ (h, h ) ≡ tE



ρθ (η) Z ρθ˜ (˜ η ) Z˜

41

p

As δn → 0

      

0 0 ˆ ˆ ρτ εˆ (τ ) − Φ (τ, Z) ˆι (τ ) Ψ (τ, Z)

ρτ ε (τ ) − Φ (τ, Z) ι (τ ) Ψ (τ, Z)   − Gn   sup

Gn

(τ,θ)

ρθ (η (θ)) Z ρθ (ˆ η (θ)) Z

     

0 0  ˜ ˜ ρτ ε − Φ (Z) ˜ι Ψ (Z)

ρτ ε − Φ (Z) ι Ψ (Z)

Gn   Gn   = oP (1) ≤ sup

˜ )≤δn ,h,h∈H ˜

ξ (h,h ρθ (η) Z ρθ (η) Z   0  ρτ ε − Φ (Z) ι Ψ (Z) , which proves claim 2. by stochastic equicontinuity of h 7→ Gn  ρθ (η) Z To prove claim 1, define    0    ρτ ε − Φ (Z) ι  A ≡ (Φ, β, ι, γ, τ, θ) 7→    ρθ (η) This class of functions is uniformly Lipschitz over (F × B × I × G × T × C) and bounded by assumption 1, so by Theorem 2.10.6 in van der Vaart and Wellner (1996), A is Donsker. Therefore, the following Uniform Law of Large Numbers hold:     0  0  ρτ ε − Φ (Z) ι ρ ε − Φ (Z) ι p  − E τ  → 0 sup En  h∈H ρθ (η) ρθ (η) which gives,       ˜ (τ, Z)0 ι ˜ (τ, Z)0 ι ρτ ε − Φ ρτ ε − Φ  − E  sup En  (β,ι,γ,τ,θ) ρθ (η) ρθ (η)

p

→0 ˜ Φ ˆ Φ=

ˆ (·) and assumption 6, I have that By uniform consistency of Φ      0 0  ˜ ρτ ε − Φ (τ, Z) ι ρ ε − Φ (τ, Z) ι p  − E τ  → 0 sup E  (β,ι,γ,τ,θ) ρθ (η) ρθ (η)

42

which implies claim 1.

B.6

Asymptotic Distribution of the IVQR and QR Estimators and the Estimator of the Copula Parameters

Lemma 6. Under assumptions 1 to 8,  βˆ (u) − β (u)  √    n  γˆ (v) − γ (v)  ⇒ GM (u, v)   ˆ ξ−ξ 

where GM (u, v) is a Gaussian process with covariance matrix ΣM (u, v, u˜, v˜) equal to  ΣM (u, v, u˜, v˜) ≡ 

ΣJ (u, v, u˜, v˜) Σ21 M

Σ21 M

0

(u, v˜)

 

(˜ u, v)

Σξ

   0 −1 0 1 y ≤ x β (u) − u Ψ (u, z ) J (u) j j j −1  ∂`j (uj , vj , ξ)   Σ21 M (u, v) = H1 E   −1 ∂ξ 0 0 1 x1j ≤ zj γ (v) − v zj H (v)   2 ∂ `j (uj , vj , ξ) −1 + H1 E M (yj , x1j , zj ) ΣJ (u, v, uj , vj ) ∂ξ∂ (u, v) 



where the expectation is taken with respect to FZ (zj ) CU V |X2 (uj , vj |x2j ; ξ), and where I have   0 gY (yj |xj ) xj 0 , Σξ ≡ H1−1 (H1 + H2 ) H1−1 , and used M (yj , x1j , zj ) ≡ −  0 fX1 (x1j |zj ) zj0  ∂ 2 `j (uj , vj , ξ) H1 ≡ E − ∂ξ∂ξ 0  2  2 ∂ `j (uj , vj , ξ) 0 ∂ `h (uh , vh , ξ) H2 ≡ E M (yj , x1j , zj ) ΣJ (uj , vj , uh , vh ) M (yj , x1j , zj ) ∂ξ∂ (u, v) ∂ (u, v)0 ∂ξ 0 

Proof. Begin by writing ξˆ in terms of the influence function. To do so, apply the mean value

43

theorem to the score:   " #    ∂`j uˆj , vˆj , ξˆ ∂ 2 `j uˆj , vˆj , ξ  ˆ ∂` (ˆ u , v ˆ , ξ) j j j  = En + En ξ − ξ 0 = En  ∂ξ ∂ξ ∂ξ∂ξ 0 

where ξ lies between ξˆ and ξ. Rearranging the previous equation yields " "  ##−1    2 √ √  ∂ ` u ˆ , v ˆ , ξ ∂`j (ˆ uj , vˆj , ξ) j j j ˆ n ξ − ξ = En nEn ∂ξ∂ξ 0 ∂ξ

(16)

Now show the uniform convergence of the Hessian:   u ˆ − u j j ∂ 2 ` uˆ , vˆ , ξ  ∂ 2 ` (u , v , ξ)    j j j 3   j j j − ≤ ∇ `j uj , v j , ξ  vˆj − vj  0 0 ∂ξ∂ξ ∂ξ∂ξ   ξ−ξ   uˆj − uj       = ∇3 `j uj , v j , ξ  vˆj − vj    ξ−ξ ≤ K · o∗P (1) = o∗P (1) where ∇3 `j (u, v, ξ) is a three dimensional array whose (i, j, k) element is the partial derivative of log (c (u, v|xj ; ξ)) with respect to the ith element of ξ, its jth element of ξ and its kth element of (u, v, ξ 0 )0 . The first equality follows by the mean value theorem, and the last equality follows by assumptions 3 and 8. Using this result,  2   2  ∂ ` (u , v , ξ) ∂ ` (ˆ u , v ˆ , ξ) j j j j j j En − E ∂ξ∂ξ 0 ∂ξ∂ξ 0 2   2   2 ∂ `j (ˆ uj , vˆj , ξ) ∂ 2 `j (uj , vj , ξ) ∂ `j (uj , vj , ξ) ∂ `j (uj , vj , ξ) ≤ En − −E + En ∂ξ∂ξ 0 ∂ξ∂ξ 0 ∂ξ∂ξ 0 ∂ξ∂ξ 0 = o∗P (1) where the inequality follows by the triangular inequality, the first term is o∗P (1) by the 44

argument above, and the second term by uniform law of large numbers. Then, show the i h √ ∂` (ˆ u ,ˆ v ,ξ) uj , vˆj ) for asymptotic distribution of nEn j ∂ξj j . Apply the mean value theorem to (ˆ all j = 1, ..., n. 





√ √ ∂`j (ˆ uj , vˆj , ξ) ∂`j (uj , vj , ξ) ∂ `j (uj , v j , ξ) uˆj − uj  nEn = Gn + nEn  ∂ξ ∂ξ ∂ξ∂ (u, v) vˆj − vj 







2

(17)

The first term is simply the usual term that appears in the maximization of the log likelihood function, and the second term takes into account that (uj , vj ) are estimated, but not observed. Leaving aside the first term and focusing on the second, it follows that   ´1  ´1 ˆ 1 SY (u|xj ) ≤ yj du − 0 1 (SY (u|xj ) ≤ yj ) du uˆj − uj √ √ 0   = n  n  ´ 1  ´1 ˆ (z1j , x2j , vj ) ≤ x1j dv − 1 (h (z1j , x2j , vj ) ≤ x1j ) dv 1 h vˆj − vj 0 0 





Let GY (y|x) ≡

´1 0

1 (SY (u|x) ≤ y) du = SY−1 (y|x), and gY (y|x) ≡

∂ G ∂y Y

(y|x).20 By

Lemma 4 in Chernozhukov et al. (2013), √

 n

uˆj − uj vˆj − vj

 =



√ nM (yj , x1j , zj ) 

βˆ (uj ) − β (uj ) γˆ (vj ) − γ (vj )

  + o∗P

(18)

where the ∗ denotes that the convergence in probability is uniform in (uj , vj ). By the extended continuous mapping theorem, assumption 8, and the uniform consistency of (ˆ uj , vˆj ), it follows that ∂ 2 `j (uj , vj , ξ) ∂ 2 `j (uj , v j , ξ) = + o∗P (1) ∂ξ∂ (u, v) ∂ξ∂ (u, v) By the information equality, the asymptotic variance of the first term equals H1 . After 20

These would be the conditional cdf and pdf of Y if U and X were independent. These functions are different from the actual conditional cdf and pdf of Y , which are given by FY (y|x) ≡ ´1 ∂ 1 (SY (u|x) ≤ y) f (u|x) du, and fY (y|x) ≡ ∂y FY (y|x). Under endogeneity f (u|x) 6= 1, and hence 0 GY 6= FY . Even though the actual data is not going to depend on GY , the way uj is identified makes it convenient for inference.

45

some tedious algebra, it is possible to show that the asymptotic variance of the second term equals H2 , and the asymptotic covariance equals zero. Hence, using this result and equation 16, I can rewrite the estimators as   ˆ β (u) − β (u) √   n        γ ˆ (v) − γ (v) βˆ (u) − β (u)           √     ˆ n  γˆ (v) − γ (v)  =  β (u ) − β (u ) 2 j j   H1−1 En  ∂ `j (uj ,vj ,ξ) M (yj , x1j , zj ) √n   +  ∂ξ∂(u,v)   γˆ (vj ) − γ (vj ) ξˆ − ξ     i h ∂`j (uj ,vj ,ξ) −1 ∗ + oP (1) +H1 Gn ∂ξ 



By lemma 3, the extended continuous mapping theorem, and the functional delta method, it follows that  ˆ β (u) − β (u)  √    n  γˆ (v) − γ (v)  ⇒ GM (u, v)   ˆ ξ−ξ 

B.7

Uniform consistency of CˇU V (u, v)

Lemma 7. Let assumptions 2, 3, and 9 hold, and (ˆ ui , vˆi ) be uniformly consistent estimators √ for (ui , vi ). Then, n supu,v CˇU V (u, v) − CU V (u, v) = OP (1). Proof. Define C˜U V (u, v) ≡ En [1 (ui ≤ u) 1 (vi ≤ v)] and split the proof into showing the probability limit of C˜U V (u, v) and CˇU V (u, v) is the same, and then that C˜U V (u, v) is a consistent estimator of CU V (u, v). √ 1 n CˇU V (u, v) − C˜U V (u, v) = √ |Σni=1 1 (ˆ ui ≤ u) 1 (ˆ vi ≤ v) − 1 (ui ≤ u) 1 (vi ≤ v)| n ≤ Gn |1 (ˆ ui ≤ u) − 1 (ui ≤ u)| + Gn |1 (ˆ vi ≤ v) − 1 (vi ≤ v)|

46

√ Consider the sequence sn that satisfies sn → 0 and sn n → ∞ as n → ∞.

sup P u



 n |1 (ˆ ui ≤ u) − 1 (ui ≤ u) 1| > ε = sup P (1 (ˆ ui ≤ u) 6= 1 (ui ≤ u)) u

≤ sup P (|ui − u| ≤ sn ) + P (|ui − uˆi | > sn ) u

≤ 2sn + P (|ui − uˆi | > sn ) √ Take limits to conclude that limn→∞ supu P ( n |1 (ˆ ui ≤ u) − 1 (ui ≤ u)| > ε) = 0. By √ vi ≤ v) − 1 (vi ≤ v)| > ε) = 0. Consequently, a parallel argument, limn→∞ supv P ( n |1 (ˆ √  ˇ ˜ limn→∞ supu,v P n CU V (u, v) − CU V (u, v) > ε = 0 As for the second step, consider the class CU V ≡ {{(x1 , x2 ) : x1 ≤ u, x2 ≤ v} , u, v ∈ [0, 1]}. This is a VC class with VC dimension V (CU V ) = 3. Therefore, by Theorem 2.6.4 in van der Vaart and Wellner (1996), its covering number is bounded: N (ε, CU V , L2 (P )) ≤ 3 · 43 Ke3 ε−4 < ∞ for some constant K and 0 < ε < 1. By theorem 2.5.2 in van der Vaart √ and Wellner (1996), it is P-Donsker, so n supu,v C˜U V (u, v) − CU V (u, v) = OP (1). Hence,  √ n CˇU V (u, v) − CU V (u, v)   √  √  = n CˇU V (u, v) − C˜U V (u, v) + n C˜U V (u, v) − CU V (u, v) = OP∗ (1)

B.8

Uniform consistency of CˇU |V (u|v)

Consider the estimator CˇU V (u, v) defined by equation 11. This estimator can be seen as the integration over [0, 1] of a nonparametric estimator of the conditional copula distribution CU |V (u|v), given by  Hn + 1 n CˇU |V (u|v) ≡ Σi=1 1 (ˆ ui ≤ u) 1 θ (v) ≤ vˆi < θ (v) n where Hn denotes the number of evenly spaced quantiles that are used in the estimation 47

of the quantile process h (z1 , x2 , v),21 and θ (v) and θ (v) are defined as {maxi θi : θi < v} and {mini θi : θi ≥ v}. It can be checked that CˇU V (u, v) =

(H +1)θ(v) ˇ 1 Σ n CU |V Hn +1 h=0

(u, θh ).

Geometrically, I am splitting the [0, 1] interval into Hn + 1 intervals of equal length, and each vi belong to any of these intervals almost surely. The probability of vi being in any of these intervals is equal to

1 , Hn +1

since vi ∼ U (0, 1). Hn is the (inverse of the) bandwidth of

this kernel estimator, and Hn → ∞ as n → ∞. For each of the cells, compute the conditional distribution of the copula. The following lemma establishes the uniform consistence of this conditional estimator of the copula, which unlike the conditional estimator, converges at a √ rate slower than n. Lemma 8. Let assumptions 2, 3, and 9 hold, (ˆ ui , vˆi ) converge uniformly in probability to √ nan (ui , vi ) at a rate n, Hn → ∞, and log(n) → ∞ as n → ∞, where an ≤ H1n . Then, supu,v CˇU |V (u|v) − CU |V (u|v) = oP (1) Proof. The proof is split into two steps: first show the consistency of the unfeasible estimator  C˜U |V (u|v) ≡ Hnn+1 Σni=1 1 (ui ≤ u) 1 θ (v) ≤ vi < θ (v) , and then show that CˇU |V (u|v) and C˜U |V (u|v) converge to the same limit. Consider the class CU V ≡ {{(x1 , x2 ) : x1 ≤ u, vl ≤ x2 < vu } , u, vl , vu ∈ [0, 1] , vl < vu }. It is a VC class with VC dimension V (CU V ) = 4. Therefore, by Theorem 2.6.4 in van der Vaart and Wellner (1996), its covering number is bounded: N (ε, CU V , L2 (P )) ≤ 45 Ke4 ε−6 < ∞ for some constant K and 0 < ε < 1. By Corollary 1 in Einmahl et al. (2005)

lim

n→∞

sup

sup

an ≤ H 1+1 ≤bn (u,v)∈[0,1]

˜ CU |V (u|v) − CU |V (u|v) = 0

n

˜ This result implies that sup(u,v)∈[0,1] CU |V (u|v) − CU |V (u|v) = oP (1). Regarding the second step, notice that it is not possible to apply the extended continuous p

p

p

mapping theorem to conclude that if uˆi → ui and vˆi → vi , then 1 (ˆ ui ≤ u) → 1 (ui ≤ u) 21

These quantiles are denoted by 0 = θ0 , θ1 , ..., θHn , θHn +1 = 1.

48

or 1 θ (v) ≤ vˆi < θ (v)



 p → 1 θ (v) ≤ vi < θ (v) uniformly in (u, v). Hence, a different

argument is required for the proof:    sup P 1 (ˆ ui ≤ u) 1 θ (v) ≤ vˆi < θ (v) − 1 (ui ≤ u) 1 θ (v) ≤ vi < θ (v) ≥ εrn u,v

≤ sup P (|1 (ˆ ui ≤ u) − 1 (ui ≤ u)| ≥ εrn ) u

   + sup P 1 θ (v) ≤ vˆi < θ (v) − 1 θ (v) ≤ vi < θ (v) ≥ εrn v

Examine the convergence of each term separately:

sup P (|1 (ˆ ui ≤ u) − 1 (ui ≤ u)| > εrn ) = sup P (1 (ˆ ui ≤ u) 6= 1 (ui ≤ u)) u

u

≤ sup P (|ui − u| ≤ sn ) + P (|ui − uˆi | > sn ) u

≤ 2sn + P (|ui − uˆi | > sn ) √

nsn → ∞ as n → ∞. The second   ∗ inequality follows from the fact that ui ∼ U (0, 1). Since ui − uˆi = OP √1n , it follows that, √ if nsn → ∞ as n → ∞, ui − uˆi = o∗P (sn ), and therefore where sn is a sequence that satisfies sn → 0 and

lim sup P (|1 (ˆ ui ≤ u) − 1 (ui ≤ u)| > εrn ) = 0

n→∞

u

Similarly,    sup P 1 θ (v) ≤ vˆi < θ (v) − 1 θ (v) ≤ vi < θ (v) ≥ εrn v

  = sup P 1 θ (v) ≤ vˆi < θ (v) 6= 1 θ (v) ≤ vi < θ (v) v

 ≤ sup P (|vi − θ (v)| ≤ sn ) + sup P vi − θ (v) ≤ sn + P (|ui − uˆi | > sn ) v

v

≤ 4sn + P (|vi − vˆi | > sn )

49

And under the same conditions as before, I get that    lim P 1 θ (v) ≤ vˆi < θ (v) − 1 θ (v) ≤ vi < θ (v) ≥ εrn = 0

n→∞

  Hence, 1 (ˆ ui ≤ u) 1 θ (v) ≤ vˆi < θ (v) − 1 (ui ≤ u) 1 θ (v) ≤ vi < θ (v) = o∗P (rn−1 ). As a consequence, sup(u,v) CˇU |V (u|v) − C˜U |V (u|v) = oP (Hn rn ). Since rn can converge to zero at any speed, it follows that the previous quantity equals oP (1) if Hn → ∞ at any speed. By the triangle inequality, CˇU |V (u|v) − CU |V (u|v) ≤ CˇU |V (u|v) − C˜U |V (u|v) + C˜U |V (u|v) − CU |V (u|v) Therefore, it follows that supu,v CˇU |V (u|v) − CU |V (u|v) = oP (1). Some remarks are in order: First of all, this lemma limits the rate of growth of the number   n of cells of the unit interval, which has to satisfy Hn = oP log(n) . This, however, does not imply that the estimator achieves the maximum possible convergence rate because of the  kernel choice, K (vi , v, Hn ) ≡ (Hn + 1) 1 θ (v) ≤ vi < θ (v) . This kernel is not symmetric around zero, which would improve the convergence rate of the estimator. Furthermore, it depends on two nonlinear functions of v: θ (v) and θ (v), which means that one cannot use a Taylor expansion around v to establish the asymptotic normality of this estimator. If instead of using indicator functions, one used functions that are (uniformly) smooth in u and v, then one could use the extended continuous mapping theorem. Consequently, it   ` vi −v , where would be possible to estimate CU |V (u|v) by C`U |V (u|v) = nh1n Σni=1 f` (ui , u, n) K hn f` (ui , u, n) is a function that is uniformly smooth in u and that converges to 1 (ui ≤ u) as   vi −v ` n → ∞, and K hn is a kernel function that is continuous in its argument and that, in order to improve the convergence rate, is symmetric around zero. Studying the asymptotic properties of such estimator is beyond the scope of this paper.

50

C

Alternative Counterfactuals

If the policy maker is able to randomly assign the treatment, then equation 1 and the independence between X and U allow to use the estimator proposed by Chernozhukov et al. (2013) to get the counterfactual distribution of Y . Denote the counterfactual treatment of individual i by xcf i . The estimator of the counterfactual distribution of Y equals N ˆ  1 X 1  ˆ  cf  cf ˆ 1 SY u|xi ≤ y du FY (y) = N i=1 0

On the other hand, if the policy maker is not able to directly assign the treatment, but can affect the distribution of Z, it is also possible to estimate the counterfactual distribution of Y using Chernozhukov et al. (2013), but using an estimator of the conditional quantile ˆ Y (u|z): function of Y on Z, denoted by Q N ˆ  1 X 1  ˆ  cf  cf ˆ FY (y) = 1 QY u|zi ≤ y du N i=1 0

where zicf denotes the counterfactual value of Z of individual i. This strategy presents the shortcoming of requiring the nonparametric estimation of the conditional quantile function, since in general it does not coincide with g (·, ·, ·) of h (·, ·, ·). For example, if both functions are linear quantile processes, the conditional quantile function of Y on Z is non-linear in general. Hence, feasible alternatives are Martinez-Sanchis et al. (2012), or to use these two functions, along with the copula: N ˆ       1 X cf cf cf ˆ ˆ ˆ FY (y) = 1 SY u|zi , v ≤ y dCU V |X2 u, v|x2i N i=1 [0,1]2

51

D

Estimator of the Asymptotic Variance of FˆY (y)

Begin by estimating the asymptotic variance of the estimator given by equation 10:

ˆ O (y, y˜) = Σ

n n Hn X Hn 1 XXX ˆ (y, zi , vk ) ΣN (u (y, zi , vk ) , zi , vk , u (˜ ˆ (˜ O y , zj , vh ) zj , vh ) O y , zj , vh )0 n2 Hn2 i=1 j=1 k=1 h=1

  ˆ (y, z, v) = −fˆ where O Y |ZV (y|z, v) 1 and 

fˆY |ZV

 ˆ  (τk+1 − τk ) cU |V X2 τk |v, x2 ; ξ  0 ˆ 0 ˆ n   1 x ˆ (v) β (τ ) ≤ y ≤ x ˆ (v) β (τ ) (y|z, v) = ΣK k k+1 k=1 xˆ (v)0 βˆ (τk+1 ) − βˆ (τk )

If the SQF and the copula are estimated by equations 7 and 9, then the central term of the variance equals ˆ N (u, v, u˜, v˜) = N ˆ (u, v, u˜, v˜) Σ ˆ M (u, v, u˜, v˜) N ˆ (u, v, u˜, v˜)0 Σ  ˆ (u, v, z) =  where N

xˆ (v)0 βˆ1 (u) z 0 0

0 

ˆ M (u, v, u˜, v˜) =  Σ

 0   and ∂ ˆ C u|v, x2 ; ξ ∂ξ U |V X2 ˆ J (u, v, u˜, v˜) Σ ˆ 21 (u, v˜)0 Σ M ˆ 21 (˜ Σ M u, v)

ˆξ Σ

 

  ˆ1 + H ˆ2 H ˆ 1−1 ˆξ = H ˆ 1−1 H Σ   2 ˆ ∂ `i uˆi , vˆi , ξ ˆ 1 = − 1 Σni=1 H n ∂ξ∂ξ 0     2 2 ˆ ˆ ∂ ` ∂ ` u ˆ , v ˆ , ξ u ˆ , v ˆ , ξ i j i i j j ˆ 2 = 1 Σni=1 Σnj=1 ˆ (yi , x1i , zi ) ΣJ (ˆ ˆ (yj , x1j , zj )0 H M ui , vˆi , uˆj , vˆj ) M n2 ∂ξ∂ (u, v) ∂ (u, v)0 ∂ξ 0

52

 ˆ (y, x1 , z) =  M

gˆY (y|x) x 0

n gˆY (y|x) = ΣK k=1

n fˆX1 (x1 |z) = ΣH h=1

0



0 fˆX1 (x1 |z) z

0



  τ − τk  k+1  1 x0 βˆ (τk ) ≤ y ≤ x0 βˆ (τk+1 ) x0 βˆ (τk+1 ) − βˆ (τk ) z0

θh+1 − θh 1 (z 0 γˆ (θh ) ≤ x1 ≤ z 0 γˆ (θh+1 )) (ˆ γ (θh+1 ) − γˆ (θh ))

       0 ˆ ˆ 0ˆ ˆ ∂` u ˆ , v ˆ , ξ j i i 1 yi ≤ xi β (u) − u Ψ (u, zi ) J (u) ˆ 21 (u, v) = H ˆ −1 1 Σn   Σ i=1 M 1 n ∂ξ 0 ˆ 0 (1 (x1i ≤ zi γˆ (v)) − v) zi H (v)   ∂`j uˆi , vˆi , ξˆ −1 1 n ˆ ˆ (yj , x1j , zj ) Σ ˆ J (u, v, uˆi , vˆi ) + H1 Σi=1 M n ∂ξ∂ (u, v)

  0 21 11 ˆ ˆ Σ (u, u˜) ΣJ (u, v˜) ˆ J (u, v, u˜, v˜) =  J  Σ 21 22 ˆ (˜ ˆ Σ ˜) J u, v) ΣJ (v, v 1 ˆ 11 (u, u˜) = Jˆ (u)−1 (min {u, u˜} − u˜ ˆ (u, zi )0 Ψ ˆ (˜ Σ u) Σni=1 Ψ u, zi ) Jˆ (˜ u)−1 J n     0ˆ 0 ˆ 21 (u, v) = H ˆ (v)−1 1 Σn ˆ (u, zi )0 Jˆ (u)−1 Σ 1 y ≤ x β (u) 1 (x ≤ z γ ˆ (v)) − uv zi Ψ i 1i J i i n i=1 1 ˆ 22 (v, v˜) = H ˆ (u)−1 (min {v, v˜} − v˜ ˆ (˜ Σ v ) Σni=1 zi zi0 H v )−1 J n ˆ (v) are estimated using Powell (1986) estimator: and the matrices Jˆ (u) and H   0  1 ˆ (u, zi ) , x02i ˆ (u, zi ) , x02i εi (u)| ≤ hn ) Φ Φ Σni=1 1 (|ˆ 2nhn ˆ (v) = 1 Σn 1 (|ˆ H ηi (v)| ≤ hn ) zi zi0 2nhn i=1 Jˆ (u) =

for some appropriately chosen bandwidth hn .

53

E

Nonparametric First Stage Equation

Assume that the first stage equation is given by the unknown function h (·, ·, ·), which is not linear in the covariates as in assumption 1. The IVQR estimator of βˆ (·) does not √ require the linearity of h to be consistent and asymptotically Gaussian the n convergence rate. However, the estimators of the counterfactual distributions FˆYcf (y) and FˇYcf (y) use an estimator of h as an input, which has an impact on their asymptotic convergence. ˆ≡h ˆ (z1 , x2 , v), and the estimator of the counterfactual Denote the estimator of h (z1 , x2 , v) by h   ˆ . Newey (1991, 1994), Andrews (1994), or Ichimura and Newey distribution by FˆYcf y, h (2015) have already studied such estimators, establishing conditions under which they are √ consistent at the n convergence rate. One of these conditions is the asymptotic linearity of the semiparametric estimator, i.e. it can be rewritten as the sum of the sample average   of an influence function whose variance is finite and a stochastic term that is OP √1n . For the counterfactual estimators considered in this paper, which are non-linear functions ˆ to converge at a rate faster than n 14 . Even of h, asymptotic linearity requires the estimator h when the dimension of X2 is small, most popular nonparametric estimators of h would typically converge at a slower rate. For example, the Nadayara-Watson kernel regression 1

estimator would converge at most at the n dim(X2 )+4 unless one is willing to use higher-order kernels, which are not typically used in empirical work. Consequently, although it is possible to use nonparametric estimators of the first stage equation that would allow the estimator of the counterfactual distribution of Y to be asymptotically linear, deriving sufficient conditions for these nonparametric estimators lies beyond the scope of this paper. In any case, to check the sensitivity of the estimates of the distribution of Y to a non-linear first equation when it is estimated by a linear quantile process, I carried out a Monte Carlo analysis. The specification is the same as the one in section 4, but the first stage equation is a nonlinear monotonic transformation of the original first stage equation: h (z1 , x2 , v) =  0  8 + 55 ∗ Φ z γ(v)−35 . The new distribution of the treatment has roughly the same support 9 and mean, but a different shape. Then, I show the estimates of the actual distribution of the four estimators considered. 54

The results are presented in figure 6 and table 5. Relative to the Monte Carlo results in section 4, having a nonparametric first stage equation results in a slight bias of the actual distribution estimators. This bias is not uniformly distributed across quantiles, and it varies for each estimator. The bias is slightly smaller for the estimator proposed by Martinez-Sanchis et al. (2012), though it is not substantially different from the bias of the two estimators proposed in this paper. In terms of precision, the results are roughly the same as those found in the main Monte Carlo exercise.

1

P arametric

0.5

0

1

Figure 6: Unconditional cdf estimators N onparametric MMK

0.5

100

200

0

100

1

1

0.5

0.5

0

200

100

200

0

CF M

100

200

Notes: In each of the four graphs, the solid line represents the actual distribution of Y , the dashed line represents the median (pointwise) across repetitions of the estimator, and the dotted line represent the 2.5 and 97.5 percentiles (pointwise) across repetitions.

Table 5: Fit of the actual distributions Parametric Nonparametric MMK   ´ Fˆ (y) − FY (y) dy 0.017 0.014 0.009 Q Y 0.5 Y supy FˆY (y) − FY (y) 0.030 0.023 0.018   ´ ˆ ∇0.975 0.030 0.024 0.023 0.025 Q FY (y) dy Y   ˆ supy ∇0.975 0.039 0.040 0.040 0.025 Q FY (y)

CFM 0.027 0.056 0.029 0.050

Notes: The first row represents the integral of the difference between the median across repetitions of the estimated counterfactual cdf and the true cdf; the second row represents the maximum of this difference; the third and fourth rows represent the same differences between the 97.5 and 2.5 percentiles.

The estimation of the counterfactual distribution yields a different picture: as in the main Monte Carlo analysis, the estimators proposed by Martinez-Sanchis et al. (2012) and Chernozhukov et al. (2013) are biased, The parametric estimator is not slightly biased, 55

although the size of the bias is smaller than for the other two estimators. Finally, the best fit is provided by the nonparametric estimator proposed in this paper. Figure 7: Difference between the actual and counterfactual unconditional cdf estimators N onparametric P arametric CF M MMK 0

0

0

0

−0.1

−0.1

−0.1

−0.1

−0.2

−0.2

−0.2

−0.2

−0.3

−0.3

−0.3

−0.3

−0.4

100

200

−0.4

100

200

−0.4

100

200

−0.4

100

200

Notes: In each of the four graphs, the solid line represents the actual distribution of Y , the dashed line represents the median (pointwise) across repetitions of the estimator, and the dotted line represent the 2.5 and 97.5 percentiles (pointwise) across repetitions.

F

Fit of the Parametric Copulas to the Data

Table 6 compares the performance of the estimators based on different copulas. Among the parametric copulas, the Gumbel copula has the best fit, both in terms of the maximum difference between the estimated actual distribution and the empirical cdf, and the mean difference between the two (although it is tied with the Gaussian copula in this category). However, the fit of the estimator based on the nonparametric estimator of the copula is remarkably better than the fit of any of the estimators based on the parametric copulas. Consequently, I report the estimates based on the Gumbel and the nonparametric copula in the main text. Table 6: Fit of the Copula Distributions Gaussian Clayton Frank Gumbel ´ ˆ F (y) − F (y) dy Y Y Y ˆ supy FY (y) − FY (y)

Nonparametric

0.0166

0.0169

0.0177

0.0167

0.0062

0.0371

0.0373

0.0372

0.0354

0.0208

Notes: The first row represents the integral of the difference between the median across repetitions of the estimated counterfactual cdf and the true cdf; the second row represents the maximum of this difference.

56

Estimation of Counterfactual Distributions with a ...

Sep 20, 2016 - which is nonparametrically identified by inverting the quantile processes that determine the outcome and the ...... To evaluate the finite sample performance of the estimator, I carried out a simulation study with the following ..... Journal of Business & Economic Statistics 31(3), 346–357. Hoderlein, S. and E.

1019KB Sizes 1 Downloads 264 Views

Recommend Documents

Nonparametric Estimation of Distributions with Given ...
Jul 30, 2007 - enue, Cambridge CB3 9DD, UK. Tel. +44-(0)1223-335272; Fax. +44-(0)1223-335299; E-mail [email protected]. 1 ...

Application of complex-lag distributions for estimation of ...
The. Fig. 1. (Color online) (a) Original phase ϕ(x; y) in radians. (b) Fringe pattern. (c) Estimated first-order phase derivative. ϕ(1)(x; y) in radians/pixel. (d) First-order phase derivative esti- mation error. (e) Estimated second-order phase de

Efficient estimation of general dynamic models with a ...
Sep 1, 2006 - that in many cases of practical interest the CF is available in analytic form, while the ..... has a unique stable solution for each f 2 L2рpЮ.

counterfactual rescuing
If I had a sister, then if she ever said anything about my hairstyle, I would get upset. Thus not all contexts for weak NPIs support PPI rescuing. Generalization (7) ...

Age estimation of faces: a review - Dental Age Estimation
Feb 27, 2008 - The current paper reviews data on the ... service) on the basis of age. ... than the remainder of the face and a relatively small, pug-like nose and ..... For example, does wearing formal attire such as a business suit, associated with

Recovering the Counterfactual Wage Distribution with ...
Two questions of interest arise once the migration process is dynamic. ... longitudinal and cross-sectional data, Lubotsky finds that return migration by low-wage.

On Counterfactual Computation
cast the definition of counterfactual protocol in the quantum program- ... fact that the quantum computer implementing that computation might have run.

DISTRIBUTED PARAMETER ESTIMATION WITH SELECTIVE ...
shared with neighboring nodes, and a local consensus estimate is ob- tained by .... The complexity of the update depends on the update rule, f[·], employed at ...

Convergence of Pseudo Posterior Distributions ... -
An informative sampling design assigns probabilities of inclusion that are correlated ..... in the design matrix, X, is denoted by P and B are the unknown matrix of ...

Estimation and Inference with a (Nearly) Singular ...
are both binary (e.g., Evans and Schwab, 1995; Goldman et al., 2001; Lochner and Moretti,. 2004; Altonji et al., 2005; Rhine et al., 2006) and instruments are ...

Increasing Interdependence of Multivariate Distributions
Apr 27, 2010 - plays a greater degree of interdependence than another. ..... of Rn with the following partial order: z ≤ v if and only if zi ≤ vi for all i ∈ N = {1,...

A Comment on Diagnostic Tools for Counterfactual ...
Feb 12, 2008 - London School of Economics, Department of Economics,. Houghton Street ... The further from the data we take a counterfactual, the ..... for n 5 100, high correlation (rho 5 0.8) does not have a big effect on the results of the.

Skewed Wealth Distributions - Department of Economics - NYU
above the "bliss point," marginal utility can become negative, creating complications. For an ...... https://sites.google.com/site/jessbenhabib/working-papers ..... Lessons from a life-Cycle model with idiosyncratic income risk," NBER WP 20601.

Spatial Spectrum Estimation with a Maneuvering ...
provide mobility but constrain the number of sensors. Exploiting a .... Special thanks to Dr. Jeffrey Rogers who laid the foundation for the work that I have ...

Cross-Sectional Distributions and Power Law with ...
Nov 24, 2014 - Therefore log wealth follows the Brownian motion with drift g− 1. 2 v2 and volatil- ..... Toda and Walsh (2014) documents that cross- sectional ...

gender discrimination estimation in a search model with matching and ...
discrimination and to show that, using the model and standard distributional assumptions, it .... action policy implemented as a quota system has no impact under the ...... on demographic characteristics, human capital, and life cycle issues. ..... B

Estimation and Inference with a (Nearly) Singular ...
Sun and Edward Vytlacil for helpful comments. This paper is developed from ..... that ensures the reparameterization function h(·) in Procedure 3.1 below is nonrandom and does not depend on the true DGP. ...... many distinct but related contexts inc

Parametric Characterization of Multimodal Distributions ...
convex log-likelihood function, only locally optimal solutions can be obtained. ... distribution function. 2011 11th IEEE International Conference on Data Mining Workshops ..... video foreground segmentation,” J. Electron. Imaging, vol. 17, pp.

Testing Parametric Conditional Distributions of ...
Nov 2, 2010 - Estimate the following GARCH(1, 1) process from the data: Yt = µ + σtεt with σ2 ... Compute the transformation ˆWn(r) and the test statistic Tn.

An Architecture for Learning Stream Distributions with Application to ...
the stream. To the best of our knowledge this is the first ... publish, to post on servers or to redistribute to lists, requires prior specific permission ..... 3.4 PRNG and RNG Monitoring ..... Design: Architectures, Methods and Tools (DSD), 2010.

An Architecture for Learning Stream Distributions with Application to ...
chitecture for learning the CDF of a data stream and apply our technique to the .... stitute of Standards and Technology recommendation [19]. Our contribution ...

Texture modeling with Gibbs probability distributions
30 Jul 2004 - esX he first figure shows the devi tion for the oldest shiftD the other three im ges for the ver ge of shifts PERD SEW nd IHEIW respe tively st is evident th t the ssumption of shiftEindependent qi s potenti l is dequ te for most of the

Noise-contrastive estimation: A new estimation principle for ...
Any solution ˆα to this estimation problem must yield a properly ... tion problem.1 In principle, the constraint can always be fulfilled by .... Gaussian distribution for the contrastive noise. In practice, the amount .... the system to learn much