Estimation and Inference for Linear Models with Two-Way Fixed Effects and Sparsely Matched Data

Appendix and Supplementary Material Valentin Verdier∗ April 21, 2017

∗ Assistant Professor, Department of Economics, University of North Carolina, Chapel Hill, NC 27599, United States. Tel.: +1 919-966-3962. E-mail address: [email protected]

1

To simplify notation, I will use C to denote a generic constant such that C < ∞ throughout this appendix, so that if a quantity an is less than some arbitrary large constant for all n, I will always write an ≤ C ∀ n, independently of what the arbitrary large constant actually is. Similarly, c will denote a generic constant such that c > 0.

A

Proof of Lemma 1 and of its Corollaries

I provide proofs of Lemma 1 and of its corollaries before providing proofs of Propositions 1-3 as the second result in Corollary 2 is used in the proofs of Propositions 1 and 2.

A.1

Proof of Lemma 1

By the Frisch-Waugh theorem, Mgn = Mgn,1 −PMgn,1 gn,2 where gn,1 = [1[i = j]]j∈N i∈N ,t∈T , gn,2 = 0

0

− n [1[dit = d]]d∈D i∈N ,t∈T , so that gn = [gn,1 , gn,2 ], and PMgn,1 gn,2 = Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 ) gn,2 Mgn,1 .

We have: Mgn,1 gn,2 = [1[dit = d] −

1 X d∈Dn 1[dis = d]]i∈N ,t∈T T

(A.1)

s∈T

so that: X 1X 0 0 d0 ∈D n gn,2 Mgn,1 gn,2 = diag([ 1[dit = d, dis = d ]]d∈D 1[dit = d]]d∈Dn ) − [ n T i,s,t

i,t

Define Nn,d =

P

i,t 1[dit

(A.2)

= d] and Nn,dd0 =

1 T

P

i,s,t 1[dit

0

= d, dis = d ] so that: 0

0

∈Dn gn,2 Mgn,1 gn,2 = diag([Nn,d ]d∈Dn ) − [Nn,dd0 ]dd∈D n

(A.3)

For any value of n, consider the weighted undirected graph of teachers: n Gn = {Dn , [Nn,dd0 ]d∈D d∈Dn }

(A.4) 0

Note that if this graph is composed of several connected components, then gn,2 Mgn,1 gn,2 would be block diagonal (after permutation), with each block corresponding to one of these

2

components, and so would PMgn,1 gn,2 , with each block corresponding to the values of {i, t} ∈ N × T that are associated with each of these components. Therefore we can consider each connected component of Gn separately. Without loss of generality we consider the case where 0

Gn is fully connected, i.e. such that any pair of nodes d, d ∈ Dn is connected by a path in Gn . 0

d ∈Dn . Note that Define the matrices Dn = diag({Nn,d }d∈Dn ) and Pn = Dn−1 [Nn,dd0 ]d∈D n P P 0 Nn,dd0 ≥ 0 and d0 ∈Dn Nn,dd0 = Nn,d since d0 ∈Dn 1[dit = d ] = 1. Therefore Pn is a

valid probability transition matrix that can be used to describe a Markov chain on Dn . The graph corresponding to the transition matrix Pn , Gn , is fully connected, and the diagonal elements of Pn ,

Nn,dd Nn,d ,

are strictly positive, so that Pn is noncyclic ergodic.

Hence by Theorem 9-49 of Kemeny et al. (1976), Pn describes a normal Markov chain, so P∞ 0 0 P∞ r r that µ r=0 Pn f exist for any vectors µ and f with µ j = 0 and αf = 0 where r=0 Pn and j is a Nn × 1 vector of ones and α = [Nn,d ]d∈Dn . 0

One can easily verify that Mgn,1 gn,2 j = 0 and αDn−1 gn,2 Mgn,1 = 0. Hence Mgn,1 gn,2 P r −1 0 and ∞ r=0 Pn Dn gn,2 Mgn,1 exist.

P∞

r r=0 Pn

0

0

0

0

0

Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 = Mgn,1 gn,2

∞ X

Pnr (INn − Pn )(gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1

(A.5)

r=0

= Mgn,1 gn,2

∞ X

0

Pnr Dn−1 gn,2 Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1

(A.6)

r=0

= Mgn,1 gn,2

∞ X

0

Pnr Dn−1 gn,2 Mgn,1

(A.7)

r=0 (n)

Define the stochastic process ξτ 00

as a Markov chain on Dn with transition probability

(n)

(n)

0

00

matrix Pn . Define d Gdd0 = Ed (number of visits of ξτ to d bef ore reaching d ), where Ed (.) denotes an expected value conditional on being in the initial state d. P −1 0 n Note that Mgn,1 gn,2 = [ T1 s∈T (1[dit = d] − 1[dis = d])]d∈D i∈N ,t∈T and Dn gn,2 Mgn,1 = i∈N ,t∈T 1 1 P . [ Nn,d s∈T (1[dit = d] − 1[dis = d])]d∈Dn T

3

Hence, using Proposition 9-12 of Kemeny et al. (1976):

Mgn,1 gn,2

∞ X

0

Pnr Dn−1 gn,2 Mgn,1 =

r=0

0 0 1 X 1 dis (n) (n) ∈N ,t ∈T [ ( Gd 0 0 dit − dis Gd 0 0 dit )]ii∈N ,t∈T 2 T Nn,dit i s i t 0

(A.8)

s,s ∈T

This completes the proof of Lemma 1.

A.2

Proof of Corollary 1

From Lemma 1, we have:

Mgn [{i, t}, {i, t}] = 1 −

1 1 X 1 dis (n) (n) ( Gdit dit − dis Gd 0 dit ) − 2 T T N is n,d it 0

(A.9)

s,s

(n)

Since ξτ

00

is a Markov chain, we have

Mgn [{i, t}, {i, t}] = 1 −

d

00

(n)

(n)

00

(n)

Gdd0 =d Hdd0 d Gd0 d0 . Therefore:

1 X 1 dis (n) 1 (n) Gdit dit (1 − dis Hd 0 dit ) − 2 T T N is n,dit 0 s,s

1 1 X 1 dis (n) dit (n) =1− − 2 Gdit dit Hd 0 dis T T Nn,dit is 0 s,s

=1−

1 1 − 2 T T

1

X s,s0 :dis 6=dit ,d

is

00

Let

d

0 6=dit

Nn,dit

dis

(n)

¯ (n)0 = Pd (ξτ(n) reaches d0 bef ore d00 f or τ > 0). Note that H dd (n)

Nn,dit dis > 0. Since ξτ

is a Markov chain:

dis

(n)

Gdit dit =

∞ X

¯ (n) )r−1 (1 − dis H ¯ (n) ) r(dis H dit dit dit dit

r=1

1

= 1 =

−dis

¯ (n) H dit dit

1 dit H ¯ (n)

dit dis

4

(n)

Gdit dit dit Hd

is

0 dis

dis H ¯ (n) dit dit

< 1 since

00

We also have

d

¯ (n)0 = P H δ∈Dn dd

Nn,dδ d00 (n) Hδd0 , Nn,d

1 1 Mgn [{i, t}, {i, t}] = 1 − − 2 T T

so that: dit H (n) d 0 dis

X s,s0 :dis 6=dit ,d

is

(n)

P

d∈Dn

0 6=dit is

Nn,dit d dit Hddis

which completes the proof of the first result in Corollary 1. For the second result in Corollary 1, we can rewrite: X 0

dit

(n)

Hd

is

0 dis

(j)

P

j∈N

(n)

Nn,d dit Hddis

(A.10)

d∈Dn :d6=dit

0 6=dit is

s ∈T :d

and T Nn,dd0 =

(i)

X

=

(j)

Nn,d Nn,d0 , so that:

1 1 Mgn [{i, t}, {i, t}] = 1 − − T T

(i)

P

d∈Dn :d6=dit

X s:dis 6=dit

P

d∈Dn

P

j∈N

(n)

Nn,d dit Hddis (j)

(j)

(n)

Nn,dit Nn,d dit Hddis

(i)

Case 1: Nn,dit ≥ 3 (j)

(j)

(n)

Since Nn,dit Nn,d dit Hddis ≥ 0:1 1 1 Mgn [{i, t}, {i, t}] ≥ 1 − − T T

(i) dit (n) d∈Dn :d6=dit Nn,d Hddis P (i) (i) dit (n) d∈Dn :d6=dit Nn,dit Nn,d Hddis

P

X s:dis 6=dit (i)

T − Nn,dit 1 1 =1− − (i) T T Nn,dit (i)

=

Nn,dit − 1 (i)

Nn,dit ≥

1 +c 2

(i)

Case 2: Nn,dit = 2 (i)

Nd P

1

j∈N

is (j)

(i)

it

is

Nd Nd

<

1 2

(j)

(j)

for some s ∈ T implies that ∃ j ∈ N , j 6= i such that Nn,dit Nn,dis ≥ 1.

As noted above, I use c as a generic small constant.

5

Therefore we have: (i)

P

d∈Dn

P

d∈Dn

(n)

Nn,d dit Hddis (j)

P

j∈N

(i) dit (n) d∈Dn Nn,d Hddis (i) (i) dit (n) d∈Dn Nn,dit Nn,d Hddis + 1 (i) Nn,d 1

P

(j)

(n)

Nn,dit Nn,d dit Hddis

≤P

=

(i)

Nn,dit ≤

=

1 (i) Nn,dit

1

it

(1 − P

(i)

(n)

dit d∈Dn Nn,d Hddis +

(1 −

1 (i) Nn,d

T+

it

1

)

1 (i) Nn,d it

)

(i) Nn,d it

1 2T 2 2T + 1

so that:

Mgn [{i, t}, {i, t}] ≥ 1 − ≥

1 T −31 1 1 2T − − T T 2 T 2 2T + 1

1 +c 2

(i)

Case 3: Nn,dit = 1 (i)

Let s1 ∈ T be such that 1 T −2 2 T −1

2T

(j)

P

(j)

Nn,d

is1 (j) (j) j∈N Ndit Ndis 1

(i)

(j)

(i)

1 T −2 2 T −1 .

<

P

1

(j)

Ndit Ndis − c since Nn,dis , Ndit , Ndis are discrete and either 1 (i) (j) (j) 1 T −2 P or Nn,dis ≤ 2 T −1 j∈N Ndit Ndis − 1. j∈N

T −1 T −2

Note that this implies Nn,dis (j)

P

j∈N

(j)

Ndit Ndis ≤ 1

1

1

(i)

Let κn,it = maxs∈T By convention let P

(j)

j∈N

(j)

Nn,d

is P (j) (j) j∈N Ndit Ndis

0 0

.

≡ 0. ∀ s ∈ T s.t. dis 6= dit , since

P

j∈N

(j)

(j)

(i)

Nn,dit Nn,d > 0 if Nn,d > 0 and

(n)

Nn,dit Nn,d dit Hddis ≥ 0:

(i) dit (n) d∈Dn Nn,d Hddis P P (j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis

P

P

=

d∈Dn

P

j∈N

(j)

(i)

Nn,d

(j)

Nn,dit Nn,d P

(j)

j∈N

(j)

Nn,d Nn,d it

(j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis P P (j) (j) dit (n) Hddis d∈Dn j∈N Nn,dit Nn,d κn,it P P (j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis

P

= κn,it

6

P

dit H (n) ddis

2

Let K = 2 (TT−1) −2 + 1. If

P

P P

d∈Dn

(j)

P

d∈Dn

j∈N

(j)

(n)

Nn,dit Nn,d dit Hddis ≥ K, we also have: 1

(i)

(n)

Nn,d dit Hddis

d∈Dn

1

T −1 K

(j) (j) dit (n) j∈N Nn,dit Nn,d Hddis1

P

(A.11)

so that:

Mgn [{i, t}, {i, t}] ≥ 1 − ≥

If

P

d∈Dn

(j)

d∈Dn

1 +c 2

(n)

j∈N

Nn,dit Nn,d dit Hddis ≤ K, we have:

d∈Dn

Nn,d dit Hddis

P

1

(i)

P P

(j)

1 T −2 1 T −1 − κn,it − T T T K

(n)

1

(j) (j) dit (n) j∈N Nn,dit Nn,d Hddis1

P

P (i) (i) (n) Nn,dis + d∈Dn :d6=dis Nn,d dit Hddis 1 1 1 = P P (j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis 1

1

≤P

d∈Dn

(j)

P

j∈N

(j)

(n)

Nn,dit Nn,d dit Hddis

×

1

1 T − 2 X (j) (j) ( Nn,dit Nn,dis − c+ 1 2T −1 j∈N

X

κn,it

d∈Dn :d6=dis1

X

(j)

(j)

1

j∈N

1T −2 − 2T −1 P

d∈Dn

c P

j∈N

so that:

1 T −1 1 c − κn,it + T T TK

1 +c 2

which concludes the proof of the second result of Corollary 1.

7

(j)

(j)

(n)

Nn,dit Nn,d dit Hddis

1T −2 c ≤ − 2T −1 K

Mgn [{i, t}, {i, t}] ≥ 1 −

(n)

Nn,dit Nn,d dit Hddis )

1

For the third result of Corollary 1, we can write: 1 0 0 ∆gn,2 ∆gn,2 = gn,2 Mgn,1 gn,2 2

(A.12)

1 0 0 0 0 ∆gn,2 (∆gn,2 ∆gn,2 )− ∆gn,2 = ∆gn,2 (gn,2 Mgn,1 gn,2 )− ∆gn,2 2

(A.13)

so that:

By the results in the proof of Lemma 1, we have: 0 1 1 0 0 (n) (n) ∈N ∆gn,2 (∆gn,2 ∆gn,2 )− ∆gn,2 = [ (di2 Gd 0 di1 − di2 Gd 0 di1 )]ii∈N 2 Nn,di1 i 1 i 2

(A.14)

which concludes the proof for the first part of this result. In addition the above implies: 1 1 di2 (n) Gdi1 di1 2 Nn,di1

M∆gn,2 [i, i] = 1 − (n)

Note that d Gdd = 0, so that M∆gn,2 [i, i] = 1 >

1 2

(A.15)

(i)

if Ndi1 = 2.

Otherwise, as in the proof of the first result of Corollary 1, we have:

M∆gn,2 [i, i] = 1 −

1 2P

1 (n)

d∈Dn

1

=1− P

d∈Dn

≥1− P

P

(j)

j∈N

(j) di1

Ndi1 Nd

(n)

Hddi2

1 (j)

j∈N

A.3

Nn,di1 d di1 Hddi2

(j)

Ndi1 Ndi2

Proof of Corollary 2

For the first result of Corollary 2, note that 0

0

0

0

||Mgn ||∞ ≥ ||Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 ||∞ − ||Mgn,1 ||∞ = ||Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 ||∞ − 2

8

T −1 T

As in the proof of Corollary 1, 0

00

d 6= d and

1 Nn,d

1 C.

d

00

(n)

Gdd0 =

d

00

(n)

00

(n)

0

0

Hdd0 d Gd0 d0 . In addition,

d

00

(n)

Gd0 d0 ≥ 1 for

Therefore for T = 2 and s 6= t, t 6= s : 0

0

||Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 ||∞ X

= maxi∈N ,t∈T

|

i0 ∈N ,t0 ∈T

1 dis (n) dis (n) (n) Gdit dit ( Hd 0 0 dit − dis Hd 0 0 dit )| Nn,dit i s i t

0

=

1 maxi∈N ,t∈T C

0

(n)

(n)

− dis Hd 0

(A.17)

|

(A.18)

1[di0 t0 = d3 , di0 s0 = d4 ]|dis Hd3 dit − dis Hd4 dit |

X 0

0 dit

(n)

(n)

|dis Hd 0

0d i t it

0

i s

− dis Hd 0

i s

0 dit

i ∈N ,t ∈T (n)

(n)

(A.19)

(n)

(n)

(A.20)

(n)

(A.21)

0

d3 ,d4 ∈Dn i ∈N ,t ∈T

=

(n)

Gdit dit |dis Hd 0

0d i t it

Nn,dit

1 maxi∈N ,t∈T C

X

dis

|

0

i ∈N ,t ∈T

X

1

X

= maxi∈N ,t∈T

(A.16)

1 maxi∈N ,t∈T C

X

|dis Hd3 dit − dis Hd4 dit |

d3 ,d4 ∈Dn :Nn,d3 ,d4 >0

1 maxd1 ,d2 ∈Dn :Nn,d1 ,d2 >0 C

X

(n)

|d2 Hd3 d1 − d2 Hd4 d1 |

d3 ,d4 ∈Dn :Nn,d3 ,d4 >0

With this inequality established, we can describe the assignment sequence which provides the example needed to prove the first result of Corollary 2. Let T = 2 and for K = 1, 2, ..., let n = 2K+2 − 3 and Nn = 3 · 2K − 2. For t = 1, assign i = 1, 2, 3 to dit = 1, assign i = 4, 5 to dit = 2, i = 6, 7 to dit = 3, and so on up to dit = 2K − 1. Assign i = n, n − 1 to dit = Nn , i = n − 2, n − 3 to dit = Nn − 1, and so on up to dit = Nn − (2K − 2). For t = 2, assign i = 1 to dit = Nn , assign i = 2 to dit = 2, i = 3 to dit = 3, i = 4 to dit = 4, i = 5 to dit = 5 and so on up to dit = 2K+1 − 1. Assign i = n to dit = Nn − 1, i = n − 1 to dit = Nn − 2, i = n − 2 to dit = Nn − 3, and so on up to dit = 2K . Figure 1 below represents Gn for several values of n. We can assign each value of d to a “level” of Gn , d = 1 to kd = 1, d = 2, 3 to kd = 2, d = 4, 5, 6, 7 to kd = 3, and so on up to d = 2K , ..., 2K+1 −1 to kd = K +1. Then we can assign d = Nn to kd = 2K + 1, d = Nn − 1, Nn − 2 to kd = 2K, d = Nn − 3, Nn − 4, Nn − 5, Nn − 6 to kd = 2K − 1, and so on up to d = Nn − 3 · 2K−1 + 2, ..., Nn − (2K − 2) to kd = K + 2.

9

(a) n = 13, K = 2

(b) n = 29, K = 3

Figure 1: plots of Gn for different values of n. By symmetry,

Nn H (n) d1

(n)

(n)

0

= Nn Hd0 1 ∀ d, d with kd = kd0 , define xk = Nn Hd1 for some d with

kd = k. We have x1 = 1. By symmetry we also have xK+1 = 12 . (n)

Since ξτ

is a first-order Markov chain, we also have: 1 2 xk = xk−1 + xk+1 ∀ k = 2, ..., K 3 3

(A.22)

This system of equations is solved by: 1 2k−1 − 1 + 2 xk+1 ∀ k = 1, ..., K 2k − 1 2k − 1 1 = 2

xk = xK+1

so that:

2K (xK − xK+1 ) = 2K ( =

1 2K 1 − ) K 22 −1 2

2K−1 2K − 1

10

(A.23) (A.24)

and ∀ k = 1, ..., K − 1:

|xk − xk+1 | = xk − xk+1 =

2k

(A.25)

1 1 − k xk+1 −1 2 −1

(A.26)

which leads to: 2k 2k − xk+1 2k − 1 2k − 1 2k 1 2k 2k − 1 2k − k − 2 xk+2 = k 2 − 1 2 − 1 2k+1 − 1 2k − 1 2k+1 − 1 2k+1 2k+1 = k+1 − k+1 xk+2 2 −1 2 −1

2k |xk − xk+1 | =

= 2k+1 |xk+1 − xk+2 |

(A.27) (A.28) (A.29) (A.30)

Hence: K X

2k (xk − xk+1 ) = K2K (xK − xK+1 )

(A.31)

k=1

=K

2K−1 2K − 1

(A.32)

Hence, for i = 1 and t = 1 or t = 2, s 6= t:

X 0

|

dis

(n) Hddit

dis

(n) Hd0 d | it

=2

K X

2k (xk − xk+1 )

(A.33)

k=1

d,d ∈Dn :N

0 >0 n,dd

=K

2K −1

2K

(A.34)

which grows unboundedly as K → ∞, i.e. as n → ∞. This establishes the first result of Corollary 2. The second result of Corollary 2 follows directly from Lemma 1 when replacing Dn with (I )

a set of finitely bounded cardinality, Dn n = {d ∈ Dn : dit = d, i ∈ In , t ∈ T } and when Nn,d ≤ C ∀ d ∈ Dn , so that

N

0 n,dd

Nn,d

1 TC

0

∀ d, d ∈ Dn s.t. Nn,dd0 > 0 since, as in the proof

11

(In )

of Lemma 1, we can treat Gn

0

(I )

(In )

= {Dn n , [Nn,dd0 ]d ∈D(Inn ) } as fully connected without loss of d∈Dn

generality and Ed (number of visits of

(n) ξτ

0

00

0

00

(I )

to d bef ore reaching d ) is finite ∀ d, d , d ∈ Dn n

for noncyclic ergodic Markov chains with finite support.

A.4

Variance of Estimation Error for Realizations of Unobserved Heterogeneity under Homoscedasticity and Uncorrelation

Although estimation of β0 is the only object of interest in the rest of the paper, it can be pointed out that Lemma 1 can also be used to bound the variance of estimation error for 0

particular realizations of ed − ed0 , d, d ∈ Dn , when transitory shocks are homoscedastic and serially and cross-sectionally uncorrelated as in Jochmans and Weidner (2016).2 As in Jochmans and Weidner (2016) consider the case where β0 = 0 (or alternatively treat 0

0

ˆn = (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 un the estimation of β0 as asymptotically ignorable). Define e ˆn corresponding to d ∈ Dn . and eˆn,d to be the element of e 0

0

Corollary 3. If V ar(un |X) = σu2 InT , for d, d ∈ Dn , d 6= d , connected by a path of any length in Gn : 1 d0 (n) Gdd Nn,d 1 = σu2 P d (n) d00 Nn,dd00 Hd00 d0

V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) = σu2

0

0

For d, d ∈ Dn connected by a path of length one in Gn , d 6= d : V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) ≤ σu2

1

(A.35)

Nn,dd0

Proof. Under the assumptions of Corollary 3, we have:

V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) 0

0

0

= σu2 [1[d = δ] − 1[d = δ]]δ∈Dn (gn,2 Mgn,1 gn,2 )− [1[d = δ] − 1[d = δ]]δ∈Dn 0

2

Note that estimation error for ed instead of ed − ed0 , d, d ∈ Dn , is irrelevant since {ed }d∈Dn is only determined up to a normalization within each connected subcomponent of Gn .

12

0

As in the proof of Lemma 1, for d and d connected in Gn by a path of any length, we can rewrite: 0

V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) =

σu2 [1[d

0

δ∈Dn

= δ] − 1[d = δ]]

1 d (n) [ Gδd ]δ∈Dn Nn,d

1 d0 (n) 1 d0 (n) Gdd − Gd0 d ) Nn,d Nn,d 1 d0 (n) = σu2 Gdd Nn,d

= σu2 (

00

where the last equality follows from the definition of

d

(n)

Gdd0 .

0

As in the proof of Corollary 1, we have, for d 6= d : 1

1 d0 (n) Gdd = P Nn,d

(n)

Nn,dd00 d Hd00 d0

d00 ∈Dn

(A.36)

0

and for d, d such that Nn,dd0 > 0: 1 d0 (n) Gdd ≤ σu2 Nn,d N

1

n,dd0

= σu2

d H (n) d0 d0

1 Nn,dd0

which completes the proof. Note that the bound given in the second part of Corollary 3 can be generalized to nodes 0

d, d ∈ Dn separated by any distance in Gn and is sharp since the comparison will hold with 0

equality if all of the paths from d to d in Gn include the edge Nn,dd0 . The bound is also finitely bounded with sparsely matched data,3 unlike the bound given in Theorem 6 of Jochmans and Weidner (2016).4 Note that one could also use the best linear unbiased estimator property of ordinary least squares under homoscedasticity and serial and cross-sectional uncorrelation of

3

0

The bound extended to nodes d, d ∈ Dn separated by a finitely bounded distance in Gn would also be finitely bounded. 4 The bound given in Jochmans and Weidner (2016) does not depend on distance and relies on the inverse −1

1

of λ2,n , the smallest positive eigenvalue of In − Dn2 Pn Dn 2 . In general λ2,n → 0 as n → ∞ with sparsely matched data (Nn,d ≤ C < ∞ ∀ d ∈ Dn , ∀ n).

13

transitory shocks to obtain similar bounds for V ar((ˆ en,d − eˆn,d0 )−(en,d −en,d0 )) as in Corollary 3.

B

Asymptotic Properties under Strict Exogeneity

In this section, the proofs of Propositions 1-3 are given. For notational simplicity I consider the case where dim(xit ) = 1. All proofs are easily extended to the case of multidimensional covariates. ˜n and z¨it to be the element of z ¨n corresponding to observation {i, t}.5 ¨ n = Mn z Define z I shorten the term near-epoch dependent to NED. A definition of near-epoch dependence is given by Definition 1 in Jenish and Prucha (2012). When Assumption 3 is used, denote by pi ∈ Pn the set in Pn that contains i. 0 ˜ When Assumption 4 is used, let ρ˜(A, B) = inf{i,l}∈A,{j,l0 }∈B ρ˜({i, l}, {j, l }) for A, B ⊆ Q.

B.1

Preliminary Results

Firstly, note that Assumption 2-c and Jensen’s inequality imply |˜ zit |4+δ ≤ B since z˜it = E(xit |Z). From Corollary 2 and Minkowski’s inequality, we also have ||Mn ||∞ ≤ C under Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , ∀ n. This implies |¨ zit |4+δ ≤ CB by Minkowski’s inequality.6 P (σ) Also note that z¨it = σ∈Sn :i∈σ Mg(σ) [{i, t}, .]˜ zn if ∃ σ ∈ Sn such that i ∈ σ, and z¨it = n

0 otherwise. Together with |¨ zit

|4+δ

0

≤ CB and maxi,i0 ∈s,s∈Sn ,t,t0 ∈T ρ˜({i, lit }, {i , li0 t0 }) ≤ C

(Assumption 4-d), this implies that the random field {¨ zit : i ∈ N , t ∈ T , n ∈ N} is L2 -NED ˜ equipped with semi-metric ρ˜, on the random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on the lattice Q with NED coefficients ψ(.) such that ψ(r) = 0 for r > C and constant NED scaling factors when Assumption 4 is used. The next lemma establishes that using the semi-metric ρ˜ defined in Assumption 4 leads to ˜ as if ρ were used with Q (up to an additive constant). the same cardinalities for basic sets in Q 5

For notational simplicity I drop the subscript n here, writing z¨it instead of z¨n,it . As noted above, I use C as a generic large constant so that we write, for instance, |¨ zit |4+δ ≤ CB when it can be shown that |¨ zit |4+δ ≤ C 4+δ B. 6

14

Therefore the results in Jenish and Prucha (2009) and Jenish and Prucha (2012) can be used ˜ instead of in our slightly modified setting where the semi-metric ρ˜ is used on the lattice Q the metric ρ on Q. Consider the basic sets and cardinalities of basic sets defined in Lemma A.1 of Jenish and Prucha (2009) for U, V ⊂ Q, U ∩ V = ∅, l ∈ U : 0

0

Bl (h) = {l ∈ Q : ρ(l, l ) ≤ h}

(B.1)

Nl (m1 , m2 , m3 ) = #{(A, B) : #A = m1 , #B = m2 , A ⊆ U with l ∈ A, 0

0

B ⊆ V, and ∃ l ∈ B with m3 ≤ ρ(l, l ) < m3 + 1}

(B.2)

˜ for U ˜ = {{j, l0 } ∈ Q ˜ : and the corresponding basic sets and cardinalities of basic sets in Q, 0 0 ˜ : l0 ∈ V }, {i, l} ∈ U ˜ :7 l ∈ U }, V˜ = {{j, l } ∈ Q

˜il (h) = {{j, l0 } ∈ Q ˜ : ρ˜({i, l}, {j, l0 }) ≤ h} B

(B.3)

˜il (m1 , m2 , m3 ) = #{(A, B) : #A = m1 , #B = m2 , A ⊆ U ˜ with {i, l} ∈ A, N 0 0 B ⊆ V˜ , and ∃ {j, l } ∈ B with m3 ≤ ρ˜({i, l}, {j, l }) < m3 + 1}

(B.4)

˜: Lemma 3. ∀ m1 , m2 , m3 ∈ N, U, V ⊂ Q, U ∩ V = ∅, {i, l} ∈ U ˜il (h) ≤ #Bl (h) + T #B ˜il (m1 , m2 , m3 ) ≤ Nl (m1 , m2 , m3 ) N

(B.5) (B.6)

Proof. The first inequality follows from ˜il (h) = {{j, l0 } ∈ Q ˜ : ρ(l, l0 ) ≤ h or j = i} B 0 ˜ : l0 ∈ Bl (h)} ∪ {{i, l0 } ∈ Q ˜ : l0 ∈ Q} ⊆ {{j, l } ∈ Q

(B.7) (B.8)

0 ˜ : l0 ∈ Bl (h)} = #Bl (h), #{{i, l0 } ∈ Q ˜ : l0 ∈ Q} = T . and #(A∪B) ≤ #A+#B, #{{j, l } ∈ Q 7 ˜ ∩ V˜ = ∅. We also have #V˜ = #V and #U ˜ = #U from Assumption 4-a and the definitions Note that U ˜ of Q and Q.

15

The second inequality follows from

0

ρ˜({i, l}, {j, l }) =

   0

if i = j

  ρ(l, l0 )

otherwise

(B.9)

and m3 ≥ 1, so that: ˜ with {i, l} ∈ A, B ⊆ V˜ #{(A, B) : #A = m1 , #B = m2 , A ⊆ U 0

0

and ∃ {j, l } ∈ B with m3 ≤ ρ˜({i, l}, {j, l }) < m3 + 1} ˜ with {i, l} ∈ A, B ⊆ V˜ = #{(A, B) : #A = m1 , #B = m2 , A ⊆ U 0

0

and ∃ {j, l } ∈ B with m3 ≤ ρ(l, l ) < m3 + 1, i 6= j}

(B.10)

˜ with {i, l} ∈ A, B ⊆ V˜ ≤ #{(A, B) : #A = m1 , #B = m2 , A ⊆ U 0

0

and ∃ {j, l } ∈ B with m3 ≤ ρ(l, l ) < m3 + 1}

(B.11)

= #{(A, B) : #A = m1 , #B = m2 , A ⊆ U, with l ∈ A, B ⊆ V 0

0

and ∃ l ∈ B with m3 ≤ ρ(l, l ) < m3 + 1}

(B.12)

The next two lemmas show convergence results that will be used in the proofs of Propositions 1 and 2. Lemma 4. Under Assumption 2, and Assumption 3 or Assumption 4, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1 0 p ˜n Mn xn − An → 0 z n 0

˜n ). where An = n1 E(˜ zn M n z 0

˜n , An = E( n1 z ˜n Mn xn ). Proof. Note that by definition of z

16

(B.13)

Proof using Assumption 3 Under Assumption 3, we show convergence in mean-squared error, which implies convergence in probability. Note that under Pn forming a partition of N and {{d : dit = d, i ∈ p, t ∈ T }}p∈Pn forming a partition of Dn , ∀ σ ∈ Sn , Mg(σ) will be block-diagonal, with each block corresponding to n

an element in Pn , so that z¨it is a function of {˜ zjs }j∈pi ,s∈T only. 0

0

Jointly with ∀ p, p ∈ Pn , p 6= p , F ({xit , zit }i∈p,t∈T , {xit , zit }i∈p0 ,t∈T ) = F ({xit , zit }i∈p,t∈T )F ({xit , zit }i∈p0 ,t∈T ) this implies: 1 0 1 ˜ n M n xn ) = 2 V ar( z n n

X

X

E(¨ zit z¨js xit xjs )

(B.14)

{i,t}∈N ×T j∈pi ,s∈T

Under Assumption 2-c we have |¨ zit |4+δ ≤ CB as noted earlier in this section, and E(|¨ zit z¨js xit xjs | |Z) ≤ CB by H¨older’s inequality. Since E(B) < C by Assumption 2-c, we have E(|¨ zit z¨js xit xjs |) ≤ C and |E(¨ zit z¨js xit xjs )| ≤ C Hence, under Assumptions 2 and 3: 1 0 1 ˜n Mn xn ) = O( 2 n maxi∈N #pi ) V ar( z n n 1 maxi∈N #pi = O( ) #Pn mini∈N #pi 1 = O( ) #Pn = o(1)

(B.15) (B.16) (B.17) (B.18)

where the third and fourth equalities follow from Assumption 3. Proof using Assumption 4 Under Assumption 4 we prove the result of this lemma by using the L1 -law of large numbers found in Jenish and Prucha (2012), which implies convergence in probability. As noted at the beginning of this Section, {¨ zit : i ∈ N , t ∈ T , n ∈ N} is L2 -NED on the ˜ equipped with semi-metric ρ˜ with random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on the lattice Q NED coefficients and NED scaling factors satisfying the conditions of Theorem 1 of Jenish and Prucha (2012). 17

E(|¨ zit xit |1+δ ) ≤ C is obtained by H¨older’s inequality, Assumption 2-c, and Jensen’s inequality as in the first part of this proof. Assumption 4-b guarantees that {xit , z˜it : i ∈ N , t ∈ T , n ∈ N} is an α-mixing random ˜ equipped with semi-metric ρ˜ with α-mixing coefficients satisfying the condition of field on Q Theorem 1 of Jenish and Prucha (2012). ˜ with Finally Lemma 3 shows that the same cardinalities for basic sets are obtained on Q ρ˜ as are obtained on Q with ρ, so that Assumption 4-a in this paper can be used in the place of Assumption 1 of Jenish and Prucha (2012). Therefore all of the conditions for Theorem 1 of Jenish and Prucha (2012) are verified, and we can invoke it to obtain: 1 0 1 0 ˜n Mn xn − E( z ˜ Mn xn )|) → 0 E(| z n n n

(B.19)

Lemma 5. Under (2.3) and (2.4) when zit = zis ∀ t, s ∈ T , and Assumptions 1-3, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1 0 a.s. V ar(˜ zn Mn un |Z) − Vn → 0 n

(B.20)

0

where Vn = n1 V ar(˜ zn Mn un ). 0

Proof. We have E(˜ zn Mn un |Z) = 0 and, under Assumptions 1 and 3:

1 1 X #Pn X 0 V ar(˜ zn Mn un |Z) = n #Pn n p∈Pn

X

z¨it z¨js E(uit ujs |{zi0 t0 }i0 ∈p,t0 ∈T )

i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt

(B.21) As in the proof of Lemma 4, z¨it is a function of {˜ zjs }j∈pi ,s∈T only. Therefore under Assumption 3 one only needs to verify the Markov condition in order to invoke the strong law of large numbers for independent observations given by Theorem 3.7 in 18

White (2001). By Minkowski’s inequality:

E(|

#Pn X n

1

X

z¨it z¨js E(uit ujs |{zi0 t0 }i0 ∈p,t0 ∈T )|1+δ ) 1+δ

i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt

#Pn X ≤ n

1

X

E(|¨ zit z¨js E(uit ujs |{zi0 t0 }i0 ∈p,t0 ∈T )|1+δ ) 1+δ

(B.22)

i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt

By Jensen’s inequality, Holder’s inequality, and Assumption 2-c: 1

E(|¨ zit z¨js E(uit ujs |{zi0 t0 }i0 ∈p,t0 ∈T )|1+δ ) 1+δ ≤ C

(B.23)

Therefore, since Nn,d ≤ C:

E(|

#Pn X n

1

X

z¨it z¨js E(uit ujs |{zi0 t0 }i0 ∈p,t0 ∈T )|1+δ ) 1+δ

i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt

≤C

#Pn X CT n

(B.24)

#Pn #p ≤C n

(B.25)

i∈p

Under Assumption 3, we have: #Pn #Pn #p = P 0 #p 0 n p ∈Pn #p

(B.26)

0

maxp0 ∈Pn #p

minp0 ∈Pn #p0

≤C

(B.27) (B.28)

Hence Markov’s condition is satisfied and the desired result is obtained.

˜ Finally the last lemma shows that Assumptions 1 and 4 imply that uit is α-mixing on Q, using ρ˜ as a measure of distance.

19

Lemma 6. Under Assumptions 1 and 4, {{˜ zit , uit } : i ∈ σ, t ∈ T , σ ∈ Sn , n ∈ N} is an α˜ equipped with the semi-metric ρ˜ with α-mixing coefficients mixing random field on the lattice Q satisfying Assumption 3 of Jenish and Prucha (2012). Proof. Let U and B be two σ-algebras of F({˜ zit , zit } : i ∈ N , t ∈ T , n ∈ N), define:

α(U, B) = supA∈U ,B∈B |P (AB) − P (A)P (B)|

(B.29)

˜ let Fn (U ) = F({˜ For U ⊆ Q, zit , zit } : i ∈ N , t ∈ T , {i, lit } ∈ U ) to define:

αn (U, V ) = α(Fn (U ), Fn (V ))

(B.30)

˜ and: for U, V ⊆ Q

α ¯ (c1 , c2 , r) = supn supU,V (αn (U, V ) : #U ≤ c1 , #V ≤ c2 , ρ˜(U, V ) ≥ r)

(B.31)

Assumption 4-b) implies:

α ¯ (c1 , c2 , r) ≤ ϕ(c1 , c2 )ˆ α(r)

(B.32)

ϕ(c1 , c2 ) = (c1 + c2 )τ , τ ≥ 0 ∞ X

(B.33)

δ

rq(τ? +1)−1 α ˆ 2(2+δ) (r) < ∞, δ > 0, τ? = δτ /(2 + δ)

(B.34)

r=1

Define αzu , Fzu,n , αzu,n and α ¯ zu similarly as above but for the random field {{˜ zit , uit } : i ∈ σ, t ∈ T , σ ∈ Sn , n ∈ N}. ˜ ρ˜(U, V ) > 2C, i ∈ σ, i0 ∈ σ 0 , σ, σ 0 ∈ Sn , For A ∈ Fzu,n (U ), B ∈ Fzu,n (V ), U, V ⊂ Q, 0

0

0

t, t ∈ T , {i, lit } ∈ U , {i , li0 t0 } ∈ V , under Assumption 4-d we have i 6= i , and dit 6= di0 t if 0

t=t. Therefore under Assumption 1:

P (AB|Z) = P (A|Z)P (B|Z)

20

(B.35)

0

0

0

0

Let E = {{i , t } : ∃ {i, lit } ∈ U with i = i , or t = t and dit = di0 t0 , i ∈ σ, σ ∈ Sn }. Under Assumption 4-c we also have:

P (A|Z) = P (A|{zit }{i,t}∈E , {˜ zit }{i,t}∈N ×T :{i,lit }∈U )

(B.36)

˜ is Under Assumption 4-d, F({zit }i,t∈E , {˜ zit }{i,t}∈N ×T :{i,lit }∈U ) = Fn (Uzd ) where Uzd ⊆ Q 0

0

such that max{i,l}∈Uzd min{i0 ,l0 }∈U ρ˜({i, l}, {i , l }) ≤ C. ˜ is such that Similarly for P (B|Z) which generates a σ-algebra Fn (Vzd ) where Vzd ⊆ Q 0

0

max{i,l}∈Vzd min{i0 ,l0 }∈V ρ˜({i, l}, {i , l }) ≤ C. Therefore, by the law of iterated expectations, we have:

P (AB) − P (A)P (B) = Cov(X1 , X2 )

(B.37)

where X1 and X2 are random variables such that F(X1 ) = Fn (Uzd ) and F(X2 ) = Fn (Vzd ) and ρ˜(Uzd , Vzd ) ≥ ρ˜(U, V ) − 2C. 0 ≤ X1 , X2 ≤ 1, so that by Lemma A.2 of Jenish and Prucha (2012) and Assumption 4-b:

Cov(X1 , X2 ) ≤ 4α(Fn (Uzd ), Fn (Vzd )) ≤ 4(#Uzd + #Vzd )τ α ˆ (˜ ρ(U, V ) − 2C) 0

0

(B.38) (B.39)

0

Since Nn,d ≤ C ∀ d ∈ Dn and #{{i , t } : i = i } = T ∀ i ∈ N , we have #Uzd ≤ CT · #U , and similarly for #Vzd . Defining α ˆ u (r) = α ˆ (r − 2C), we have: |P (AB) − P (A)P (B)| ≤ C(#U + #V )τ α ˆ u (˜ ρ(U, V ))

(B.40)

which obtains the desired result since one can easily verify that α ˆ u satisfies ∞ X

δ

rq(τ? +1)−1 α ˆ u2(2+δ) (r) < ∞

r=1

21

(B.41)

as long as (B.34) is satisfied.

B.2

Proof of Proposition 1

Note that: 1 0 1 0 ˜n Mn xn )−1 z ˜ Mn u n βˆn = β0 + ( z n n n B.2.1

(B.42)

Proof of Consistency

There are several ways of proving consistency of βˆn . Here we will show convergence in meansquared error, which implies convergence in probability. Under (2.3) and (2.4): 1 0 ˜ Mn u n ) = 0 E( z n n

(B.43)

Under Assumption 1: 1 0 1 ˜ n M n un ) = 2 V ar( z n n

X

X

E(¨ zit z¨js uit ujs )

(B.44)

i,t∈N ×T j,s:i=j ∨ t=s,dit =djt

Assumption 2-c, |¨ zit |4+δ ≤ CB, H¨older’s inequality, and Jensen’s inequality obtain |E(¨ zit z¨js uit ujs )| ≤ C. Therefore, since #{{j, s} ∈ N × T : i = j ∨ t = s, dit = djt } ≤ CT ∀ {i, t} ∈ N × T , we have: 1 0 1 ˜n Mn un ) = O( 2 n) V ar( z n n = o(1)

(B.45) (B.46)

Combining this result with Lemma 4, Assumption 2-a), and the continuous mapping theorem, we obtain consistency of βˆn .

22

B.2.2

Proof of Asymptotic Normality

From Lemma 4 and the asymptotic equivalence lemma, it suffices to show that: − 21

Vn

1 0 d √ z ˜n Mn un → N (0, 1) n

(B.47)

Note that, even though we have a martingale difference sequence in each time period, so P P 1 d that one could show ( n1 V ar( i∈N z¨it uit )) 2 √1n i∈N z¨it uit → N (0, 1) for each t ∈ T by using a central limit theorem for martingale difference sequences, joint normality of these quantities is not guaranteed since cluster membership varies over time (dit 6= dis is possible for t 6= s). Here we split the proof depending on whether Assumption 3 or Assumption 4 is used.

Asymptotic Normality under Assumption 3 asymptotic normality of

0 √1 z ˜ M u n n n n

In this part of the proof, we first show

conditional on Z, before invoking a law of large numbers

and the dominated convergence theorem to yield the desired result. 1 1 0 √ z ˜n Mn un = √ n n

X

z¨it uit

(B.48)

i∈N ,t∈T

For the purpose of this proof, define the lattice L = NT . Define the location of a cross0

sectional observation i on L to be ki = [di1 , ..., diT ]. Define the metric % on L to be %(k, k ) = PT 0 0 t=1 1[k[t] 6= k [t]], ∀ k, k ∈ L. One can easily verify that % is a valid metric, i.e. satisfies positivity, definiteness, symmetry, and the triangle inequality. In order to show asymptotic normality conditional on Z, we will verify that the conditions for Theorem 1 of Jenish and Prucha (2009) hold using the lattice L equipped with metric %. Note that under Nn,d ≤ C, #{j ∈ N : %({dit }t∈T , {djt }t∈T ) < T } ≤ CT ∀ i ∈ N . Hence the conditions on cardinality of Lemma A.1 in Jenish and Prucha (2009) clearly hold for the lattice L equipped with the metric %, which will stand in lieu of Assumption 1 of Jenish and Prucha (2009). In addition, Assumption 1 implies {{uit }t∈T }i:ki ∈A ⊥ {{uit }t∈T }i:ki ∈B |Z if %(A, B) ≥ T , A, B ⊂ L. Hence since independence trivially implies α-mixing, Assumption 3 of Jenish and

23

Prucha (2009) is also satisfied. As was noted previously, Assumption 2-c implies E(|¨ zit uit |2+δ |Z) ≤ C, so that Assumption 2 of Jenish and Prucha (2009) is satisfied. 0

From Lemma 5 and Assumption 2-b, we have lim infn→∞ λmin ( n1 V ar(˜ zn Mn un |Z)) ≥ c > 0 a.e., so that Assumption 5 of Jenish and Prucha (2009) is satisfied. Therefore we can invoke Theorem 1 of Jenish and Prucha (2009) to obtain: 1 1 1 0 0 d ˜n Mn un → N (0, 1) |Z , a.e. ( V ar(˜ zn Mn un |Z))− 2 √ z n n

(B.49)

By the dominated convergence theorem we also have: 1 1 1 0 0 d ˜n Mn un → N (0, 1) ( V ar(˜ zn Mn un |Z))− 2 √ z n n

(B.50)

And from Lemma 5 and the asymptotic equivalence lemma, we have: − 21

Vn

1 0 d √ z ˜ Mn un → N (0, 1) n n

(B.51)

Asymptotic Normality under Assumption 4 Write: 1 0 1 √ z ˜n Mn un = √ n n 1 =√ n

X

z¨it uit

(B.52)

i∈N ,t∈T

X

z¨it uit

(B.53)

i∈N ,t∈T :i∈σ,σ∈Sn

where the last equality follows from z¨it = 0 if @ σ ∈ Sn s.t. i ∈ σ. It is noted at the beginning of Section B.1 that the random field {¨ zit : i ∈ N , t ∈ T , n ∈ N} ˜ equipped with ρ˜ with NED is L2 -NED on the random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on Q coefficients equal to zero after a fixed distance and constant NED scaling factors. In addition, Lemma 6 shows that {{˜ zit , uit } : i ∈ σ, σ ∈ Sn , t ∈ T , n ∈ N} is an α-mixing ˜ equipped with the semi-metric ρ˜ with α-mixing coefficients satisfying random field on Q Assumption 3 of Jenish and Prucha (2012).

24

Therefore the random field {¨ zit uit : i ∈ σ, σ ∈ Sn , t ∈ T , n ∈ N} is L2 − N ED on the α-mixing random field {{˜ zit , uit } : i ∈ σ, σ ∈ Sn , t ∈ T , n ∈ N} with NED coefficients and scaling factors satisfying Assumptions 4-c and 4-d of Jenish and Prucha (2012) and α-mixing coefficients satisfying Assumption 3 of Jenish and Prucha (2012). Assumption 2-b guarantees that Assumption 4-b of Jenish and Prucha (2012) holds. As was noted previously, Assumption 2-c implies E(|¨ zit uit |2+δ ) ≤ C, so that Assumption 4-a of Jenish and Prucha (2012) is satisfied. As was noted previously, Lemma 3 also shows that the same cardinalities for basic sets ˜ with ρ˜ as are obtained on Q with ρ, so that Assumption 4-a in this paper are obtained on Q can be used in lieu of Assumption 1 of Jenish and Prucha (2012). Therefore all conditions for applying Theorem 2 of Jenish and Prucha (2012) are met, and we have: − 21

Vn

B.3

1 d √ z ˜n Mn un → N (0, 1) n

(B.54)

Proof of Proposition 2

The first result of Proposition 2 is shown by Lemma 4. P (σ )0 (σ )0 (σ ) P (σ ) ˜n 1 Mg(σ1 ) v ˆ n 2 Mg(σ2 ) z ˆn 1 ˜n 2 . We need to show that Let Vˆn = n1 σ1 ∈Sn z σ2 ∈Un (σ1 ) v n

n

Vˆn − Vn = op (1). (σ)

Since Mg(σ) gn = 0, we have: n

1 X (σ1 )0 1) ˜n Mg(σ1 ) u(σ z Vˆn = n n n σ1 ∈Sn

+ (βˆn − β0 )

X

n

σ2 ∈Un (σ1 )

1 X (σ1 )0 1) ˜n Mg(σ1 ) x(σ z n n n σ1 ∈Sn

+ (βˆn − β0 )

1 X (σ1 )0 1) ˜n Mg(σ1 ) u(σ z n n n σ1 ∈Sn

+ (βˆn − β0 )2

0

2) 2) ˜(σ u(σ Mg(σ2 ) z n n

X

n

σ2 ∈Un (σ1 )

X

0

2) 2) ˜(σ x(σ Mg(σ2 ) z n n n

σ2 ∈Un (σ1 )

1 X (σ1 )0 1) ˜n Mg(σ1 ) x(σ z n n n σ1 ∈Sn

0

2) 2) ˜(σ u(σ Mg(σ2 ) z n n

X

0

2) 2) ˜(σ x(σ Mg(σ2 ) z n n n

σ2 ∈Un (σ1 )

≡ Tn,1 + (βˆn − β0 )Tn,2 + (βˆn − β0 )Tn,3 + (βˆn − β0 )2 Tn,4

25

(σ)

(σ)

¨n For σ ∈ Sn , let z

˜n = Mg(σ) z n

(σ)

(σ)

¨n be the element of z

and let z¨it

corresponding to

observation {i, t} ∈ σ × T .8 We have:

Tn,1 =

1 n

X

X

0

0

X

(σ) (σ )

1[σ ∈ Un (σ)]¨ zit z¨js uit ujs

(B.55)

i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0 0

(σ) (σ )

Under Assumption 1, E(¨ zit z¨js uit ujs ) = 0 if 1[i = j, or dit = djt and t = s] = 0. Since 0

0

1[i = j, or dit = djt and t = s] ≤ 1[σ ∈ Un (σ)] for i ∈ σ, j ∈ σ , we have:

Vn = =

1 n 1 n

X

X

0

(σ) (σ )

X

1[i = j or (dit = djt and t = s)]E(¨ zit z¨js uit ujs )

i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0

X

X

0

0

X

(σ) (σ )

1[σ ∈ Un (σ)]E(¨ zit z¨js uit ujs )

i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0

∀ {i, t} ∈ N × T , under the conditions of Proposition 2, we have: 0

0

0

#{j ∈ N , s ∈ T , σ, σ ∈ Sn : i ∈ σ, j ∈ σ , σ ∈ Un (σ)} ≤ C

(B.56)

so that as in the preliminary results of this section, Assumption 2-c, Minkowski’s inequality and Jensen’s inequality can be combined to obtain:

E(|

X

X

0

0

1

(σ) (σ )

X

z¨it z¨js uit ujs |1+δ ) 1+δ ≤ C

(B.57)

0

σ∈Sn :i∈σ σ ∈Un (σ) j∈σ ,s∈T

Under Assumption 3, we can then show that Tn,1 − Vn = op (1) by using Theorem 3.7 of White (2001) as in the proof of Lemma 5. Under Assumption 4 we have:

maxj∈N ,s∈T

: σ,σ ∈Sn , i∈σ, j∈σ , σ ∈Un (σ) ρ(lit , ljs ) 0

0

0

≤C

(B.58)

Under Assumption 4, we can then show that Tn,1 − Vn = op (1) by using Theorem 1 of 8

(σ)

(σ)

As for z¨it , I drop the subscript n for notational simplicity here, writing z¨it instead of z¨n,it .

26

Jenish and Prucha (2012) as in the proof of Lemma 4.

Under the conditions of Proposition 2 and Assumption 2-c we also have: X

E(|

X

0

0

(σ) (σ )

0

(σ) (σ )

1[σ ∈ Un (σ)]¨ zit z¨js xit ujs |1+δ ) ≤ C

0

j∈N ,s∈T σ,σ ∈Sn :i∈σ,j∈σ

(B.59)

0

and: X

E(|

X

0

1[σ ∈ Un (σ)]¨ zit z¨js xit xjs |1+δ ) ≤ C

(B.60)

j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0

Therefore we can use Markov’s inequality to obtain Tn,2 = Op (1), Tn,3 = Op (1), Tn,4 = Op (1). Proposition 1 shows that βˆn − β0 = op (1), so that applying the continuous mapping theorem concludes the proof.

B.4

Proof of Proposition 3

If Sn is a partition of N and {{dit }i∈σ,t∈T }σ∈Sn is a partition of Dn , we have Mn = Mgn . Under serial and cross-sectional uncorrelation we also have: −1 Bn−1 Wn Bn−1 − A−1 n Vn An

1 0 ˜ Mg z ˜n )−1 = Bn−1 Wn Bn−1 − E( z n n n 1 0 1 0 1 0 1 0 ´n Mgn xn )−1 E( z ´ n Mg n z ´n )E( xn Mgn z ´n )−1 − E( z ˜ Mg z ˜n )−1 = E( z n n n n n n 1 0 1 0 1 0 1 0 ´ n Mg n z ˜n )−1 E( z ´ n Mg n z ´n )E( z ˜ n Mg n z ´n )−1 − E( z ˜ Mg z ˜n )−1 = E( z n n n n n n 1 0 = E( vn Mgn vn ) n 0

(B.61) (B.62) (B.63) (B.64)

0

´n E(˜ ´n )−1 − z ˜n E(˜ ˜n )−1 . where vn ≡ z z n Mg n z z n Mg n z Mgn is symmetric idempotent and the expected value of a quadratic form is positive semi-definite, so that the desired result is obtained.

27

C

Results under Sequential Exogeneity

C.1

Proof of Lemma 2

I use a recursive argument to derive moment conditions that exhaust the information for estimating β0 that is contained in (2.3) and (2.4) when instrumental variables are sequentially exogenous, i.e. zit ⊆ zis for s ≥ t. n Define yn,t = [yit ]i∈N , xn,t = [xit ]i∈N , g2,n,t = [1[dit = d]]d∈D i∈N .

(2.3) and (2.4) for t = T are equivalent to:

E(cn |ZT ) = E(yn,T − xn,T β0 − g2,n,T en |ZT )

(C.1)

PT PT 1 1 ˙ n,t = xn,t − T −t+1 ˙ 2,n,t = g2,n,t − Define y˙ n,t = yn,t − T −t+1 s=t yn,s , x s=t xn,s , g P T 1 ˙ n,t − x˙ n,t β for t = 1, ..., T − 1. s=t g2,n,s , and mn,t (β) = y T −t+1 Since no restriction is imposed on E(cn |ZT ), (C.1) contains no information for estimating β0 , and the information for estimating β0 contained in (2.3) and (2.4) is equivalent to the information contained in:

E(mn,t (β0 ) − g˙ 2,n,t en |Zt ) = 0 ∀ t = 1, ..., T − 1

(C.2)

With one-way unobserved heterogeneity only, a similar result is stated in Chamberlain (1992) for instance. With two-way unobserved heterogeneity, we cannot use (C.2) for estimation directly since en cannot be treated as a vector of parameters that can be estimated consistently. For t = T − 1, (C.2) is equivalent to: 0

0

E(en |ZT −1 ) = (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 E(mn,T −1 (β0 )|ZT −1 ) 0

0

+ (INn − (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 g˙ 2,n,T −1 )ξn,T −1 E(Mg˙ 2,n,T −1 mn,T −1 (β0 )|ZT −1 ) = 0

28

(C.3) (C.4)

where ξn,T −1 is an unrestricted Nn × 1 vector. Since E(en |ZT −1 ) is unrestricted, no information for estimating β0 is found in (C.3), and the information for estimating β0 found in (C.2) for t = T − 1 is equivalent to the information found in (C.4). 0

0

Define an,T −1 (β) = (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 mn,T −1 (β) and: 0

0

Bn,T −1 = INn − (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 g˙ 2,n,T −1

(C.5)

The information for estimating β0 found in (C.2) for t = T − 1, T − 2 is equivalent to the information contained in (C.4) and:

E(mn,T −2 (β0 ) − g˙ 2,n,T −2 an,T −1 (β0 ) −g˙ 2,n,T −2 Bn,T −1 ξn,T −1 |ZT −2 ) = 0

(C.6)

¨2,n,T −2 = g˙ 2,n,T −2 Bn,T −1 . Define g (C.6) in turn is equivalent to: 0

0

¨2,n,T −2 E(mn,T −2 (β0 ) − g˙ 2,n,T −2 an,T −1 (β0 )|ZT −2 ) ¨2,n,T −2 )− g E(ξn,T −1 |ZT −2 ) = (¨ g2,n,T −2 g 0

0

¨2,n,T −2 )− g ¨2,n,T −2 g ¨2,n,T −2 )ξn,T −2 + (INn − (¨ g2,n,T −2 g E(Mg¨2,n,T −2 (mn,T −2 (β0 ) − g˙ 2,n,T −2 an,T −1 (β0 ))|ZT −2 ) = 0

(C.7) (C.8)

where ξn,T −2 is an unrestricted Nn × 1 vector. Since E(ξn,T −1 |ZT −2 ) is unrestricted, the information for estimating β0 found in (C.2) for t = T − 1, T − 2 is equivalent to the information contained in (C.4) and (C.8). Define:

¨2,n,T = 0 g Bn,T = INn

29

(C.9) (C.10)

and for t = T − 1, ..., 1:

¨2,n,t = g˙ 2,n,t Bn,t+1 g

(C.11) 0

0

¨2,n,t+1 )− g ¨2,n,t+1 g ¨2,n,t+1 ) Bn,t = Bn,t+1 (INn − (¨ g2,n,t+1 g

(C.12)

as well as an,T (β0 ) = 0 and for t = T − 1, ..., 1:

an,t (β0 ) = an,t+1 (β0 ) 0

0

¨2,n,t (mn,t (β0 ) − g˙ 2,n,t an,t+1 (β0 )) ¨2,n,t )− g + Bn,t+1 (¨ g2,n,t g

(C.13)

Then, by recursion, the information for estimating β0 contained in (2.3) and (2.4) is equivalent to the information contained in:

E(Mg¨2,n,t (mn,t (β0 ) − g˙ 2,n,t an,t+1 (β0 ))|Zt ) = 0 ∀ t = 1, ..., T − 1

(C.14)

Define ynt = [yn,t0 ]t0 =T,...,t , xtn = [xn,t0 ]t0 =T,...,t , gnt = [gn,t0 ]t0 =T,...,t . It only remains to show that (C.14) is equivalent to: 0

0

E(yn,t − xn,t β0 − gn,t (gnt gnt )− gnt (ynt − xtn β0 )|Zt ) = 0 ∀ t = 1, ..., T − 1

(C.15)

This is shown in the next subsection.

C.2

Equivalence of Moment Conditions

To shorten notation, define mn,t,0 = mn,t (β0 ) and an,t,0 = an,t (β0 ). Here I show that:

E(Mg¨2,n,t (mn,t,0 − g˙ 2,n,t an,t+1,0 )|Zt ) = 0 ∀ t = 1, ..., T − 1

30

(C.16)

is equivalent to: 0

0

E(yn,t − xn,t β0 − gn,t (gnt gnt )− gnt (ynt − xtn β0 )|Zt ) = 0 ∀ t = 1, ..., T − 1

(C.17)

Since (C.17) is a set of moment conditions implied by (2.3) and (2.4) that can be used for estimating β0 , and (C.16) contains all information in (2.3) and (2.4) relevant for estimating β0 , then (C.17) is implied by (C.16). Therefore it only remains to show that (C.17) implies (C.16) to conclude the proof. By the Frisch-Waugh theorem: 0

0

yn,t − xn,t β0 − gn,t (gnt gnt )− gnt (ynt − xtn β0 ) 0

t t ¯2,n,t ) (g2,n ¯2,n,t ))− × = y˙ n,t − x˙ n,t β0 − g˙ 2,n,t ((g2,n − jT −t+1 ⊗ g − jT −t+1 ⊗ g 0

¯2,n,t ) (ynt − jT −t+1 ⊗ y ¯ n,t − (xtn − jT −t+1 ⊗ x ¯ n,t )β0 ) (gnt − jT −t+1 ⊗ g ¯2,n,t = where ⊗ is the kronecker product, jk is a k × 1 vector of ones, g PT PT 1 1 ¯ n,t = T −t+1 ¯ n,t = T −t+1 x s=t xn,s , y s=t yn,s .

1 T −t+1

t Define g˙ 2,n = [g˙ 2,n,s ]s≥t , y˙ nt = [y˙ n,s ]s≥t , x˙ tn = [x˙ n,s ]s≥t and mtn,0 = y˙ nt − x˙ tn β0 .

One can show that: 0

t t ¯2,n,t ) (g2,n ¯2,n,t ))− × g˙ 2,n,t ((g2,n − jT −t+1 ⊗ g − jT −t+1 ⊗ g 0

¯2,n,t ) (ynt − jT −t+1 ⊗ y ¯ n,t − (xtn − jT −t+1 ⊗ x ¯ n,t )β0 ) (gnt − jT −t+1 ⊗ g 0

0

t t t = g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n mtn,0

31

(C.18) PT

s=t g2,n,s ,

and we can write: 0

0

t t t mtn,0 mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n 0

0

0

0

t+1 t t t t = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t )mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n mt+1 n,0

(C.19)

0

0

t t = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t )mn,t,0 0

0

0

0

t+1 t+1 t+1 t+1 − t+1 t t g˙ 2,n (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 −g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n n,0 0

(C.20)

0

t t = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t )mn,t,0 0

0

0

0

0

t+1 t+1 − t+1 t t t t −g˙ 2,n,t (g˙ 2,n g˙ 2,n )− (g˙ 2,n g˙ 2,n − g˙ 2,n,t g˙ 2,n,t )(g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 0

0

0

(C.21)

0

t+1 t+1 − t+1 t t g˙ 2,n )− g˙ 2,n,t )(mn,t,0 − g˙ 2,n,t (g˙ 2,n = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )

(C.22)

Note that: 0

0

0

0

0

0

0

t t t t g˙ 2,n,t (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t ) = g˙ 2,n,t − g˙ 2,n,t g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t 0

0

0

(C.23)

0

t t t t = g˙ 2,n,t − g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n )− g˙ 2,n,t 0

0

0

0

0

0

(C.24)

t+1 t+1 t t )− g˙ 2,n,t g˙ 2,n (g˙ 2,n g˙ 2,n + g˙ 2,n

(C.25)

t+1 t+1 t t )− g˙ 2,n,t = g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n

(C.26)

so that: 0

0

0

0

0

0

0

0

t+1 t+1 − t+1 t+1 t t+1 t+1 − t t t )− g˙ 2,n,t g˙ 2,n )− g˙ 2,n,t ) = g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n,t (In −g˙ 2,n,t (g˙ 2,n

(C.27) t+10

0

t+1 − Therefore (C.22) premultiplied by g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n,t is equal to:

0

0

0

0

0

0

t+1 t+1 − t+1 t+1 t t+1 t+1 − t+1 t g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n )− g˙ 2,n,t (mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )

0

(C.28)

0

t+1 t+1 − t+1 t+1 It can be shown that Bn,t+1 = INn − (g˙ 2,n g˙ 2,n ) g˙ 2,n g˙ 2,n .

¨2,n,t = g˙ 2,n,t Bn,t+1 . Then subtracting (C.28) from (C.22) gives: Recall that by definition g 0

0

0

0

t+1 t+1 − t+1 t t ¨2,n,t (g˙ 2,n (In − g g˙ 2,n )− g˙ 2,n,t )(mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )

32

(C.29)

so that premultiplying by Mg¨2,n,t gives: 0

0

t+1 t+1 − t+1 Mg¨2,n,t (mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )

(C.30)

Therefore (C.17) implies: 0

0

t+1 t+1 − t+1 E(Mg¨2,n,t (mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )|Zt ) = 0 ∀ t = 1, ..., T − 1

(C.31)

It only remains to show that (C.31) implies (C.16). 0

0

T g T )− g T mT = ˙ 2,n ˙ 2,n For t = T −1, (C.31) and (C.16) are identical since an,T (β0 ) = 0 and (g˙ 2,n n,0 0

T = 0 and mT = 0. 0 since g˙ 2,n n,0 0

0

−1 T −1 T −1 − T −1 by definition. g˙ 2,n ) g˙ 2,n mTn,0 In addition an,T −1,0 = (g˙ 2,n

The result can be proven by recursion. Assume that (C.31) for t = T − 1, ..., s + 1 implies (C.16) for t = T − 1, ..., s + 1 and: 0

s+1 s+1 − s+1 g˙ 2,n ) g˙ 2,n ms+1 E(an,s+1,0 |Zs+1 ) = E((g˙ 2,n n,0 |Zs+1 )

+ Bn,s+1 ξs+1

(C.32)

where ξs+1 is an unrestricted Nn × 1 vector. ¨2,n,s ≡ g˙ 2,n,s Bn,s+1 . Then clearly (C.31) for t = T − 1, ..., s implies (C.16) for t = s since g (C.16) for t = s also implies:

E(mn,s,0 |Zs ) = E(Pg¨2,n,s mn,s,0 + Mg¨2,n,s g˙ 2,n,s an,s+1,0 |Zs )

(C.33)

= E(Pg¨2,n,s (mn,s,0 − g˙ 2,n,s an,s+1,0 ) + g˙ 2,n,s an,s+1,0 |Zs )

(C.34)

= E(g˙ 2,n,s an,s,0 |Zs )

(C.35)

¨2,n,s . where the last equality follows from the definitions of an,s (β) and of g So (C.31) for t = T − 1, ..., s implies: 0

0

E(g˙ 2,n,s mn,s,0 |Zs ) = E(g˙ 2,n,s g˙ 2,n,s an,s,0 |Zs ) 33

(C.36)

By the same derivation as we did above for t = s, (C.16) for t = T − 1, ..., s + 1 implies: 0

0

E(g˙ 2,n,t mn,t,0 |Zs ) = E(g˙ 2,n,t g˙ 2,n,t an,t,0 |Zs )

(C.37)

0

= E(g˙ 2,n,t g˙ 2,n,t an,s,0 |Zs )

(C.38)

0

where the last equality follows from g˙ 2,n,t Bn,t0 = 0 for t ≤ t and an,t,0 = an,t+1,0 + 0

0

¨2,n,t )− g ¨2,n,t (mn,t,0 − g˙ 2,n,t an,t+1,0 ). Bn,t+1 (¨ g2,n,t g Therefore (C.31) for t = T − 1, ..., s implies: 0

0

s s s E(g˙ 2,n g˙ 2,n an,s,0 |Zs ) = E(g˙ 2,n msn,0 |Zs )

(C.39)

which in turn implies: 0

0

s s s msn,0 |Zs ) E(an,s,0 |Zs ) = E((g˙ 2,n g˙ 2,n )− g˙ 2,n 0

(C.40)

0

s s s s + (I − (g˙ 2,n g˙ 2,n )− g˙ 2,n g˙ 2,n )ξs

(C.41)

0

0

s s s g˙ 2,n )− g˙ 2,n msn,0 |Zs ) = E((g˙ 2,n

(C.42)

+ Bn,s ξs

(C.43)

where ξs in an unrestricted vector. This concludes the proof

C.3

Equivalence for Estimation

Equation (4.3) in the main text follows from: 0

0

0

0

0

0

(In − gn,t (gnt gnt )− gn,t )Σn,t (In − gn,t (gnt gnt )− gn,t ) = (In − gn,t (gnt gnt )− gn,t )

34

(C.44)

by definition of Σn,t and: 0

0

0

0

(In − gn,t (gnt gnt )− gn,t )[−gn,t (gnt+1 gnt+1 )− gnt+1 , In ] 0

0

0

0

0

0

0

0

= [−gn,t (gnt+1 gnt+1 )− gnt+1 + gn,t (gnt gnt )− gn,t gn,t (gnt+1 gnt+1 )− gnt+1 , In − gn,t (gnt gnt )− gn,t ] 0

0

0

and gn,t gn,t = gnt gnt − gnt+1 gnt+1 , so that: 0

0

0

0

(In − gn,t (gnt gnt )− gn,t )[−gn,t (gnt+1 gnt+1 )− gnt+1 , In ] 0

0

0

0

= [−gn,t (gnt gnt )− gnt+1 , In − gn,t (gnt gnt )− gn,t ] In this subsection I also show why βˆn is defined by (4.4) in section 4.2. We can rewrite: 0

¨n,t x˙ n,t z   0 t+1 − 0 t+1 t+1 (gn gn ) gn,t  0 −gn t0 t − t+10 t0 t − 0 t ˜tn  =z  [−gn,t (gn gn ) gn , In − gn,t (gn gn ) gn,t ]xn In   0 t+1 − t+10 0 t − t+10 0 t − 0 t+1 t+1 t+1 t t+1 t (gn gn ) gn − gn (gn gn ) gn −gn (gn gn ) gn,t  t 0 gn ˜tn  =z  xn 0 0 0 0 In − gn,t (gnt gnt )− gn,t −gn,t (gnt gnt )− gnt+1 ˜tn = E([xn,s ]s≥t |Zt ). where z ˜n,t and z ˜t+1 ˜tn respectively. Define z n,t to be the last n × 1 and the first n(T − t) × 1 blocks of z

35

We have: X

0

¨n,t x˙ n,t = z

X

0

0

0

˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t z

t≤T −1

t≤T −1

X

0

0

0

˜n,t gn,t (gnt gnt )− gnt+1 xt+1 z n

t≤T −1

X

0

0

0

t+1 t t − ˜t+1 z n,t gn (gn gn ) gn,t xn,t

t≤T −1

+

X

0

0

0

0

t+1 t+1 t+1 − ˜t+1 z gn ) − (gnt gnt )− )gnt+1 xt+1 n n,t gn ((gn

t≤T −1

=

X

0

0

0

˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t z

t≤T −1

X X

0

0

0

˜n,t gn,t (gnt gnt )− gn,s xn,s z

t≤T −1 s≥t

X

0

0

0

t+1 t t − ˜t+1 z n,t gn (gn gn ) gn,t xn,t

t≤T −1

+

X X

0

0

0

0

t+1 t+1 t+1 − ˜t+1 z gn ) − (gnt gnt )− )gn,s xn,s n,t gn ((gn

t≤T −1 s≥t 0

X

=−

0

0

˜n,s gn,s (gns gns )− gn,T xn,T z

s≤T −1

+

X

0

0

0

s+1 s+1 s+1 − ˜s+1 z gn ) − (gns gns )− )gn,T xn,T n,s gn ((gn

s≤T −1

+

X

0

0

0

˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t z

t≤T −1

X X

0

0

0

˜n,s gn,s (gns gns )− gn,t xn,t z

t≤T −1 s≤t

X

0

0

0

t+1 t t − ˜t+1 z n,t gn (gn gn ) gn,t xn,t

t≤T −1

+

X X

0

0

t≤T −1 s≤t

X

0

0

s+1 s+1 s+1 − ˜s+1 z gn ) − (gns gns )− )gn,t xn,s n,s gn ((gn

At

t∈T

36

where: 0

X

AT = −

0

0

˜n,s gn,s (gns gns )− gn,T xn,T z

s≤T −1

+

X

0

0

0

0

s+1 s+1 s+1 − ˜s+1 z gn ) − (gns gns )− )gn,T xn,T n,s gn ((gn

s≤T −1 0

X

=−

0

0

˜n,s gn,s (gns gns )− gn,T xn,T z

s≤T −1 0

˜Tn,T −1 xn,T +z X 0 s+1 s+10 s+1 − ˜s+1 + z gn ) gn,T xn,T n,s gn (gn s≤T −2

X

0

0

s+1 s s − ˜s+1 z n,s gn (gn gn ) xn,T

s≤T −1 0

˜Tn,T −1 xn,T =z X 0 s+1 s+10 s+1 − ˜s+1 + z gn ) gn,T xn,T n,s gn (gn s≤T −2

X

0

0

˜sn gns (gns gns )− xn,T z

s≤T −1 0

˜Tn,T −1 xn,T =z X 0 0 ˜sn,s−1 gns (gns gns )− gn,T xn,T + z s≤T −1

X

0

0

˜sn gns (gns gns )− xn,T z

s≤T −1 0

˜Tn,T −1 xn,T =z X 0 0 0 ˜sn )gns (gns gns )− gn,T xn,T + (˜ zsn,s−1 − z s≤T −1

˜1n,0 ≡ 0. where the fourth equality follows from a summation index change and z

37

Similarly we can show that, for t ≤ T − 1: 0

0

0

˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t At = z X 0 0 0 ˜n,s gn,s (gns gns )− gn,t xn,t − z s≤t 0

0

0

t+1 t t − ˜t+1 −z n,t gn (gn gn ) gn,t xn,t X 0 s+1 0 0 s+10 s+1 − ˜s+1 + z gn ) − (gns gns )− )gn,t xn,s n,s gn ((gn s≤t 0

˜n,t xn,t =z X 0 0 0 ˜sn )gns (gns gns )− gn,t xn,t + (˜ zsn,s−1 − z s≤t

Therefore we see that the estimator defined by (4.4) is indeed the pooled regression version of the estimator defined by (4.3).

C.4

Preliminary Results for the Proofs of Section 4.2

As in section B of this appendix, I consider for simplicity the case where dim(xit ) = 1. ˘n,t corresponding to observation i.9 Define z˜i,t,t0 = Define z˘it to be the element of z 0

E(xit |Zt0 ) for t ≤ t. When Assumption 7 is used, denote by pi ∈ Pn the set in Pn that contains i. As in the previous section, Assumption 6-c and Jensen’s inequality imply |˜ zi,t,t0 |4+δ ≤ 0

E(Bit |Zt0 ) ∀ t ≤ t, i ∈ N . As in section B of this appendix, this last result combined with Nn,d ≤ C ∀ d ∈ Dn , #{d ∈ Dn : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ} ≤ C ∀ i ∈ N , ∀ n, P also imply |˘ zit |4+δ ≤ C s≥t E(Bis |Zt ). Also note that z˘it is a function of {˜ zjs : ∃ σ ∈ Sn s.t. j, i ∈ σ, s ∈ T } only. Together with P |˘ zit |4+δ ≤ C s≥t E(Bis |Zt ), E(Bit ) ≤ C, and Assumption 4-d, this implies that the random field {˘ zit : i ∈ N , t ∈ T , n ∈ N} is L2 -NED on the random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on ˜ equipped with semi-metric ρ˜, with NED coefficients ψ(.) such that ψ(r) = 0 for the lattice Q 9

For notational simplicity I drop the n subscript and write z˘it instead of z˘n,it .

38

r > C since maxi,i0 ∈σ,σ∈Sn ,t,t0 ∈T ρ(lit , li0 t0 ) ≤ C, and constant NED scaling factors. Lemma 7. Under Assumption 6, and Assumption 7 or Assumption 8, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1X 0 p ˘n,t xn,t − An → 0 z n t

(C.45)

P 0 ˘n,t xn,t ). where An = n1 E( t z Proof. The proof is identical to the proof of Lemma 4 in part B of this appendix. Lemma 8. Under Assumptions 5, 6, and Assumption 7 or Assumption 8, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1X n

X

p

z˘i,t z˘j,t ui,t uj,t − Vt → 0

(C.46)

i∈N j∈N :dit =djt

0

zn,t un,t ). where Vt = limn→∞ n1 V ar(˘ Proof.

Proof using Assumption 7 Under Assumptions 6 and 7, we can use H¨older’s inequality, Minkowski’s inequality, and Jensen’s inequality, to verify Markov’s condition as in the proof of Lemma 5 and obtain the desired result. Proof using Assumption 8 Define V¯n,t =

1 n

P

i∈N

P

˘it z˘jt E(uit ujt |{zi0 t }i0 ∈N :d j∈N :djt =dit z

0 =dit i t

p Firstly we first show that V¯n,t − Vt → 0:

Under Assumption 4-d, E(uit ujt |{zi0 t }i0 ∈N :d 0

i t

=dit )

is L2 -NED on the random field {zit :

˜ equipped with semi-metric ρ˜ with NED coefficients ψ(.) i ∈ N , t ∈ T , n ∈ N} on the lattice Q such that ψ(r) = 0 for r > C and constant NED scaling factors. As in the proof of Lemma 4,

39

).

we can verify that the rest of the conditions for Theorem 1 of Jenish and Prucha (2012) hold, p so that V¯n,t − Vt → 0.

Secondly note that Assumption 6-c implies that Assumption 1-a of Kuersteiner and Prucha (2013) holds. Finally note that Assumption 5 implies that Assumption 2 of Kuersteiner and Prucha (2013) holds. Therefore all of the conditions for Lemma 1 of Kuersteiner and Prucha (2013) to apply are met, so that we have: 1X n

X

p

z˘i,t z˘j,t ui,t uj,t − Vt → 0

(C.47)

i∈N j∈N :dit =djt

C.5

Proof of Proposition 4

Consistency is obtained from the result of Lemma 7, the continuous mapping theorem, Assumption 6-a, and: 1X 0 p ˘n,t un,t → 0 z n t

(C.48)

which can be shown by showing convergence in mean-squared error as in the proof of Proposition 1. Asymptotic normality is obtained from the result of Lemma 7, the asymptotic equivalence lemma, Assumption 6-a, and: 1

V −2

1X 0 d ˘ un,t → N (0, I) z n t n,t

(C.49)

In order to show this last result, we use the central limit theorem given by Theorem 2 of Kuersteiner and Prucha (2013). As in the proof of Lemma 8, Assumption 6-c implies that Assumption 1-a of Kuersteiner and Prucha (2013) holds and Assumption 5 implies that Assumption 2 of Kuersteiner and Prucha (2013) holds. 40

Lemma 8 guarantees that Assumption 1-c of Kuersteiner and Prucha (2013) holds. Therefore all conditions for Theorem 2 of Kuersteiner and Prucha (2013) are met and we obtain the desired result.

C.6

Proof of Proposition 5

The first result of Proposition 5 is shown by Lemma 7. For the second result of Proposition 5, define: 1 Vˆn = n

(σ )0 (σ )

X

X

ˆ n,t11 ˘n,t11 v z

σ1 ∈Sn t1 ∈T

(σ )0 (σ )

˘n,t22 ˆ n,t22 z v

(C.50)

σ2 ∈Un (σ1 ),t2 ∈T

We need to show that Vˆn − V = op (1). P (σ) (σ) ˘n,t gn,t = 0, so that, defining: Note that Tt=1 z

Tn,1 =

1 n

(σ )0 (σ )

X

X

˘n,t11 un,t11 z

σ1 ∈Sn t1 ∈T

(σ )0 (σ )

˘n,t22 un,t22 z

(C.51)

σ2 ∈Un (σ1 ),t2 ∈T

we can show as in the proof of Proposition 2 that: Vˆn − Tn,1 = op (1) (σ)

Define z˘it

(σ)

˘n to be the element of z

(C.52)

corresponding to observation {i, t} for i ∈ σ. We

have: Tn,1 =

1 n

X

X

X

0

(σ) (σ)

1[σ ∈ Un (σ)]˘ zit z˘js uit ujs

(C.53)

i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0

We can use the same arguments as in the proof of Proposition 2 to show that Tn,1 − Vn = op (1) can be obtained from the conditions of Proposition 5 under Assumption 7 or Assumption 8 as in the proof of Lemma 8.

41

C.7

Proof of Proposition 6 0

0

0

0

Define Mn,t = [−gn,t (gnt gnt )− gnt+1 , In − gn,t (gnt gnt )− gn,t ] and recall the definition Σn,t = 0

0

(In − gn,t (gnt gnt )− gn,t )− . Under the assumptions of Proposition 6, we have: 1X 0 0 ˜tn ) E(˜ ztn Mn,t Σn,t Mn,t z n t 1X 0 0 ˜tn ) V = E(˜ ztn Mn,t Σn,t Mn,t z n t

An =

= An

(C.54) (C.55) (C.56)

0

The second equality follows from Σn,t = (Mn,t Mn,t )− and from Mn,t gnt = 0, so that for s < t: 0

0

Mn,t gnt (gns gns )− gn,s = 0

(C.57)

This last result implies that the moment conditions given by (C.15) are indeed serially uncorrelated under the assumptions of homoscedasticity and serial and cross-sectional uncorrelation of Proposition 6, which is why it is advantageous to work from these moment conditions in order to define a locally efficient estimator rather than from the moment conditions (C.14), which are not serially uncorrelated under the assumptions of Proposition 6.

42

Therefore, under the assumptions of Proposition 6, we have: −1 −1 −1 Bn−1 Wn Bn−1 − A−1 n Vn An = Bn Wn Bn − E(

1 X t0 0 ˜tn )−1 ˜ M Σn,t Mn,t z z n t n n,t

1X 0 ˜tn )−1 × ´ Mn,t z z n t n,t 1X 0 0 0 0 ´n,t Mn,t (In − gn,t (gnt gnt )− gn,t )Mn,t z ´n,t )× E( z n t 1 X t0 0 ˜ M z ´n,t )−1 E( z n t n n,t 1 X t0 0 ˜ M Σn,t Mn,t z ˜tn )−1 z − E( n t n n,t 1X 0 0 0 = E(vn,t (In − gn,t (gnt gnt )− gn,t )vn,t ) n t = E(

´n,t E( n1 where vn,t ≡ z

0 0 ˜tn Mn,t z ´n,t )−1 tz

P

˜tn E( n1 − Σn,t Mn,t z

0 0 ˜tn Mn,t Σn,t Mn,t z ˜tn )−1 . tz

P

(C.58) (C.59) (C.60) (C.61) (C.62) (C.63)

The last

equality follows from: 0

0

0

0

(In − gn,t (gnt gnt )− gn,t )Σn,t Mn,t = Mn,t Mn,t (Mn,t Mn,t )− Mn,t = Mn,t 0

0

(C.64) (C.65)

0

This concludes the proof since In − gn,t (gnt gnt )− gn,t = Mn,t Mn,t and positive definiteness is preserved by sums and expected values.

D

Effect of Class Size on Student Achievement: Details and Additional Results

D.1

Details of the Estimation Methods

D.1.1

Dynamic model with all sources of unobserved heterogeneity

In order to implement the estimator and the associated standard errors of section 4.2, Sn and t(σ)

˜n z

need to be defined.

Let An be the set of all schools in the dataset (1, 369 schools). Let ait denote the school 43

0

attended by student i in year t (schoolit in the main text). For each pair of schools a, a ∈ P 0 An , let #a,a0 = i,t∈N ×T 1[∃ s ∈ T : ait = a, ais = a ]. For any a ∈ An , let a1 (a) = argmaxa0 ∈An ,a0 6=a #a,a0 and a2 (a) = argmaxa0 ∈An ,a0 6=a,a0 6=a1 (a) #a,a0 . In this empirical application, Sn was defined to be {{i ∈ N : ait ∈ {a, a1 (a), a2 (a)} ∀ t ∈ T }}a∈An , i.e. the set of students who attended either school a or one of the two schools that most students transfer to from or from to a for grades 4 and 5. This definition of Sn resulted in 1, 369 groups with an average number of teachers per group of approximately 31 teachers and an average number of overlapping groups, #Un (σ), per group σ ∈ Sn , of approximately 8, so that the conditions in Sections 3 and 4 that #{d : dit = d, i ∈ σ, t ∈ T } and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} be relatively small seem appropriate here.10 In this empirical application, yit−1 is the only sequentially exogenous covariate, the other covariates (grade-year indicators and the polynomial in class size) being treated as strictly exogenous. t(σ)

˜n In order to form z

we thus only need to replace E(yit−1 |Y t−2 , X) by an estimated

auxiliary model since here only two time periods, grades 4 and 5, are available for each student. This was obtained by a sequence of grade-year specific ordinary least squares regressions of P P P yit−1 on yit−2 , #d 1 ,t−2 j:djt−2 =dit−2 yjt−2 , #d 1 ,t−1 j:djt−1 =dit−1 yjt−2 , #d1 ,t j:djt =dit yjt−2 , it−2

it−1

it

csit−2 , csit−1 , csit where #d,t = #{i ∈ N : dit = d}. D.1.2

Dynamic model with student unobserved heterogeneity only

This estimator was calculated as in Section 5.2 and in the section above but replacing gn with j∈N [1[i = j]]i∈N ,t∈T and Sn with {N }.

10

Similarly as was noted in Section 3.1, one could also have defined Sn = {{i ∈ N : ait = a ∀ t ∈ T }}a∈An (the collection of students who are “stayers” at the school level for each school) so that Un (σ) = {σ} ∀ σ ∈ Sn and standard errors could be computed without having to compute cross-products, although generally this definition of Sn could lead to a smaller effective sample size and a loss in efficiency.

44

D.1.3

Dynamic model with teacher and school unobserved heterogeneity only, treated as finite dimensional parameters that can be estimated consistently

This estimator was calculated by a pooled regression of yit on yit−1 , grade-year indicators, the polynomial in class size, and indicator variables for each school and teacher.

D.1.4

Dynamic model with no unobserved heterogeneity

This estimator was calculated by a pooled regression of yit on yit−1 , grade-year indicators and the polynomial in class size.

D.1.5

Static model and model in gains

For the static model (ρ0 assumed to be equal to zero) or the model in gain (ρ0 assumed to be equal to one), the resulting model has covariates that are all treated as strictly exogenous. Therefore the model with all sources of unobserved heterogeneity is estimated as in section D.1.1 but without having to use an auxiliary model for predicting the covariates based on the instrumental variables and replacing the dependent variable with ∆yit for the model in gains. The model with unobserved heterogeneity indexed by student only is estimated by one-way fixed effects estimation. The models with teacher and school unobserved heterogeneity only and no unobserved heterogeneity are estimated by ordinary least squares regressions, simply excluding yit−1 from the list of covariates.

D.2

Classical Measurement Error in Test Score

In this section I discuss results obtained with estimators for a model as in (6.1) and (6.2) in the main text, but where test scores measure student achievement up to an additive error term which is independent of the explanatory variables and independent across subjects. This type of measurement error was considered in Andrabi et al. (2011) and Verdier (2016) for models with one-way unobserved heterogeneity only.

45

The resulting model is:

? ? yit = αgradeit ,t + ρ0 yit−1 + xit β0 + ci + edit + fschoolit + uit

(D.1)

? yit = yit + it

(D.2)

E(uit |Y other,t−1 , X) = 0

(D.3)

E(it |Y other,t , X)

(D.4)

? is student i’s achievement in year t in mathematics, y is the student’s test score where yit it other } other is in mathematics, so that it is measurement error, Y other,t = {yis i∈N ,s=1,...,t and yit

student i’s test score in reading in year t. All other variables are defined as in the main text. The definitions are simply exchanged when reading is of interest instead of mathematics. Therefore we can rewrite:

yit = αgradeit ,t + ρ0 yit−1 + xit β0 + ci + edit + fschoolit + uit + it − ρ0 it−1

(D.5)

E(uit + it − ρ0 it−1 |Y other,t−1 , X) = 0

(D.6)

so that estimation of this model can proceed as in the main text but using past test scores in reading as instruments for past test scores in mathematics and vice versa. Similarly, for estimation of this dynamic model but with student level unobserved heterogeneity only, estimation is done as in the main text but using past test scores in reading as instruments for past test scores in mathematics and vice versa, as in Andrabi et al. (2011) and Verdier (2016). For estimation of this dynamic model with teacher and school level unobserved heterogeneity only and treated as finite dimensional parameters that can be estimated consistently and estimation without any unobserved heterogeneity, estimation is not done by ordinary least squares regression as in the main text but rather by instrumental variable regression using past test score in reading as instruments for past test scores in mathematics and vice versa. Note that for the model in gains (ρ0 is assumed to be one) or the static model (ρ0 is 46

assumed to be zero), the presence of measurement error of the type considered here does not invalidate the estimators considered in the main text, so the results for these models are unchanged in this section. The results are presented in Tables 1 and 2 of the Appendix, and additional visualizations are shown in Figures 2-4 of the Appendix. With the model considered in this section, the effect of class size on student achievement is estimated to be higher with the model with measurement error considered in this section than in the main text, and the strength of persistence in the effect of past educational inputs is also estimated to be higher (for mathematics for instance, ρ0 is estimated to be 0.315 here instead of 0.094 in the main text). The estimated effect of class size is still estimated to be higher when using a model that includes all sources of unobserved heterogeneity rather than a model that excludes student or teacher and school level unobserved heterogeneity (Figure 3). However the difference between the estimates in the effect of class size corresponding to the models with all sources of unobserved heterogeneity and teacher and school level unobserved heterogeneity only is smaller here than in the main text. Since persistence is estimated to be stronger here than in the main text, the difference between the results obtained for mathematics with a static model and with a dynamic model is larger than in the main text, and the difference between the results obtained with a model in gains and with a dynamic model is smaller than in the main text (Figure 4).

D.3

Results for an Estimation Sample with Class Size between Five and Thirty Students Instead of Ten to Thirty Students

Tables 3 to 9 and Figures 5 to 11 show the same results as in section 6 of the paper and section D.2 of this appendix for a larger estimation sample where class size is restricted to be between five and thirty students instead of being restricted to be between ten and thirty students.

47

References Andrabi, T., J. Das, A. Ijaz Khwaja, and T. Zajonc (2011, July). Do Value-Added Estimates Add Value? Accounting for Learning Dynamics. American Economic Journal. Applied Economics 3 (3), 29–54. Chamberlain, G. (1992, January). Comment: Sequential Moment Restrictions in Panel Data. Journal of Business & Economic Statistics 10 (1), 20–26. Jenish, N. and I. R. Prucha (2009, May). Central limit theorems and uniform laws of large numbers for arrays of random fields. Journal of Econometrics 150 (1), 86–98. Jenish, N. and I. R. Prucha (2012, September). On spatial processes and asymptotic inference under near-epoch dependence. Journal of Econometrics 170 (1), 178–190. Jochmans, K. and M. Weidner (2016). Fixed-Effects Regression on Network Data. Working Paper . Kemeny, J. G., J. L. Snell, and A. W. Knapp (1976). Denumerable Markov Chains (second ed.), Volume 40 of Graduate Texts in Mathematics. Springer New York. Kuersteiner, G. M. and I. R. Prucha (2013, June). Limit theory for panel data models with cross sectional dependence and sequential exogeneity. Journal of Econometrics 174 (2), 107–126. Verdier, V. (2016, January). Estimation of Dynamic Panel Data Models with Cross-Sectional Dependence: Using Cluster Dependence for Efficiency. Journal of Applied Econometrics 31 (1), 85–105. White, H. (2001). Asymptotic theory for econometricians. San Diego: Academic Press.

48

Table 1: Estimates of persistence and of the effect of class size reductions on student achievement in mathematics, model with measurement error.

Three-way unobserved heterogeneity

Only student unobserved heterogeneity

No student unobserved heterogeneity

No unobserved heterogeneity

0.324 (0.029)

0.928 (0.001)

0.936 (0.002)

Dynamic Model Persistence

0.315 (0.034)

Class size reduction from thirty to: twenty-five

0.371 (0.181)

0.095 (0.110)

0.416 (0.133)

−0.012 (0.135)

twenty

0.730 (0.172)

0.468 (0.103)

0.676 (0.124)

0.177 (0.124)

fifteen

0.931 (0.199)

0.523 (0.120)

0.839 (0.142)

0.364 (0.142)

ten

0.539 (0.241)

0.058 (0.139)

0.444 (0.169)

−0.295 (0.159)

Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

49

Table 2: Estimates of persistence and of the effect of class size reductions on student achievement in reading, model with measurement error.

Three-way unobserved heterogeneity

Only student unobserved heterogeneity

No student unobserved heterogeneity

No unobserved heterogeneity

0.476 (0.012)

0.832 (0.001)

0.839 (0.001)

Dynamic Model Persistence

0.446 (0.017)

Class size reduction from thirty to: twenty-five

0.078 (0.175)

−0.133 (0.102)

0.009 (0.107)

−0.266 (0.091)

twenty

0.227 (0.172)

0.075 (0.093)

0.178 (0.101)

−0.200 (0.082)

fifteen

0.302 (0.190)

0.172 (0.107)

0.235 (0.114)

−0.124 (0.093)

ten

0.397 (0.208)

0.106 (0.119)

0.263 (0.124)

−0.235 (0.101)

Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

50

(a) Mathematics

Figure 2: Effect of reducing class size from thirty with pointwise ninety-five percent confidence intervals, model with measurement error.

(a) Mathematics

Figure 3: Effect of reducing class size from thirty, estimators for dynamic model with different sources of unobserved heterogeneity, model with measurement error.

(a) Mathematics

Figure 4: Effect of reducing class size from thirty, estimators for model with all sources of unobserved heterogeneity and different restrictions on persistence, model with measurement error.

51

Results when restricting estimation sample to class sizes between 5 and 30 students instead of between 10 and 30 students. Table 3: Description of the estimation sample.

Mathematics 238,997

Students Teachers

12,569

13,125 2009 to 2012

Years

4 and 5

354.92

349.05

9.39

9.55

average

21.51

20.34

standard dev.

4.55

5.66

5-9 students

352.88

347.40

10-14 students

353.35

347.95

15-19 students

353.29

347.51

20-24 students

354.83

349.06

24-30 students

356.75

351.04

standard dev. Class size

Average test score by class size

52

Table 4: Observed frequencies of transition between class sizes, Mathematics.

10-14

15-19

20-24

25-30

Total

5-9

446

486

1,282

2,622

1,507

6,343

10-14

326

1,878

3,758

8,005

5,231

19,198

15-19

688

2,948

15,422

24,399

9,870

53,327

20-24

907

3,643

14,890

56,979

30,743

107,162

25-30

354

1,614

3,711

18,032

29,256

52,967

Total

2,721

10,569

39,063

110,037

76,607

238,997

Teachers, 2009 to 2010, 2010 to 2011, 2011 to 2012. 5-9

10-14

15-19

20-24

25-30

Total

5-9

63

46

92

117

59

377

10-14

36

162

279

338

164

979

15-19

46

195

1,038

1,294

387

2,960

20-24

48

211

919

2,774

1,328

5,280

25-30

26

73

210

1,042

1,244

2,595

Total

219

687

2,538

5,565

3,182

12,191

53

Table 5: Observed frequencies of transition between class sizes, Reading.

10-14

15-19

20-24

25-30

Total

5-9

3,171

3,823

3,902

6,910

3,931

21,737

10-14

2,474

5,077

5,990

11,202

6,531

31,274

15-19

1,600

3,551

12,837

20,823

8,646

47,457

20-24

1,732

3,926

12,862

47,244

25,410

91,174

25-30

469

1,659

3,407

15,410

26,410

47,355

Total

9,446

18,036

38,998

101,589

70,928

238,997

Teachers, 2009 to 2010, 2010 to 2011, 2011 to 2012. 5-9

10-14

15-19

20-24

25-30

Total

5-9

271

245

278

415

188

1,397

10-14

134

329

430

588

243

1,724

15-19

115

202

788

1,085

340

2,530

20-24

107

213

716

2,129

1,112

4,277

25-30

23

62

166

873

1,079

2,203

Total

650

1,051

2,378

5,090

2,962

12,131

54

Table 6: Estimates of persistence and of the effect of class size reductions on student achievement in mathematics.

Three-way unobserved heterogeneity

No student unobserved heterogeneity

No unobserved heterogeneity

0.109 (0.006)

0.670 (0.001)

0.684 (0.001)

Only student unobserved heterogeneity Dynamic Model

Persistence

0.094 (0.009)

Class size reduction from thirty to: twenty-five

0.318 (0.141)

0.095 (0.088)

0.189 (0.106)

−0.314 (0.119)

twenty

0.611 (0.145)

0.396 (0.084)

0.373 (0.103)

−0.238 (0.114)

fifteen

0.742 (0.148)

0.421 (0.086)

0.423 (0.106)

−0.215 (0.115)

ten

0.692 (0.176)

0.289 (0.106)

0.387 (0.132)

−0.343 (0.141)

five

0.559 (0.248)

0.721 (0.173)

0.483 (0.191)

−0.376 (0.196)

Model in Gains Class size reduction from thirty to: twenty-five

0.389 (0.227)

0.078 (0.136)

0.303 (0.119)

−0.059 (0.119)

twenty

0.777 (0.226)

0.552 (0.132)

0.632 (0.116)

0.298 (0.114)

fifteen

1.000 (0.237)

0.609 (0.135)

0.743 (0.119)

0.369 (0.115)

ten

0.994 (0.294)

0.385 (0.166)

0.660 (0.148)

0.146 (0.142)

five

0.799 (0.407)

0.960 (0.280)

0.680 (0.209)

0.327 (0.203)

Static Model Class size reduction from thirty to: twenty-five

0.310 (0.136)

0.097 (0.085)

−0.149 (0.228)

−1.610 (0.280)

twenty

0.593 (0.141)

0.377 (0.081)

−0.671 (0.221)

−3.312 (0.269)

fifteen

0.715 (0.143)

0.398 (0.083)

−0.985 (0.228)

−3.704 (0.275)

ten

0.661 (0.169)

0.278 (0.102)

−0.861 (0.276)

−3.275 (0.339)

five

0.534 (0.238)

0.692 (0.165)

−0.425 (0.369)

−4.402 (0.498)

Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator 55 (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

Table 7: Estimates of persistence and of the effect of class size reductions on student achievement in reading.

Three-way unobserved heterogeneity

No student unobserved heterogeneity

No unobserved heterogeneity

0.256 (0.004)

0.690 (0.001)

0.719 (0.001)

Only student unobserved heterogeneity Dynamic Model

Persistence

0.225 (0.006)

Class size reduction from thirty to: twenty-five

0.111 (0.139)

−0.129 (0.078)

0.185 (0.134)

−0.381 (0.127)

twenty

0.215 (0.144)

0.040 (0.076)

0.194 (0.139)

−0.582 (0.130)

fifteen

0.292 (0.142)

0.110 (0.075)

0.195 (0.132)

−0.531 (0.124)

ten

0.295 (0.161)

0.031 (0.089)

0.210 (0.143)

−0.384 (0.134)

five

0.147 (0.170)

0.102 (0.105)

0.115 (0.149)

−0.529 (0.138)

Model in Gains Class size reduction from thirty to: twenty-five

0.089 (0.215)

−0.249 (0.122)

0.139 (0.109)

−0.084 (0.088)

twenty

0.161 (0.223)

−0.081 (0.119)

0.376 (0.108)

0.342 (0.083)

fifteen

0.293 (0.219)

0.035 (0.116)

0.501 (0.106)

0.448 (0.080)

ten

0.368 (0.249)

0.008 (0.139)

0.475 (0.123)

0.252 (0.097)

five

0.071 (0.278)

0.121 (0.162)

0.424 (0.135)

0.613 (0.108)

Static Model Class size reduction from thirty to: twenty-five

0.117 (0.128)

−0.088 (0.070)

−0.276 (0.211)

−1.455 (0.243)

twenty

0.230 (0.132)

0.082 (0.068)

−0.769 (0.211)

−3.328 (0.238)

fifteen

0.291 (0.130)

0.136 (0.066)

−0.993 (0.203)

−3.558 (0.229)

ten

0.273 (0.148)

0.039 (0.079)

−0.944 (0.229)

−2.716 (0.269)

five

0.169 (0.152)

0.095 (0.094)

−1.091 (0.239)

−3.996 (0.284)

Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator 56 (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

(a) Mathematics

Figure 5: Average test score by class size.

(a) Mathematics

Figure 6: Effect of reducing class size from thirty with pointwise ninety-five percent confidence intervals.

(a) Mathematics

Figure 7: Effect of reducing class size from thirty, estimators for dynamic model with different sources of unobserved heterogeneity.

57

(a) Mathematics

Figure 8: Effect of reducing class size from thirty, estimators for model with all sources of unobserved heterogeneity and different restrictions on persistence.

58

Table 8: Estimates of persistence and of the effect of class size reductions on student achievement in mathematics, model with measurement error.

Three-way unobserved heterogeneity

No student unobserved heterogeneity

No unobserved heterogeneity

0.300 (0.027)

0.927 (0.001)

0.935 (0.002)

Only student unobserved heterogeneity Dynamic Model

Persistence

0.297 (0.033)

Class size reduction from thirty to: twenty-five

0.333 (0.154)

0.091 (0.096)

0.270 (0.113)

−0.160 (0.118)

twenty

0.648 (0.158)

0.429 (0.092)

0.538 (0.110)

0.065 (0.113)

fifteen

0.800 (0.163)

0.461 (0.094)

0.617 (0.113)

0.105 (0.114)

ten

0.760 (0.195)

0.310 (0.116)

0.550 (0.141)

−0.075 (0.140)

five

0.613 (0.275)

0.772 (0.191)

0.600 (0.202)

0.021 (0.199)

Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

59

Table 9: Estimates of persistence and of the effect of class size reductions on student achievement in reading, model with measurement error.

Three-way unobserved heterogeneity

Only student unobserved heterogeneity

No student unobserved heterogeneity

No unobserved heterogeneity

0.472 (0.011)

0.832 (0.001)

0.838 (0.001)

Dynamic Model Persistence

0.442 (0.015)

Class size reduction from thirty to: twenty-five

0.105 (0.156)

−0.164 (0.089)

0.069 (0.093)

−0.306 (0.079)

twenty

0.200 (0.161)

0.005 (0.086)

0.183 (0.092)

−0.252 (0.076)

fifteen

0.292 (0.159)

0.088 (0.085)

0.250 (0.090)

−0.200 (0.073)

ten

0.315 (0.181)

0.024 (0.101)

0.236 (0.106)

−0.228 (0.087)

five

0.126 (0.195)

0.108 (0.119)

0.169 (0.117)

−0.132 (0.096)

Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

60

(a) Mathematics

Figure 9: Effect of reducing class size from thirty with pointwise ninety-five percent confidence intervals, model with measurement error.

(a) Mathematics

Figure 10: Effect of reducing class size from thirty, estimators for dynamic model with different sources of unobserved heterogeneity, model with measurement error.

(a) Mathematics

Figure 11: Effect of reducing class size from thirty, estimators for model with all sources of unobserved heterogeneity and different restrictions on persistence, model with measurement error.

61

## Estimation and Inference for Linear Models with Two ...

Estimation and Inference for Linear Models with Two-Way. Fixed Effects and Sparsely Matched Data. Appendix and Supplementary Material. Valentin Verdierâ. April 21, 2017. âAssistant Professor, Department of Economics, University of North Carolina, Chapel Hill, NC 27599,. United States. Tel.: +1 919-966-3962. E-mail ...

#### Recommend Documents

With Applications to Linear Models, Logistic and ...
Regression, and Survival Analysis (Springer Series in ... Logistic and Ordinal Regression, and Survival Analysis (Springer Series in Statistics), free ebook Regression Modeling Strategies: With Applications .... software. In keeping with the previous

With Applications to Linear Models, Logistic and ...
Survival Analysis (Springer Series in Statistics) Media Books. PDF Download Regression Modeling Strategies: With Applications to Linear ...... chapters and sections, 225 new references, and comprehensive R software. ... arising when developing multi-

Estimation and inference on correlations between ...
studies were conducted to compare the proposed MCEM method with ... respect to relative bias and coverage probability for the 95% confidence interval is ...... and left-censoring, with application to the evolution CD4+ cell count and HIV RNA.

Estimation and inference on correlations between ...
Albert Einstein College of Medicine, 1300 Morris Park Ave, Belfer Bldg. ...... Overall, the MCEM approach performs the best among DL, HDL, MI, MCEM .... WIHS HPV Study Group includes Andrea Kovacs (University of Southern California, Los.

Estimation of affine term structure models with spanned
of Canada, Kansas, UMass, and Chicago Booth Junior Finance Symposium for helpful ... University of Chicago Booth School of Business for financial support.