Estimation and Inference for Linear Models with TwoWay Fixed Effects and Sparsely Matched Data
Appendix and Supplementary Material Valentin Verdier∗ April 21, 2017
∗ Assistant Professor, Department of Economics, University of North Carolina, Chapel Hill, NC 27599, United States. Tel.: +1 9199663962. Email address:
[email protected]
1
To simplify notation, I will use C to denote a generic constant such that C < ∞ throughout this appendix, so that if a quantity an is less than some arbitrary large constant for all n, I will always write an ≤ C ∀ n, independently of what the arbitrary large constant actually is. Similarly, c will denote a generic constant such that c > 0.
A
Proof of Lemma 1 and of its Corollaries
I provide proofs of Lemma 1 and of its corollaries before providing proofs of Propositions 13 as the second result in Corollary 2 is used in the proofs of Propositions 1 and 2.
A.1
Proof of Lemma 1
By the FrischWaugh theorem, Mgn = Mgn,1 −PMgn,1 gn,2 where gn,1 = [1[i = j]]j∈N i∈N ,t∈T , gn,2 = 0
0
− n [1[dit = d]]d∈D i∈N ,t∈T , so that gn = [gn,1 , gn,2 ], and PMgn,1 gn,2 = Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 ) gn,2 Mgn,1 .
We have: Mgn,1 gn,2 = [1[dit = d] −
1 X d∈Dn 1[dis = d]]i∈N ,t∈T T
(A.1)
s∈T
so that: X 1X 0 0 d0 ∈D n gn,2 Mgn,1 gn,2 = diag([ 1[dit = d, dis = d ]]d∈D 1[dit = d]]d∈Dn ) − [ n T i,s,t
i,t
Define Nn,d =
P
i,t 1[dit
(A.2)
= d] and Nn,dd0 =
1 T
P
i,s,t 1[dit
0
= d, dis = d ] so that: 0
0
∈Dn gn,2 Mgn,1 gn,2 = diag([Nn,d ]d∈Dn ) − [Nn,dd0 ]dd∈D n
(A.3)
For any value of n, consider the weighted undirected graph of teachers: n Gn = {Dn , [Nn,dd0 ]d∈D d∈Dn }
(A.4) 0
Note that if this graph is composed of several connected components, then gn,2 Mgn,1 gn,2 would be block diagonal (after permutation), with each block corresponding to one of these
2
components, and so would PMgn,1 gn,2 , with each block corresponding to the values of {i, t} ∈ N × T that are associated with each of these components. Therefore we can consider each connected component of Gn separately. Without loss of generality we consider the case where 0
Gn is fully connected, i.e. such that any pair of nodes d, d ∈ Dn is connected by a path in Gn . 0
d ∈Dn . Note that Define the matrices Dn = diag({Nn,d }d∈Dn ) and Pn = Dn−1 [Nn,dd0 ]d∈D n P P 0 Nn,dd0 ≥ 0 and d0 ∈Dn Nn,dd0 = Nn,d since d0 ∈Dn 1[dit = d ] = 1. Therefore Pn is a
valid probability transition matrix that can be used to describe a Markov chain on Dn . The graph corresponding to the transition matrix Pn , Gn , is fully connected, and the diagonal elements of Pn ,
Nn,dd Nn,d ,
are strictly positive, so that Pn is noncyclic ergodic.
Hence by Theorem 949 of Kemeny et al. (1976), Pn describes a normal Markov chain, so P∞ 0 0 P∞ r r that µ r=0 Pn f exist for any vectors µ and f with µ j = 0 and αf = 0 where r=0 Pn and j is a Nn × 1 vector of ones and α = [Nn,d ]d∈Dn . 0
One can easily verify that Mgn,1 gn,2 j = 0 and αDn−1 gn,2 Mgn,1 = 0. Hence Mgn,1 gn,2 P r −1 0 and ∞ r=0 Pn Dn gn,2 Mgn,1 exist.
P∞
r r=0 Pn
In addition: 0
0
0
0
0
0
Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 = Mgn,1 gn,2
∞ X
Pnr (INn − Pn )(gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1
(A.5)
r=0
= Mgn,1 gn,2
∞ X
0
Pnr Dn−1 gn,2 Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1
(A.6)
r=0
= Mgn,1 gn,2
∞ X
0
Pnr Dn−1 gn,2 Mgn,1
(A.7)
r=0 (n)
Define the stochastic process ξτ 00
as a Markov chain on Dn with transition probability
(n)
(n)
0
00
matrix Pn . Define d Gdd0 = Ed (number of visits of ξτ to d bef ore reaching d ), where Ed (.) denotes an expected value conditional on being in the initial state d. P −1 0 n Note that Mgn,1 gn,2 = [ T1 s∈T (1[dit = d] − 1[dis = d])]d∈D i∈N ,t∈T and Dn gn,2 Mgn,1 = i∈N ,t∈T 1 1 P . [ Nn,d s∈T (1[dit = d] − 1[dis = d])]d∈Dn T
3
Hence, using Proposition 912 of Kemeny et al. (1976):
Mgn,1 gn,2
∞ X
0
Pnr Dn−1 gn,2 Mgn,1 =
r=0
0 0 1 X 1 dis (n) (n) ∈N ,t ∈T [ ( Gd 0 0 dit − dis Gd 0 0 dit )]ii∈N ,t∈T 2 T Nn,dit i s i t 0
(A.8)
s,s ∈T
This completes the proof of Lemma 1.
A.2
Proof of Corollary 1
From Lemma 1, we have:
Mgn [{i, t}, {i, t}] = 1 −
1 1 X 1 dis (n) (n) ( Gdit dit − dis Gd 0 dit ) − 2 T T N is n,d it 0
(A.9)
s,s
(n)
Since ξτ
00
is a Markov chain, we have
Mgn [{i, t}, {i, t}] = 1 −
d
00
(n)
(n)
00
(n)
Gdd0 =d Hdd0 d Gd0 d0 . Therefore:
1 X 1 dis (n) 1 (n) Gdit dit (1 − dis Hd 0 dit ) − 2 T T N is n,dit 0 s,s
1 1 X 1 dis (n) dit (n) =1− − 2 Gdit dit Hd 0 dis T T Nn,dit is 0 s,s
=1−
1 1 − 2 T T
1
X s,s0 :dis 6=dit ,d
is
00
Let
d
0 6=dit
Nn,dit
dis
(n)
¯ (n)0 = Pd (ξτ(n) reaches d0 bef ore d00 f or τ > 0). Note that H dd (n)
Nn,dit dis > 0. Since ξτ
is a Markov chain:
dis
(n)
Gdit dit =
∞ X
¯ (n) )r−1 (1 − dis H ¯ (n) ) r(dis H dit dit dit dit
r=1
1
= 1 =
−dis
¯ (n) H dit dit
1 dit H ¯ (n)
dit dis
4
(n)
Gdit dit dit Hd
is
0 dis
dis H ¯ (n) dit dit
< 1 since
00
We also have
d
¯ (n)0 = P H δ∈Dn dd
Nn,dδ d00 (n) Hδd0 , Nn,d
1 1 Mgn [{i, t}, {i, t}] = 1 − − 2 T T
so that: dit H (n) d 0 dis
X s,s0 :dis 6=dit ,d
is
(n)
P
d∈Dn
0 6=dit is
Nn,dit d dit Hddis
which completes the proof of the first result in Corollary 1. For the second result in Corollary 1, we can rewrite: X 0
dit
(n)
Hd
is
0 dis
(j)
P
j∈N
(n)
Nn,d dit Hddis
(A.10)
d∈Dn :d6=dit
0 6=dit is
s ∈T :d
and T Nn,dd0 =
(i)
X
=
(j)
Nn,d Nn,d0 , so that:
1 1 Mgn [{i, t}, {i, t}] = 1 − − T T
(i)
P
d∈Dn :d6=dit
X s:dis 6=dit
P
d∈Dn
P
j∈N
(n)
Nn,d dit Hddis (j)
(j)
(n)
Nn,dit Nn,d dit Hddis
(i)
Case 1: Nn,dit ≥ 3 (j)
(j)
(n)
Since Nn,dit Nn,d dit Hddis ≥ 0:1 1 1 Mgn [{i, t}, {i, t}] ≥ 1 − − T T
(i) dit (n) d∈Dn :d6=dit Nn,d Hddis P (i) (i) dit (n) d∈Dn :d6=dit Nn,dit Nn,d Hddis
P
X s:dis 6=dit (i)
T − Nn,dit 1 1 =1− − (i) T T Nn,dit (i)
=
Nn,dit − 1 (i)
Nn,dit ≥
1 +c 2
(i)
Case 2: Nn,dit = 2 (i)
Nd P
1
j∈N
is (j)
(i)
it
is
Nd Nd
<
1 2
(j)
(j)
for some s ∈ T implies that ∃ j ∈ N , j 6= i such that Nn,dit Nn,dis ≥ 1.
As noted above, I use c as a generic small constant.
5
Therefore we have: (i)
P
d∈Dn
P
d∈Dn
(n)
Nn,d dit Hddis (j)
P
j∈N
(i) dit (n) d∈Dn Nn,d Hddis (i) (i) dit (n) d∈Dn Nn,dit Nn,d Hddis + 1 (i) Nn,d 1
P
(j)
(n)
Nn,dit Nn,d dit Hddis
≤P
=
(i)
Nn,dit ≤
=
1 (i) Nn,dit
1
it
(1 − P
(i)
(n)
dit d∈Dn Nn,d Hddis +
(1 −
1 (i) Nn,d
T+
it
1
)
1 (i) Nn,d it
)
(i) Nn,d it
1 2T 2 2T + 1
so that:
Mgn [{i, t}, {i, t}] ≥ 1 − ≥
1 T −31 1 1 2T − − T T 2 T 2 2T + 1
1 +c 2
(i)
Case 3: Nn,dit = 1 (i)
Let s1 ∈ T be such that 1 T −2 2 T −1
2T
(j)
P
(j)
Nn,d
is1 (j) (j) j∈N Ndit Ndis 1
(i)
(j)
(i)
1 T −2 2 T −1 .
<
P
1
(j)
Ndit Ndis − c since Nn,dis , Ndit , Ndis are discrete and either 1 (i) (j) (j) 1 T −2 P or Nn,dis ≤ 2 T −1 j∈N Ndit Ndis − 1. j∈N
T −1 T −2
≤
Note that this implies Nn,dis (j)
P
j∈N
(j)
Ndit Ndis ≤ 1
1
1
(i)
Let κn,it = maxs∈T By convention let P
(j)
j∈N
(j)
Nn,d
is P (j) (j) j∈N Ndit Ndis
0 0
.
≡ 0. ∀ s ∈ T s.t. dis 6= dit , since
P
j∈N
(j)
(j)
(i)
Nn,dit Nn,d > 0 if Nn,d > 0 and
(n)
Nn,dit Nn,d dit Hddis ≥ 0:
(i) dit (n) d∈Dn Nn,d Hddis P P (j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis
P
P
=
≤
d∈Dn
P
j∈N
(j)
(i)
Nn,d
(j)
Nn,dit Nn,d P
(j)
j∈N
(j)
Nn,d Nn,d it
(j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis P P (j) (j) dit (n) Hddis d∈Dn j∈N Nn,dit Nn,d κn,it P P (j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis
P
= κn,it
6
P
dit H (n) ddis
2
Let K = 2 (TT−1) −2 + 1. If
P
P P
d∈Dn
(j)
P
d∈Dn
j∈N
(j)
(n)
Nn,dit Nn,d dit Hddis ≥ K, we also have: 1
(i)
(n)
Nn,d dit Hddis
d∈Dn
1
T −1 K
≤
(j) (j) dit (n) j∈N Nn,dit Nn,d Hddis1
P
(A.11)
so that:
Mgn [{i, t}, {i, t}] ≥ 1 − ≥
If
P
d∈Dn
(j)
d∈Dn
1 +c 2
(n)
j∈N
Nn,dit Nn,d dit Hddis ≤ K, we have:
d∈Dn
Nn,d dit Hddis
P
1
(i)
P P
(j)
1 T −2 1 T −1 − κn,it − T T T K
(n)
1
(j) (j) dit (n) j∈N Nn,dit Nn,d Hddis1
P
P (i) (i) (n) Nn,dis + d∈Dn :d6=dis Nn,d dit Hddis 1 1 1 = P P (j) (j) dit (n) d∈Dn j∈N Nn,dit Nn,d Hddis 1
1
≤P
d∈Dn
(j)
P
j∈N
(j)
(n)
Nn,dit Nn,d dit Hddis
×
1
1 T − 2 X (j) (j) ( Nn,dit Nn,dis − c+ 1 2T −1 j∈N
X
κn,it
d∈Dn :d6=dis1
≤
X
(j)
(j)
1
j∈N
1T −2 − 2T −1 P
d∈Dn
c P
j∈N
so that:
≥
1 T −1 1 c − κn,it + T T TK
1 +c 2
which concludes the proof of the second result of Corollary 1.
7
(j)
(j)
(n)
Nn,dit Nn,d dit Hddis
1T −2 c ≤ − 2T −1 K
Mgn [{i, t}, {i, t}] ≥ 1 −
(n)
Nn,dit Nn,d dit Hddis )
1
For the third result of Corollary 1, we can write: 1 0 0 ∆gn,2 ∆gn,2 = gn,2 Mgn,1 gn,2 2
(A.12)
1 0 0 0 0 ∆gn,2 (∆gn,2 ∆gn,2 )− ∆gn,2 = ∆gn,2 (gn,2 Mgn,1 gn,2 )− ∆gn,2 2
(A.13)
so that:
By the results in the proof of Lemma 1, we have: 0 1 1 0 0 (n) (n) ∈N ∆gn,2 (∆gn,2 ∆gn,2 )− ∆gn,2 = [ (di2 Gd 0 di1 − di2 Gd 0 di1 )]ii∈N 2 Nn,di1 i 1 i 2
(A.14)
which concludes the proof for the first part of this result. In addition the above implies: 1 1 di2 (n) Gdi1 di1 2 Nn,di1
M∆gn,2 [i, i] = 1 − (n)
Note that d Gdd = 0, so that M∆gn,2 [i, i] = 1 >
1 2
(A.15)
(i)
if Ndi1 = 2.
Otherwise, as in the proof of the first result of Corollary 1, we have:
M∆gn,2 [i, i] = 1 −
1 2P
1 (n)
d∈Dn
1
=1− P
d∈Dn
≥1− P
P
(j)
j∈N
(j) di1
Ndi1 Nd
(n)
Hddi2
1 (j)
j∈N
A.3
Nn,di1 d di1 Hddi2
(j)
Ndi1 Ndi2
Proof of Corollary 2
For the first result of Corollary 2, note that 0
0
0
0
Mgn ∞ ≥ Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 ∞ − Mgn,1 ∞ = Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 ∞ − 2
8
T −1 T
As in the proof of Corollary 1, 0
00
d 6= d and
1 Nn,d
≥
1 C.
d
00
(n)
Gdd0 =
d
00
(n)
00
(n)
0
0
Hdd0 d Gd0 d0 . In addition,
d
00
(n)
Gd0 d0 ≥ 1 for
Therefore for T = 2 and s 6= t, t 6= s : 0
0
Mgn,1 gn,2 (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 ∞ X
= maxi∈N ,t∈T

i0 ∈N ,t0 ∈T
1 dis (n) dis (n) (n) Gdit dit ( Hd 0 0 dit − dis Hd 0 0 dit ) Nn,dit i s i t
0
≥
=
1 maxi∈N ,t∈T C
0
(n)
(n)
− dis Hd 0
(A.17)

(A.18)
1[di0 t0 = d3 , di0 s0 = d4 ]dis Hd3 dit − dis Hd4 dit 
X 0
0 dit
(n)
(n)
dis Hd 0
0d i t it
0
i s
− dis Hd 0
i s
0 dit
i ∈N ,t ∈T (n)
(n)
(A.19)
(n)
(n)
(A.20)
(n)
(A.21)
0
d3 ,d4 ∈Dn i ∈N ,t ∈T
≥
=
(n)
Gdit dit dis Hd 0
0d i t it
Nn,dit
1 maxi∈N ,t∈T C
X
dis

0
i ∈N ,t ∈T
X
1
X
= maxi∈N ,t∈T
(A.16)
1 maxi∈N ,t∈T C
X
dis Hd3 dit − dis Hd4 dit 
d3 ,d4 ∈Dn :Nn,d3 ,d4 >0
1 maxd1 ,d2 ∈Dn :Nn,d1 ,d2 >0 C
X
(n)
d2 Hd3 d1 − d2 Hd4 d1 
d3 ,d4 ∈Dn :Nn,d3 ,d4 >0
With this inequality established, we can describe the assignment sequence which provides the example needed to prove the first result of Corollary 2. Let T = 2 and for K = 1, 2, ..., let n = 2K+2 − 3 and Nn = 3 · 2K − 2. For t = 1, assign i = 1, 2, 3 to dit = 1, assign i = 4, 5 to dit = 2, i = 6, 7 to dit = 3, and so on up to dit = 2K − 1. Assign i = n, n − 1 to dit = Nn , i = n − 2, n − 3 to dit = Nn − 1, and so on up to dit = Nn − (2K − 2). For t = 2, assign i = 1 to dit = Nn , assign i = 2 to dit = 2, i = 3 to dit = 3, i = 4 to dit = 4, i = 5 to dit = 5 and so on up to dit = 2K+1 − 1. Assign i = n to dit = Nn − 1, i = n − 1 to dit = Nn − 2, i = n − 2 to dit = Nn − 3, and so on up to dit = 2K . Figure 1 below represents Gn for several values of n. We can assign each value of d to a “level” of Gn , d = 1 to kd = 1, d = 2, 3 to kd = 2, d = 4, 5, 6, 7 to kd = 3, and so on up to d = 2K , ..., 2K+1 −1 to kd = K +1. Then we can assign d = Nn to kd = 2K + 1, d = Nn − 1, Nn − 2 to kd = 2K, d = Nn − 3, Nn − 4, Nn − 5, Nn − 6 to kd = 2K − 1, and so on up to d = Nn − 3 · 2K−1 + 2, ..., Nn − (2K − 2) to kd = K + 2.
9
(a) n = 13, K = 2
(b) n = 29, K = 3
Figure 1: plots of Gn for different values of n. By symmetry,
Nn H (n) d1
(n)
(n)
0
= Nn Hd0 1 ∀ d, d with kd = kd0 , define xk = Nn Hd1 for some d with
kd = k. We have x1 = 1. By symmetry we also have xK+1 = 12 . (n)
Since ξτ
is a firstorder Markov chain, we also have: 1 2 xk = xk−1 + xk+1 ∀ k = 2, ..., K 3 3
(A.22)
This system of equations is solved by: 1 2k−1 − 1 + 2 xk+1 ∀ k = 1, ..., K 2k − 1 2k − 1 1 = 2
xk = xK+1
so that:
2K (xK − xK+1 ) = 2K ( =
1 2K 1 − ) K 22 −1 2
2K−1 2K − 1
10
(A.23) (A.24)
and ∀ k = 1, ..., K − 1:
xk − xk+1  = xk − xk+1 =
2k
(A.25)
1 1 − k xk+1 −1 2 −1
(A.26)
which leads to: 2k 2k − xk+1 2k − 1 2k − 1 2k 1 2k 2k − 1 2k − k − 2 xk+2 = k 2 − 1 2 − 1 2k+1 − 1 2k − 1 2k+1 − 1 2k+1 2k+1 = k+1 − k+1 xk+2 2 −1 2 −1
2k xk − xk+1  =
= 2k+1 xk+1 − xk+2 
(A.27) (A.28) (A.29) (A.30)
Hence: K X
2k (xk − xk+1 ) = K2K (xK − xK+1 )
(A.31)
k=1
=K
2K−1 2K − 1
(A.32)
Hence, for i = 1 and t = 1 or t = 2, s 6= t:
X 0

dis
(n) Hddit
−
dis
(n) Hd0 d  it
=2
K X
2k (xk − xk+1 )
(A.33)
k=1
d,d ∈Dn :N
0 >0 n,dd
=K
2K −1
2K
(A.34)
which grows unboundedly as K → ∞, i.e. as n → ∞. This establishes the first result of Corollary 2. The second result of Corollary 2 follows directly from Lemma 1 when replacing Dn with (I )
a set of finitely bounded cardinality, Dn n = {d ∈ Dn : dit = d, i ∈ In , t ∈ T } and when Nn,d ≤ C ∀ d ∈ Dn , so that
N
0 n,dd
Nn,d
≥
1 TC
0
∀ d, d ∈ Dn s.t. Nn,dd0 > 0 since, as in the proof
11
(In )
of Lemma 1, we can treat Gn
0
(I )
(In )
= {Dn n , [Nn,dd0 ]d ∈D(Inn ) } as fully connected without loss of d∈Dn
generality and Ed (number of visits of
(n) ξτ
0
00
0
00
(I )
to d bef ore reaching d ) is finite ∀ d, d , d ∈ Dn n
for noncyclic ergodic Markov chains with finite support.
A.4
Variance of Estimation Error for Realizations of Unobserved Heterogeneity under Homoscedasticity and Uncorrelation
Although estimation of β0 is the only object of interest in the rest of the paper, it can be pointed out that Lemma 1 can also be used to bound the variance of estimation error for 0
particular realizations of ed − ed0 , d, d ∈ Dn , when transitory shocks are homoscedastic and serially and crosssectionally uncorrelated as in Jochmans and Weidner (2016).2 As in Jochmans and Weidner (2016) consider the case where β0 = 0 (or alternatively treat 0
0
ˆn = (gn,2 Mgn,1 gn,2 )− gn,2 Mgn,1 un the estimation of β0 as asymptotically ignorable). Define e ˆn corresponding to d ∈ Dn . and eˆn,d to be the element of e 0
0
Corollary 3. If V ar(un X) = σu2 InT , for d, d ∈ Dn , d 6= d , connected by a path of any length in Gn : 1 d0 (n) Gdd Nn,d 1 = σu2 P d (n) d00 Nn,dd00 Hd00 d0
V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) = σu2
0
0
For d, d ∈ Dn connected by a path of length one in Gn , d 6= d : V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) ≤ σu2
1
(A.35)
Nn,dd0
Proof. Under the assumptions of Corollary 3, we have:
V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) 0
0
0
= σu2 [1[d = δ] − 1[d = δ]]δ∈Dn (gn,2 Mgn,1 gn,2 )− [1[d = δ] − 1[d = δ]]δ∈Dn 0
2
Note that estimation error for ed instead of ed − ed0 , d, d ∈ Dn , is irrelevant since {ed }d∈Dn is only determined up to a normalization within each connected subcomponent of Gn .
12
0
As in the proof of Lemma 1, for d and d connected in Gn by a path of any length, we can rewrite: 0
V ar((ˆ en,d − eˆn,d0 ) − (en,d − en,d0 )) =
σu2 [1[d
0
δ∈Dn
= δ] − 1[d = δ]]
1 d (n) [ Gδd ]δ∈Dn Nn,d
1 d0 (n) 1 d0 (n) Gdd − Gd0 d ) Nn,d Nn,d 1 d0 (n) = σu2 Gdd Nn,d
= σu2 (
00
where the last equality follows from the definition of
d
(n)
Gdd0 .
0
As in the proof of Corollary 1, we have, for d 6= d : 1
1 d0 (n) Gdd = P Nn,d
(n)
Nn,dd00 d Hd00 d0
d00 ∈Dn
(A.36)
0
and for d, d such that Nn,dd0 > 0: 1 d0 (n) Gdd ≤ σu2 Nn,d N
1
n,dd0
= σu2
d H (n) d0 d0
1 Nn,dd0
which completes the proof. Note that the bound given in the second part of Corollary 3 can be generalized to nodes 0
d, d ∈ Dn separated by any distance in Gn and is sharp since the comparison will hold with 0
equality if all of the paths from d to d in Gn include the edge Nn,dd0 . The bound is also finitely bounded with sparsely matched data,3 unlike the bound given in Theorem 6 of Jochmans and Weidner (2016).4 Note that one could also use the best linear unbiased estimator property of ordinary least squares under homoscedasticity and serial and crosssectional uncorrelation of
3
0
The bound extended to nodes d, d ∈ Dn separated by a finitely bounded distance in Gn would also be finitely bounded. 4 The bound given in Jochmans and Weidner (2016) does not depend on distance and relies on the inverse −1
1
of λ2,n , the smallest positive eigenvalue of In − Dn2 Pn Dn 2 . In general λ2,n → 0 as n → ∞ with sparsely matched data (Nn,d ≤ C < ∞ ∀ d ∈ Dn , ∀ n).
13
transitory shocks to obtain similar bounds for V ar((ˆ en,d − eˆn,d0 )−(en,d −en,d0 )) as in Corollary 3.
B
Asymptotic Properties under Strict Exogeneity
In this section, the proofs of Propositions 13 are given. For notational simplicity I consider the case where dim(xit ) = 1. All proofs are easily extended to the case of multidimensional covariates. ˜n and z¨it to be the element of z ¨n corresponding to observation {i, t}.5 ¨ n = Mn z Define z I shorten the term nearepoch dependent to NED. A definition of nearepoch dependence is given by Definition 1 in Jenish and Prucha (2012). When Assumption 3 is used, denote by pi ∈ Pn the set in Pn that contains i. 0 ˜ When Assumption 4 is used, let ρ˜(A, B) = inf{i,l}∈A,{j,l0 }∈B ρ˜({i, l}, {j, l }) for A, B ⊆ Q.
B.1
Preliminary Results
Firstly, note that Assumption 2c and Jensen’s inequality imply ˜ zit 4+δ ≤ B since z˜it = E(xit Z). From Corollary 2 and Minkowski’s inequality, we also have Mn ∞ ≤ C under Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , ∀ n. This implies ¨ zit 4+δ ≤ CB by Minkowski’s inequality.6 P (σ) Also note that z¨it = σ∈Sn :i∈σ Mg(σ) [{i, t}, .]˜ zn if ∃ σ ∈ Sn such that i ∈ σ, and z¨it = n
0 otherwise. Together with ¨ zit
4+δ
0
≤ CB and maxi,i0 ∈s,s∈Sn ,t,t0 ∈T ρ˜({i, lit }, {i , li0 t0 }) ≤ C
(Assumption 4d), this implies that the random field {¨ zit : i ∈ N , t ∈ T , n ∈ N} is L2 NED ˜ equipped with semimetric ρ˜, on the random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on the lattice Q with NED coefficients ψ(.) such that ψ(r) = 0 for r > C and constant NED scaling factors when Assumption 4 is used. The next lemma establishes that using the semimetric ρ˜ defined in Assumption 4 leads to ˜ as if ρ were used with Q (up to an additive constant). the same cardinalities for basic sets in Q 5
For notational simplicity I drop the subscript n here, writing z¨it instead of z¨n,it . As noted above, I use C as a generic large constant so that we write, for instance, ¨ zit 4+δ ≤ CB when it can be shown that ¨ zit 4+δ ≤ C 4+δ B. 6
14
Therefore the results in Jenish and Prucha (2009) and Jenish and Prucha (2012) can be used ˜ instead of in our slightly modified setting where the semimetric ρ˜ is used on the lattice Q the metric ρ on Q. Consider the basic sets and cardinalities of basic sets defined in Lemma A.1 of Jenish and Prucha (2009) for U, V ⊂ Q, U ∩ V = ∅, l ∈ U : 0
0
Bl (h) = {l ∈ Q : ρ(l, l ) ≤ h}
(B.1)
Nl (m1 , m2 , m3 ) = #{(A, B) : #A = m1 , #B = m2 , A ⊆ U with l ∈ A, 0
0
B ⊆ V, and ∃ l ∈ B with m3 ≤ ρ(l, l ) < m3 + 1}
(B.2)
˜ for U ˜ = {{j, l0 } ∈ Q ˜ : and the corresponding basic sets and cardinalities of basic sets in Q, 0 0 ˜ : l0 ∈ V }, {i, l} ∈ U ˜ :7 l ∈ U }, V˜ = {{j, l } ∈ Q
˜il (h) = {{j, l0 } ∈ Q ˜ : ρ˜({i, l}, {j, l0 }) ≤ h} B
(B.3)
˜il (m1 , m2 , m3 ) = #{(A, B) : #A = m1 , #B = m2 , A ⊆ U ˜ with {i, l} ∈ A, N 0 0 B ⊆ V˜ , and ∃ {j, l } ∈ B with m3 ≤ ρ˜({i, l}, {j, l }) < m3 + 1}
(B.4)
˜: Lemma 3. ∀ m1 , m2 , m3 ∈ N, U, V ⊂ Q, U ∩ V = ∅, {i, l} ∈ U ˜il (h) ≤ #Bl (h) + T #B ˜il (m1 , m2 , m3 ) ≤ Nl (m1 , m2 , m3 ) N
(B.5) (B.6)
Proof. The first inequality follows from ˜il (h) = {{j, l0 } ∈ Q ˜ : ρ(l, l0 ) ≤ h or j = i} B 0 ˜ : l0 ∈ Bl (h)} ∪ {{i, l0 } ∈ Q ˜ : l0 ∈ Q} ⊆ {{j, l } ∈ Q
(B.7) (B.8)
0 ˜ : l0 ∈ Bl (h)} = #Bl (h), #{{i, l0 } ∈ Q ˜ : l0 ∈ Q} = T . and #(A∪B) ≤ #A+#B, #{{j, l } ∈ Q 7 ˜ ∩ V˜ = ∅. We also have #V˜ = #V and #U ˜ = #U from Assumption 4a and the definitions Note that U ˜ of Q and Q.
15
The second inequality follows from
0
ρ˜({i, l}, {j, l }) =
0
if i = j
ρ(l, l0 )
otherwise
(B.9)
and m3 ≥ 1, so that: ˜ with {i, l} ∈ A, B ⊆ V˜ #{(A, B) : #A = m1 , #B = m2 , A ⊆ U 0
0
and ∃ {j, l } ∈ B with m3 ≤ ρ˜({i, l}, {j, l }) < m3 + 1} ˜ with {i, l} ∈ A, B ⊆ V˜ = #{(A, B) : #A = m1 , #B = m2 , A ⊆ U 0
0
and ∃ {j, l } ∈ B with m3 ≤ ρ(l, l ) < m3 + 1, i 6= j}
(B.10)
˜ with {i, l} ∈ A, B ⊆ V˜ ≤ #{(A, B) : #A = m1 , #B = m2 , A ⊆ U 0
0
and ∃ {j, l } ∈ B with m3 ≤ ρ(l, l ) < m3 + 1}
(B.11)
= #{(A, B) : #A = m1 , #B = m2 , A ⊆ U, with l ∈ A, B ⊆ V 0
0
and ∃ l ∈ B with m3 ≤ ρ(l, l ) < m3 + 1}
(B.12)
The next two lemmas show convergence results that will be used in the proofs of Propositions 1 and 2. Lemma 4. Under Assumption 2, and Assumption 3 or Assumption 4, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1 0 p ˜n Mn xn − An → 0 z n 0
˜n ). where An = n1 E(˜ zn M n z 0
˜n , An = E( n1 z ˜n Mn xn ). Proof. Note that by definition of z
16
(B.13)
Proof using Assumption 3 Under Assumption 3, we show convergence in meansquared error, which implies convergence in probability. Note that under Pn forming a partition of N and {{d : dit = d, i ∈ p, t ∈ T }}p∈Pn forming a partition of Dn , ∀ σ ∈ Sn , Mg(σ) will be blockdiagonal, with each block corresponding to n
an element in Pn , so that z¨it is a function of {˜ zjs }j∈pi ,s∈T only. 0
0
Jointly with ∀ p, p ∈ Pn , p 6= p , F ({xit , zit }i∈p,t∈T , {xit , zit }i∈p0 ,t∈T ) = F ({xit , zit }i∈p,t∈T )F ({xit , zit }i∈p0 ,t∈T ) this implies: 1 0 1 ˜ n M n xn ) = 2 V ar( z n n
X
X
E(¨ zit z¨js xit xjs )
(B.14)
{i,t}∈N ×T j∈pi ,s∈T
Under Assumption 2c we have ¨ zit 4+δ ≤ CB as noted earlier in this section, and E(¨ zit z¨js xit xjs  Z) ≤ CB by H¨older’s inequality. Since E(B) < C by Assumption 2c, we have E(¨ zit z¨js xit xjs ) ≤ C and E(¨ zit z¨js xit xjs ) ≤ C Hence, under Assumptions 2 and 3: 1 0 1 ˜n Mn xn ) = O( 2 n maxi∈N #pi ) V ar( z n n 1 maxi∈N #pi = O( ) #Pn mini∈N #pi 1 = O( ) #Pn = o(1)
(B.15) (B.16) (B.17) (B.18)
where the third and fourth equalities follow from Assumption 3. Proof using Assumption 4 Under Assumption 4 we prove the result of this lemma by using the L1 law of large numbers found in Jenish and Prucha (2012), which implies convergence in probability. As noted at the beginning of this Section, {¨ zit : i ∈ N , t ∈ T , n ∈ N} is L2 NED on the ˜ equipped with semimetric ρ˜ with random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on the lattice Q NED coefficients and NED scaling factors satisfying the conditions of Theorem 1 of Jenish and Prucha (2012). 17
E(¨ zit xit 1+δ ) ≤ C is obtained by H¨older’s inequality, Assumption 2c, and Jensen’s inequality as in the first part of this proof. Assumption 4b guarantees that {xit , z˜it : i ∈ N , t ∈ T , n ∈ N} is an αmixing random ˜ equipped with semimetric ρ˜ with αmixing coefficients satisfying the condition of field on Q Theorem 1 of Jenish and Prucha (2012). ˜ with Finally Lemma 3 shows that the same cardinalities for basic sets are obtained on Q ρ˜ as are obtained on Q with ρ, so that Assumption 4a in this paper can be used in the place of Assumption 1 of Jenish and Prucha (2012). Therefore all of the conditions for Theorem 1 of Jenish and Prucha (2012) are verified, and we can invoke it to obtain: 1 0 1 0 ˜n Mn xn − E( z ˜ Mn xn )) → 0 E( z n n n
(B.19)
Lemma 5. Under (2.3) and (2.4) when zit = zis ∀ t, s ∈ T , and Assumptions 13, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1 0 a.s. V ar(˜ zn Mn un Z) − Vn → 0 n
(B.20)
0
where Vn = n1 V ar(˜ zn Mn un ). 0
Proof. We have E(˜ zn Mn un Z) = 0 and, under Assumptions 1 and 3:
1 1 X #Pn X 0 V ar(˜ zn Mn un Z) = n #Pn n p∈Pn
X
z¨it z¨js E(uit ujs {zi0 t0 }i0 ∈p,t0 ∈T )
i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt
(B.21) As in the proof of Lemma 4, z¨it is a function of {˜ zjs }j∈pi ,s∈T only. Therefore under Assumption 3 one only needs to verify the Markov condition in order to invoke the strong law of large numbers for independent observations given by Theorem 3.7 in 18
White (2001). By Minkowski’s inequality:
E(
#Pn X n
1
X
z¨it z¨js E(uit ujs {zi0 t0 }i0 ∈p,t0 ∈T )1+δ ) 1+δ
i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt
#Pn X ≤ n
1
X
E(¨ zit z¨js E(uit ujs {zi0 t0 }i0 ∈p,t0 ∈T )1+δ ) 1+δ
(B.22)
i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt
By Jensen’s inequality, Holder’s inequality, and Assumption 2c: 1
E(¨ zit z¨js E(uit ujs {zi0 t0 }i0 ∈p,t0 ∈T )1+δ ) 1+δ ≤ C
(B.23)
Therefore, since Nn,d ≤ C:
E(
#Pn X n
1
X
z¨it z¨js E(uit ujs {zi0 t0 }i0 ∈p,t0 ∈T )1+δ ) 1+δ
i∈p j,s∈N ×T ,i=j ∨ t=s,dit =djt
≤C
#Pn X CT n
(B.24)
#Pn #p ≤C n
(B.25)
i∈p
Under Assumption 3, we have: #Pn #Pn #p = P 0 #p 0 n p ∈Pn #p
(B.26)
0
≤
maxp0 ∈Pn #p
minp0 ∈Pn #p0
≤C
(B.27) (B.28)
Hence Markov’s condition is satisfied and the desired result is obtained.
˜ Finally the last lemma shows that Assumptions 1 and 4 imply that uit is αmixing on Q, using ρ˜ as a measure of distance.
19
Lemma 6. Under Assumptions 1 and 4, {{˜ zit , uit } : i ∈ σ, t ∈ T , σ ∈ Sn , n ∈ N} is an α˜ equipped with the semimetric ρ˜ with αmixing coefficients mixing random field on the lattice Q satisfying Assumption 3 of Jenish and Prucha (2012). Proof. Let U and B be two σalgebras of F({˜ zit , zit } : i ∈ N , t ∈ T , n ∈ N), define:
α(U, B) = supA∈U ,B∈B P (AB) − P (A)P (B)
(B.29)
˜ let Fn (U ) = F({˜ For U ⊆ Q, zit , zit } : i ∈ N , t ∈ T , {i, lit } ∈ U ) to define:
αn (U, V ) = α(Fn (U ), Fn (V ))
(B.30)
˜ and: for U, V ⊆ Q
α ¯ (c1 , c2 , r) = supn supU,V (αn (U, V ) : #U ≤ c1 , #V ≤ c2 , ρ˜(U, V ) ≥ r)
(B.31)
Assumption 4b) implies:
α ¯ (c1 , c2 , r) ≤ ϕ(c1 , c2 )ˆ α(r)
(B.32)
ϕ(c1 , c2 ) = (c1 + c2 )τ , τ ≥ 0 ∞ X
(B.33)
δ
rq(τ? +1)−1 α ˆ 2(2+δ) (r) < ∞, δ > 0, τ? = δτ /(2 + δ)
(B.34)
r=1
Define αzu , Fzu,n , αzu,n and α ¯ zu similarly as above but for the random field {{˜ zit , uit } : i ∈ σ, t ∈ T , σ ∈ Sn , n ∈ N}. ˜ ρ˜(U, V ) > 2C, i ∈ σ, i0 ∈ σ 0 , σ, σ 0 ∈ Sn , For A ∈ Fzu,n (U ), B ∈ Fzu,n (V ), U, V ⊂ Q, 0
0
0
t, t ∈ T , {i, lit } ∈ U , {i , li0 t0 } ∈ V , under Assumption 4d we have i 6= i , and dit 6= di0 t if 0
t=t. Therefore under Assumption 1:
P (ABZ) = P (AZ)P (BZ)
20
(B.35)
0
0
0
0
Let E = {{i , t } : ∃ {i, lit } ∈ U with i = i , or t = t and dit = di0 t0 , i ∈ σ, σ ∈ Sn }. Under Assumption 4c we also have:
P (AZ) = P (A{zit }{i,t}∈E , {˜ zit }{i,t}∈N ×T :{i,lit }∈U )
(B.36)
˜ is Under Assumption 4d, F({zit }i,t∈E , {˜ zit }{i,t}∈N ×T :{i,lit }∈U ) = Fn (Uzd ) where Uzd ⊆ Q 0
0
such that max{i,l}∈Uzd min{i0 ,l0 }∈U ρ˜({i, l}, {i , l }) ≤ C. ˜ is such that Similarly for P (BZ) which generates a σalgebra Fn (Vzd ) where Vzd ⊆ Q 0
0
max{i,l}∈Vzd min{i0 ,l0 }∈V ρ˜({i, l}, {i , l }) ≤ C. Therefore, by the law of iterated expectations, we have:
P (AB) − P (A)P (B) = Cov(X1 , X2 )
(B.37)
where X1 and X2 are random variables such that F(X1 ) = Fn (Uzd ) and F(X2 ) = Fn (Vzd ) and ρ˜(Uzd , Vzd ) ≥ ρ˜(U, V ) − 2C. 0 ≤ X1 , X2 ≤ 1, so that by Lemma A.2 of Jenish and Prucha (2012) and Assumption 4b:
Cov(X1 , X2 ) ≤ 4α(Fn (Uzd ), Fn (Vzd )) ≤ 4(#Uzd + #Vzd )τ α ˆ (˜ ρ(U, V ) − 2C) 0
0
(B.38) (B.39)
0
Since Nn,d ≤ C ∀ d ∈ Dn and #{{i , t } : i = i } = T ∀ i ∈ N , we have #Uzd ≤ CT · #U , and similarly for #Vzd . Defining α ˆ u (r) = α ˆ (r − 2C), we have: P (AB) − P (A)P (B) ≤ C(#U + #V )τ α ˆ u (˜ ρ(U, V ))
(B.40)
which obtains the desired result since one can easily verify that α ˆ u satisfies ∞ X
δ
rq(τ? +1)−1 α ˆ u2(2+δ) (r) < ∞
r=1
21
(B.41)
as long as (B.34) is satisfied.
B.2
Proof of Proposition 1
Note that: 1 0 1 0 ˜n Mn xn )−1 z ˜ Mn u n βˆn = β0 + ( z n n n B.2.1
(B.42)
Proof of Consistency
There are several ways of proving consistency of βˆn . Here we will show convergence in meansquared error, which implies convergence in probability. Under (2.3) and (2.4): 1 0 ˜ Mn u n ) = 0 E( z n n
(B.43)
Under Assumption 1: 1 0 1 ˜ n M n un ) = 2 V ar( z n n
X
X
E(¨ zit z¨js uit ujs )
(B.44)
i,t∈N ×T j,s:i=j ∨ t=s,dit =djt
Assumption 2c, ¨ zit 4+δ ≤ CB, H¨older’s inequality, and Jensen’s inequality obtain E(¨ zit z¨js uit ujs ) ≤ C. Therefore, since #{{j, s} ∈ N × T : i = j ∨ t = s, dit = djt } ≤ CT ∀ {i, t} ∈ N × T , we have: 1 0 1 ˜n Mn un ) = O( 2 n) V ar( z n n = o(1)
(B.45) (B.46)
Combining this result with Lemma 4, Assumption 2a), and the continuous mapping theorem, we obtain consistency of βˆn .
22
B.2.2
Proof of Asymptotic Normality
From Lemma 4 and the asymptotic equivalence lemma, it suffices to show that: − 21
Vn
1 0 d √ z ˜n Mn un → N (0, 1) n
(B.47)
Note that, even though we have a martingale difference sequence in each time period, so P P 1 d that one could show ( n1 V ar( i∈N z¨it uit )) 2 √1n i∈N z¨it uit → N (0, 1) for each t ∈ T by using a central limit theorem for martingale difference sequences, joint normality of these quantities is not guaranteed since cluster membership varies over time (dit 6= dis is possible for t 6= s). Here we split the proof depending on whether Assumption 3 or Assumption 4 is used.
Asymptotic Normality under Assumption 3 asymptotic normality of
0 √1 z ˜ M u n n n n
In this part of the proof, we first show
conditional on Z, before invoking a law of large numbers
and the dominated convergence theorem to yield the desired result. 1 1 0 √ z ˜n Mn un = √ n n
X
z¨it uit
(B.48)
i∈N ,t∈T
For the purpose of this proof, define the lattice L = NT . Define the location of a cross0
sectional observation i on L to be ki = [di1 , ..., diT ]. Define the metric % on L to be %(k, k ) = PT 0 0 t=1 1[k[t] 6= k [t]], ∀ k, k ∈ L. One can easily verify that % is a valid metric, i.e. satisfies positivity, definiteness, symmetry, and the triangle inequality. In order to show asymptotic normality conditional on Z, we will verify that the conditions for Theorem 1 of Jenish and Prucha (2009) hold using the lattice L equipped with metric %. Note that under Nn,d ≤ C, #{j ∈ N : %({dit }t∈T , {djt }t∈T ) < T } ≤ CT ∀ i ∈ N . Hence the conditions on cardinality of Lemma A.1 in Jenish and Prucha (2009) clearly hold for the lattice L equipped with the metric %, which will stand in lieu of Assumption 1 of Jenish and Prucha (2009). In addition, Assumption 1 implies {{uit }t∈T }i:ki ∈A ⊥ {{uit }t∈T }i:ki ∈B Z if %(A, B) ≥ T , A, B ⊂ L. Hence since independence trivially implies αmixing, Assumption 3 of Jenish and
23
Prucha (2009) is also satisfied. As was noted previously, Assumption 2c implies E(¨ zit uit 2+δ Z) ≤ C, so that Assumption 2 of Jenish and Prucha (2009) is satisfied. 0
From Lemma 5 and Assumption 2b, we have lim infn→∞ λmin ( n1 V ar(˜ zn Mn un Z)) ≥ c > 0 a.e., so that Assumption 5 of Jenish and Prucha (2009) is satisfied. Therefore we can invoke Theorem 1 of Jenish and Prucha (2009) to obtain: 1 1 1 0 0 d ˜n Mn un → N (0, 1) Z , a.e. ( V ar(˜ zn Mn un Z))− 2 √ z n n
(B.49)
By the dominated convergence theorem we also have: 1 1 1 0 0 d ˜n Mn un → N (0, 1) ( V ar(˜ zn Mn un Z))− 2 √ z n n
(B.50)
And from Lemma 5 and the asymptotic equivalence lemma, we have: − 21
Vn
1 0 d √ z ˜ Mn un → N (0, 1) n n
(B.51)
Asymptotic Normality under Assumption 4 Write: 1 0 1 √ z ˜n Mn un = √ n n 1 =√ n
X
z¨it uit
(B.52)
i∈N ,t∈T
X
z¨it uit
(B.53)
i∈N ,t∈T :i∈σ,σ∈Sn
where the last equality follows from z¨it = 0 if @ σ ∈ Sn s.t. i ∈ σ. It is noted at the beginning of Section B.1 that the random field {¨ zit : i ∈ N , t ∈ T , n ∈ N} ˜ equipped with ρ˜ with NED is L2 NED on the random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on Q coefficients equal to zero after a fixed distance and constant NED scaling factors. In addition, Lemma 6 shows that {{˜ zit , uit } : i ∈ σ, σ ∈ Sn , t ∈ T , n ∈ N} is an αmixing ˜ equipped with the semimetric ρ˜ with αmixing coefficients satisfying random field on Q Assumption 3 of Jenish and Prucha (2012).
24
Therefore the random field {¨ zit uit : i ∈ σ, σ ∈ Sn , t ∈ T , n ∈ N} is L2 − N ED on the αmixing random field {{˜ zit , uit } : i ∈ σ, σ ∈ Sn , t ∈ T , n ∈ N} with NED coefficients and scaling factors satisfying Assumptions 4c and 4d of Jenish and Prucha (2012) and αmixing coefficients satisfying Assumption 3 of Jenish and Prucha (2012). Assumption 2b guarantees that Assumption 4b of Jenish and Prucha (2012) holds. As was noted previously, Assumption 2c implies E(¨ zit uit 2+δ ) ≤ C, so that Assumption 4a of Jenish and Prucha (2012) is satisfied. As was noted previously, Lemma 3 also shows that the same cardinalities for basic sets ˜ with ρ˜ as are obtained on Q with ρ, so that Assumption 4a in this paper are obtained on Q can be used in lieu of Assumption 1 of Jenish and Prucha (2012). Therefore all conditions for applying Theorem 2 of Jenish and Prucha (2012) are met, and we have: − 21
Vn
B.3
1 d √ z ˜n Mn un → N (0, 1) n
(B.54)
Proof of Proposition 2
The first result of Proposition 2 is shown by Lemma 4. P (σ )0 (σ )0 (σ ) P (σ ) ˜n 1 Mg(σ1 ) v ˆ n 2 Mg(σ2 ) z ˆn 1 ˜n 2 . We need to show that Let Vˆn = n1 σ1 ∈Sn z σ2 ∈Un (σ1 ) v n
n
Vˆn − Vn = op (1). (σ)
Since Mg(σ) gn = 0, we have: n
1 X (σ1 )0 1) ˜n Mg(σ1 ) u(σ z Vˆn = n n n σ1 ∈Sn
+ (βˆn − β0 )
X
n
σ2 ∈Un (σ1 )
1 X (σ1 )0 1) ˜n Mg(σ1 ) x(σ z n n n σ1 ∈Sn
+ (βˆn − β0 )
1 X (σ1 )0 1) ˜n Mg(σ1 ) u(σ z n n n σ1 ∈Sn
+ (βˆn − β0 )2
0
2) 2) ˜(σ u(σ Mg(σ2 ) z n n
X
n
σ2 ∈Un (σ1 )
X
0
2) 2) ˜(σ x(σ Mg(σ2 ) z n n n
σ2 ∈Un (σ1 )
1 X (σ1 )0 1) ˜n Mg(σ1 ) x(σ z n n n σ1 ∈Sn
0
2) 2) ˜(σ u(σ Mg(σ2 ) z n n
X
0
2) 2) ˜(σ x(σ Mg(σ2 ) z n n n
σ2 ∈Un (σ1 )
≡ Tn,1 + (βˆn − β0 )Tn,2 + (βˆn − β0 )Tn,3 + (βˆn − β0 )2 Tn,4
25
(σ)
(σ)
¨n For σ ∈ Sn , let z
˜n = Mg(σ) z n
(σ)
(σ)
¨n be the element of z
and let z¨it
corresponding to
observation {i, t} ∈ σ × T .8 We have:
Tn,1 =
1 n
X
X
0
0
X
(σ) (σ )
1[σ ∈ Un (σ)]¨ zit z¨js uit ujs
(B.55)
i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0 0
(σ) (σ )
Under Assumption 1, E(¨ zit z¨js uit ujs ) = 0 if 1[i = j, or dit = djt and t = s] = 0. Since 0
0
1[i = j, or dit = djt and t = s] ≤ 1[σ ∈ Un (σ)] for i ∈ σ, j ∈ σ , we have:
Vn = =
1 n 1 n
X
X
0
(σ) (σ )
X
1[i = j or (dit = djt and t = s)]E(¨ zit z¨js uit ujs )
i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0
X
X
0
0
X
(σ) (σ )
1[σ ∈ Un (σ)]E(¨ zit z¨js uit ujs )
i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0
∀ {i, t} ∈ N × T , under the conditions of Proposition 2, we have: 0
0
0
#{j ∈ N , s ∈ T , σ, σ ∈ Sn : i ∈ σ, j ∈ σ , σ ∈ Un (σ)} ≤ C
(B.56)
so that as in the preliminary results of this section, Assumption 2c, Minkowski’s inequality and Jensen’s inequality can be combined to obtain:
E(
X
X
0
0
1
(σ) (σ )
X
z¨it z¨js uit ujs 1+δ ) 1+δ ≤ C
(B.57)
0
σ∈Sn :i∈σ σ ∈Un (σ) j∈σ ,s∈T
Under Assumption 3, we can then show that Tn,1 − Vn = op (1) by using Theorem 3.7 of White (2001) as in the proof of Lemma 5. Under Assumption 4 we have:
maxj∈N ,s∈T
: σ,σ ∈Sn , i∈σ, j∈σ , σ ∈Un (σ) ρ(lit , ljs ) 0
0
0
≤C
(B.58)
Under Assumption 4, we can then show that Tn,1 − Vn = op (1) by using Theorem 1 of 8
(σ)
(σ)
As for z¨it , I drop the subscript n for notational simplicity here, writing z¨it instead of z¨n,it .
26
Jenish and Prucha (2012) as in the proof of Lemma 4.
Under the conditions of Proposition 2 and Assumption 2c we also have: X
E(
X
0
0
(σ) (σ )
0
(σ) (σ )
1[σ ∈ Un (σ)]¨ zit z¨js xit ujs 1+δ ) ≤ C
0
j∈N ,s∈T σ,σ ∈Sn :i∈σ,j∈σ
(B.59)
0
and: X
E(
X
0
1[σ ∈ Un (σ)]¨ zit z¨js xit xjs 1+δ ) ≤ C
(B.60)
j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0
Therefore we can use Markov’s inequality to obtain Tn,2 = Op (1), Tn,3 = Op (1), Tn,4 = Op (1). Proposition 1 shows that βˆn − β0 = op (1), so that applying the continuous mapping theorem concludes the proof.
B.4
Proof of Proposition 3
If Sn is a partition of N and {{dit }i∈σ,t∈T }σ∈Sn is a partition of Dn , we have Mn = Mgn . Under serial and crosssectional uncorrelation we also have: −1 Bn−1 Wn Bn−1 − A−1 n Vn An
1 0 ˜ Mg z ˜n )−1 = Bn−1 Wn Bn−1 − E( z n n n 1 0 1 0 1 0 1 0 ´n Mgn xn )−1 E( z ´ n Mg n z ´n )E( xn Mgn z ´n )−1 − E( z ˜ Mg z ˜n )−1 = E( z n n n n n n 1 0 1 0 1 0 1 0 ´ n Mg n z ˜n )−1 E( z ´ n Mg n z ´n )E( z ˜ n Mg n z ´n )−1 − E( z ˜ Mg z ˜n )−1 = E( z n n n n n n 1 0 = E( vn Mgn vn ) n 0
(B.61) (B.62) (B.63) (B.64)
0
´n E(˜ ´n )−1 − z ˜n E(˜ ˜n )−1 . where vn ≡ z z n Mg n z z n Mg n z Mgn is symmetric idempotent and the expected value of a quadratic form is positive semidefinite, so that the desired result is obtained.
27
C
Results under Sequential Exogeneity
C.1
Proof of Lemma 2
I use a recursive argument to derive moment conditions that exhaust the information for estimating β0 that is contained in (2.3) and (2.4) when instrumental variables are sequentially exogenous, i.e. zit ⊆ zis for s ≥ t. n Define yn,t = [yit ]i∈N , xn,t = [xit ]i∈N , g2,n,t = [1[dit = d]]d∈D i∈N .
(2.3) and (2.4) for t = T are equivalent to:
E(cn ZT ) = E(yn,T − xn,T β0 − g2,n,T en ZT )
(C.1)
PT PT 1 1 ˙ n,t = xn,t − T −t+1 ˙ 2,n,t = g2,n,t − Define y˙ n,t = yn,t − T −t+1 s=t yn,s , x s=t xn,s , g P T 1 ˙ n,t − x˙ n,t β for t = 1, ..., T − 1. s=t g2,n,s , and mn,t (β) = y T −t+1 Since no restriction is imposed on E(cn ZT ), (C.1) contains no information for estimating β0 , and the information for estimating β0 contained in (2.3) and (2.4) is equivalent to the information contained in:
E(mn,t (β0 ) − g˙ 2,n,t en Zt ) = 0 ∀ t = 1, ..., T − 1
(C.2)
With oneway unobserved heterogeneity only, a similar result is stated in Chamberlain (1992) for instance. With twoway unobserved heterogeneity, we cannot use (C.2) for estimation directly since en cannot be treated as a vector of parameters that can be estimated consistently. For t = T − 1, (C.2) is equivalent to: 0
0
E(en ZT −1 ) = (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 E(mn,T −1 (β0 )ZT −1 ) 0
0
+ (INn − (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 g˙ 2,n,T −1 )ξn,T −1 E(Mg˙ 2,n,T −1 mn,T −1 (β0 )ZT −1 ) = 0
28
(C.3) (C.4)
where ξn,T −1 is an unrestricted Nn × 1 vector. Since E(en ZT −1 ) is unrestricted, no information for estimating β0 is found in (C.3), and the information for estimating β0 found in (C.2) for t = T − 1 is equivalent to the information found in (C.4). 0
0
Define an,T −1 (β) = (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 mn,T −1 (β) and: 0
0
Bn,T −1 = INn − (g˙ 2,n,T −1 g˙ 2,n,T −1 )− g˙ 2,n,T −1 g˙ 2,n,T −1
(C.5)
The information for estimating β0 found in (C.2) for t = T − 1, T − 2 is equivalent to the information contained in (C.4) and:
E(mn,T −2 (β0 ) − g˙ 2,n,T −2 an,T −1 (β0 ) −g˙ 2,n,T −2 Bn,T −1 ξn,T −1 ZT −2 ) = 0
(C.6)
¨2,n,T −2 = g˙ 2,n,T −2 Bn,T −1 . Define g (C.6) in turn is equivalent to: 0
0
¨2,n,T −2 E(mn,T −2 (β0 ) − g˙ 2,n,T −2 an,T −1 (β0 )ZT −2 ) ¨2,n,T −2 )− g E(ξn,T −1 ZT −2 ) = (¨ g2,n,T −2 g 0
0
¨2,n,T −2 )− g ¨2,n,T −2 g ¨2,n,T −2 )ξn,T −2 + (INn − (¨ g2,n,T −2 g E(Mg¨2,n,T −2 (mn,T −2 (β0 ) − g˙ 2,n,T −2 an,T −1 (β0 ))ZT −2 ) = 0
(C.7) (C.8)
where ξn,T −2 is an unrestricted Nn × 1 vector. Since E(ξn,T −1 ZT −2 ) is unrestricted, the information for estimating β0 found in (C.2) for t = T − 1, T − 2 is equivalent to the information contained in (C.4) and (C.8). Define:
¨2,n,T = 0 g Bn,T = INn
29
(C.9) (C.10)
and for t = T − 1, ..., 1:
¨2,n,t = g˙ 2,n,t Bn,t+1 g
(C.11) 0
0
¨2,n,t+1 )− g ¨2,n,t+1 g ¨2,n,t+1 ) Bn,t = Bn,t+1 (INn − (¨ g2,n,t+1 g
(C.12)
as well as an,T (β0 ) = 0 and for t = T − 1, ..., 1:
an,t (β0 ) = an,t+1 (β0 ) 0
0
¨2,n,t (mn,t (β0 ) − g˙ 2,n,t an,t+1 (β0 )) ¨2,n,t )− g + Bn,t+1 (¨ g2,n,t g
(C.13)
Then, by recursion, the information for estimating β0 contained in (2.3) and (2.4) is equivalent to the information contained in:
E(Mg¨2,n,t (mn,t (β0 ) − g˙ 2,n,t an,t+1 (β0 ))Zt ) = 0 ∀ t = 1, ..., T − 1
(C.14)
Define ynt = [yn,t0 ]t0 =T,...,t , xtn = [xn,t0 ]t0 =T,...,t , gnt = [gn,t0 ]t0 =T,...,t . It only remains to show that (C.14) is equivalent to: 0
0
E(yn,t − xn,t β0 − gn,t (gnt gnt )− gnt (ynt − xtn β0 )Zt ) = 0 ∀ t = 1, ..., T − 1
(C.15)
This is shown in the next subsection.
C.2
Equivalence of Moment Conditions
To shorten notation, define mn,t,0 = mn,t (β0 ) and an,t,0 = an,t (β0 ). Here I show that:
E(Mg¨2,n,t (mn,t,0 − g˙ 2,n,t an,t+1,0 )Zt ) = 0 ∀ t = 1, ..., T − 1
30
(C.16)
is equivalent to: 0
0
E(yn,t − xn,t β0 − gn,t (gnt gnt )− gnt (ynt − xtn β0 )Zt ) = 0 ∀ t = 1, ..., T − 1
(C.17)
Since (C.17) is a set of moment conditions implied by (2.3) and (2.4) that can be used for estimating β0 , and (C.16) contains all information in (2.3) and (2.4) relevant for estimating β0 , then (C.17) is implied by (C.16). Therefore it only remains to show that (C.17) implies (C.16) to conclude the proof. By the FrischWaugh theorem: 0
0
yn,t − xn,t β0 − gn,t (gnt gnt )− gnt (ynt − xtn β0 ) 0
t t ¯2,n,t ) (g2,n ¯2,n,t ))− × = y˙ n,t − x˙ n,t β0 − g˙ 2,n,t ((g2,n − jT −t+1 ⊗ g − jT −t+1 ⊗ g 0
¯2,n,t ) (ynt − jT −t+1 ⊗ y ¯ n,t − (xtn − jT −t+1 ⊗ x ¯ n,t )β0 ) (gnt − jT −t+1 ⊗ g ¯2,n,t = where ⊗ is the kronecker product, jk is a k × 1 vector of ones, g PT PT 1 1 ¯ n,t = T −t+1 ¯ n,t = T −t+1 x s=t xn,s , y s=t yn,s .
1 T −t+1
t Define g˙ 2,n = [g˙ 2,n,s ]s≥t , y˙ nt = [y˙ n,s ]s≥t , x˙ tn = [x˙ n,s ]s≥t and mtn,0 = y˙ nt − x˙ tn β0 .
One can show that: 0
t t ¯2,n,t ) (g2,n ¯2,n,t ))− × g˙ 2,n,t ((g2,n − jT −t+1 ⊗ g − jT −t+1 ⊗ g 0
¯2,n,t ) (ynt − jT −t+1 ⊗ y ¯ n,t − (xtn − jT −t+1 ⊗ x ¯ n,t )β0 ) (gnt − jT −t+1 ⊗ g 0
0
t t t = g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n mtn,0
31
(C.18) PT
s=t g2,n,s ,
and we can write: 0
0
t t t mtn,0 mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n 0
0
0
0
t+1 t t t t = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t )mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n mt+1 n,0
(C.19)
0
0
t t = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t )mn,t,0 0
0
0
0
t+1 t+1 t+1 t+1 − t+1 t t g˙ 2,n (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 −g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n n,0 0
(C.20)
0
t t = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t )mn,t,0 0
0
0
0
0
t+1 t+1 − t+1 t t t t −g˙ 2,n,t (g˙ 2,n g˙ 2,n )− (g˙ 2,n g˙ 2,n − g˙ 2,n,t g˙ 2,n,t )(g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 0
0
0
(C.21)
0
t+1 t+1 − t+1 t t g˙ 2,n )− g˙ 2,n,t )(mn,t,0 − g˙ 2,n,t (g˙ 2,n = (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )
(C.22)
Note that: 0
0
0
0
0
0
0
t t t t g˙ 2,n,t (In − g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t ) = g˙ 2,n,t − g˙ 2,n,t g˙ 2,n,t (g˙ 2,n g˙ 2,n )− g˙ 2,n,t 0
0
0
(C.23)
0
t t t t = g˙ 2,n,t − g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n )− g˙ 2,n,t 0
0
0
0
0
0
(C.24)
t+1 t+1 t t )− g˙ 2,n,t g˙ 2,n (g˙ 2,n g˙ 2,n + g˙ 2,n
(C.25)
t+1 t+1 t t )− g˙ 2,n,t = g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n
(C.26)
so that: 0
0
0
0
0
0
0
0
t+1 t+1 − t+1 t+1 t t+1 t+1 − t t t )− g˙ 2,n,t g˙ 2,n )− g˙ 2,n,t ) = g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n,t (In −g˙ 2,n,t (g˙ 2,n
(C.27) t+10
0
t+1 − Therefore (C.22) premultiplied by g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n,t is equal to:
0
0
0
0
0
0
t+1 t+1 − t+1 t+1 t t+1 t+1 − t+1 t g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n g˙ 2,n (g˙ 2,n g˙ 2,n )− g˙ 2,n,t (mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )
0
(C.28)
0
t+1 t+1 − t+1 t+1 It can be shown that Bn,t+1 = INn − (g˙ 2,n g˙ 2,n ) g˙ 2,n g˙ 2,n .
¨2,n,t = g˙ 2,n,t Bn,t+1 . Then subtracting (C.28) from (C.22) gives: Recall that by definition g 0
0
0
0
t+1 t+1 − t+1 t t ¨2,n,t (g˙ 2,n (In − g g˙ 2,n )− g˙ 2,n,t )(mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )
32
(C.29)
so that premultiplying by Mg¨2,n,t gives: 0
0
t+1 t+1 − t+1 Mg¨2,n,t (mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )
(C.30)
Therefore (C.17) implies: 0
0
t+1 t+1 − t+1 E(Mg¨2,n,t (mn,t,0 − g˙ 2,n,t (g˙ 2,n g˙ 2,n ) g˙ 2,n mt+1 n,0 )Zt ) = 0 ∀ t = 1, ..., T − 1
(C.31)
It only remains to show that (C.31) implies (C.16). 0
0
T g T )− g T mT = ˙ 2,n ˙ 2,n For t = T −1, (C.31) and (C.16) are identical since an,T (β0 ) = 0 and (g˙ 2,n n,0 0
T = 0 and mT = 0. 0 since g˙ 2,n n,0 0
0
−1 T −1 T −1 − T −1 by definition. g˙ 2,n ) g˙ 2,n mTn,0 In addition an,T −1,0 = (g˙ 2,n
The result can be proven by recursion. Assume that (C.31) for t = T − 1, ..., s + 1 implies (C.16) for t = T − 1, ..., s + 1 and: 0
s+1 s+1 − s+1 g˙ 2,n ) g˙ 2,n ms+1 E(an,s+1,0 Zs+1 ) = E((g˙ 2,n n,0 Zs+1 )
+ Bn,s+1 ξs+1
(C.32)
where ξs+1 is an unrestricted Nn × 1 vector. ¨2,n,s ≡ g˙ 2,n,s Bn,s+1 . Then clearly (C.31) for t = T − 1, ..., s implies (C.16) for t = s since g (C.16) for t = s also implies:
E(mn,s,0 Zs ) = E(Pg¨2,n,s mn,s,0 + Mg¨2,n,s g˙ 2,n,s an,s+1,0 Zs )
(C.33)
= E(Pg¨2,n,s (mn,s,0 − g˙ 2,n,s an,s+1,0 ) + g˙ 2,n,s an,s+1,0 Zs )
(C.34)
= E(g˙ 2,n,s an,s,0 Zs )
(C.35)
¨2,n,s . where the last equality follows from the definitions of an,s (β) and of g So (C.31) for t = T − 1, ..., s implies: 0
0
E(g˙ 2,n,s mn,s,0 Zs ) = E(g˙ 2,n,s g˙ 2,n,s an,s,0 Zs ) 33
(C.36)
By the same derivation as we did above for t = s, (C.16) for t = T − 1, ..., s + 1 implies: 0
0
E(g˙ 2,n,t mn,t,0 Zs ) = E(g˙ 2,n,t g˙ 2,n,t an,t,0 Zs )
(C.37)
0
= E(g˙ 2,n,t g˙ 2,n,t an,s,0 Zs )
(C.38)
0
where the last equality follows from g˙ 2,n,t Bn,t0 = 0 for t ≤ t and an,t,0 = an,t+1,0 + 0
0
¨2,n,t )− g ¨2,n,t (mn,t,0 − g˙ 2,n,t an,t+1,0 ). Bn,t+1 (¨ g2,n,t g Therefore (C.31) for t = T − 1, ..., s implies: 0
0
s s s E(g˙ 2,n g˙ 2,n an,s,0 Zs ) = E(g˙ 2,n msn,0 Zs )
(C.39)
which in turn implies: 0
0
s s s msn,0 Zs ) E(an,s,0 Zs ) = E((g˙ 2,n g˙ 2,n )− g˙ 2,n 0
(C.40)
0
s s s s + (I − (g˙ 2,n g˙ 2,n )− g˙ 2,n g˙ 2,n )ξs
(C.41)
0
0
s s s g˙ 2,n )− g˙ 2,n msn,0 Zs ) = E((g˙ 2,n
(C.42)
+ Bn,s ξs
(C.43)
where ξs in an unrestricted vector. This concludes the proof
C.3
Equivalence for Estimation
Equation (4.3) in the main text follows from: 0
0
0
0
0
0
(In − gn,t (gnt gnt )− gn,t )Σn,t (In − gn,t (gnt gnt )− gn,t ) = (In − gn,t (gnt gnt )− gn,t )
34
(C.44)
by definition of Σn,t and: 0
0
0
0
(In − gn,t (gnt gnt )− gn,t )[−gn,t (gnt+1 gnt+1 )− gnt+1 , In ] 0
0
0
0
0
0
0
0
= [−gn,t (gnt+1 gnt+1 )− gnt+1 + gn,t (gnt gnt )− gn,t gn,t (gnt+1 gnt+1 )− gnt+1 , In − gn,t (gnt gnt )− gn,t ] 0
0
0
and gn,t gn,t = gnt gnt − gnt+1 gnt+1 , so that: 0
0
0
0
(In − gn,t (gnt gnt )− gn,t )[−gn,t (gnt+1 gnt+1 )− gnt+1 , In ] 0
0
0
0
= [−gn,t (gnt gnt )− gnt+1 , In − gn,t (gnt gnt )− gn,t ] In this subsection I also show why βˆn is defined by (4.4) in section 4.2. We can rewrite: 0
¨n,t x˙ n,t z 0 t+1 − 0 t+1 t+1 (gn gn ) gn,t 0 −gn t0 t − t+10 t0 t − 0 t ˜tn =z [−gn,t (gn gn ) gn , In − gn,t (gn gn ) gn,t ]xn In 0 t+1 − t+10 0 t − t+10 0 t − 0 t+1 t+1 t+1 t t+1 t (gn gn ) gn − gn (gn gn ) gn −gn (gn gn ) gn,t t 0 gn ˜tn =z xn 0 0 0 0 In − gn,t (gnt gnt )− gn,t −gn,t (gnt gnt )− gnt+1 ˜tn = E([xn,s ]s≥t Zt ). where z ˜n,t and z ˜t+1 ˜tn respectively. Define z n,t to be the last n × 1 and the first n(T − t) × 1 blocks of z
35
We have: X
0
¨n,t x˙ n,t = z
X
0
0
0
˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t z
t≤T −1
t≤T −1
−
X
0
0
0
˜n,t gn,t (gnt gnt )− gnt+1 xt+1 z n
t≤T −1
−
X
0
0
0
t+1 t t − ˜t+1 z n,t gn (gn gn ) gn,t xn,t
t≤T −1
+
X
0
0
0
0
t+1 t+1 t+1 − ˜t+1 z gn ) − (gnt gnt )− )gnt+1 xt+1 n n,t gn ((gn
t≤T −1
=
X
0
0
0
˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t z
t≤T −1
−
X X
0
0
0
˜n,t gn,t (gnt gnt )− gn,s xn,s z
t≤T −1 s≥t
−
X
0
0
0
t+1 t t − ˜t+1 z n,t gn (gn gn ) gn,t xn,t
t≤T −1
+
X X
0
0
0
0
t+1 t+1 t+1 − ˜t+1 z gn ) − (gnt gnt )− )gn,s xn,s n,t gn ((gn
t≤T −1 s≥t 0
X
=−
0
0
˜n,s gn,s (gns gns )− gn,T xn,T z
s≤T −1
+
X
0
0
0
s+1 s+1 s+1 − ˜s+1 z gn ) − (gns gns )− )gn,T xn,T n,s gn ((gn
s≤T −1
+
X
0
0
0
˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t z
t≤T −1
−
X X
0
0
0
˜n,s gn,s (gns gns )− gn,t xn,t z
t≤T −1 s≤t
−
X
0
0
0
t+1 t t − ˜t+1 z n,t gn (gn gn ) gn,t xn,t
t≤T −1
+
X X
0
0
t≤T −1 s≤t
≡
X
0
0
s+1 s+1 s+1 − ˜s+1 z gn ) − (gns gns )− )gn,t xn,s n,s gn ((gn
At
t∈T
36
where: 0
X
AT = −
0
0
˜n,s gn,s (gns gns )− gn,T xn,T z
s≤T −1
+
X
0
0
0
0
s+1 s+1 s+1 − ˜s+1 z gn ) − (gns gns )− )gn,T xn,T n,s gn ((gn
s≤T −1 0
X
=−
0
0
˜n,s gn,s (gns gns )− gn,T xn,T z
s≤T −1 0
˜Tn,T −1 xn,T +z X 0 s+1 s+10 s+1 − ˜s+1 + z gn ) gn,T xn,T n,s gn (gn s≤T −2
−
X
0
0
s+1 s s − ˜s+1 z n,s gn (gn gn ) xn,T
s≤T −1 0
˜Tn,T −1 xn,T =z X 0 s+1 s+10 s+1 − ˜s+1 + z gn ) gn,T xn,T n,s gn (gn s≤T −2
−
X
0
0
˜sn gns (gns gns )− xn,T z
s≤T −1 0
˜Tn,T −1 xn,T =z X 0 0 ˜sn,s−1 gns (gns gns )− gn,T xn,T + z s≤T −1
−
X
0
0
˜sn gns (gns gns )− xn,T z
s≤T −1 0
˜Tn,T −1 xn,T =z X 0 0 0 ˜sn )gns (gns gns )− gn,T xn,T + (˜ zsn,s−1 − z s≤T −1
˜1n,0 ≡ 0. where the fourth equality follows from a summation index change and z
37
Similarly we can show that, for t ≤ T − 1: 0
0
0
˜n,t (In − gn,t (gnt gnt )− gn,t )xn,t At = z X 0 0 0 ˜n,s gn,s (gns gns )− gn,t xn,t − z s≤t 0
0
0
t+1 t t − ˜t+1 −z n,t gn (gn gn ) gn,t xn,t X 0 s+1 0 0 s+10 s+1 − ˜s+1 + z gn ) − (gns gns )− )gn,t xn,s n,s gn ((gn s≤t 0
˜n,t xn,t =z X 0 0 0 ˜sn )gns (gns gns )− gn,t xn,t + (˜ zsn,s−1 − z s≤t
Therefore we see that the estimator defined by (4.4) is indeed the pooled regression version of the estimator defined by (4.3).
C.4
Preliminary Results for the Proofs of Section 4.2
As in section B of this appendix, I consider for simplicity the case where dim(xit ) = 1. ˘n,t corresponding to observation i.9 Define z˜i,t,t0 = Define z˘it to be the element of z 0
E(xit Zt0 ) for t ≤ t. When Assumption 7 is used, denote by pi ∈ Pn the set in Pn that contains i. As in the previous section, Assumption 6c and Jensen’s inequality imply ˜ zi,t,t0 4+δ ≤ 0
E(Bit Zt0 ) ∀ t ≤ t, i ∈ N . As in section B of this appendix, this last result combined with Nn,d ≤ C ∀ d ∈ Dn , #{d ∈ Dn : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ} ≤ C ∀ i ∈ N , ∀ n, P also imply ˘ zit 4+δ ≤ C s≥t E(Bis Zt ). Also note that z˘it is a function of {˜ zjs : ∃ σ ∈ Sn s.t. j, i ∈ σ, s ∈ T } only. Together with P ˘ zit 4+δ ≤ C s≥t E(Bis Zt ), E(Bit ) ≤ C, and Assumption 4d, this implies that the random field {˘ zit : i ∈ N , t ∈ T , n ∈ N} is L2 NED on the random field {˜ zit : i ∈ N , t ∈ T , n ∈ N} on ˜ equipped with semimetric ρ˜, with NED coefficients ψ(.) such that ψ(r) = 0 for the lattice Q 9
For notational simplicity I drop the n subscript and write z˘it instead of z˘n,it .
38
r > C since maxi,i0 ∈σ,σ∈Sn ,t,t0 ∈T ρ(lit , li0 t0 ) ≤ C, and constant NED scaling factors. Lemma 7. Under Assumption 6, and Assumption 7 or Assumption 8, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1X 0 p ˘n,t xn,t − An → 0 z n t
(C.45)
P 0 ˘n,t xn,t ). where An = n1 E( t z Proof. The proof is identical to the proof of Lemma 4 in part B of this appendix. Lemma 8. Under Assumptions 5, 6, and Assumption 7 or Assumption 8, as n → ∞ while Nn,d ≤ C ∀ d ∈ Dn , #{d : dit = d, i ∈ σ, t ∈ T } ≤ C ∀ σ ∈ Sn , and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} ≤ C ∀ i ∈ N , d ∈ Dn , C < ∞, ∀ n: 1X n
X
p
z˘i,t z˘j,t ui,t uj,t − Vt → 0
(C.46)
i∈N j∈N :dit =djt
0
zn,t un,t ). where Vt = limn→∞ n1 V ar(˘ Proof.
Proof using Assumption 7 Under Assumptions 6 and 7, we can use H¨older’s inequality, Minkowski’s inequality, and Jensen’s inequality, to verify Markov’s condition as in the proof of Lemma 5 and obtain the desired result. Proof using Assumption 8 Define V¯n,t =
1 n
P
i∈N
P
˘it z˘jt E(uit ujt {zi0 t }i0 ∈N :d j∈N :djt =dit z
0 =dit i t
p Firstly we first show that V¯n,t − Vt → 0:
Under Assumption 4d, E(uit ujt {zi0 t }i0 ∈N :d 0
i t
=dit )
is L2 NED on the random field {zit :
˜ equipped with semimetric ρ˜ with NED coefficients ψ(.) i ∈ N , t ∈ T , n ∈ N} on the lattice Q such that ψ(r) = 0 for r > C and constant NED scaling factors. As in the proof of Lemma 4,
39
).
we can verify that the rest of the conditions for Theorem 1 of Jenish and Prucha (2012) hold, p so that V¯n,t − Vt → 0.
Secondly note that Assumption 6c implies that Assumption 1a of Kuersteiner and Prucha (2013) holds. Finally note that Assumption 5 implies that Assumption 2 of Kuersteiner and Prucha (2013) holds. Therefore all of the conditions for Lemma 1 of Kuersteiner and Prucha (2013) to apply are met, so that we have: 1X n
X
p
z˘i,t z˘j,t ui,t uj,t − Vt → 0
(C.47)
i∈N j∈N :dit =djt
C.5
Proof of Proposition 4
Consistency is obtained from the result of Lemma 7, the continuous mapping theorem, Assumption 6a, and: 1X 0 p ˘n,t un,t → 0 z n t
(C.48)
which can be shown by showing convergence in meansquared error as in the proof of Proposition 1. Asymptotic normality is obtained from the result of Lemma 7, the asymptotic equivalence lemma, Assumption 6a, and: 1
V −2
1X 0 d ˘ un,t → N (0, I) z n t n,t
(C.49)
In order to show this last result, we use the central limit theorem given by Theorem 2 of Kuersteiner and Prucha (2013). As in the proof of Lemma 8, Assumption 6c implies that Assumption 1a of Kuersteiner and Prucha (2013) holds and Assumption 5 implies that Assumption 2 of Kuersteiner and Prucha (2013) holds. 40
Lemma 8 guarantees that Assumption 1c of Kuersteiner and Prucha (2013) holds. Therefore all conditions for Theorem 2 of Kuersteiner and Prucha (2013) are met and we obtain the desired result.
C.6
Proof of Proposition 5
The first result of Proposition 5 is shown by Lemma 7. For the second result of Proposition 5, define: 1 Vˆn = n
(σ )0 (σ )
X
X
ˆ n,t11 ˘n,t11 v z
σ1 ∈Sn t1 ∈T
(σ )0 (σ )
˘n,t22 ˆ n,t22 z v
(C.50)
σ2 ∈Un (σ1 ),t2 ∈T
We need to show that Vˆn − V = op (1). P (σ) (σ) ˘n,t gn,t = 0, so that, defining: Note that Tt=1 z
Tn,1 =
1 n
(σ )0 (σ )
X
X
˘n,t11 un,t11 z
σ1 ∈Sn t1 ∈T
(σ )0 (σ )
˘n,t22 un,t22 z
(C.51)
σ2 ∈Un (σ1 ),t2 ∈T
we can show as in the proof of Proposition 2 that: Vˆn − Tn,1 = op (1) (σ)
Define z˘it
(σ)
˘n to be the element of z
(C.52)
corresponding to observation {i, t} for i ∈ σ. We
have: Tn,1 =
1 n
X
X
X
0
(σ) (σ)
1[σ ∈ Un (σ)]˘ zit z˘js uit ujs
(C.53)
i∈N ,t∈T j∈N ,s∈T σ,σ 0 ∈Sn :i∈σ,j∈σ 0
We can use the same arguments as in the proof of Proposition 2 to show that Tn,1 − Vn = op (1) can be obtained from the conditions of Proposition 5 under Assumption 7 or Assumption 8 as in the proof of Lemma 8.
41
C.7
Proof of Proposition 6 0
0
0
0
Define Mn,t = [−gn,t (gnt gnt )− gnt+1 , In − gn,t (gnt gnt )− gn,t ] and recall the definition Σn,t = 0
0
(In − gn,t (gnt gnt )− gn,t )− . Under the assumptions of Proposition 6, we have: 1X 0 0 ˜tn ) E(˜ ztn Mn,t Σn,t Mn,t z n t 1X 0 0 ˜tn ) V = E(˜ ztn Mn,t Σn,t Mn,t z n t
An =
= An
(C.54) (C.55) (C.56)
0
The second equality follows from Σn,t = (Mn,t Mn,t )− and from Mn,t gnt = 0, so that for s < t: 0
0
Mn,t gnt (gns gns )− gn,s = 0
(C.57)
This last result implies that the moment conditions given by (C.15) are indeed serially uncorrelated under the assumptions of homoscedasticity and serial and crosssectional uncorrelation of Proposition 6, which is why it is advantageous to work from these moment conditions in order to define a locally efficient estimator rather than from the moment conditions (C.14), which are not serially uncorrelated under the assumptions of Proposition 6.
42
Therefore, under the assumptions of Proposition 6, we have: −1 −1 −1 Bn−1 Wn Bn−1 − A−1 n Vn An = Bn Wn Bn − E(
1 X t0 0 ˜tn )−1 ˜ M Σn,t Mn,t z z n t n n,t
1X 0 ˜tn )−1 × ´ Mn,t z z n t n,t 1X 0 0 0 0 ´n,t Mn,t (In − gn,t (gnt gnt )− gn,t )Mn,t z ´n,t )× E( z n t 1 X t0 0 ˜ M z ´n,t )−1 E( z n t n n,t 1 X t0 0 ˜ M Σn,t Mn,t z ˜tn )−1 z − E( n t n n,t 1X 0 0 0 = E(vn,t (In − gn,t (gnt gnt )− gn,t )vn,t ) n t = E(
´n,t E( n1 where vn,t ≡ z
0 0 ˜tn Mn,t z ´n,t )−1 tz
P
˜tn E( n1 − Σn,t Mn,t z
0 0 ˜tn Mn,t Σn,t Mn,t z ˜tn )−1 . tz
P
(C.58) (C.59) (C.60) (C.61) (C.62) (C.63)
The last
equality follows from: 0
0
0
0
(In − gn,t (gnt gnt )− gn,t )Σn,t Mn,t = Mn,t Mn,t (Mn,t Mn,t )− Mn,t = Mn,t 0
0
(C.64) (C.65)
0
This concludes the proof since In − gn,t (gnt gnt )− gn,t = Mn,t Mn,t and positive definiteness is preserved by sums and expected values.
D
Effect of Class Size on Student Achievement: Details and Additional Results
D.1
Details of the Estimation Methods
D.1.1
Dynamic model with all sources of unobserved heterogeneity
In order to implement the estimator and the associated standard errors of section 4.2, Sn and t(σ)
˜n z
need to be defined.
Let An be the set of all schools in the dataset (1, 369 schools). Let ait denote the school 43
0
attended by student i in year t (schoolit in the main text). For each pair of schools a, a ∈ P 0 An , let #a,a0 = i,t∈N ×T 1[∃ s ∈ T : ait = a, ais = a ]. For any a ∈ An , let a1 (a) = argmaxa0 ∈An ,a0 6=a #a,a0 and a2 (a) = argmaxa0 ∈An ,a0 6=a,a0 6=a1 (a) #a,a0 . In this empirical application, Sn was defined to be {{i ∈ N : ait ∈ {a, a1 (a), a2 (a)} ∀ t ∈ T }}a∈An , i.e. the set of students who attended either school a or one of the two schools that most students transfer to from or from to a for grades 4 and 5. This definition of Sn resulted in 1, 369 groups with an average number of teachers per group of approximately 31 teachers and an average number of overlapping groups, #Un (σ), per group σ ∈ Sn , of approximately 8, so that the conditions in Sections 3 and 4 that #{d : dit = d, i ∈ σ, t ∈ T } and #{σ ∈ Sn : i ∈ σ or d ∈ δ(σ)} be relatively small seem appropriate here.10 In this empirical application, yit−1 is the only sequentially exogenous covariate, the other covariates (gradeyear indicators and the polynomial in class size) being treated as strictly exogenous. t(σ)
˜n In order to form z
we thus only need to replace E(yit−1 Y t−2 , X) by an estimated
auxiliary model since here only two time periods, grades 4 and 5, are available for each student. This was obtained by a sequence of gradeyear specific ordinary least squares regressions of P P P yit−1 on yit−2 , #d 1 ,t−2 j:djt−2 =dit−2 yjt−2 , #d 1 ,t−1 j:djt−1 =dit−1 yjt−2 , #d1 ,t j:djt =dit yjt−2 , it−2
it−1
it
csit−2 , csit−1 , csit where #d,t = #{i ∈ N : dit = d}. D.1.2
Dynamic model with student unobserved heterogeneity only
This estimator was calculated as in Section 5.2 and in the section above but replacing gn with j∈N [1[i = j]]i∈N ,t∈T and Sn with {N }.
10
Similarly as was noted in Section 3.1, one could also have defined Sn = {{i ∈ N : ait = a ∀ t ∈ T }}a∈An (the collection of students who are “stayers” at the school level for each school) so that Un (σ) = {σ} ∀ σ ∈ Sn and standard errors could be computed without having to compute crossproducts, although generally this definition of Sn could lead to a smaller effective sample size and a loss in efficiency.
44
D.1.3
Dynamic model with teacher and school unobserved heterogeneity only, treated as finite dimensional parameters that can be estimated consistently
This estimator was calculated by a pooled regression of yit on yit−1 , gradeyear indicators, the polynomial in class size, and indicator variables for each school and teacher.
D.1.4
Dynamic model with no unobserved heterogeneity
This estimator was calculated by a pooled regression of yit on yit−1 , gradeyear indicators and the polynomial in class size.
D.1.5
Static model and model in gains
For the static model (ρ0 assumed to be equal to zero) or the model in gain (ρ0 assumed to be equal to one), the resulting model has covariates that are all treated as strictly exogenous. Therefore the model with all sources of unobserved heterogeneity is estimated as in section D.1.1 but without having to use an auxiliary model for predicting the covariates based on the instrumental variables and replacing the dependent variable with ∆yit for the model in gains. The model with unobserved heterogeneity indexed by student only is estimated by oneway fixed effects estimation. The models with teacher and school unobserved heterogeneity only and no unobserved heterogeneity are estimated by ordinary least squares regressions, simply excluding yit−1 from the list of covariates.
D.2
Classical Measurement Error in Test Score
In this section I discuss results obtained with estimators for a model as in (6.1) and (6.2) in the main text, but where test scores measure student achievement up to an additive error term which is independent of the explanatory variables and independent across subjects. This type of measurement error was considered in Andrabi et al. (2011) and Verdier (2016) for models with oneway unobserved heterogeneity only.
45
The resulting model is:
? ? yit = αgradeit ,t + ρ0 yit−1 + xit β0 + ci + edit + fschoolit + uit
(D.1)
? yit = yit + it
(D.2)
E(uit Y other,t−1 , X) = 0
(D.3)
E(it Y other,t , X)
(D.4)
? is student i’s achievement in year t in mathematics, y is the student’s test score where yit it other } other is in mathematics, so that it is measurement error, Y other,t = {yis i∈N ,s=1,...,t and yit
student i’s test score in reading in year t. All other variables are defined as in the main text. The definitions are simply exchanged when reading is of interest instead of mathematics. Therefore we can rewrite:
yit = αgradeit ,t + ρ0 yit−1 + xit β0 + ci + edit + fschoolit + uit + it − ρ0 it−1
(D.5)
E(uit + it − ρ0 it−1 Y other,t−1 , X) = 0
(D.6)
so that estimation of this model can proceed as in the main text but using past test scores in reading as instruments for past test scores in mathematics and vice versa. Similarly, for estimation of this dynamic model but with student level unobserved heterogeneity only, estimation is done as in the main text but using past test scores in reading as instruments for past test scores in mathematics and vice versa, as in Andrabi et al. (2011) and Verdier (2016). For estimation of this dynamic model with teacher and school level unobserved heterogeneity only and treated as finite dimensional parameters that can be estimated consistently and estimation without any unobserved heterogeneity, estimation is not done by ordinary least squares regression as in the main text but rather by instrumental variable regression using past test score in reading as instruments for past test scores in mathematics and vice versa. Note that for the model in gains (ρ0 is assumed to be one) or the static model (ρ0 is 46
assumed to be zero), the presence of measurement error of the type considered here does not invalidate the estimators considered in the main text, so the results for these models are unchanged in this section. The results are presented in Tables 1 and 2 of the Appendix, and additional visualizations are shown in Figures 24 of the Appendix. With the model considered in this section, the effect of class size on student achievement is estimated to be higher with the model with measurement error considered in this section than in the main text, and the strength of persistence in the effect of past educational inputs is also estimated to be higher (for mathematics for instance, ρ0 is estimated to be 0.315 here instead of 0.094 in the main text). The estimated effect of class size is still estimated to be higher when using a model that includes all sources of unobserved heterogeneity rather than a model that excludes student or teacher and school level unobserved heterogeneity (Figure 3). However the difference between the estimates in the effect of class size corresponding to the models with all sources of unobserved heterogeneity and teacher and school level unobserved heterogeneity only is smaller here than in the main text. Since persistence is estimated to be stronger here than in the main text, the difference between the results obtained for mathematics with a static model and with a dynamic model is larger than in the main text, and the difference between the results obtained with a model in gains and with a dynamic model is smaller than in the main text (Figure 4).
D.3
Results for an Estimation Sample with Class Size between Five and Thirty Students Instead of Ten to Thirty Students
Tables 3 to 9 and Figures 5 to 11 show the same results as in section 6 of the paper and section D.2 of this appendix for a larger estimation sample where class size is restricted to be between five and thirty students instead of being restricted to be between ten and thirty students.
47
References Andrabi, T., J. Das, A. Ijaz Khwaja, and T. Zajonc (2011, July). Do ValueAdded Estimates Add Value? Accounting for Learning Dynamics. American Economic Journal. Applied Economics 3 (3), 29–54. Chamberlain, G. (1992, January). Comment: Sequential Moment Restrictions in Panel Data. Journal of Business & Economic Statistics 10 (1), 20–26. Jenish, N. and I. R. Prucha (2009, May). Central limit theorems and uniform laws of large numbers for arrays of random fields. Journal of Econometrics 150 (1), 86–98. Jenish, N. and I. R. Prucha (2012, September). On spatial processes and asymptotic inference under nearepoch dependence. Journal of Econometrics 170 (1), 178–190. Jochmans, K. and M. Weidner (2016). FixedEffects Regression on Network Data. Working Paper . Kemeny, J. G., J. L. Snell, and A. W. Knapp (1976). Denumerable Markov Chains (second ed.), Volume 40 of Graduate Texts in Mathematics. Springer New York. Kuersteiner, G. M. and I. R. Prucha (2013, June). Limit theory for panel data models with cross sectional dependence and sequential exogeneity. Journal of Econometrics 174 (2), 107–126. Verdier, V. (2016, January). Estimation of Dynamic Panel Data Models with CrossSectional Dependence: Using Cluster Dependence for Efficiency. Journal of Applied Econometrics 31 (1), 85–105. White, H. (2001). Asymptotic theory for econometricians. San Diego: Academic Press.
48
Table 1: Estimates of persistence and of the effect of class size reductions on student achievement in mathematics, model with measurement error.
Threeway unobserved heterogeneity
Only student unobserved heterogeneity
No student unobserved heterogeneity
No unobserved heterogeneity
0.324 (0.029)
0.928 (0.001)
0.936 (0.002)
Dynamic Model Persistence
0.315 (0.034)
Class size reduction from thirty to: twentyfive
0.371 (0.181)
0.095 (0.110)
0.416 (0.133)
−0.012 (0.135)
twenty
0.730 (0.172)
0.468 (0.103)
0.676 (0.124)
0.177 (0.124)
fifteen
0.931 (0.199)
0.523 (0.120)
0.839 (0.142)
0.364 (0.142)
ten
0.539 (0.241)
0.058 (0.139)
0.444 (0.169)
−0.295 (0.159)
Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are twoway clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with twoway clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.
49
Table 2: Estimates of persistence and of the effect of class size reductions on student achievement in reading, model with measurement error.
Threeway unobserved heterogeneity
Only student unobserved heterogeneity
No student unobserved heterogeneity
No unobserved heterogeneity
0.476 (0.012)
0.832 (0.001)
0.839 (0.001)
Dynamic Model Persistence
0.446 (0.017)
Class size reduction from thirty to: twentyfive
0.078 (0.175)
−0.133 (0.102)
0.009 (0.107)
−0.266 (0.091)
twenty
0.227 (0.172)
0.075 (0.093)
0.178 (0.101)
−0.200 (0.082)
fifteen
0.302 (0.190)
0.172 (0.107)
0.235 (0.114)
−0.124 (0.093)
ten
0.397 (0.208)
0.106 (0.119)
0.263 (0.124)
−0.235 (0.101)
Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are twoway clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with twoway clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.
50
(b) Reading
(a) Mathematics
Figure 2: Effect of reducing class size from thirty with pointwise ninetyfive percent confidence intervals, model with measurement error.
(a) Mathematics
(b) Reading
Figure 3: Effect of reducing class size from thirty, estimators for dynamic model with different sources of unobserved heterogeneity, model with measurement error.
(b) Reading
(a) Mathematics
Figure 4: Effect of reducing class size from thirty, estimators for model with all sources of unobserved heterogeneity and different restrictions on persistence, model with measurement error.
51
Results when restricting estimation sample to class sizes between 5 and 30 students instead of between 10 and 30 students. Table 3: Description of the estimation sample.
Reading
Mathematics 238,997
Students Teachers
12,569
13,125 2009 to 2012
Years
4 and 5
Grades Test Scores average
354.92
349.05
9.39
9.55
average
21.51
20.34
standard dev.
4.55
5.66
59 students
352.88
347.40
1014 students
353.35
347.95
1519 students
353.29
347.51
2024 students
354.83
349.06
2430 students
356.75
351.04
standard dev. Class size
Average test score by class size
52
Table 4: Observed frequencies of transition between class sizes, Mathematics.
Students, grade 4 (left) to grade 5 (top). 59
1014
1519
2024
2530
Total
59
446
486
1,282
2,622
1,507
6,343
1014
326
1,878
3,758
8,005
5,231
19,198
1519
688
2,948
15,422
24,399
9,870
53,327
2024
907
3,643
14,890
56,979
30,743
107,162
2530
354
1,614
3,711
18,032
29,256
52,967
Total
2,721
10,569
39,063
110,037
76,607
238,997
Teachers, 2009 to 2010, 2010 to 2011, 2011 to 2012. 59
1014
1519
2024
2530
Total
59
63
46
92
117
59
377
1014
36
162
279
338
164
979
1519
46
195
1,038
1,294
387
2,960
2024
48
211
919
2,774
1,328
5,280
2530
26
73
210
1,042
1,244
2,595
Total
219
687
2,538
5,565
3,182
12,191
53
Table 5: Observed frequencies of transition between class sizes, Reading.
Students, grade 4 (left) to grade 5 (top). 59
1014
1519
2024
2530
Total
59
3,171
3,823
3,902
6,910
3,931
21,737
1014
2,474
5,077
5,990
11,202
6,531
31,274
1519
1,600
3,551
12,837
20,823
8,646
47,457
2024
1,732
3,926
12,862
47,244
25,410
91,174
2530
469
1,659
3,407
15,410
26,410
47,355
Total
9,446
18,036
38,998
101,589
70,928
238,997
Teachers, 2009 to 2010, 2010 to 2011, 2011 to 2012. 59
1014
1519
2024
2530
Total
59
271
245
278
415
188
1,397
1014
134
329
430
588
243
1,724
1519
115
202
788
1,085
340
2,530
2024
107
213
716
2,129
1,112
4,277
2530
23
62
166
873
1,079
2,203
Total
650
1,051
2,378
5,090
2,962
12,131
54
Table 6: Estimates of persistence and of the effect of class size reductions on student achievement in mathematics.
Threeway unobserved heterogeneity
No student unobserved heterogeneity
No unobserved heterogeneity
0.109 (0.006)
0.670 (0.001)
0.684 (0.001)
Only student unobserved heterogeneity Dynamic Model
Persistence
0.094 (0.009)
Class size reduction from thirty to: twentyfive
0.318 (0.141)
0.095 (0.088)
0.189 (0.106)
−0.314 (0.119)
twenty
0.611 (0.145)
0.396 (0.084)
0.373 (0.103)
−0.238 (0.114)
fifteen
0.742 (0.148)
0.421 (0.086)
0.423 (0.106)
−0.215 (0.115)
ten
0.692 (0.176)
0.289 (0.106)
0.387 (0.132)
−0.343 (0.141)
five
0.559 (0.248)
0.721 (0.173)
0.483 (0.191)
−0.376 (0.196)
Model in Gains Class size reduction from thirty to: twentyfive
0.389 (0.227)
0.078 (0.136)
0.303 (0.119)
−0.059 (0.119)
twenty
0.777 (0.226)
0.552 (0.132)
0.632 (0.116)
0.298 (0.114)
fifteen
1.000 (0.237)
0.609 (0.135)
0.743 (0.119)
0.369 (0.115)
ten
0.994 (0.294)
0.385 (0.166)
0.660 (0.148)
0.146 (0.142)
five
0.799 (0.407)
0.960 (0.280)
0.680 (0.209)
0.327 (0.203)
Static Model Class size reduction from thirty to: twentyfive
0.310 (0.136)
0.097 (0.085)
−0.149 (0.228)
−1.610 (0.280)
twenty
0.593 (0.141)
0.377 (0.081)
−0.671 (0.221)
−3.312 (0.269)
fifteen
0.715 (0.143)
0.398 (0.083)
−0.985 (0.228)
−3.704 (0.275)
ten
0.661 (0.169)
0.278 (0.102)
−0.861 (0.276)
−3.275 (0.339)
five
0.534 (0.238)
0.692 (0.165)
−0.425 (0.369)
−4.402 (0.498)
Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are twoway clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator 55 (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with twoway clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.
Table 7: Estimates of persistence and of the effect of class size reductions on student achievement in reading.
Threeway unobserved heterogeneity
No student unobserved heterogeneity
No unobserved heterogeneity
0.256 (0.004)
0.690 (0.001)
0.719 (0.001)
Only student unobserved heterogeneity Dynamic Model
Persistence
0.225 (0.006)
Class size reduction from thirty to: twentyfive
0.111 (0.139)
−0.129 (0.078)
0.185 (0.134)
−0.381 (0.127)
twenty
0.215 (0.144)
0.040 (0.076)
0.194 (0.139)
−0.582 (0.130)
fifteen
0.292 (0.142)
0.110 (0.075)
0.195 (0.132)
−0.531 (0.124)
ten
0.295 (0.161)
0.031 (0.089)
0.210 (0.143)
−0.384 (0.134)
five
0.147 (0.170)
0.102 (0.105)
0.115 (0.149)
−0.529 (0.138)
Model in Gains Class size reduction from thirty to: twentyfive
0.089 (0.215)
−0.249 (0.122)
0.139 (0.109)
−0.084 (0.088)
twenty
0.161 (0.223)
−0.081 (0.119)
0.376 (0.108)
0.342 (0.083)
fifteen
0.293 (0.219)
0.035 (0.116)
0.501 (0.106)
0.448 (0.080)
ten
0.368 (0.249)
0.008 (0.139)
0.475 (0.123)
0.252 (0.097)
five
0.071 (0.278)
0.121 (0.162)
0.424 (0.135)
0.613 (0.108)
Static Model Class size reduction from thirty to: twentyfive
0.117 (0.128)
−0.088 (0.070)
−0.276 (0.211)
−1.455 (0.243)
twenty
0.230 (0.132)
0.082 (0.068)
−0.769 (0.211)
−3.328 (0.238)
fifteen
0.291 (0.130)
0.136 (0.066)
−0.993 (0.203)
−3.558 (0.229)
ten
0.273 (0.148)
0.039 (0.079)
−0.944 (0.229)
−2.716 (0.269)
five
0.169 (0.152)
0.095 (0.094)
−1.091 (0.239)
−3.996 (0.284)
Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are twoway clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator 56 (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with twoway clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.
(a) Mathematics
(b) Reading
Figure 5: Average test score by class size.
(a) Mathematics
(b) Reading
Figure 6: Effect of reducing class size from thirty with pointwise ninetyfive percent confidence intervals.
(a) Mathematics
(b) Reading
Figure 7: Effect of reducing class size from thirty, estimators for dynamic model with different sources of unobserved heterogeneity.
57
(a) Mathematics
(b) Reading
Figure 8: Effect of reducing class size from thirty, estimators for model with all sources of unobserved heterogeneity and different restrictions on persistence.
58
Table 8: Estimates of persistence and of the effect of class size reductions on student achievement in mathematics, model with measurement error.
Threeway unobserved heterogeneity
No student unobserved heterogeneity
No unobserved heterogeneity
0.300 (0.027)
0.927 (0.001)
0.935 (0.002)
Only student unobserved heterogeneity Dynamic Model
Persistence
0.297 (0.033)
Class size reduction from thirty to: twentyfive
0.333 (0.154)
0.091 (0.096)
0.270 (0.113)
−0.160 (0.118)
twenty
0.648 (0.158)
0.429 (0.092)
0.538 (0.110)
0.065 (0.113)
fifteen
0.800 (0.163)
0.461 (0.094)
0.617 (0.113)
0.105 (0.114)
ten
0.760 (0.195)
0.310 (0.116)
0.550 (0.141)
−0.075 (0.140)
five
0.613 (0.275)
0.772 (0.191)
0.600 (0.202)
0.021 (0.199)
Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are twoway clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with twoway clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.
59
Table 9: Estimates of persistence and of the effect of class size reductions on student achievement in reading, model with measurement error.
Threeway unobserved heterogeneity
Only student unobserved heterogeneity
No student unobserved heterogeneity
No unobserved heterogeneity
0.472 (0.011)
0.832 (0.001)
0.838 (0.001)
Dynamic Model Persistence
0.442 (0.015)
Class size reduction from thirty to: twentyfive
0.105 (0.156)
−0.164 (0.089)
0.069 (0.093)
−0.306 (0.079)
twenty
0.200 (0.161)
0.005 (0.086)
0.183 (0.092)
−0.252 (0.076)
fifteen
0.292 (0.159)
0.088 (0.085)
0.250 (0.090)
−0.200 (0.073)
ten
0.315 (0.181)
0.024 (0.101)
0.236 (0.106)
−0.228 (0.087)
five
0.126 (0.195)
0.108 (0.119)
0.169 (0.117)
−0.132 (0.096)
Numbers between parenthesis are standard errors. Standard errors for the first estimator are calculated as discussed in section 4.2 and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are twoway clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with twoway clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.
60
(b) Reading
(a) Mathematics
Figure 9: Effect of reducing class size from thirty with pointwise ninetyfive percent confidence intervals, model with measurement error.
(a) Mathematics
(b) Reading
Figure 10: Effect of reducing class size from thirty, estimators for dynamic model with different sources of unobserved heterogeneity, model with measurement error.
(a) Mathematics
(b) Reading
Figure 11: Effect of reducing class size from thirty, estimators for model with all sources of unobserved heterogeneity and different restrictions on persistence, model with measurement error.
61