A Practical Introduction to Regression Discontinuity ...

Viewer
Transcript

A Practical Introduction to Regression Discontinuity Designs Matias D. Cattaneo∗

Nicol´as Idrobo†

Roc´ıo Titiunik‡

May 29, 2017

Monograph prepared for Cambridge Elements: Quantitative and Computational Methods for Social Science Cambridge University Press http://www.cambridge.org/us/academic/elements/ quantitative-and-computational-methods-social-science

** PRELIMINARY AND INCOMPLETE – COMMENTS WELCOME **

∗

Department of Economics and Department of Statistics, University of Michigan. Department of Economics, University of Michigan. ‡ Department of Political Science, University of Michigan. †

CONTENTS

CONTENTS

Contents Acknowledgments

1

1 Introduction

2

1.1

Software for RD Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2

Running Empirical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2 The RD Design: Definition and Taxonomy

9

2.1

The Sharp RD Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2

The Fuzzy RD Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3

The Kink RD Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4

The Multi-Cutoff RD Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5

The Multi-Score and Geographic RD Designs . . . . . . . . . . . . . . . . . . . . . . 19

2.6

The Local Nature of RD Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7

Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Graphical illustration of RD treatment effects

25

3.1

ES versus QS bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2

Choosing the Number of Bins Optimally . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1

Binning to Trace Out the Underlying Regression Functions . . . . . . . . . . 35

3.2.2

Binning to Mimic the Variability of the Data . . . . . . . . . . . . . . . . . . 40

3.3

Recommendations for Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4

Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 The Continuity-Based Approach to RD Analysis

46

4.1

Local Polynomial Approach: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2

Local Polynomial Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3

4.2.1

Choice of Kernel Function and Polynomial Order . . . . . . . . . . . . . . . . 50

4.2.2

Bandwidth Selection and Implementation . . . . . . . . . . . . . . . . . . . . 51

4.2.3

Optimal Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.4

Point Estimation in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Local Polynomial Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1

Using the MSE-Optimal Bandwidth for Inference . . . . . . . . . . . . . . . . 67

4.3.2

Using Different Bandwidths for Point Estimation and Inference . . . . . . . . 72

4.3.3

Statistical Inference in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 73

i

CONTENTS

CONTENTS

4.4

Extensions: Covariates and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5

Recommendations for Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.6

Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 The Local Randomization Approach to RD Analysis

82

5.1

Local Randomization Approach: Overview . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2

Local Randomization Estimation and Inference . . . . . . . . . . . . . . . . . . . . . 87 5.2.1

Finite Sample Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2.2

Large Sample Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.3

Estimation and Inference in Practice . . . . . . . . . . . . . . . . . . . . . . . 94

5.3

How to Choose the Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4

When To Use The Local Randomization Approach . . . . . . . . . . . . . . . . . . . 111

5.5

Recommendations for Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.6

Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6 Validation and Falsification of the RD Design

114

6.1

Density of Running Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.2

Covariates and Placebo Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2.1

Continuity-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2.2

Local Randomization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.3

Other Design-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.4

Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7 Empirical Example with Discrete Running Variable

136

7.1

The Effect of Academic Probation on Future Academic Achievement . . . . . . . . . 136

7.2

Counting the Number of Mass Points in the RD Score . . . . . . . . . . . . . . . . . 137

7.3

Using the Continuity-Based Approach when the Number of Mass Points is Large . . 140

7.4

Interpreting Continuity-Based RD Analysis with Mass Points . . . . . . . . . . . . . 147

7.5

Local Randomization RD Analysis with Discrete Score . . . . . . . . . . . . . . . . . 150

7.6

Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8 Final Remarks

157

Bibliography

158

ii

ACKNOWLEDGMENTS

CONTENTS

Acknowledgments This monograph collects and expands the instructional materials we prepared for more than 25 short courses and workshops on Regression Discontinuity (RD) methodology taught over the years 2014–2017. These teaching materials were used at the following institutions and programs: the Asian Development Bank, the Philippine Institute for Development Studies, the International Food Policy Research Institute, the ICPSR’s Summer Program in Quantitative Methods of Social Research, the Abdul Latif Jameel Poverty Action Lab, the Inter-American Development Bank, the Georgetown Center for Econometric Practice, and the Universidad Cat´olica del Uruguay’s Winter School in Methodology and Data Analysis. The materials were also employed for teaching at the undergraduate and graduate level at Brigham Young University, Cornell University, Instituto Tecnol´ogico Aut´ onomo de M´exico, Pennsylvania State University, Pontificia Universidad Cat´olica de Chile, University of Michigan, and Universidad Torcuato Di Tella. We thank all these institutions and programs, as well as their many audiences, for the support, feedback and encouragement we received over the years. The work collected in this monograph evolved and benefited from many insightful discussions with our present and former collaborators: Sebasti´an Calonico, Bob Erikson, Juan Carlos Escanciano, Max H. Farrell, Yingjie Feng, Brigham Frandsen, Sebasti´an Galiani, Michael Jansson, Luke Keele, Marko Klaˇsnja, Xinwei Ma, Kenichi Nagasawa, Brendan Nyhan, Jas Sekhon, Gonzalo Vazquez-Bare, and Jos´e Zubizarreta. Their intellectual contribution to our research program on RD designs cannot be overestimated, and certainly made this monograph much better than it would have otherwise been. We also thank Alberto Abadie, Josh Angrist, Ivan Canay, Richard Crump, David Drukker, Sebastian Galiani, Guido Imbens, Pat Kline, Justin McCrary, David McKenzie, Doug Miller, Aniceto Orbeta, Zhuan Pei, and Andres Santos for the many stimulating discussions and criticisms we received from them over the years, which also shaped the work presented here in important ways. The monograph is purposely practical and hence focuses on empirical analysis of RD designs. We do not seek to provide a comprehensive literature review on RD designs nor discuss theoretical aspects in detail. We employ the data of Meyersson (2014) as the main running example throughout the manuscript, and we also use the data of Lindo et al. (2010) as a second empirical illustration. We thank these authors for making their data and codes publicly available. Accompanying the monograph, we provide complete replication codes in both R and Stata. Furthermore, we provide full replication codes for a third empirical illustration using the data of Cattaneo et al. (2015), though this example is not discussed in the text to conserve space and because it is already analyzed in our companion software articles. The general purpose, open-source software used in this monograph, as well as all replication files, can be found at https://sites.google.com/site/rdpackages. Last but not least, we gratefully acknowledge funding from the National Science Foundation through grant SES-1357561.

1

1. INTRODUCTION

1

CONTENTS

Introduction

One important goal in the social sciences is to understand the causal effect of a treatment on some outcome of interest. As social scientists, we are interested in questions as varied as the effect of minimum wage increases on unemployment, the role of information dissemination on political participation, the impact of educational reforms on student achievement, and the effects of conditional cash transfers on children’s health. The analysis of such effects is relatively straightforward when the treatment of interest is randomly assigned, as this ensures the comparability of units assigned to the treatment and control conditions. However, by its very nature, many interventions of interest to social scientists cannot be randomly assigned for either ethical or practical reasons—often both. In this context, research designs that allow for the rigorous study of non-experimental interventions are particularly promising. One of them is the regression discontinuity (RD) design, which has emerged as one of the most credible non-experimental strategies for the analysis of causal effects. In the simplest RD design, units are assigned a score, and a treatment is given to those units whose value of the score exceeds a known cutoff and withheld from units whose value of the score is below the cutoff. The key feature of the design is that the probability of receiving the treatment changes abruptly at the known threshold. If units are not able to perfectly “sort” around this threshold, this discontinuous change in the treatment assignment probability can be used to infer the effect of the treatment on an outcome of interest, at least locally, because units with scores barely below the cutoff can be used as counterfactuals for units with scores barely above it. The first step to employ the RD design in practice is to learn how to recognize it. There are three fundamental components in the RD design—a score, a cutoff, and a treatment. Without these three basic defining features, RD methodology cannot be employed. Therefore, the analysis of the RD design is not always implementable, unlike other non-experimental methods such as those based on regression adjustments or more sophisticated selection-on-observables approaches, which can always be used to describe the relationship between outcomes and treatments after adjusting for observed covariates. Instead, RD is a research design that must possess certain objective features: we can only study causal effects with a RD design when one occurs; the decision to use a RD design is not up to the researcher. The key defining feature of any RD design is that the probability of treatment assignment as a function of the score changes discontinuously at the cutoff—a condition that is directly verifiable. In addition, the RD design comes with an extensive array of falsification and related empirical approaches that can be used to offer empirical support for its validity, making it more plausible in any specific application. These features give the RD design an objective basis for implementation and testing that is usually lacking in other non-experimental empirical strategies, and endow it with superior credibility among observational studies. The popularity of the RD design has grown markedly over the last decades, and it is now used frequently in Economics, Political Science, Education, Epidemiology, Criminology, and many other disciplines. This recent proliferation of RD applications has been accompanied by great disparities

2

1. INTRODUCTION

CONTENTS

in how RD analysis is implemented, interpreted, and evaluated. RD applications often differ significantly in how authors estimate the effects of interest, make statistical inferences, present their results, evaluate the plausibility of the underlying assumptions, and interpret the estimated effects. The lack of consensus about best practices for validation, estimation, inference, and interpretation of RD results makes it hard for scholars and policy-makers to judge the plausibility of the evidence and compare results from different RD studies. In this monograph, our goal is to provide an accessible and practical guide for the analysis and interpretation of RD designs that encourages the use of a common set of practices and facilitates the accumulation of RD-based empirical evidence. In addition to the existence of a treatment assignment rule based on a score and a cutoff, the formal interpretation, estimation and inference of RD treatment effects requires several other assumptions. First, we need to define the parameter of interest and provide assumptions under which such parameter is identifiable—i.e., conditions under which is uniquely estimable in some objective sense (finite sample or super population). Second, we must impose additional assumptions to ensure that the parameter can be estimated; these assumptions will naturally vary according to the estimation method employed and the parameter under consideration. In this monograph, we discuss two frameworks for RD analysis that define different parameters of interest, rely on different identification assumptions, and employ different estimation and inference methods. These two alternative frameworks also generate different testable implications, which can be used to assess their validity in specific applications. The first framework we discuss is based on conditions that ensure the smoothness of the regression functions, and is the framework most commonly employed in practice. We call this the standard or continuity-based framework for RD analysis. The second framework we describe is based on conditions that ensure that the treatment can be interpreted as being randomly assigned for units near the cutoff. We call this second setup the local randomization framework for RD analysis. Both setups rely on the notion that units that receive very similar score values on opposite sides of the cutoff ought to be comparable to each other except for their treatment status. The main distinction between both approaches is how the idea of comparability is formalized: in the continuity-based framework, comparability is conceptualized as continuity of average (or some other feature of) potential outcomes; in the local randomization framework, comparability is conceptualized as conditions that mimic an experimental setting in a neighborhood around the cutoff. We present each approach separately, discussing in detail the required assumptions, the adequate interpretation of the target parameters, the graphical illustration of the design, the appropriate methods to estimate effects and conduct statistical inference, and the available strategies to evaluate the plausibility of the design. Our presentation of the topics is intentionally geared towards practitioners: our main goal is to clarify conceptual issues in the analysis of RD designs, and offer an accessible guide for applied researchers and policy-makers who wish to implement RD analyses. For this reason, we omit most technical discussions—but provide references for the technically inclined

3

1. INTRODUCTION

CONTENTS

reader along the way and at the end of each section. To ensure that our discussion is most useful to practitioners, we illustrate all methods with two previously published empirical applications, one of which we use as a running example throughout the sections. Our leading example is a study conducted by Meyersson (2014), who analyzed the effect Islamic political representation in Turkey’s municipal elections on the educational attainment of women. The score in this RD design is the margin of victory of the largest Islamic party in the municipality, a continuous random variable, which makes the example suitable to illustrate both the continuity-based and the local randomization methods. The second example we consider is the study by Lindo et al. (2010), who investigate the effects of placing students on academic probation on their future academic achievement. The score in this second example is the students’ Grade Point Average (GPA); since there are many students with the same GPA value, this variable has mass points and is therefore a discrete—rather than continuous—random variable. RD designs with discrete running variables create some difficulties for analysis and interpretation, because the continuity-based methods cannot be applied directly. In the last section of this monograph we use this empirical study to illustrate how to approach the analysis of RD designs with discrete scores. We put special emphasis on discussing issues of neighborhood selection and causal interpretation of estimands. Full replication codes in R and Stata are available at https://sites.google.com/site/ rdpackages/replication. In that website, we also provide full replication codes for two other empirical applications, both following closely the discussion in this monograph. One employs the data on U.S. Senate incumbency advantage originally analyzed by Cattaneo et al. (2015), while the other uses the Head Start data originally analyzed by Ludwig and Miller (2007) and recently employed in Cattaneo et al. (2017d). To conclude, we emphasize that this monograph is not meant to offer a comprehensive review of the literature on RD designs (though we do offer references to further readings after each topic is presented), but rather only a succinct practical guide for empirical analysis. For early review articles see Imbens and Lemieux (2008) and Lee and Lemieux (2010), and for an edited volume with a contemporaneous overview of the RD literature see Cattaneo and Escanciano (2017). We are currently working on a comprehensive literature review that complements this monograph (Cattaneo and Titiunik, 2017).

1.1

Software for RD Analysis

As already mentioned, we use two empirical applications to illustrate the different RD methods discussed in this monograph. All implementations of these methods are done using two leading statistical software environments in the social sciences: R and Stata. We lead our illustrations with R, but every time we illustrate a method we also present the equivalent Stata command. To be specific, each numerical illustration includes an R command with its output, and the analogous 4

1. INTRODUCTION

CONTENTS

Stata command that reproduces the same analysis—though we omit the Stata output to avoid repetition. All the RD methods we discuss and illustrate are implemented using various user-developed packages, which are free and available for both R and Stata. The local polynomial methods for continuity-based RD analysis are implemented in the package rdrobust, which is presented and illustrated in three companion software articles: Calonico et al. (2014a), Calonico et al. (2015b) and Calonico et al. (2017d). This package has three functions specifically designed for continuitybased RD analysis: rdbwselect for data-driven bandwidth selection methods, rdrobust for local polynomial point estimation and inference, and rdplot for graphical RD analysis. In addition, the package rddensity, discussed by Cattaneo et al. (2017b), provides manipulation tests of density discontinuity based on local polynomial density estimation methods. The local randomization methods for RD analysis are implemented in the package rdlocrand, which is presented and illustrated by Cattaneo et al. (2016b). This package has four functions specifically designed for local randomization RD analysis: rdrandinf for randomization-based estimation and inference, rdwinselect for data-driven window selection methods based on predetermined covariates, and rdsensitivity and rdrbounds for different randomization-based sensitivity analyses. These packages are freely available at https://sites.google.com/site/rdpackages/, where related packages, replication files, and methodological materials can also be found.

1.2

Running Empirical Example

Before concluding this section, we introduce the empirical example that we employ throughout the manuscript, originally analyzed by Meyersson (2014)—henceforth Meyersson. The example is based on a (sharp) RD design in Turkey that studies of the impact of having a mayor from an Islamic party on the educational outcomes of women. This study is one of many RD applications based on close elections, as popularized by the original work of Lee (2008). Meyersson is broadly interested in the effect of Islamic parties’ control of local governments on women’s rights, in particular on the educational attainment of young women. The methodological challenge is that municipalities where the support for Islamic parties is high enough to result in the election of an Islamic mayor may differ systematically from municipalities where the support for Islamic parties is more tenuous and results in the election of a secular mayor. (For brevity, we refer to a mayor who belongs to one of the Islamic parties as an “Islamic mayor”, and to a mayor who belongs to a non-Islamic party as a “secular mayor”.) If some of the characteristics on which they differ affect (or are correlated with) the educational outcomes of women, a simple comparison of municipalities with an Islamic versus a secular mayor will be misleading. For example, municipalities where an Islamic mayor wins in 1994 may be on average more religiously conservative than municipalities where a secular mayor is elected. If religious conservatism affects the educational outcomes of women, the naive comparison between municipalities controlled by an Islamic versus 5

1. INTRODUCTION

CONTENTS

a secular mayor will not successfully isolate the effect of the Islamic party’s control of the local government—instead, the effect of interest will be contaminated by differences in the degree of religious conservatism between the two groups. This challenge is illustrated in Figure 1.1, where we plot the share of young women who complete high school by 2000 against the Islamic margin of victory in the 1994 mayoral elections (more information on these variables is given below). These figures are examples of so-called RD plots, which we discuss in detail in Section 3. In Figure 1.1(a), we show the scatter plot of the raw data (i.e., each point is an observation), superimposing the overall sample mean for each group— treated observations where an Islamic mayor is elected are located to the right of zero, and control observations where a secular mayor is elected are located to the left of zero. The raw comparison reveals a negative average effect: municipalities with an Islamic mayor have on average a lower share of young women who complete high school. Figure 1.1(b), shows the scatter plot for the subset of municipalities where the Islamic margin of victory is within 50 percentage points, a range that includes 83% of the total observations; this second figure superimposes a fourth-order polynomial fit separately on either side of the cutoff. Figure 1.1(b) reveals that the negative effect in Figure 1.1(a) arises because there is a general negative relationship between Islamic vote share and female high school share for the majority of the observations. Thus, a naive comparison of treated and control municipalities will mask systematic differences and may lead to incorrect inferences about the effect of local Islamic political representation. The RD design can be used in cases such as these to isolate a treatment effect of interest from all other systematic differences between treated and control groups. Under appropriate assumptions, a comparison municipalities where the Islamic party barely wins the election and municipalities the Islamic party barely loses will reveal the causal (local) effect of Islamic party control of the local government on female educational attainment. If parties cannot systematically manipulate the vote share that they obtain, observations just above and just below the cutoff will tend to be comparable in terms of all characteristics with the exception of the party that won the 1994 election. Thus, right at the cutoff, the comparison is free of the complications introduced by systematic observed and unobserved differences between the groups. This strategy is illustrated in Figure 1.1(b), where we see that, despite the negative slope on either side, right near the cutoff the effect of an Islamic victory on the educational attainment of women is positive, in stark contrast to the negative difference-in-means in Figure 1.1(a). Meyersson’s original study employs a RD design to circumvent these methodological challenges and to estimate the causal effect of local Islamic rule. The design is focused exclusively on the 1994 Turkish mayoral elections. Specifically, the unit of analysis is the municipality, and the score is the Islamic margin of victory—defined as the difference between the vote share obtained by the largest Islamic party, and the vote share obtained by the largest secular party opponent. Two Islamic parties compete in the 1994 mayoral elections, Refah and B¨ uy¨ uk Birlik Partisi (BBP). However, the results essentially capture the effect of a victory by Refah, as the BBP received only

6

1. INTRODUCTION

CONTENTS

70

70

Figure 1.1: Municipalities with Islamic Mayor vs. Municipalities with Secular Mayor—Meyersson data ●

60

●

60

●

●

50 Female High School Share

● ●

● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ●●● ● ● ●● ● ●● ● ● ●● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ●●●● ● ● ●● ● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●● ●●●● ● ● ●● ● ● ● ●●●●● ● ●● ● ●● ● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ●● ●● ●●● ● ● ● ● ●●●● ●●● ● ● ●● ● ● ●● ●● ●● ●● ● ●● ●● ●● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●● ●● ●●● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●● ● ● ● ● ● ●● ●●● ● ●● ● ● ●● ● ● ● ●●●● ● ● ●●● ●● ● ●●● ● ●● ● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●● ●● ●● ● ●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ●●●●● ● ● ●● ● ●● ● ● ●● ● ● ●● ●● ●● ●● ●● ●● ● ●● ●● ● ● ● ● ●●●● ●● ● ● ●● ●● ●● ●● ●●●● ● ● ● ●● ● ●● ● ●●● ●● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●●● ●●● ●●●●●● ● ● ● ●● ●● ● ● ●●●●● ● ●● ● ● ● ●● ● ●● ●● ● ●● ●● ●● ●● ●●●● ●● ● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ●●●● ● ● ●● ● ●● ●●● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ● ● ● ●● ●●●● ●●●● ●●● ●●●●●● ● ● ●●●●●● ● ● ●●● ● ● ● ● ●● ●● ● ●● ●●● ● ● ● ● ● ●●●●●● ●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ●●● ●● ● ●●●● ● ● ● ● ●● ● ● ● ● ●● ●●●●● ●● ● ●●●● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●●●● ● ●● ●● ●● ● ● ● ●● ● ● ● ●●●● ● ●● ●● ●●● ●● ●● ● ●● ●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●●●● ● ●● ● ●● ● ●● ● ● ●●● ● ●● ● ● ● ● ●● ●● ● ●●● ●● ● ●● ● ●●●●● ● ●●●● ●●●● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●● ●●● ●● ●●● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●● ● ●● ●● ● ●● ● ● ●● ●● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●●●●● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ●● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●

−100

−50

0

●

● ●

●

●

50

●

●

● ● ●

● ●

●

● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●● ●● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ● ●●● ● ●● ●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●●●●●●●● ●● ●●● ● ●●● ●●●● ● ● ● ●●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ● ●● ● ● ●● ● ● ●●●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ●● ● ●● ● ●●●●●●● ●●●●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ●● ●● ●● ●● ●● ● ●● ●● ● ●● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●●● ●●● ●●● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●●●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ●●● ● ● ●●● ●●● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●●● ●●● ● ●●● ● ●●● ●● ●● ●●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ●● ● ●●●● ● ●● ● ●● ●● ●● ● ● ● ● ●●●●● ●● ●● ● ●●●●●● ●● ● ● ● ● ●● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ●●●●●●●● ● ●●●●● ● ● ● ●● ●● ● ● ● ● ●●●● ● ●●● ● ●● ●● ●● ● ●● ● ● ● ●● ● ● ●●● ●●● ● ● ● ●● ●● ● ● ● ●●●● ●● ● ●● ●● ● ●● ● ● ●● ●●●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●●●●●● ●● ●● ●● ●● ● ● ● ● ● ●●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ●●● ● ● ● ● ●●●●● ● ●●● ●● ● ●●● ●●● ●● ●● ● ●● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ●● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●● ● ● ● ●●●●● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ●● ●●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ●●●● ● ●●● ● ● ● ●● ●● ● ●● ● ● ●●●●●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●●

●

●

●

●

●

● ●

●● ● ●

0

10

20

30

●

●

●

40

●● ●

● ●

30

●

● ●

10

40

●

●

●

20

●

0

Female High School Share

50

● ●

●

100

● ● ● ● ●● ●

−40

Islamic Margin of Victory

−20

0

●

●

●

●

●

20

●

●

●

●

●

●

40

Islamic Margin of Victory

(a) Raw Comparison of Means

(b) Local Comparison of Means

0.94 percent of the national vote and won in only 11 out of the 329 municipalities where an Islamic mayor was elected. As defined, the Islamic margin of victory can be positive or negative, and the cutoff that determines an Islamic party victory is located at zero. Given this setup, the treatment group consists of municipalities that elect a mayor from an Islamic party in 1994, and the control group consists of municipalities that elect a mayor from a secular party. The main outcome of interest is the school attainment for women who were (potentially) in high school during the period 1994 − −2000, measured with variables extracted from the 2000 census. The particular outcome we re-analyze is the share of the cohort of women ages 15 to 20 in 2000 who had completed high school by 2000. For brevity, we refer to this outcome variable interchangeably as female high school attainment share, female high school attainment, or high school attainment for women. In order to streamline the computer code for our analysis, we rename the variables in the following way: • Y: high school attainment for women in 2000, measured as the share of women ages 15 to 20 in 2000 who had completed high school by 2000. • X: vote margin obtained by the Islamic candidate for mayor in the 1994 Turkish elections, measured as the vote percentage obtained by the Islamic candidate minus the vote percentage obtained by its strongest opponent. • T: electoral victory of the Islamic candidate in 1994, measured as 1 if Islamic candidate won the election and 0 if the candidate lost. 7

1. INTRODUCTION

CONTENTS

The Meyersson dataset also contains several predetermined covariates that we will later use to investigate the plausibility of the RD design and also to illustrate covariate-adjusted estimation methods. The covariates that we include in our analysis are the Islamic vote share in 1994 (vshr islam1994), the number of parties receiving votes in 1994 (partycount), the logarithm of the population in 1994 (lpop1994), an indicator equal to one if the municipality elected an Islamic party in the previous election in 1989 (i89), a district center indicator (merkezi), a province center indicator (merkezp), a sub-metro center indicator (subbuyuk), and a metro center indicator (buyuk). Table 1.1 presents descriptive statistics for the three RD variables (Y, X, and T), and the municipality-level predetermined covariates. Table 1.1: Descriptive Statistics for Meyersson Variable Y X Z Share Men aged 15-20 with High School Education Islamic vote share 1994 Number of parties receiving votes 1994 Log Population in 1994 Population share below 19 in 2000 Population share above 60 in 2000 Gender ratio in 2000 Household size in 2000 District center Province center Sub-metro center

Mean 16.306 -28.141 0.120 0.192 0.139 5.541 7.840 0.405 0.092 1.073 5.835 0.345 0.023 0.022

Median 15.523 -31.426 0.000 0.187 0.070 5.000 7.479 0.397 0.085 1.032 5.274 0.000 0.000 0.000

Std. Deviation 9.584 22.115 0.325 0.077 0.154 2.192 1.188 0.083 0.040 0.253 2.360 0.475 0.149 0.146

Min. 0.000 -100.000 0.000 0.000 0.000 1.000 5.493 0.065 0.017 0.750 2.823 0.000 0.000 0.000

Max. 68.038 99.051 1.000 0.683 0.995 14.000 15.338 0.688 0.272 10.336 33.634 1.000 1.000 1.000

Obs. 2629 2629 2629 2629 2629 2629 2629 2629 2629 2629 2629 2629 2629 2629

The outcome of interest (Y) has a minimum of 0 and a maximum of 68.04, with a mean of 16.31. Since this variable measures the share of women between ages 15 and 20 who had completed high school by 2000, these descriptive statistics mean that there is at least one municipality in 2000 where no women in this age cohort had completed high school, and on average 16.31% of women in this cohort had completed high school by the year 2000. The Islamic vote margin (X) ranges from −100 (party receives zero votes) to 100 (party receives 100% of the vote), and it has a mean of −28.14, implying that on average the Islamic party loses by 28.14 percentage points. This explains why the mean of the treatment variable (T) is 0.120, since this indicates that in 1994 an Islamic mayor was elected in only 12.0% of the municipalities. This small proportion of victories is consistent with the finding that the average margin of victory is negative and thus leads to electoral loss.

8

2. RD TAXONOMY

2

CONTENTS

The RD Design: Definition and Taxonomy

In the RD design, all units in the study receive a score, also known as running variable or index, and a treatment is assigned to those units whose score is above a known cutoff and not assigned to those units whose score is below the cutoff. In our running example based Meyersson’s study, the units are municipalities and the score is the margin of victory of the Islamic party in the 1994 Turkish mayoral elections. The treatment is the Islamic party’s electoral victory, and the cutoff is zero: municipalities elect an Islamic mayor when the Islamic vote margin is above zero, and elect a secular mayor otherwise. In the second empirical example we present in Section 7, which is based on Lindo et al. (2010), some students are placed on academic probation if their GPA a given semester exceeds 1.6, and the authors are interested on the effects of probation on future academic performance. In this example, the score is the grade point average of each student, the cutoff is 1.6, and the treatment is being placed on academic probation. These three components—score, cutoff, and treatment—define the RD design in general, and characterize its most important feature: in the RD design, unlike in other non-experimental studies, the assignment of the treatment follows a rule that is known (at least to the researcher) and hence empirically verifiable. To formalize, we assume that there are n units, indexed by i = 1, 2, . . . , n, each unit has a score or running variable Xi , and x ¯ is a known cutoff. Units with Xi ≥ x ¯ are assigned to the treatment condition, and units with Xi < x ¯ are assigned to the control condition. This assignment, denoted Ti , is defined as Ti = 1(Xi ≥ x ¯), where

1(·) is the indicator function.

This treatment assignment rule implies that if we know a unit’s score, we know with certainty whether that unit was assigned to the treatment or the control condition. This is a key defining feature of any RD design: the probability of treatment assignment as a function of the score changes discontinuously at the cutoff. Being assigned to the treatment condition, however, is not the same as receiving the treatment. As in experimental and other non-experimental settings, this distinction is important in RD designs because non-compliance introduces complications and typically requires stronger assumptions to learn about treatment effects of interest. We introduce the binary variable Di to denote whether the treatment was actually received by unit i. In a RD design with perfect compliance, also known as the Sharp RD design, units comply perfectly with their assignment, so Di = Ti for all i. In contrast, in a RD with imperfect compliance, also known as the Fuzzy RD design, we have Di 6= Ti for some units. In the remainder of this section, we discuss the basic features of the most common types of RD designs encountered in practice. In addition to the canonical Sharp RD design, which is the focus of this monograph, we discuss important extensions of this canonical RD setup. This includes Fuzzy and Kink RD designs, RD designs with multiple cutoffs, RD designs with multiple scores and geographic RD designs, just to mention a few. We give references to other related designs at the end of this section. We also include a discussion of the local nature of RD-based parameters,

9

2. RD TAXONOMY

CONTENTS

and the resulting limitations to the external validity of any conclusions drawn from RD designs. This monographs focuses almost exclusively on the practical aspects of RD analysis in Sharp RD designs. However, as will become apparent throughout the manuscript, most methodological discussions can be applied or easily extended to many (if not all) the other RD designs encountered in practice. We will hint to some of these extensions in the upcoming subsections, as we discuss in some detail the other types of RD designs.

2.1

The Sharp RD Design

The Sharp RD design is the canonical RD setup. In this design, all units whose score is above the cutoff are assigned to the treatment condition and actually receive the treatment, and all units whose score is below the cutoff are assigned to the control condition and do not receive the treatment. This stands in contrast to the Fuzzy RD design, where some of the units fail to receive the treatment despite having a score above the cutoff, and/or some units receive the treatment despite having been assigned to the control condition. Every Fuzzy RD design can be analyzed as a Sharp RD design, if the “treatment” status is downgraded to “intention-to-treat” status, as it is common in experimental settings with imperfect compliance. This fact is also applicable to the other RD settings discussed below. The difference between the Sharp and Fuzzy RD designs is illustrated in Figure 2.1, where we plot the conditional probability of receiving treatment given the score, P(Di = 1|Xi = x), for different values of the running variable Xi . As shown in Figure 2.1(a), in a Sharp RD design the probability of receiving treatment changes exactly from zero to one at the cutoff. In contrast, in a Fuzzy RD design, the change in the probability of being treated at the cutoff is always less than one. Figure 2.1(b) illustrates a Fuzzy RD design where units with score below the cutoff comply perfectly with the treatment, but compliance with the treatment is imperfect for units with score above the cutoff. This case is sometimes called one-sided non-compliance and, of course, RD designs can (and often will) exhibit two-sided non-complaince, where P(Di = 1|Xi = x) will be neither zero nor one for units with running variable Xi near the cutoff x ¯. In the remaining of this subsection, we focus on Sharp RD designs and thus assume that Di = Ti = 1(Xi ≥ x ¯) for all units. Following the causal inference literature, we adopt the potential outcomes framework and assume that each unit has two potential outcomes, Yi (1) and Yi (0), corresponding, respectively, to the outcomes that would be observed under treatment or control. In this framework, treatments effects are defined in terms of comparisons between features of (the distribution of) both potential outcomes, such as their means, variances or quantiles. Although every unit is assumed to have both Yi (1) and Yi (0), these outcomes are called potential because only one of them is observed. If unit i receives the treatment, we will observe Yi (1), the unit’s outcome under treatment—and Yi (0) will remain latent or unobserved. Similarly, if i receives the control condition, we will observe Yi (0) but not Yi (1). This results in the fundamental problem of causal inference, and implies that the

10

2. RD TAXONOMY

CONTENTS

Figure 2.1: Conditional Probability of Receiving Treatment in Sharp vs. Fuzzy RD Designs 1

Assigned to Control

Conditional Probability of Receiving Treatment

Conditional Probability of Receiving Treatment

1

Assigned to Treatment

0.5

Cutoff

0

Assigned to Control

Assigned to Treatment

0.5

Cutoff

0 −100

−50

x

50

100

−100

x

−50

Score

50

Score

(a) Sharp RD

(b) Fuzzy RD (One-Sided)

treatment effect at the individual level is fundamentally unknowable. The observed outcome is ( Yi = (1 − Di ) · Yi (0) + Di · Yi (1) =

Yi (0)

if Xi < x ¯

Yi (1)

if Xi ≥ x ¯

.

For now we adopt the usual econometric perspective that sees the data (Yi , Xi )ni=1 as a random sample from a larger population, taking the potential outcomes (Yi (1), Yi (0))ni=1 as random variables. We consider an alternative perspective in Section 5 when we discuss inference in the local randomization framework, employing ideas from the classical statistical literature on analysis of experiment. In later sections we will also augment the basic models to account for pre-intervention covariates and other empirically relevant features, which we omit at this stage to ease exposition and ground ideas. In the specific context of the Sharp RD design, the fundamental problem of causal inference occurs because we only observe the outcome under control, Yi (0), for units whose score is below the cutoff, and we only observe the outcome under treatment, Yi (1), for those units whose score is above the cutoff. We illustrate this problem in Figure 2.2, which plots the average potential outcomes given the score, E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x], against the score. In statistics, conditional expectation functions such as these are usually called regression functions. As shown in Figure 2.2, the regression function E[Yi (1)|Xi ] is observed for values of the score to the right of the cutoff—because when Xi ≥ x ¯, the observed outcome Yi is equal to the potential outcome under treatment, Yi (1), for every i. This is represented with the solid red line. However, to the left of the

11

100

2. RD TAXONOMY

CONTENTS

cutoff, all units are untreated, and therefore E[Yi (1)|Xi ] is not observed (represented by a dashed red line). A similar phenomenon occurs for E[Yi (0)|Xi ], which is observed for values of the score to the left of the cutoff (solid blue line), Xi < x ¯, but unobserved for Xi ≥ x ¯ (dashed blue line). Thus, the observed average outcome given the score is ( E[Yi |Xi ] =

E[Yi (0)|Xi ]

if Xi < x ¯,

E[Yi (1)|Xi ]

if Xi ≥ x ¯.

The Sharp RD design exhibits an extreme case of lack of common support, as units in the control (Di = Ti = 1(Xi ≥ x ¯) = 0) and treatment (Di = Ti = 1(Xi ≥ x ¯) = 0) groups can not have the same value of their running variable (Xi ). This feature sets aside RD designs from other nonexperimental settings, and highlights one of its fundamental features: extrapolation is unavoidable. As we discuss through this monograph, a major practical enterprise of empirical work employing RD designs voids down to perform extrapolation in orde to compare control and treatment units. This unique feature of RD designs also makes causal interpretation of some parameters potentially more difficult, though we do not discuss this issue further here as it does not change the practical aspects underlying the analysis of RD designs. See Cattaneo et al. (2017d) for more discussion on this point. As seen in Figure 2.2, the average treatment effect at a given value of the score, E[Yi (1)|Xi = x] − E[Yi (0)|Xi = x], is the vertical distance between the two regression curves at that value. This distance cannot be directly estimated because we never observe both curves for the same value of x. However, a special situation occurs at the cutoff x ¯: this is the only point at which we “almost” observe both curves. To see this, we imagine having units with score exactly equal to x ¯, and units with score barely below x ¯—that is, with score x ¯ − ε for a small and positive ε. The former units would receive treatment, and the latter would receive control. Yet if the values of the average potential outcomes at x ¯ are not abruptly different from their values at points near x ¯, the units with Xi = x ¯ and Xi = x ¯ − ε would be very similar except for their treatment status, and we could approximately calculate the vertical distance at x ¯ using observed outcomes. This notion of comparability between units with very similar values of the score but on opposite sides of the cutoff is the fundamental concept on which all RD designs are based, and it was first formalized by Hahn et al. (2001). These authors showed that, among other conditions, if the regression functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x], seen as functions of x, are continuous at x=x ¯, then in a Sharp RD design we have E[Yi (1) − Yi (0)|Xi = x ¯] = lim E[Yi |Xi = x] − lim E[Yi |Xi = x] x↓¯ x

x↑¯ x

(2.1)

The result in Equation 2.1 says that, if the average potential outcomes are continuous functions of the score at x ¯, the difference between the limits of the treated and control average observed outcomes as the score converges to the cutoff is equal to the average treatment effect at the cutoff. 12

2. RD TAXONOMY

CONTENTS Figure 2.2: RD Treatment Effect in Sharp RD Design

2

E[Y(1)|X] µ1

E[Y(1)|X], E[Y(0)|X]

1

●

τSRD

µ0 0

●

E[Y(0)|X]

Cutoff

−1

−100

x

−50

50

100

Score (X)

We call this effect the Sharp RD treatment effect, defined as the right-hand-side of Equation 2.1: τSRD ≡ E[Yi (1) − Yi (0)|Xi = x ¯]. In words, τSRD captures the (reduced form) treatment effect for units with score values Xi = x ¯. This parameter answers the question: what would be the change in average response for control units with score level Xi = x ¯ had they received treatment? This treatment effect is, by construction, local in nature and cannot inform about treatment effects at other levels of the score without additional assumptions. We revisit this point further below when we discuss issues of extrapolation.

13

2. RD TAXONOMY

2.2

CONTENTS

The Fuzzy RD Design

In the Fuzzy RD design, the treatment is assigned based on whether the score exceeds the cutoff x ¯, but compliance with treatment is imperfect. As a consequence, the probability of receiving treatment changes at x ¯, but not necessarily from 0 to 1. This occurs, for example, when units with score above the cutoff are eligible to participate in a program, but participation is optional. Using our previous notation to distinguish between the treatment assignment, Ti , and the treatment received, Di , in a Fuzzy RD design there are some units for which Ti 6= Di . Because the treatment received is not always equal to treatment assigned, now the treatment take-up variable Di has two potential values, Di (1) and Di (0), corresponding, respectively, to the treatment taken when the unit is assigned to treatment condition and the treatment taken when the unit is assigned to the control condition. The observed treatment taken is Di = Ti · Di (1) + (1 − Ti ) · Di (0) and, as occurred for the outcome Yi , the fundamental problem of causal inference now means that we do not observe, for example, whether a unit that was assigned to the treatment condition and took the treatment would have taken the treatment if it had been assigned to the control condition instead. Notice that our notation also imposes additional restrictions on the potential outcomes, sometimes called exclusion restrictions. Under regularity conditions, the canonical parameter in the Fuzzy RD design, τFRD , is the “ratio” between the Sharp RD parameter, τSRD capturing the intention-to-treat effect, and the average effect of the treatment assignment capturing the treatment take-up, both at the cutoff, that is, τFRD =

E[(Di (1) − Di (0))(Yi (1) − Yi (0))|Xi = x ¯] . E[Di (1) − Di (0)|Xi = x ¯]

Under additional conditions, such as monotonicity or local independence, this parameter can be given a causal interpretation similar to the well-known Local Average Treatment Effect (LATE) estimand in experiments with imperfect complainace (and other instrumental variable settings): the Fuzzy RD parameter τFRD can be interpreted as a LATE at the cutoff for “compliers”. See, for example, Imbens and Lemieux (2008) and Cattaneo et al. (2016a) for further discussion on the interpretation of LATE-type estimands in RD designs. Regardless of the interpretation attached to τFRD via additional identifying assumptions, this popular parameter is identifiable and estimable from data because τFRD =

limx↓¯x E[Yi |Xi = x] − limx↑¯x E[Yi |Xi = x] , limx↓¯x E[Di |Xi = x] − limx↑¯x E[Di |Xi = x]

under continuity conditions on the regression functions. Thus, τFRD is the ratio of two Sharp RD effects: the effect of Ti on Yi (the outcome equation or intention-to-treat effect), and the effect of Ti on Di (the treatment equation or take-up effect). This implies that from a practical perspective, analyzing Fuzzy RD designs is not more difficult than analyzing a ratio of Sharp RD designs, and hence most estimation, inference and testing procedures naturally extend from sharp to fuzzy 14

2. RD TAXONOMY

CONTENTS

settings under standard assumptions. Of course, specific issues for fuzzy RD designs do arise, such as those related to small denominators (called “weak instruments” in the econometrics literature), validity of exclusion restrictions (called “invalid instruments” in the econometrics literature) or interpretation/extrapolation of estimands, just to mention a few. We offer some references to further reading in the context of RD designs connected with these issues at the end of this section. As mentioned before, because of pedagogical and space considerations we focus exclusively on the Sharp RD design. Nevertheless, as it will become apparent through the presentation, most of the concepts and recommendations we discuss below are directly relevant to the fuzzy case, both because any Fuzzy RD design can be redefined as a Sharp RD design where the treatment of interest is the treatment assignment (intention-to-treat), and because, as just explained, the canonical Fuzzy RD parameter is nothing more than the ratio of two Sharp RD parameters. Practically, the only additional information needed is to specify the variable Di , which does not even need to be binary from an implementation perspective.

2.3

The Kink RD Designs

In recent years, researchers have been particularly interested in RD parameters defined in terms of derivatives of the regression functions at the cutoff, as opposed to the the levels of the regression functions themselves. Generically, we call the Kink RD design as the setting where the goal is to estimate the first derivative of the regression function, in which case the canonical parameters are given by τSKRD and τFKRD =

d = E[Yi (1) − Yi (0)|Xi = x] dx x=¯ x

d dx E[(Di (1) − Di (0))(Yi (1) − Yi (0))|Xi = d dx E[Di (1) − Di (0)|Xi = x] x=¯ x

x] x=¯x

.

The Sharp Kink RD parameter τSKRD corresponds to the first derivative at the cutoff of τSRD , while the Fuzzy Kink RD parameter τFKRD corresponds to the ratio of first derivatives at the cutoff of numerator and denominator entering τFRD . These parameters are written in this form because they emerge generically as the probability limits of the plug-in RD estimators for the first derivatives at the cutoff of different estimable functions. To be more precise, the kink RD designs parameters τSKRD and τFKRD are generically identifiable and estimable from data because: τSKRD = lim x↓¯ x

and τFKRD =

d d E[Yi |Xi = x] − lim E[Yi |Xi = x] x↑¯ x dx dx

d dx E[Yi |Xi d E[Di |Xi limx↓¯x dx

limx↓¯x

d dx E[Yi |Xi = x] , d limx↑¯x dx E[Di |Xi = x]

= x] − limx↑¯x = x] −

under appropriate regularity conditions.

15

2. RD TAXONOMY

CONTENTS Figure 2.3: Kink Effects in the RD Design

2

12

∂E[Y(1)|X]/∂X , ∂E[Y(0)|X]/∂X

E[Y(1)|X], E[Y(0)|X]

E[Y(1)|X] 1

Kink in E[Y|X]

E[Y(0)|X]

0

∂E[Y(1)|X]/∂X 10

8

6

τSKRD

4

2

∂E[Y(0)|X]/∂X

Cutoff

−1

0

Cutoff

−1 −100

−50

x

50

100

−100

−50

Score (X)

x

50

100

Score (X)

(a) Kink in Function Levels

(b) Jump in First Derivatives

Beyond the basic (reduced form) interpretation of the kink parameters, these parameters have featured in other related contexts. For example, τSKRD is of direct interest in Cerulli et al. (2017) for analysis of local sensitivity of RD treatment effects. Furthermore, the parameter τSKRD up to a known scale and the parameter τFKRD are both on interest in Card et al. (2015) where, under additional assumptions and in a different setting, are estimated using RD methods. The practical discussion given in this monograph also excludes Kink RD designs for space and pedagogical reasons, but these estimands and estimators are readily available using the methods presented in the upcoming sections. As in the caseof Fuzzy RD designs, the researcher only needs to specify additional information: the derivative of interest and, in the fuzzy case, the variable Di . Furthermore, the Regression Kink Designs of Card et al. (2015) are readily available because, from an implementation perspective, the variable Di can be take to be continuous without loss of generality. See also Card et al. (2017) for further discussion.

2.4

The Multi-Cutoff RD Designs

Another generalization of the RD design that is commonly seen in practice occurs when the treatment is assigned using different cutoff values for different subgroups of units. In the standard RD design, all units face the same cutoff value x ¯; as a consequence, the treatment assignment rule is Ti =

1(Xi ≥ x¯) for all units. In contrast, in the Multi-Cutoff RD design different units face

difference cutoff values. An example occurs in RD designs where the running variable Xi is a political party’s vote share in a district, and the treatment is winning that district’s election. When there are only two parties contesting the election, the cutoff for the party of interest to win the election is always 50%, because the party’s strongest opponent always obtains (100 − Xi )% of the vote. However, when 16

2. RD TAXONOMY

CONTENTS

there are three or more parties contesting the race and the winner is the party with the highest vote share, the party can win the election barely in many different ways. For example, if there are three parties, the party of interest could barely win with 50% of the vote against two opponents who get, respectively, 49% and 1% of the vote; but it could also barely win with 39% of the vote against two opponents who got 38% and 23%. Indeed, in this context, there is an infinite number of ways in which one party can barely win the election—the party just needs to obtain a barely higher vote share than the vote share obtained by its strongest opponent, whatever value the latter takes. Another common example occurs when a federal program is administered by sub-national units, and each of the units chooses a different cutoff value to determine program eligibility. For example, in order to target households that were most in need in a given locality, the Mexican conditional cash transfer program Progresa determined program eligibility based on a household-level poverty index. In rural areas, the cutoff that determined program eligibility varied geographically, with seven distinct cutoffs used depending on the geographic location of each locality. This type of situation arises in many other contexts where the cutoff for eligibility varies among the units in the analysis. Cattaneo et al. (2016a) introduced an RD framework based on potential outcomes and continuity conditions to analyzing Multi-Cutoff RD designs, and established a connection with the most common practice of normalizing-and-pooling the information for empirical implementation. Suppose that the cutoff is a random variable Ci , instead of a known constant, taking on J distinct values C = {c1 , c2 , . . . , cJ }. The continuous case is discussed below, though in practice it is often hard to implement RD designs with more than a few cutoff points due to data limitations. In a multi-cutoff RD setting, the treatment assignment is generalized to Ti =

1(Xi ≥ Ci ), where Ci is a random

variable with support C. Of course, the single cutoff RD design is contained in this generalization when C = {¯ x} and thus P[Ci = x ¯] = 1, though more generally P[Ci = c] ∈ (0, 1) for each c ∈ C. In multi-cutoff RD settings, one approach commonly used in practice is to normalize the running variable so that all units face the same common cutoff value at zero, and then apply the standard RD design machinery to the normalized score and the common cutoff. To do this, researchers define ˜ i = Xi − Ci , and pool all observations using the same cutoff of zero for all the normalized score X observations in a standard RD design, with the normalized score used in place of the original score. In this normalizing-and-pooling approach, the treatment assignment indicator is therefore ˜ i ≥ 0) for all units. In the case of the single-cutoff RD design discussed so Ti = 1(Xi −Ci ≥ 0) = 1(X far, this normalization is achieved without loss of generality as the interpretation of the estimands ˜ i ) and cutoff (¯ remain unchanged; only the score (Xi 7→ X x 7→ 0) change. ˜ i , usually called More generally, the normalize-and-pool strategy employing the score variable X the normalized (or centered) running variable, changes the parameters already discussed above in an intuitive way: they become weighted averages of RD treatment effects for each cutoff value

17

2. RD TAXONOMY

CONTENTS Figure 2.4: RD Design with Multiple Cutoffs

Population Exposed to Cutoff c2

Ec2[Y(1)|X] ●

τSRD(c2)

●

τSRD(c2, c1) 2

● ●

E[Y(1)|X], E[Y(0)|X]

Ec2[Y(0)|X]

Population Exposed to Cutoff c1

●

1

τSRD(c1) 0

τSRD(c1, c2)

Ec1[Y(1)|X]

●

●

●

Ec1[Y(0)|X]

Cutoff c1

−1

Cutoff c2

c1

c2

Score (X)

underlying the original score variable. For example, the Sharp RD treatment effect now is ˜ i = 0] = τ¯SRD = E[Yi (1) − Yi (0)|X

X

τSRD (c)ω(c),

c∈C

fX|C (c|c)P[Ci = c] c∈C fX|C (c|c)P[Ci = c]

ω(c) = P

with τ¯SRD denoting the normalized-and-pooled sharp RD treatment effect, τSRD (c) denoting the cutoff-specific sharp RD treatment effect, and fX|C (x|c) denoting the conditional density of Xi |Ci . See Cattaneo et al. (2016a) for more details on notation and interpretation, and for analogous results for Fuzzy and Kink RD designs. From a practical perspective, Multi-Cutoff RD designs can be analyzed as a single-cutoff RD design by either normalizing-and-pooling or by considering each cutoff separately. For example, the first approach maps the Multi-Cutoff RD design to a single sharp/fuzzy/kink RD Design, and thus

18

2. RD TAXONOMY

CONTENTS

the discussion in this monograph and/or its extension to fuzzy and kink designs applies directly: under standard assumptions on the normalized score, we have the analogous identification result to the standard Sharp RD design, given by ˜ i = x] − lim E[Yi |X ˜ i = x], τ¯SRD = lim E[Yi |X x↓0

x↑0

which implies that estimation and inference for τ¯SRD can proceed in the same way as in the standard Sharp RD design with a single cutoff. Alternatively, by considering each subsample Ci = c with c ∈ C, the methods discuss in this monograph can be applied directly to each cutoff point, and then collected for further analysis and interpretation under additional regularity conditions. Either way, as mentioned before, we focus exclusively on the practical aspects of implementing estimation, inference and falsification for the single-cutoff Sharp RD design to conserve space and avoid side discussions.

2.5

The Multi-Score and Geographic RD Designs

Yet another generalization of canonical RD designs occurs when two or more running variables are determining a treatment assignment, which by construction induces a multi-cutoff RD design with infinite many cutoffs. For example, a grant or scholarship may be given to students who score above a given cutoff in both a mathematics and a language exam. This leads to two running variables—the student’s score in the mathematics exam and her score in the language exam— and two (possibly different) cutoffs. Another popular example is related to geographic boundaries inducing discontinuous treatment assignments. This type of designs have been studied in Papay et al. (2011), Reardon and Robinson (2012), and Wong et al. (2013) for generic multi-score RD settings, and in Keele and Titiunik (2015) for geographic RD settings. To allow for multiple running variables, we assume each unit’s score is a vector (instead of a scalar as before) denoted by Xi . When there are two running variables, the score for unit i is Xi = (X1i , X2i ), and the treatment assignment is, for example, Ti = 1(X1i > b1 )· 1(X2i > b2 ) where b1 and b2 denote the cutoff points along each of the two dimensions. For simplicity, we assume the potential outcome functions are Yi (1) and Yi (0), which implicitly imposes additional assumptions (e.g., no spill-overs in a geographic setting). See Cattaneo et al. (2016a) for more discussion on this type of restrictions on potential outcomes. The parameter of interest changes, as discussed before in the context of Multi-Cutoff RD designs, because there is no longer a single cutoff at which the probability of treatment assignment changes discontinuously. Instead, there is a set of values at which the treatment changes discontinuously. To continue our education example, assume that the scholarship is given to all students who score above 60 in the language exam and above 80 in the mathematics exam, letting X1i denote the language score and X2i the math score, and b1 = 80 and b2 = 60 be the respective cutoffs. According to this hypothetical treatment assignment rule, a student with score xi = (80, 59.9) is assigned to the 19

2. RD TAXONOMY

CONTENTS

control condition, since 1(80 ≥ 80)· 1(59.9 ≥ 60) = 1·0 = 0, and misses the treatment only barely— had she scored an additional 1/10 of a point in the mathematics exam, she would have received the scholarship. Without a doubt, this student is very close to the cutoff criteria for receiving the treatment. However, scoring very close to both cutoffs is not the only way for a student to be barely assigned to treatment or control. A student with a perfect 100 score in language would still be barely assigned to control if he scored 59.9 in the mathematics exam, and a student with a perfect math score would be barely assigned to control if she got 79.9 points the language exam. Thus, with multiple running variables, there is no longer a single cutoff value at which the treatment status of units changes from control to treated. Instead, the discontinuity in the treatment assignment occurs along a boundary of points. This is illustrated graphically in Figure 2.5. Figure 2.5: Example of RD Design With Multiple Scores: Treated and Control Areas 100

Mathematics Score

80

Mathematics Cutoff

Boundary

Treated Area

60

Control Area 40

20

Language Cutoff 0 0

20

40

60

80

100

Language Score

Consider once again for simplicity a sharp RD design (or an intention-to-treat situation). The parameter of interest in the Multi-Score RD design is therefore a generalization of the standard Sharp RD design parameter, where the average treatment effect is calculated at all (or, more empirically relevant, at some) points along the boundary between the treated and control areas, that is, at points where the treatment assignment changes discontinuously from zero to one: E[Yi (1) − Yi (0)|Xi = b],

b ∈ B,

where B denotes the boundary determining the control and treatment areas. For example, in the hypothetical education example in Figure 2.5, B = {(x1 , x2 ) : x1 = 80 and x2 = 60}. Although notationally more complicated, conceptually a Multi-Score RD design is very easy to analyze. For example, in the sharp example we are discussing, the identification result is completely analogous to single score case: τSRD (b) =

lim

x→b;x∈Bt

E[Yi |Xi = x] −

lim

x→b;x∈Bc

E[Yi |Xi = x],

b ∈ B,

where Bt and Bc denote the treatment and control areas, respectively. In other words, for each 20

2. RD TAXONOMY

CONTENTS

cutoff point along the boundary, the treatment effect at that point is identifiable by the observed bivariate regression functions for each treatment group, just like in the single-score case. The only conceptually important distinction is that Multi-Score RD designs generate a τSRD (b) family or curve of treatment effects, one for each boundary point b ∈ B. For example, two potentially distinct sharp RD treatment effects are τSRD (80, 70) and τSRD (90, 60). An important special case of the RD design with multiple running variables is the Geographic RD design, where the boundary B at which the treatment assignment changes discontinuously is a geographic boundary that separates a geographic treated area from a geographic control area. A typical Geographic RD design is one where the treated and control areas are adjacent administrative units such as counties, districts, municipalities, states, etc., with opposite treatment status. In this case, the boundary at which the treatment status changes discontinuously is the border that separates the adjacent administrative units. For example, some counties in Colorado have all-mail elections where voting can only be conducted by mail and in-person voting is not allowed, while other counties have traditional in-person voting. Where the two types of counties are adjacent, the administrative border between the counties induces a discontinuous treatment assignment between in-person and all-mail voting, and a Geographic RD design can be used to estimate the effect of adopting all-mail elections on voter turnout. This RD design can be formalized as a RD design with two running variables, where the score Xi = (X1i , X2i ) contains two coordinates such as latitude and longitude that determine the exact geographic location of unit i. In practice, the score Xi = (X1i , X2i )—that is, the geographic location of each unit in the study—is obtained using Geographic Information Systems (GIS) software, which allows researchers to locate each unit on a map as well as to locate the entire treated and control areas, and all points on the boundary between them. For implementation, in both the geographic and non-geographic cases, there are two main approaches mirroring the discussion for the case of Multi-Cutoff RD designs. One approach is the equivalent of normalizing-and-pooling, while the other approach estimates many RD treatment effects along the boundary. For example, consider first the latter approach in a sharp RD context: the RD effect at a given boundary point b = (b1 , b2 ) ∈ B may be obtained by calculating each unit’s distance to b, and using this one-dimensional distance as the unit’s running variable, giving negative values to control units and positive values to treated units. Letting the distance between a unit’s score Xi and a point x be di (x), we can re-write the above identification result as τSRD (b) = lim E[Yi |di (b) = d] − lim E[Yi |di (b) = d], d↑0

d↓0

b ∈ B.

The choice of distance metric di (·) depends on the particular application. A typical choice is the p Euclidean distance di (b) = (X1i − b1 )2 + (X2i − b2 )2 . In practice, this approach is implemented for a finite collection of evaluation points along the boundary, and all the methods and discussion presented in this monograph can be apply to this case directly, one cutoff at the time. The normalizing-and-pooling approach is also straightforward in the case of Multi-Score RD designs, as 21

2. RD TAXONOMY

CONTENTS

the approach simply pools together all the units closed to boundary and conducts inference as in a single-cutoff RD design. As in the previous cases, we do not elaborate on practical issues for this specific setting to conserve space and because all the main methodological recommendations, codes and discussions apply directly. However, to conclude our discussion, we do highlighting an important connection between RD designs with multiple running variables and RD designs with multiple cutoffs. In the Multi-Cutoff RD design, our discussion was based on a discrete set of cutoff points, which would be the natural setting in Multi-Score RD designs applications. In such case, we can map each cutoff point on the boundary to one of the cutoff points in C and each observation can be assigned a running variable relative to each cutoff point via the distance function. With these two simple modifications, any Multi-Score RD design can be analyzed as a Multi-Cutoff RD design over finitely many cutoff points on the boundary. In particular, this implies that all the conclusions and references given in the previous section apply to this case as well. See the supplemental appendix of Cattaneo et al. (2016a) for more discussion on this idea and further generalizations.

2.6

The Local Nature of RD Effects

All the RD parameters discussed in previous sections can be interpreted as causal in the sense that they capture differences in some feature of potential outcome under treatment, Yi (1), and the potential outcome under control, Yi (0). However, in contrast to other causal parameters in the potential outcomes framework, these average RD differences are calculated at a single point on the support of a continuous random variable (Xi ) and as a result are very local causal effects in nature. From some perspectives, the parameters cannot even be interpreted as causal as they cannot be reproduced via manipulation (i.e., experimentation). Regardless of their status as causal parameters, RD treatment effects tend to have little external validity, that is, the degree to which RD effects are representative of the treatment effect that would occur for units with scores farther away from the cutoff. For example, in the case of the canonical Sharp RD design, the RD effect can be interpreted graphically as the vertical difference between E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] at the point where the score equals the cutoff, x = x ¯. In the general case where the average treatment effect varies as a function of the score Xi , as it is common in applications, the RD effect may not be informative of the average effect of treatment at values of x different from x ¯. For this reason, in the absence of specific (usually restrictive) assumptions about the global shape of the regression functions, the effect recovered by the RD design is only the average effect of treatment for units local to the cutoff, that is, for units with score values Xi = x ¯. How much can be learned from such local treatment effects will depend on each particular application. For example, in the scenario illustrated in Figure 2.6(a), the vertical distance between E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] at x = x ¯ is considerably higher than at other points, such as x = −100 and x = 100, but the effect is positive everywhere. A more heterogeneous scenario is shown in Figure 2.6(b), where the effect is zero at the cutoff but ranges from positive to negative 22

2. RD TAXONOMY

CONTENTS

at other points. Since in real examples the counterfactual (dotted) regression functions are not observed, it is not possible to know with certainty the degree of external validity of any given RD application. Increasing the external validity of the RD estimates is a topic of very active research and, regardless of the approach taken, will necessarily require more assumptions. For example, extrapolation of RD treatment effects can be done by imposing additional assumptions about (i) the regression functions near the cutoff (Wing and Cook, 2013; Dong and Lewbel, 2015), (ii) local independence assumptions (Angrist and Rokkanen, 2015), or (iii) exploiting specific features of the design such as imperfect complaince (Bertanha and Imbens, 2017) or the presence of multiple cutoffs (Cattaneo et al., 2017c). On this regard, RD designs are not different from experiments as they both require additional assumptions to map internally valid estimates into externally valid ones. Figure 2.6: Local Nature of RD Effect 2

2

E[Y(1)|X] 1

τSRD

E[Y(1)|X]

E[Y(1)|X], E[Y(0)|X]

E[Y(1)|X], E[Y(0)|X]

1

τSRD E[Y(0)|X] 0

●

●

0

−1

E[Y(0)|X] Cutoff

−1

−100

−50

x

50

100

Score (X)

−100

−50

x

50

Score (X)

(a) Mild Heterogeneity

2.7

Cutoff

−2

(b) Severe Heterogeneity

Further Readings

For an introduction to causal inference based on potential outcomes see, for example, Imbens and Rubin (2015) and references therein. The RD design was originally proposed by Thistlethwaite and Campbell (1960), and historical as well as early review articles are given by Cook (2008), Imbens and Lemieux (2008) and Lee and Lemieux (2010). Lee (2008) provided an influential contribution to the identification of RD effects and was the first to apply the RD design to elections. The edited volume by Cattaneo and Escanciano (2017) provides a more recent overview of the RD literature and includes several methodological and practical contributions. Most of the literature focuses on average treatment effects, but quantile and distributional RD treatment effects have

23

100

2. RD TAXONOMY

CONTENTS

also been considered; for example, see Shen and Zhang (2016), Chiang and Sasaki (2017) and references therein. Finally, a recent and related literature on bunching and density discontinuities is summarized by Kleven (2016) and Jales and Yu (2017).

24

3. RD PLOTS

3

CONTENTS

Graphical illustration of RD treatment effects

An appealing feature of any RD design is that it can be illustrated graphically. This graphical representation, in combination with the formal approaches to estimation, inference and falsification discussed below, adds transparency to the analysis by plotting all (or a subset of) the observations used for estimation and inference. RD plots also allow researchers to readily summarize the main empirical findings as well as other important features of the work conducted. We now discuss the most transparent and effective methods to plot the RD design and present effects (and later conduct empirical falsification) in RD designs. At first glance, it seems that one should be able to illustrate the relationship between the outcome and the running variable by simply constructing a scatter plot of the observed outcome against the score, clearly identifying the points above and below the cutoff. However, this strategy is rarely useful, as it is often hard to see “jumps” or discontinuities in the outcome-score relationship by simply looking at the raw data. We illustrate this point with the Meyersson application, plotting female high school attainment against the Islamic vote margin using the raw observations. We create this scatter plot in R with the plot command. > plot (X , Y , xlab = " Running Variable " , ylab = " Outcome " , col = 1 , + pch = 20) > abline ( v = 0)

. > > >

Analogous Stata command twoway (scatter Y X, /// mcolor(black) xline(0, lcolor(black))), /// graphregion(color(white)) ytitle(Outcome) /// xtitle(Running Variable)

Every point in Figure 3.1 corresponds to one raw municipality-level observation in the dataset— so there are 2,629 points in the scatter plot (see Table 1.1). Although this plot is helpful to visualize the raw observations, detect outliers, etc., its effectiveness for visualizing the RD design is limited. In the Meyersson application there is empirical evidence that the Islamic party’s victory translates into a small increase in women’s educational attainment. Despite this evidence of a positive RD treatment effect, a jump in the values of the outcome at the cutoff cannot be seen by simply looking at the raw cloud of points around the cutoff in Figure 3.1. In general, raw scatter plots do not allow for easy visualization of the RD effect even when the effect is large. A more useful approach is to aggregate or “smooth” the data before plotting. The typical RD plot presents two summaries: (i) a global polynomial fit, represented by a solid line, and (ii) local sample means, represented by dots. The global polynomial fit is simply an smooth approximation to the unknown regresion functions based a fourth- or fifth-order polynomial regression of the outcome on the score, fitted separately above and below the cutoff, and using the original raw data. In contrast, the local sample means are created by first choosing disjoint (i.e., non-overlapping) 25

CONTENTS

70

3. RD PLOTS

●

60

●

50

●

● ●

●

● ●

● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ● ● ● ●● ● ●●● ● ●●●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●●●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ●● ●●● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ●●● ● ●● ●● ● ●● ● ● ●● ● ●● ●●● ● ● ● ●●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●●● ●● ●● ●●● ●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ●● ● ●●●● ●● ● ●●● ● ●● ● ●● ●●●●● ●●●● ●●●●●●● ●●●●● ● ● ● ● ●● ● ●● ●● ●● ● ●●●●●●●● ●● ●● ● ●●● ●●●●● ● ● ● ● ● ● ● ●●●●●●● ●● ●●● ●● ●●●●●● ●● ● ● ●● ●●●●● ●● ●●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●●● ●●●● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ●● ● ●●● ●●● ● ●● ● ●● ● ● ●● ● ●● ● ●● ●●● ● ●● ●● ●●● ● ● ● ●● ● ● ●● ● ●●●● ●●● ●●●● ●● ●●●● ● ● ●● ● ● ●●●●●● ●● ●● ● ● ● ● ●●●● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ●● ● ●●● ●●● ●● ● ●●● ●●● ● ● ●●● ●● ● ●●●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ●●● ●● ●●● ● ● ●●● ● ●● ●● ● ● ● ● ●● ●● ● ●● ●● ● ● ●● ● ● ●● ●●●● ● ●● ● ●● ●● ●● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ●● ● ●●●●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ●●● ●●● ● ●● ● ●● ● ● ● ●●●●● ●● ●● ●●● ●● ● ● ●●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ●●●●●● ●● ●● ●● ● ●●● ● ●● ● ●● ●●● ● ●● ● ● ●●●● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ●●●●● ●● ●● ● ●● ●● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●● ● ●●● ●● ● ●● ● ●● ● ● ●● ●● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ●● ● ● ●● ●● ● ●● ●● ●●● ●●● ● ● ● ●●●●●●●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●●● ●● ●● ● ● ●●● ● ●●●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ●●●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ●●● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●●●● ● ●● ●● ●●● ● ● ●●● ●● ●● ● ● ● ● ●● ●●●● ●● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ●● ●●●● ●●● ● ●● ●● ●●●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●●● ●● ●● ● ● ● ● ●● ● ● ● ● ●●●●●● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ●● ●● ● ●● ● ● ● ●● ●● ●● ●●● ●●●● ●●● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ●● ● ● ● ●●● ● ● ●● ●●●● ● ● ●●●● ● ● ●● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ● ●● ● ●●● ●●● ● ● ●● ● ● ● ● ●●●●● ●●●● ●● ●● ●●● ●● ● ●● ● ●● ●●●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ●● ●●● ● ● ● ●●● ●● ●●●● ●● ●● ● ● ● ● ● ●● ●● ●● ● ●● ●●●● ● ●● ● ●●● ●● ● ●● ●● ●●● ●● ● ● ● ● ●●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ●●● ● ●● ● ●● ● ●● ● ● ● ● ●●● ● ● ●●● ● ●●● ● ●●● ● ● ● ● ●●● ●● ● ●● ● ●● ●● ● ●● ●●●● ● ●● ● ● ●● ● ●●● ● ● ● ●●●● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ●● ●● ●●● ● ● ●● ●●● ●● ● ● ● ●●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ●● ●● ● ●●● ● ●●●● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ●● ●●● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ●●● ●● ● ●● ● ● ●●● ● ● ●● ● ●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●

30

40

●

0

10

20

Outcome

●

● ● ●

●

●

−100

−50

0

● ● ● ●

●

●●

●

●

50

100

Running Variable

Figure 3.1: Scatter Plot—Meyersson Data intervals or “bins” of the score, calculating the mean of the outcome for all observations falling within each bin, and then plotting the average outcome in each bin against the mid point of the bin, which can be interpreted as a non-smooth approximation of the unknown regression functions. The combination of these two ingredients in the same plot allows researchers to visualize the global or overall shape of the regression functions for treated and control observations, while at the same time retaining enough information about the local behavior of the data to observe the RD treatment effect and the variability of the data around the global fit. Importantly, in the standard RD plot, the global polynomial is calculated using the original observations, not the binned observations. For example, using the Meyersson data, if we use 20 bins of equal length on each side of the cutoff, we partition the support of the Islamic margin of victory into 40 disjoint intervals of length 5—recall that a party’s margin of victory ranges theoretically from −100 to 100, and in practice the Islamic margin of victory ranges from −100 to 99.051. Table 3.1 shows the bins and the corresponding average outcomes in this case, where we denote the bins by B−,1 , B−,2 , . . . , B−,20 (control group) and B+,1 , B+,2 , . . . , B+,20 (treatment group); that is, using the subscripts − and + to indicate, respectively, bins located to the left and right of the cutoff. In that table, each local sample average is computed as Y¯−,j =

1 #{Xi ∈ B−,j }

X

Yi

and

i:Xi ∈B−,j

where j = 1, 2, · · · , 20 in this numerical example.

26

Y¯+,j =

1 #{Xi ∈ B+,j }

X i:Xi ∈B+,j

Yi

3. RD PLOTS

CONTENTS

Table 3.1: Partition of Islamic Margin of Victory into 40 Bins of Equal Length—Meyersson Data

Bin

Average Outcome in Bin

B−,20 = [−100, −95) B−,19 = [−95, −90) .. .

Y¯−,20 = 4.6366 Y¯−,19 = 10.8942 .. . ¯ Y−,3 = 17.0525 Y¯−,2 = 12.9518 Y¯−,1 = 13.8267 Y¯+,1 = 15.3678 Y¯+,2 = 13.9640 Y¯+,3 = 14.5288 .. . Y¯+,19 = NA Y¯+,20 = 10.0629

B−,3 = [−15, −10) B−,2 = [−10, −5) B−,1 = [−5, 0) B+,1 = [0, 5) B+,2 = [5, 10) B+,3 = [10, 15) .. . B+,19 = [90, 95) B+,20 = [95, 100]

Number of Observations

Group Assignment

4 2 .. .

Control Control .. .

162 149 148

Control Control Control

109 83 56 .. .

Treatment Treatment Treatment .. .

0 1

Treatment Treatment

In order to create an RD plot corresponding to the binned outcome means in Table 3.1 and with the addition of a 4-order global polynomial fit estimated separately for treated and control observations, we use the rdplot command: > out = rdplot (Y , X , nbins = c (20 , 20) , binselect = " esmv " ) > print ( out ) Analogous Stata command . rdplot Y X, nbins(20 20) binselect(esmv) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

27

CONTENTS

40 30 20

Outcome

50

60

70

3. RD PLOTS

●

● ●

● ●

10

●

●

●

●

●

●

● ●

●

● ●

● ●

●

●

●

● ● ●

●

● ●

●

● ●

● ●

0

●

−100

−50

0

50

100

Running Variable

Figure 3.2: RD Plot for Meyersson Data Using 40 Bins of Equal Length In figure 3.2, the global fit reveals that the observed regression function seems to be non-linear— particularly on the control side. At the same time, the binned local means let us see the variability around the global fit. The plot also reveals a positive jump at the cutoff: the average share of female high school attainment seems to be higher for those municipalities where the Islamic party obtained a barely positive margin of victory than in those municipalities where the Islamic party lost narrowly. The type of information conveyed by Figures 3.1 and 3.2 is very different. In order to facilitate their comparison, we reproduce them side by side in Figure 3.3. In the raw scatter plot (Figure 3.3(a)), it is difficult to see any systematic pattern, and there is no visible discontinuity in the average outcome at the cutoff. In contrast, when we use 20 bins on each side of the cutoff to bin the data and include the global polynomial fit (Figure 3.3(b)), the plot now allows us to see a discontinuity at the cutoff and to better understand the shape of the underlying regression function over the whole support of the running variable. The differences between the plots clearly show that binning the data may reveal striking patterns that can remain hidden in a simple scatter plot. Since binning leads to such drastic differences, a natural question is how many bins should be chosen, and what kinds of properties are desirable in the chosen bins. In Figure 3.2, we chose 20 bins of equal length on either side of the cutoff—but we could have chosen 10 or 40, a decision that could have affected the conclusions drawn from the plot. Choosing the number and type of bins in an ad-hoc manner compromises the transparency and replicability of the RD plots, and leaves researchers uncertain about the underlying properties of this smoothing strategy. As we now discuss, a more desirable approach is to choose the type and 28

3. RD PLOTS

CONTENTS

70

70

Figure 3.3: RD Plots—Meyersson Data

●

60

60

●

● ●

50

50

● ●

● ●

●

● ● ●

●

●

−100

−50

0

40

Outcome

20

30

● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ● ● ● ●● ● ●●● ● ●●●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●●● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●●●● ● ●●● ● ●● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ●●●● ● ● ● ●●● ● ●● ●● ● ●● ● ● ●● ● ●● ●●● ● ● ● ●●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ●● ● ● ● ● ●●● ●●● ●●●● ●● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ●●● ●●● ● ●● ● ●●●●● ●● ●●●●●● ●● ● ● ● ● ●● ●● ● ●● ●● ●● ● ●●●●●●●● ●● ●● ● ●●● ●●●●● ● ● ● ● ● ● ● ●●●●●●● ●● ●●● ●● ●●●●●● ●● ● ● ●● ●●●●● ●● ●●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●●● ●●●● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ●● ● ●●● ●●● ● ●● ● ●● ● ● ●● ● ●● ● ●● ●●● ● ●● ●● ●●● ● ● ● ●● ● ● ●● ● ●●●● ●●● ●●●● ●● ●●●● ● ● ●● ● ● ●●●●●● ●● ●● ● ● ● ● ●●●● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ●●● ●●● ●● ● ●●● ●●● ● ● ●●● ●● ● ●●●● ●● ● ● ●● ● ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ● ● ●● ● ● ●●● ●● ● ● ●● ●●●● ● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ●● ● ●●●●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ●●● ●●● ● ●● ● ●● ● ● ● ●●●●● ●● ●● ●●● ●● ● ● ●●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ●●●●●● ●● ●● ●● ● ●●● ● ●● ● ●● ●●● ● ●● ● ● ●●●● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ●●●●● ●● ●● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●●● ●●●● ●●● ● ●●●● ● ● ●● ● ● ●●● ● ●●● ●● ● ●● ● ●● ● ● ●● ●● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ●● ● ● ●● ●● ● ●● ●● ●●● ●●● ● ● ●●●●●●●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●●● ●● ●● ● ● ●●● ● ●● ●●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ●●●● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ●●●● ●● ●● ● ●● ●● ●●●● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ●●● ●● ●● ● ● ● ●● ●●● ●● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ●●●●●● ●● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ●● ●●●● ● ●●● ●● ●● ●● ●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ●● ● ● ● ● ●●●●●● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ●● ●● ● ●● ● ● ● ●● ●● ●● ●●● ●●●● ●●● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ●● ● ● ● ●●● ● ● ●● ●●●● ● ● ●●●● ● ● ●● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ● ●● ● ●●● ●●● ● ● ●● ● ● ● ● ●●●●● ●●●● ●● ●● ●●● ●● ● ●● ● ●● ●●●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ●● ●●● ● ● ● ●●● ●● ●●●● ●● ●● ● ● ● ● ● ●● ●● ●● ● ●● ●●●● ● ●● ● ●●● ●● ● ●● ●● ●●● ●● ● ● ● ● ●●● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ●●● ● ●● ● ●● ● ●● ● ● ● ● ●●● ● ● ●●● ● ●●● ● ●●● ● ● ● ● ●●● ●● ● ●● ● ●● ●● ● ●● ●●●● ● ●● ● ● ●● ● ●●● ● ● ●●●● ● ●● ● ● ●●● ● ●● ●●● ● ● ● ● ●●● ● ●●● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●● ●● ●● ●● ●● ●● ● ●●●● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ●● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ●●● ●● ● ●●● ● ● ●●● ● ● ●● ● ●● ●● ● ●●● ● ● ● ●● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●●● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ●● ● ●● ●● ● ● ●● ● ● ●

●

● ●

● ●

●

10

●

● ●

●

●

●

●

●

● ●

●

● ●

● ●

●

●

●

● ● ●

●

● ●

●

●

● ●

●

●●

● ●

●

●

50

●

0

40 30 20 10 0

Outcome

●

100

−100

Running Variable

−50

0

50

100

Running Variable

(a) Raw Data

(b) Data Binned in 40 Bins of Equal Length

number of bins in a data-driven, transparent, and optimal way.

3.1

Two types of Bins: Evenly-spaced versus Quantile Bins

We first discuss two different types of bins that can be used in the construction of RD plots. When we partition the running variable in bins, we may employ bins of equal length as in Table 3.1 or, alternatively, we may employ bins that contain (roughly) the same number of observations but whose length may differ. We refer to these two bin types as evenly-spaced bins and quantile-spaced bins, respectively. In order to define the bins more precisely, we assume that the running variable takes values inside the interval [xl , xu ]. In other words, xl is the lowest value and xu the highest value that the score may take. In the Meyersson application, xl = −100 and xu = 100. We continue to use the subscripts + and − to denote treated and control observations, respectively. The bins are constructed separately for treated and control observations; thus, the control bins partition [xl , x ¯) in non-overlapping intervals, and the treated bins partition [¯ x, xu ] in non-overlapping intervals— recall that x ¯ is the RD cutoff. We use J− to denote the total number of bins chosen to the left of the cutoff, and J+ to denote the total number of bins chosen to the right of the cutoff. Using this

29

3. RD PLOTS

CONTENTS

notation, we can generally define the bins as follows:

Control Bins:

Treated Bins:

B−,j

B+,j

   [xl , b−,1 ) = [b−,j−1 , b−,j )   [b−,J− −1 , x ¯)

j=1 j = 2, · · · , J− − 1 j = J−

 x , b+,1 )   [¯ = [b+,j−1 , b+,j )   [b+,J+ −1 , xu ]

j=1 j = 2, · · · , J+ − 1 j = J+ ,

with b−,0 < b−,1 < · · · < b−,J− and b+,0 < b+,1 < · · · < b+,J+ . In other words, the union of the control and treated bins, B−,1 ∪ B−,2 ∪ . . . ∪ BJ− ∪ B+,1 ∪ B+,2 ∪ . . . ∪ BJ+ , forms a disjoint partition of the support of the running variable, [xl , xu ], centered at the cutoff x ¯. Letting X−,(i) and X+,(i) denote the i-th quantile of the control and treatment subsamples, respectively, and b·c denote the floor function, we can now formally define evenly-spaced and quantilespaced bins. • Evenly-spaced (ES) bins: non-overlapping intervals that partition the entire support of the running variable, all of the same length within each treatment assignment status: b−,j = xl +

j (¯ x − xl ) J−

and

b+,j = x ¯+

Note that all ES bins in the control side have length have length

xu −¯ x J+ .

x ¯−xl J−

j (xu − x ¯) . J+

and all bins in the treated side

If we choose the same number of bins on both sides (J+ = J− ), treated and

control bins will have the same length if the cutoff is the midpoint of the support, x ¯=

xu −xl 2 .

• Quantile-spaced (QS) bins: non-overlapping intervals that partition the entire support of the running variable, all containing (roughly) the same number of observations within each treatment assignment status: b−,j = X−,(bj/J− c)

and

b+,j = X+,(bj/J+ c) .

Note that the length of QS bins may differ even within treatment assignment status; the bins will be larger in regions of the support where there are fewer observations. QS bins will be evenly-spaced, for example, in the (unusual) case that the running variable has unique values uniformly spaced-out over [xl , xu ]. In practical terms, the most important difference between ES and QS bins is the underlying variability of the local mean estimate in every bin. Although ES bins have equal length, if the observations are not uniformly distributed in the support of the running variable [xl , xu ], each 30

3. RD PLOTS

CONTENTS

bin may contain a different number of observations. This means that in an RD plot with evenlyspaced bins, each of the local means represented by a dot may be computed using a different number of observations and thus may be more or less variable than the other local means in the plot. For example, if there are many more observations near the cutoff than far away from it, the local mean estimates in the farthest bins will be much more variable than the local mean estimates near the cutoff. Thus, when the data is not approximately uniformly distributed in [xl , xu ], the dots representing the local means in an evenly-spaced RD plot may not be directly comparable. For example, Table 3.1 shows that there are only 4 observations in [−100, −95], and only 2 observations in [−95, −90]; thus, the variance of these local mean estimates is very high because they are constructed with very few observations. In contrast, QS bins contain approximately the same number of observations by construction. Thus, all dots in a quantile-spaced RD plot will be directly comparable in terms of variability. Moreover, a quantile-spaced RD plot has the advantage of providing a quick visual representation of the density of observations over the support of the running variable. For example, if there are very few observations far from the cutoff, an RD plot with quantile-spaced bins will tend to be “empty” near the extremes of [xl , xu ], and will quickly convey the message that there no observations with values of the score near xl or xu . To illustrate the differences between binning strategies, we again use the rdplot command but this time specifying the desired type of bin via the binselect option. We reproduce the full output of rdplot, which includes several descriptive statistics in addition to the actual plot. First, we reproduce the RD plot in Figure 3.2 above, using 20 evenly-spaced bins on each side, including the full output, which we now explain in detail: > out = rdplot (Y , X , nbins = c (20 , 20) , binselect = " es " ) > print ( out ) Call : rdplot ( y = Y , x = X , nbins = c (20 , 20) , binselect = " es " ) Method :

Number of Obs . Polynomial Order Scale

Left 2314 4 2

Right 315 4 3

Selected Bins Average Bin Length Median Bin Length

20 20 5.0000 4.9526 5.0000 4.9526

IMSE - optimal bins Mimicking Variance bins

11 40

7 75

Relative to IMSE - optimal : Implied scale 1.8182 2.8571 WIMSE variance weight 0.1426 0.0411 WIMSE bias weight 0.8574 0.9589

31

3. RD PLOTS

CONTENTS

40 30 20

Outcome

50

60

70

Analogous Stata command . rdplot Y X, nbins(20 20) binselect(es) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

●

● ●

● ●

10

●

●

●

●

●

●

● ●

●

● ●

● ●

●

●

●

● ● ●

●

● ●

●

● ●

● ●

0

●

−100

−50

0

50

100

Running Variable

Figure 3.4: 40 Evenly-Spaced Bins

The total number of observations is shown in the top right of the panel, where we can also see the type of weights used to plot the observations. We have 2, 629 observations in total, which by default are all given equal or uniform weight, which is indicated by Kernel = Uniform. The rest of the output is divided in two columns, one corresponding to observations assigned to control and located to the left of the cutoff (indicated by c in the output), and another corresponding to observations assigned to treatment and located to the right of the cutoff. The top output panel shows that there are 2, 314 observations to the left of the cutoff, and 315 to the right of the cutoff, consistent with our descriptive analysis indicating that the Islamic party loses the majority of the electoral races. The third row in the top panel indicates that the global polynomial fit used in the RD plot is of order 4 on both sides of the cutoff. The fourth row indicates the window or bandwidth h where the global polynomial fit was conducted; the global fit uses all observations in [c − h, c) on the control side, and all observations in [c, c + h] on the treated side. By default, all observations to the left of the cutoff are included in the left fit, and all observations to the right of the cutoff are included in the right fit. In this case, because the range of the Islamic margin of victory is [−100, 99.051], the bandwidth on the right is slightly smaller than 100. This occurs because in our data there are no observations where the Islamic party wins an uncontested election. Finally, the last row on the top panel shows the scale selected, which is an optional factor by which the chosen number of bins can be multiplied to either increase or decrease the original 32

3. RD PLOTS

CONTENTS

choice; by default, this factor is one and no scaling is performed. The lower output panel shows results on the number and type of bin selected. The top two rows show that we have selected 20 bins to the left of the cutoff, and 20 bins to the right of the cutoff. On the control side, the length of each bin is exactly 5 =

x ¯−xl J−

=

0−(−100) 20

= 100/20. However, the

actual length of the ES bins to the right of the cutoff is slightly smaller than 5, as the edge of the support on treated side is 99.051 instead of 100. The actual length of the bins to the right of the cutoff is

xu −¯ x J+

=

99.051−0 20

= 99.051/20 = 4.9526. We postpone discussion of the five bottom rows

until the next subsection where we discuss optimal bin number selection. We now compare this plot to an RD plot that also uses 20 bins on each side, but uses quantilespaced bins instead of evenly-spaced bins by setting the option binselect = "qs". > out = rdplot (Y , X , nbins = c (20 , 20) , binselect = " qs " ) > print ( out ) Call : rdplot ( y = Y , x = X , nbins = c (20 , 20) , binselect = " qs " ) Method :

Number of Obs . Polynomial Order Scale

Left 2314 4 1

Right 315 4 1

Selected Bins Average Bin Length Median Bin Length

20 20 4.9951 4.9575 2.9498 1.0106

IMSE - optimal bins Mimicking Variance bins

21 44

14 41

Relative to IMSE - optimal : Implied scale 0.9524 1.4286 WIMSE variance weight 0.5365 0.2554 WIMSE bias weight 0.4635 0.7446 Analogous Stata command . rdplot Y X, nbins(20 20) binselect(qs) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

For easy comparison, Figure 3.6 reproduces side-by-side the RD plots in Figures 3.4 and 3.5: Figure 3.6(a) reproduces the evenly-spaced RD plot, while Figure 3.6(b) reproduces the quantilespaced RD plot. Both plots use the Meyersson data and have 20 bins on each side of the cutoff; the only difference difference between them is the type of bin—ES vs. QS—that is used in each case. One of the main differences between these two plots is that they show where the data is located. In the evenly-spaced RD plot in Figure 3.6(a), there are five bins in the interval [-100,-75] of the 33

CONTENTS

40 30 20

Outcome

50

60

70

3. RD PLOTS

● ●

● ● ●●

●

● ●

●● ●●

●

● ●

● ●

● ● ●

●● ● ● ● ● ● ●● ● ● ● ● ● ● ●

10

● ●

0

●

−100

−50

0

50

100

Running Variable

Figure 3.5: 40 Quantile-Spaced Bins running variable. In contrast, in the quantile-spaced RD plot in Figure 3.6(b), this interval is entirely contained in the first bin. The vast difference in the length of QS and ES bins occurs because, as shown in Table 3.1, there are very few observations near −100, which leads to local mean estimates with high variance. This problem is avoided if we choose quantile-spaced bins, since the procedure ensures that each bin has the same number of observations.

3.2

Choosing the Number of Bins Optimally

Once a decision to use either quantile-spaced or evenly-spaced bins has been made, which determines the position of the bins, the only remaining choice for practical implementation of RD plots is the total number of bins on either side of the cutoff: the quantities J− and J+ . Thus, given a choice of QS or ES bins, a data-driven and automatic RD plot can be produced as long as one has a datadriven and automatic way for selecting J− and J+ . Below we discuss two such methods to choose the number of bins. Both methods set up the problem of choosing J− and J+ in an automatic, data-driven way, where the chosen values of J− and J+ are those that either optimize or satisfy a particular criterion. To be more specific, the procedure involves constructing asymptotic expansions of the (integrated) variance and squared bias of the local means under ES or QS bins, and then choosing the values of J− and J+ that either minimize or satisfy particular restrictions on functions of these expansions.

34

3. RD PLOTS

CONTENTS

70 60 50

10

●

●

●

●

20

●

40

Outcome ●

●

● ●

● ●

● ●

●

● ●●

●

● ●

●

●

●

●

●

●

●

●

●● ●●

●

● ●

● ●

● ● ●

●

●

● ●

●

●

●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●

10

●

30

40 30 20

●

● ●

● ●

● ● ●

0

●

0

Outcome

50

60

70

Figure 3.6: RD Plots—Meyersson Data

−100

−50

0

50

100

−100

Running Variable

0

50

100

Running Variable

(a) 40 Evenly-Spaced Bins

3.2.1

−50

(b) 40 Quantile-Spaced Bins

Binning to Trace Out the Underlying Regression Functions

The first method we consider selects the values of J− and J+ that minimize an asymptotic approximation to the integrated mean-squared error (IMSE) of the local means estimator, that is, the sum of the expansions of the (integrated) variance and squared bias. The resulting choice of bins is therefore IMSE-optimal, implying that the chosen values of J− and J+ optimally balance or “trade off” squared-bias and variance of the local sample means when viewed as a local estimator of the underlying unknown regression functions. If we choose a large number of bins, we have a small bias because the bins are smaller and the local constant fit is better; but this reduction in bias comes at a cost, as increasing the number of bins leads to less observations per bin and therefore more variability within bin. The IMSE-optimal J− and J+ are the numbers of bins that balance squared-bias and variance so that the IMSE is (approximately) minimized. By construction, choosing an IMSE-optimal number of bins will result in binned sample means that “trace out” the underlying regression function. For this reason, this method is most useful to asses the overall shape of the regression function, perhaps to identify potential discontinuities in these functions that occur far from the cutoff. In general, however, an IMSE-optimal number of bins will tend to produce a very smooth plot where the local means nearly overlap with the global polynomial fit. For this reason, this method of choosing bins is not always the most appropriate to capture the local variability of the data near the cutoff.

35

3. RD PLOTS

CONTENTS

The IMSE-optimal values of J− and J+ are, respectively, l m IMSE J− = C−IMSE n1/3

and

l m IMSE J+ = C+IMSE n1/3 ,

where n is the total number of observations, d·e denotes the ceiling operator, and the exact form of the constants C−IMSE and C+IMSE depends on whether ES or QS bins are used (and some features of the underlying data generating process). In practice, the unknown constants C−IMSE and C+IMSE are estimated using preliminary, objective data-driven procedures. In order to produce an RD plot that uses an IMSE-optimal number of evenly-spaced bins, we use the command rdplot with the option binselect = "es", but this time omitting the nbins = c(20 20) option. When the number of bins is omitted, rdplot automatically chooses the number of bins according to the criterion specified with binselect. We now produce an RD plot that uses ES bins and chooses the total number of bins on either side of the cutoff to be IMSE-optimal. > out = rdplot (Y , X , binselect = " es " ) > print ( out ) Call : rdplot ( y = Y , x = X , binselect = " es " ) Method :

Number of Obs . Polynomial Order Scale

Left 2314 4 1

Right 315 4 1

Selected Bins Average Bin Length Median Bin Length

11 7 9.0909 14.1501 9.0909 14.1501

IMSE - optimal bins Mimicking Variance bins

11 40

7 75

Relative to IMSE - optimal : Implied scale 1.0000 1.0000 WIMSE variance weight 0.5000 0.5000 WIMSE bias weight 0.5000 0.5000 Analogous Stata command . rdplot Y X, binselect(es) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

The output reports both the average length of the bins, and the median length of the bins. In the ES case, since each bin has the same length, each bin has length equal to both the average and the median bin length on each side. The IMSE criterion leads to different numbers of ES bins above and below the cutoff. As shown in the Selected Bins row in the bottom panel, the IMSE-optimal 36

CONTENTS

40 30 20

Outcome

50

60

70

3. RD PLOTS

●

●

● ●

● ●

●

●

●

●

●

●

10

●

●

● ●

0

●

−100

−50

0

50

100

Running Variable

Figure 3.7: IMSE RD Plot with Evenly-Spaced Bins—Meyersson Data number of bins to the left of the cutoff is 11, while the optimal number of bins above the cutoff is only 7. As a result, the length of the bins above and below the cutoff is different: above the cutoff, each bin has a length of 14.1501 percentage points, while below the cutoff the bins are smaller, with a length of 9.0909. The middle rows show the optimal number of bins according to the IMSE criterion (which coincides with the selected number of bins because this is the criterion chosen), and the mimicking variance criterion which we discuss below. The bottom three rows show the weights that the chosen bins give to the variance term relative to the bias term in the IMSE objective function—when the IMSE criterion is used, these weights are always equal to 1/2. To produce an RD plot that uses an IMSE-optimal number of quantile-spaced bins, we use the option binselect = "qs" instead of binselect = "es". > out = rdplot (Y , X , binselect = " qs " ) > print ( out ) Call : rdplot ( y = Y , x = X , binselect = " qs " ) Method :

Number of Obs . Polynomial Order Scale

Left 2314 4 1

Right 315 4 1

Selected Bins Average Bin Length Median Bin Length

21 14 4.7572 7.0821 2.8327 1.4289

37

3. RD PLOTS

CONTENTS

IMSE - optimal bins Mimicking Variance bins

21 44

14 41

Relative to IMSE - optimal : Implied scale 1.0000 1.0000 WIMSE variance weight 0.5000 0.5000 WIMSE bias weight 0.5000 0.5000

40 30 20

Outcome

50

60

70

Analogous Stata command . rdplot Y X, binselect(qs) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

● ● ● ● ● ●● ●

● ●● ●

●

● ● ●

● ●

●●

●

●

● ● ● ●

●

●

● ● ●

●

10

●

●

0

●

−100

−50

0

50

100

Running Variable

Figure 3.8: IMSE RD Plot with Quantile-Spaced Bins—Meyersson Data

Note that the IMSE-optimal number of QS bins is much larger on both sides, with 21 bins below the cutoff and 14 above it, versus 11 and 7 in the analogous ES plot in Figure 3.7. The average bin length is 4.7572 below the cutoff, and 7.0821 above it. As expected, the median length of the bins is much smaller than the average length on both sides of the cutoff, particularly above. Since there are very few observations where the Islamic vote margin is above 50%, the length of the last bin above the cutoff must be very large in order to ensure that it contains 315/14 ≈ 22 observations. Figure 3.9 reproduces the evenly-spaced and quantile-spaced IMSE-optimal RD plots side-byside. As shown, the ES IMSE-optimal bins produce local means that trace the global polynomial fit closely, and do not reveal much variability near the cutoff. In contrast, the IMSE-optimal QS bins give a better idea of where most of the observations are located, and because there are more bins on each side, they produce a plot that reveals more local variability.

38

3. RD PLOTS

CONTENTS

70 60 50 40

Outcome

20

●

● ● ● ● ● ●● ●

● ●

●

● ●

●

●

● ● ●

● ●

●

●●

●

● ● ● ●

●

●

● ●● ●

●

● ● ●

● ●

10

●

10

●

30

40 30 20

● ● ●

●

● ●

● ●

0

●

0

Outcome

50

60

70

Figure 3.9: IMSE RD Plots—Meyersson Data

−100

−50

0

50

100

−100

Running Variable

−50

0

50

Running Variable

(a) Evenly-Spaced Bins

(b) Quantile-Spaced Bins

39

100

3. RD PLOTS

3.2.2

CONTENTS

Binning to Mimic the Variability of the Data

Since one of the most important roles of the RD plot is to illustrate the behavior of the data near the cutoff, the oversmoothed RD plot that may result from choosing an IMSE-optimal number of bins—particularly when using ES bins—is not always desirable. An alternative is to select bins in a way that guarantees a sufficiently large number of local means so that researchers can easily get a graphical representation of the variability of the data near the cutoff (and elsewhere in the support). The second fully automatic and data-driven method to select the number bins reaches this goal by selecting the vales of J− and J+ so that the binned means have an asymptotic (integrated) variability that is approximately equal to the variability of the raw data. In other words, the number of bins are chosen so that the overall variability of the binned means “mimics” the overall variability in the raw scatter plot of the data. In the Meyersson application, this method involves choosing J− and J+ so that the binned means have a total variability approximately equal to the variability illustrated in Figure 3.1. We refer to this choice of total number of bins as mimicking-variance (MV) choices. The mimicking-variance values of J− and J+ are MV J−

=

C−MV

n , log(n)2

and

MV J+

=

C+MV

n , log(n)2

where again n is the sample size and the exact form of the constants C−MV and C+MV depends on whether ES or QS bins are used (and some features of the underlying data generating process). These constants are different than those appearing in the IMSE-optimal choices and, in practice, are also estimated using preliminary, objective data-driven procedures. MV > J ES and J MV > J ES . That is, the MV method to select the number of bins In general, J− − + +

leads to a larger number of bins than the IMSE method, resulting in an RD plot with more dots representing local means and thus giving a better sense of the variability of the data. In order to produce an RD plot with ES bins and a MV total number of bins on either side, we use the option binselect ="esmv". > out = rdplot (Y , X , binselect = " esmv " ) > print ( out ) Call : rdplot ( y = Y , x = X , binselect = " esmv " ) Method :

Number of Obs . Polynomial Order Scale

Left 2314 4 4

Right 315 4 11

40

3. RD PLOTS

CONTENTS

Selected Bins Average Bin Length Median Bin Length

40 75 2.5000 1.3207 2.5000 1.3207

IMSE - optimal bins Mimicking Variance bins

11 40

7 75

Relative to IMSE - optimal : Implied scale 3.6364 10.7143 WIMSE variance weight 0.0204 0.0008 WIMSE bias weight 0.9796 0.9992

40

●

30

Outcome

50

60

70

Analogous Stata command . rdplot Y X, binselect(esmv) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

●

20

● ●

●

●●

● ●

● ●

● ●

●

● ●●

●

●●

●●

●●

●

●●

●

●

● ●●

●

●●

●

● ● ●● ● ● ● ● ●

●

● ●

● ●

●

●●

●

●

●

●

10

●

●

●

●

●

●

● ●

●

● ●

●●

●

0

●

−100

−50

0

50

100

Running Variable

Figure 3.10: Mimicking Variance RD Plot with Evenly-Spaced Bins—Meyersson Data

This produces a much higher number of bins than we obtained with both ES and QS bins under the IMSE criterion. The MV total number of bins is 40 below the cutoff and 75 above the cutoff, with length 2.5 and 1.3207, respectively. The difference in the chosen number of bins between the IMSE and the MV criteria is dramatic. The middle rows in the bottom panel show the number of bins that would have been produced according to the IMSE criterion (11 and 7) and the number of bins that would have been produced according to the MV criterion (40 and 75). This allows for a quick comparison between both methods. In this example, we see that the MV criterion leads to approximately four times (below the cutoff) and ten times (above the cutoff) as many bins as the number of bins produced by the IMSE criterion. The bottom rows indicate that the chosen number of MV bins on both sides of the cutoff is equivalent to the number of bins that would have been 41

3. RD PLOTS

CONTENTS

chosen according to an IMSE criterion where, instead of giving the bias and the variance each a weight of 1/2, the relative weights of the variance and bias had been, respectively, 0.024 and 0.9796 below the cutoff, and 0.0008 and 0.9992 above the cutoff. Thus, we see that if we want to justify the MV choice in terms of the IMSE criterion, we must weight the bias much more than the variance. Finally, to obtain an RD plot that chooses the total number of bins according to the MV criterion but uses QS bins, we use the option binselect = "qsmv". > out = rdplot (Y , X , binselect = " qsmv " ) > print ( out ) Call : rdplot ( y = Y , x = X , binselect = " qsmv " ) Method :

Number of Obs . Polynomial Order Scale

Left 2314 4 2

Right 315 4 3

Selected Bins Average Bin Length Median Bin Length

44 41 2.2705 2.4183 1.3755 0.5057

IMSE - optimal bins Mimicking Variance bins

21 44

14 41

Relative to IMSE - optimal : Implied scale 2.0952 2.9286 WIMSE variance weight 0.0981 0.0383 WIMSE bias weight 0.9019 0.9617 Analogous Stata command . rdplot Y X, binselect(qsmv) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

42

CONTENTS

40 30 20

Outcome

50

60

70

3. RD PLOTS

● ●

● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ●●

●●

●

● ●

● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●

10

●

● ● ●

0

●

−100

−50

0

50

100

Running Variable

Figure 3.11: Mimicking Variance RD Plot with Quantile-Spaced Bins—Meyersson Data

Below the cutoff, the MV number of bins is very similar to the MV choice for ES bins (44 versus 40). However, above the cutoff, the MV number of bins for QS bins is much lower than for MV ES bins (41 versus 75). This occurs because, although the range of the running variable is [−100, 99.051], there are very few observations in the intervals [−100, −50] and [50, 100] far from the cutoff, which leads to high variability; at the same time, choosing ES bins forces the length of the bins to be the same everywhere in the support . Thus, in order to produce small enough bins to adequately mimic the overall variability of the scatter plot in regions with few observations, the number of bins has to be large. In contrast, QS bins can be short near the cutoff and long away from the cutoff, so they can mimic the overall variability by adapting their length to the density of the data. Figure 3.12 reproduces the two MV RD plots—one using ES bins and the other using QS bins.

43

3. RD PLOTS

CONTENTS

70 60 50 40

Outcome

30

30

40

●

●

●●

● ● ●

● ●

●

● ●●

●

●●

●●

●●

●

●●

●

●

● ●●

●

●●

●

●

● ● ●● ● ● ● ● ●

●

●

●

●

●

●

●

●

● ● ● ●

● ●

●●

●

−50

0

50

0 100

−100

Running Variable

−50

0

50

100

Running Variable

(a) Evenly-Spaced Bins

3.3

●● ● ●

● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●

●

●

●

−100

● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ●●

●

● ●

● ●

●●

●

●

●

●

●

10

●

●

20

●

●

●

10

20

●

0

Outcome

50

60

70

Figure 3.12: Mimicking Variance RD Plots—Meyersson Data

(b) Quantile-Spaced Bins

Recommendations for Practice

Since the IMSE-optimal bins are better suited to trace out the regression function rather than illustrate the local variability, and given that the overall shape of the regression function is represented by the global fit, we recommend choosing MV bins for depicting the overall RD design in the first place. The IMSE-optimal number of bins choices will be useful to identify potential discontinuities in the underlying regression functions, specially when contrasted with the global polynomial fits, as we will discuss in more detail in Section 6. Both QS and ES bins are useful in their own right for describing the RD design, and hence it is useful to report both RD plots side-by-side in applications. The QS RD plot gives a more accurate representation of the concentration of the observations along the support of the score, while the ES RD Plot provides similar information but without being influenced by the underlying distribution of the score. We recommend using a 4th-order or 5th-order polynomial for the global fit over extended supports of the score, and lower order poylyomials for restricted supports (we discuss explicitly issues of global vs. local polynomial fitting in the upcoming section). Having said this, we caution about interpreting the specific jump at the cutoff as a valit treatment effect estimator because, as we will discuss in the upcoming section, global polynomial fits tend to perform very poorly at boundary points. Finally, RD plots are useful not only to present the overall RD design and motivate its empirical falsification, but also to depict the specific RD analysis local to the cutoff, as we will illustrate in the following sections.

44

3. RD PLOTS

3.4

CONTENTS

Further Readings

A detailed discussion of RD plots and formal methods for automatic data-driven bin selection are given by Calonico et al. (2015a). That paper formalized the commonly used RD plots with evenlyspaced binning, introduced RD Plots with quantile-spaced binning, and developed optimal choices for the number of bins in terms of both integrated mean squared error and mimicking variance targets. RD plots are special cases of nonparametric partitioning estimators—see, e.g., Cattaneo and Farrell (2013) and references therein.

45

4. CONTINUITY-BASED RD APPROACH

4

CONTENTS

The Continuity-Based Approach to RD Analysis

This section discusses empirical methods based on continuity assumptions and extrapolation for estimation and inference in RD designs, which rely on large sample approximations with random potential outcomes under repeated sampling. These methods offer tools useful not only for estimation of and inference on main treatment effects, but also for falsification and validation of the design—discussed in Section 6. The approach discussed here is based on formal statistical methods and hence leads to discipline and objective empirical analysis, which typically has two related but distinct goals: point estimation of RD treatment effect (i.e., give a scalar estimate of the vertical distance between the regression functions at the cutoff) and statistical inference about the RD treatment effect (i.e., construct valid hypothesis tests and confidence intervals to establish the values of the RD parameter that are most supported by our data). The methods dicussed in this section is based on the continuity conditions underlying Equation (2.1), and generalizations thereof. This framework for RD analysis, which we call the continuitybased RD framework, uses methodological tools that directly rely on continuity (and differentiability) assumptions and define τSRD as the parameter of interest. In this framework, estimation typically proceeds by using (local to the cutoff) polynomial methods to model or approximate the regression function E[Yi |Xi = x] on each side of the cutoff separately. In practical terms, this involves using least-squares methods to fit a polynomial of the observed outcome on the score. When all the observations are used for estimation, these polynomial fits are global or parametric in nature, as those used in the default RD plots discussed in the previous section. In contrast, when estimation employs only observations with scores near the cutoff, the polynomial fits are local or “nonparametric”. Local polynomial methods are by now the standard framework for RD empirical analysis, as they offer a good compromise between flexibility and simplicity. In Section 5, we discuss an alternative framework that relies on assumptions of local random assignment of the treatment near the cutoff and hence employs tools and ideas from the statistical literature on analysis of experiments. That alternative approach offers a nice complement to, and a robustness check for, the local polynomial methods based on continuity assumptions discussed in this section.

4.1

Local Polynomial Approach: Overview

A fundamental feature of the RD design is that, in general, there are no observations whose score Xi equals the cutoff value x ¯: because the running variable is assumed continuous, there are none (or sometimes very few) observations whose score is x ¯ or very nearly so. Thus, extrapolation in RD designs is unavoidable in general. In other words, in order to form estimates of the average response of control units at the cutoff, E[Yi (0)|Xi = x ¯], and of the average response of treatment units at the cutoff, E[Yi (1)|Xi = x ¯], we must rely on observations further away from the cutoff to approximate the unknown regression functions. In the Sharp RD design, for example, the treatment effect τSRD is 46

4. CONTINUITY-BASED RD APPROACH

CONTENTS

the vertical distance between the E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] at x = x ¯, as shown in Figure 2.2, and thus estimation and inference proceeds by first approximating the unknown regression functions to then compute the estimated treatment effect and/or the statistical inference procedure of interest. In this context, the key practical issue in RD design analysis is how the approximation of the regression functions is done, as this will have major effects on the robustness and credibility of the empirical findings. The problem of approximating an unknown function is well understood: any sufficiently smooth function can be well approximated by a polynomial function, locally or globally. Applied to the RD point estimation problem, this result suggests that the unknown regression functions E[Yi (t)|Xi = x], t = 0, 1, can in principle be well approximated by a polynomial function of the score, up to random sampling error. Early empirical work employed this idea globally, that is, tried to approximate the unknown regression functions using flexible higher-order polynomials, usually 4th or 5th order, over the entire support of the data. This global approach is still used in RD plots, as illustrated in the previous section, because there the goal is to approximate the entire unknown regression functions. However, it is now widely recognized that this global polynomial approach does not deliver point estimators and inference procedures with good properties for the main object of interest: the RD treatment effect. The reason is that global polynomial approximations tend to do a good job overall but a very bad job at boundary points—this problem is known as the Runge’s phenomenon in approximation theory. Put differently, global polynomial approximations tend to have very erratic behavior near boundary points and induce counter-intuitive weighting schemes on the observations when the goal is to estimate the unknown function at the boundary point. Because RD treatment effects are defined at boundary points, of control and treatment groups separately, the global polynomial approach will be suspect in practice. For example, distant points and/or outliers may severely affect the global polynomial RD point estimator. In sum, global polynomials can lead to invalid estimates, and thus the conclusions from a global parametric RD analysis can be highly misleading. For these reasons, we recommend against using global polynomial methods for RD analysis. Modern and principled RD empirical work employs local polynomial methods, which focus on approximating the regression functions only near the cutoff. Because this approach localizes the polynomial fit to the cutoff (discarding all other observations sufficiently further away) and employs a low-order polynomial approximation (usually linear or quadratic), it is substantially more robust and less sensitive to boundary-related problems. Furthermore, this approach can be viewed formally as a nonparametic local polynomial approximation, which has also aided the development of a large toolkit of statistical and econometrics results. In contrast to global higher-order polynomials, local lower-order polynomial approximations can be viewed as intuitive approximations with a potential misspecification of the functional form of the regression function, which can be modeled and understood formally, while at the same time are less sensitive to outliers or other extreme features in the data generating process away from the cutoff.

47

4. CONTINUITY-BASED RD APPROACH

CONTENTS

Modern empirical work in RD designs employ local polynomial methods using only observations close to the cutoff and taking them as local approximations, not necessarily as correctly specified models. Not surprisingly, the statistical properties of local polynomial estimation and inference crucially depend on the accuracy of the approximation near the cutoff, which is controlled by the size of the neighborhood or bandwidth around the cutoff where the local polynomial is fit. In the upcoming subsections, we discuss the modern local polynomial methods for RD analysis, and explain all the steps involved in their implementation for both estimation and inference. We also discuss several extensions, including inclusion of covariates and use of cluster-robust standard errors.

4.2

Local Polynomial Point Estimation

Local polynomial methods estimate the desired polynomials using only observations near the cutoff point, separately for control and treatment units. This approach uses only observations that are between x ¯ − h and x ¯ + h, where h > 0 is some chosen bandwidth. Moreover, within this bandwidth, observations closer to x ¯ often receive more weight than observations further away, where the weights are determined by a kernel function K(·). This local polynomial approach can be understood and analyzed formally as nonparametric, in which case the fit is taken as an approximation of the unknown underlying regression functions within the region determined by the bandwidth used. Local-polynomial estimation consists of the following steps: 1. Choose a polynomial order p and a kernel function K(·). 2. Choose a bandwidth h. 3. For observations above the cutoff (i.e., observations with score Xi ≥ x ¯), fit a weighted least squares regression of the outcome Yi on a constant and (Xi − x ¯), (Xi − x ¯)2 , . . . , (Xi − x ¯ )p , where p is the chosen polynomial order, with weight K( xih−¯x ) for each observation. The estimated intercept from this local weighted regression, µ ˆ+ , is an estimate of the point µ+ = E[Yi (1)|Xi = x ¯]. In other words, µ ˆ+ is the first element (intercept) of the weighted least squares problem: βˆ+ = arg min

n X

β+,0 ,··· ,β+,p i=1

1(Xi ≥ x¯) (Yi − β+,0 − β+,1 (Xi − x¯) − · · · − β+,p (Xi − x¯)p )2 K

xi − x ¯ h

4. For observations below the cutoff (i.e., observations with score Xi < x ¯), fit a weighted least squares regression of the outcome Yi on a constant and (Xi − x ¯), (Xi − x ¯)2 , . . . , (Xi − x ¯ )p , where p is the chosen polynomial order, with weight K( xih−¯x ) for each observation. The estimated intercept from this local weighted regression, µ ˆ− , is an estimate of the point µ− = E[Yi (0)|Xi = x ¯]. In other words, µ ˆ− is the first element (intercept) of the weighted least

48

4. CONTINUITY-BASED RD APPROACH

CONTENTS

squares problem: n X

βˆ− = arg min

1(Xi < x¯) (Yi − β−,0 − β−,1 (Xi − x¯) − · · · − β−,p (Xi − x¯) ) K p 2

β−,0 ,··· ,β−,p i=1

xi − x ¯ h

5. Calculate the Sharp RD point estimate: τˆSRD = µ ˆ+ − µ ˆ− . A graphical representation of local polynomial RD point estimation is given in Figure 4.1, where a polynomial of order one (p = 1) is fit within bandwidth h1 —observations outside this bandwidth are not used in the estimation. The RD effect is τSRD = µ+ − µ− and the local polynomial estimator of this effect is µ ˆ+ − µ ˆ− . The dots and squares represent observed data points because this methods employs the actual raw data, not the binned data typically reported in the RD plots). Figure 4.1: RD Estimation with local polynomial

2

Binned Control Observations

●

Binned Treatment Observations

Local linear fit

●

● ● ●

●

●

E[Y(1)|X], E[Y(0)|X]

µ+

E[Y(1)|X]

1

^ µ +

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

µ−

●

●

● ● ●

●

●

●

●

●

●

●

●

●

0

^ µ −

●

●

●

●

● ●

●

●

●

● ● ●

● ●

● ● ●

● ● ● ●

●

E[Y(0)|X] Cutoff −1 −100

x − h1

x

x + h1

100

Score (X)

The implementation of the local polynomial approach thus requires the choice of three main ingredients: the kernel function K(·), the order of the polynomial p, and the bandwidth h. We now turn to a detailed discussion of each of these choices. 49

4. CONTINUITY-BASED RD APPROACH

4.2.1

CONTENTS

Choice of Kernel Function and Polynomial Order

The kernel function K(x) assigns non-negative weights to each observation based on the distance of their score Xi relative to the cutoff x ¯. The recommended choice is the triangular kernel function, K(x) = (1 − |x|)1(|x| ≤ 1), because with an optimal bandwidth in a mean square error (MSE) sense, it leads to a point estimator with optimal properties in a MSE sense. As illustrated in Figure 4.2, the triangular kernel function assigns zero weight to all observations with score outside the interval [¯ x − h, x ¯ + h], and positive weights to all observations within this interval. The weight is maximized at Xi = x ¯, and declines symmetrically and linearly as the value of the score gets farther from the cutoff. Despite the desirable asymptotic optimality property (from a point estimation perspective) of the triangular kernel, researchers sometimes prefer to use the more simple uniform kernel K(x) =

1(|x| ≤ 1), which also gives zero weight to all observations with score outside [¯x −h, x¯ +h] but equal weight to all the observations whose scores are within this interval: see Figure 4.2. Employing a locallinear estimation with bandwidth h and the uniform kernel is therefore equivalent to estimating a simple linear regression without weights using only observations whose distance from the cutoff is at most h, i.e. observations with Xi ∈ [¯ x − h, x ¯ + h]. This second choice of kernel minimizes the asymptotic variance of the local polynomial estimator. A third weighting scheme alternative sometimes encountered in practice is the Epanechnikov kernel, also depicted in Figure 4.2, which gives a quadratic decaying weighting to observations within Xi ∈ [¯ x − h, x ¯ + h] and zero weight to the rest. In practice, estimation and inference results are typically not very sensitive to the particular choice of kernel used. Figure 4.2: Different Kernel Weights for RD Estimation

Epanechnikov KernelChosen 1.0

Bandwidth

Uniform Kernel

Triangular Kernel Weights

Triangular Kernel

0.5

0.0 −100

x−h

x

x+h

100

Score (X)

A more important issue is the choice of the order of the local polynomial used. As we will

50

4. CONTINUITY-BASED RD APPROACH

CONTENTS

discussed next, given this choice, the accuracy of the approximation will be essentially controlled by the bandwidth: in other words, the bandwidth will be selected given the polynomial order taking into account misspecification errors. As already mentioned, higher-order polynomials tend to produce over-fitting of the data and hence unreliable results near boundary points. At the same time, local constant fits (p = 0) exhibit some undesirable theoretical features and usually under-fit the data. In practice, the recommended choices are usually p = 1 or p = 2, though theory and practice considers other polynomial orders as well. In sum, several issues need to be considered when choosing the specific order of the local polynomial. First, a polynomial of order zero—a constant fit— has undesirable theoretical properties at boundary points, which is precisely where RD estimation must occur. Second, for a given bandwidth, increasing the order of the polynomial generally improves the accuracy of the approximation but also increases the variability of the treatment effect estimator. More precisely, in particular, it can be shown that the asymptotic variances of the local constant (p = 0) and local linear (p = 1) polynomial fits are equal, while the latter fit has small asymptotic bias. This fact has lead researchers to prefer the local linear RD estimator, which by now is the default point estimator in most applications. In finite samples, of course, the ranking between different local polynomial estimators may be different, but in general the local linear estimator seems to deliver a good trade off between simplicity, precision and stability in sharp RD settings. Although it may seem at first that a linear polynomial is not flexible enough, the bandwidth chosen appropriately will adjust to the selected polynomial order so that the linear approximation to the unknown regression functions is reliable. We discuss this in more detail below, when we discuss optimal bandwidth selection.

4.2.2

Bandwidth Selection and Implementation

The choice of bandwidth h is fundamental for the analysis and interpretation of RD designs, and it is rarely the case that the findings are not sensitive to the choice of this key tuning parameter. This choice controls the width of the neighborhood around the cutoff that is used to fit the local polynomial that approximates the unknown regression functions. Figure 4.3 illustrates how the error in the approximation is directly related to the bandwidth choice. The unknown regression functions in the figure, E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x], have considerable curvature. At first, it would seem inappropriate to approximate these functions with a linear polynomial. Indeed, inside the interval [¯ x − h2 , x ¯ + h2 ], a linear approximation yields an estimated RD effect equal to µ ˆ+ (h2 ) − µ ˆ− (h2 ), which is considerably different from the true effect µ+ − µ− . Thus, a linear regression approximation within bandwidth h2 results in a large approximation error because of misspecification error. However, reducing the bandwidth from h2 to h1 improves the linear approximation considerably, as now the estimated RD effect µ ˆ+ (h1 ) − µ ˆ− (h1 ) is much closer to the population treatment effect τSRD . The reason is that the regression functions are nearly linear in the interval [¯ x − h1 , x ¯ + h1 ], and therefore the linear approximation results in a smaller misspecification error. This illustrates the general principle that, given a polynomial order, the accuracy of the approximation can always 51

4. CONTINUITY-BASED RD APPROACH

CONTENTS

be improved by reducing the bandwidth. Figure 4.3: Bias in Local Approximations

Local linear approximation (p=1)

E[Y(1)|X], E[Y(0)|X]

1.0

●

τSRD

E[Y(0)|X]

●

E[Y(1)|X]

0.0

Cutoff −0.5 −100

x − h2

x − h1

x

x + h1

x0 + h2

100

Score (X)

Choosing a smaller h will reduce the error or bias of the local polynomial approximation, but will simultaneously tend to increase the variance of the estimated coefficients because fewer observations will be available for estimation. On the other hand, a larger h will result in more misspecification error (bias) if the unknown function differs considerably from the polynomial model used for approximation, but will reduce the variance because the number of observations in the interval [¯ x − h, x ¯ + h] will be larger. For this reason, the choice of bandwidth is said to involve a “bias-variance trade-off”. Since empirical RD results are often sensitive to the choice of bandwidth, it is important to select h in a data-driven, automatic way to avoid specification searching and ad-hoc decisions, as such an approach provides (at least) a good benchmark for empirical work. Most (if not all) bandwidth

52

4. CONTINUITY-BASED RD APPROACH

CONTENTS

selection methods try to balance some form of bias-variance trade-off (sometimes involving other features of the estimator as well). The most popular approach in practice seeks to minimize the MSE of the local polynomial RD point estimator, τˆSRD , given a choice of polynomial order and kernel function. Since the MSE of an estimator is the sum of its squared bias and its variance, this approach effectively chooses h to optimize a bias-variance trade-off. The precise procedure involves using data-driven methods to choose the bandwidth that minimizes an approximation to the asymptotic MSE of the RD point estimator because the exact MSE of the estimator is difficult to characterize in general: this requires deriving an asymptotic MSE approximation, estimating the unknown quantities in the resulting formula, and optimizing it with respect to h. To describe the MSE-optimal bandwidth selection in more detail, let p denote the order of the polynomial used to form the RD estimator τˆSRD , with kernel K(·). Then, the general form of the asymptotic MSE approximation is MSE(ˆ τSRD ) ≈ Bias2 (ˆ τSRD ) + Variance(ˆ τSRD ) ≈ h2(p+1) B 2 +

1 V, nh

where the constants B and V represent, respectively, the (leading) asymptotic bias and variance of the RD point estimator τˆSRD . Although we omit the technical details, we present the general form of B and V to clarify the most important trade-offs involved in the choice of a MSE-optimal bandwidth for the local polynomial RD estimate. The general form of the asymptotic bias B is B = B+ − B− ,

(p+1)

B− = µ−

B− ,

(p+1)

B+ = µ+

B+

where (p+1)

µ+

= lim x↓¯ x

dp+1 E[Yi (1)|X = x] dxp+1

and

(p+1)

µ−

= lim x↑¯ x

dp+1 E[Yi (0)|X = x] dxp+1

are related to the “curvature” of the unknown regression functions for treatment and control units, respectively, and known constants B+ and B− are related to the kernel function and the order p of the polynomial used. The bias term B associated with the local polynomial RD point estimator of order p, τˆSRD , depends on the (p + 1)-th derivatives of the regression functions E[Yi (1)|X = x] and E[Yi (0)|X = x] with respect to the running variable. This is a more formal characterization of the phenomenon we illustrated in Figure 4.3: when we approximate E[Yi (1)|X = x] and E[Yi (0)|X = x] with a local polynomial of order p, that approximation has an error (unless E[Yi (1)|X = x] and E[Yi (0)|X = x] happen to be polynomials of at most order p). The leading term of the approximation error is the derivative of order p + 1—that is, the order following the polynomial order used to estimate τSRD . For example, as illustrated in Figure 4.3, if we use a local linear (p = 1) polynomial to estimate τSRD , our approximation by construction ignores the second order term (which depends on the

53

4. CONTINUITY-BASED RD APPROACH

CONTENTS

second derivative of the function), and all higher order terms (which depend on the higher-order derivatives). Thus, the leading bias associated with a local linear estimator depends on the second derivatives of the regression functions, which are the leading terms in the error of approximation incurred when we set p = 1. Similarly, if we use a local quadratic polynomial to estimate τˆSRD , the leading bias will depend on the third derivatives of the regression function. The variance term V can also be characterized in some more detail: V = V− + V+ ,

V− =

2 σ− V− , f

V+ =

2 σ+ V+ f

where 2 σ+ = lim V[Yi (1)|Xi = x] x↓¯ x

and

2 σ− = lim V[Yi (0)|Xi = x] x↑¯ x

capture the conditional variability of the outcome given the score at the cutoff for treatment and control units, respectively, f denotes the density of the score variable at the cutoff, and the known constants V− and V− are related to kernel function and the order p of the polynomial used. Thus, for example, as the number of observations near the cutoff decreases (i.e., as the density f decreases), the contribution of the variance term to the MSE increases. Similarly, as the number of observations near the cutoff increases, the contribution of the variance term to the MSE decreases accordingly. This captures the intuition that the variability of the RD point estimator will partly depend on the density of observations near the cutoff. Similarly, an increase (decrease) in the conditional variability of the outcome given the score will increase (decrease) the MSE of the RD point estimators. In order to obtain a MSE-optimal point estimator τˆSRD , we choose the bandwidth that minimizes the MSE approximation: 1 2(p+1) 2 min h B + V , h>0 nh which leads to the MSE-optimal bandwidth choice hMSE =

V 2(p + 1)B 2

1/(2p+3)

n−1/(2p+3) .

This formula formally incorporates the bias-variance trade-off mentioned above. It follows that hMSE is proportional to n−1/(2p+3) , and that this MSE-optimal bandwidth increases with V and decreases with B. In other words, a larger asymptotic variance will lead to a larger MSE-optimal bandwidth; this is intuitive, as a larger bandwidth will include more observations in the estimation and thus reduce the variance of the resulting point estimator. In contrast, a larger asymptotic bias will lead to a smaller bandwidth, as a smaller bandwidth will reduce the approximation error and reduce the bias of the resulting point estimator. Another way to see this trade-off is to note that if we chose a bandwidth h > hMSE , decreasing h would lead to a reduction in the approximation error and an increase the variability of the point 54

4. CONTINUITY-BASED RD APPROACH

CONTENTS

estimator, but the MSE reduction caused by the decrease in bias would be larger than MSE increase caused by the variance increase, leading to a smaller MSE overall. In other words, when h > hMSE , it is possible to reduce the misspecification error without increasing the MSE. In contrast, when we set h = hMSE , both increasing and decreasing the bandwidth necessarily lead to a higher MSE. Given the quantities V and B, increasing the sample size n leads to a smaller optimal hMSE . This is also intuitive: as more sample becomes available both bias and variance are reduced. Thus, the lager the sample size, the better the asymptotic MSE of the RD estimator because it is possible to reduce the error in the approximation by reducing the bandwidth without paying a penalty in variability increase due to the large sample. In some applications, it may be useful to choose different bandwidths for each group, that is, on either side of the cutoff. Since the RD treatment effect of τSRD = µ+ − µ− is simply the difference of two (one-sided) estimates, allowing for two distinct bandwidth choices can be accomplished by considering an MSE approximation for each estimate separately. In other words, two different bandwidths can be selected for µ ˆ+ and µ ˆ− , and then used to form the RD treatment effect estimator. Practically, this is equivalent to choosing a asymmetric neighborhood near the cutoff of the form [¯ x − h− , x ¯ + h+ ], where h− and h+ denote the control (left) and treatment (right) bandwidths, respectively. These MSE-optimal choices are given by hMSE,− =

V− 2 2(p + 1)B−

1/(2p+3)

n−1/(2p+3)

and hMSE,+ =

V+ 2 2(p + 1)B+

1/(2p+3)

n−1/(2p+3) .

Thus, these bandwidth choices will be most practically relevant when the bias and/or variance of the control and treatment group differ substantially, for example because of different curvature of the unknwon regression functions or different conditional variance of the outcome given the score near the cutoff. In practice, the optimal bandwidth selectors described above (and all other variants thereof) are implemented by constructing preliminary plug-in estimates of the unknown quantities entering their formulas. For example, the misspecification biases B+ and B− are constructed by forming (p+1)

preliminary “curvature” estimates µ ˆ−

(p+1)

and µ ˆ+

, which are constructed using a local polynomial

of order q ≥ p + 1 with bias-bandwidth b, not necessarily equal to h. Since the constants B− and B+ (p+1) (p+1) are known, feasible bias estimates can be form as: Bˆ+ = µ ˆ B+ and B− = µ ˆ B− . Similarly, +

−

for the terms V− and V+ capturing the asymptotic variance of the estimates on the left and on the right of the cutoff, respectively, can be estimated by replacing the unknwon conditional variance and density functions at the cutoff by preliminary estimates thereof. Given these ingredients, datadriven MSE-optimal bandwidth selectors are easily constructed for the RD treatment effect (i.e., one common bandwidths on both sides of the cutoff) and for each of the two regression function estimators at the cutoff (i.e., two distinct bandwidths). The approach described so far for bandwidth selection in RD designs is arguable the default in most modern empirical work. A potential drawback of this approach is that in some applications 55

4. CONTINUITY-BASED RD APPROACH

CONTENTS

the estimated biases may be close to zero, leading to poor behavior of the resulting bandwidth selectors. To handle this computational issue, it is common to include a “regularization” term R to avoid small denominators in small samples. For example, in the case of the MSE-optimal bandwidth choice for the RD treatment effect estimator, the alternative formula is hMSE =

V 2(p + 1)B 2 + R

1/(2p+3)

n−1/(2p+3) ,

where the extra term R can be justified theoretically but requires additional preliminary estimators when implemented. Empirically, since R is in the denominator, including a regularization terms will always lead to a smaller hMSE . This idea is also used in the case of hMSE,− and hMSE,+ , and other related bandwidth selection procedures. We discuss how to include and exclude a regularization term in practice when we illustrate local polynomial methods in subsection 4.2.4.

4.2.3

Optimal Point Estimation

Given the choice of polynomial order p and kernel function K(·), the local polynomial RD point estimator τSRD is implemented for a choice of neighborhood around the cutoff x ¯ determined by the bandwidth h. As discussed previously, the smaller h the smaller the misspecification bias and the larger variability of the RD treatment effect estimator, while for larger bandwidths the bias-variance effects are reversed. Selecting an common MSE-optimal bandwidth for τSRD , or two distinct MSEoptimal bandwidths for its ingredients µ ˆ− and µ ˆ+ , leads to an MSE-optimal RD point estimator. To be more specific, the resulting estimator is consistent and achieves the fastest rate of decay in MSE sense. Furthermore, it can be argued in some precise technical sense that the triangular kernel is the MSE-optimal choice for point estimation. Because of these optimality properties, and the fact that the procedures are data-driven and objective, modern RD empirical work routinely employs some form of automatic MSE-optimal bandwidth selector, and reports the resulting MSE-optimal point estimator when estimating RD treatment effects.

4.2.4

Point Estimation in Practice

We now return to the Meyersson application to illustrate point estimation of RD effects using local polynomials. First, we use standard least-squares commands to emphasize that local polynomial point estimation is nothing more than a weighted least-squares fit when it comes to point estimation (this is not true when the goal is inference, as we discussed below). We start by choosing a fixed or ad-hoc bandwidth equal to h = 20, and thus postpone the illustration of optimal bandwidth selection until further below. Within this arbitrary bandwidth choice, we can construct the local linear (p = 1) RD point estimation with a uniform kernel using standard least-squares routines. As mentioned above, a uniform kernel simply means that all observations outside [¯ x −h, x ¯ +h] are excluded, and all observations inside this interval are weighted 56

4. CONTINUITY-BASED RD APPROACH

CONTENTS

equally. > out = lm ( Y [ X < 0 & X >= -20] ~ X [ X < 0 & X >= -20]) > left _ intercept = out $ coefficients [1] > print ( left _ intercept ) ( Intercept ) 12.62254 > out = lm ( Y [ X >= 0 & X < 20] ~ X [ X >= 0 & X < 20]) > right _ intercept = out $ coefficients [1] > print ( right _ intercept ) ( Intercept ) 15.54961 > difference = right _ intercept - left _ intercept > print ( paste ( " The RD estimator is " , difference , sep = " " ) ) [1] " The RD estimator is 2.92707507543107 " Analogous Stata command . reg Y X if X<0 & X>=-20 . local intercept_left=coef_left[1,2] . reg Y X if X>=0 & X<=20 Source | SS df MS Number of obs = 280 . local difference=‘intercept_right’-‘intercept_left’ The RD estimator is ‘difference’ The RD estimator is 2.92707507543108

The results indicate that within this ad-hoc bandwidth of 20 percentage points, the share of women ages 15 to 20 who completed high school increases by 2.927 percentage points: 15.549 percent of women in this age group had completed high school by 2000 in municipalities where the Islamic party barely won the 1994 mayoral elections, while the analogous share in municipalities where the Islamic party was barely defeated is 12.622 percent. The same point estimator can be obtained by fitting a single linear regression that includes an interaction between the treatment indicator and the score—both approaches are algebraically equivalent. > > + >

Z_X = X * Z out = lm ( Y [ X >= -20 & X <= 20] ~ X [ X >= -20 & X <= 20] + Z [ X >= -20 & X <= 20] + Z _ X [ X >= -20 & X <= 20]) print ( out )

Call : lm ( formula = Y [ X >= -20 & X <= 20] ~ X [ X >= -20 & X <= 20] + Z [ X >= -20 & X <= 20] + Z _ X [ X >= -20 & X <= 20]) Coefficients : ( Intercept ) >= -20 & X <= 20] 12.6225 0.1261

X [ X >= -20 & X <= 20]

Z [ X >= -20 & X <= 20]

-0.2481

2.9271

57

Z_X[X

4. CONTINUITY-BASED RD APPROACH

CONTENTS

Analogous Stata command . gen Z_X=X*Z . reg Y X Z Z_X if X>=-20 & X<=20

To produce the same point estimation with a triangular kernel instead of a uniform kernel, we simply use a least-squares routine with weights. First, we create the weights according to the triangular kernel formula. > w = NA > w [ X < 0 & X >= -20] = 1 - abs ( X [ X < 0 & X >= -20] / 20) > w [ X >= 0 & X <= 20] = 1 - abs ( X [ X >= 0 & X <= 20] / 20) Analogous Stata command . gen weights=. . replace weights=(1-abs(X/20)) if X<0 & X>=-20 . replace weights=(1-abs(X/20)) if X>=0 & X<=20

Then, we use the weights in the least-squares fit. > out = lm ( Y [ X < 0] ~ X [ X < 0] , weights = w [ X < 0]) > left _ intercept = out $ coefficients [1] > out = lm ( Y [ X >= 0] ~ X [ X >= 0] , weights = w [ X >= 0]) > right _ intercept = out $ coefficients [1] > difference = right _ intercept - left _ intercept > print ( difference ) ( Intercept ) 2.937319 Analogous Stata command . reg Y X [aw=weights] if X<0 & X>=-20 . matrix coef_left=e(b) . local intercept_left=coef_left[1,2] . reg Y X [aw=weights] if X>=0 & X<=20 . matrix coef_right=e(b) . local intercept_right=coef_right[1,2] . local difference=‘intercept_right’-‘intercept_left’ The RD estimator is 2.9373186846586

Note that, with h and p fixed, changing the kernel from uniform to triangular alters the point estimator only slightly, from 2.9271 to 2.9373. This is typical; point estimates tend to be relatively stable with respect to the choice of kernel. We showed how to use least-squares estimation only for pedagogical purposes, that is, to clarify the algebraic mechanics behind local polynomial point estimation. However, employing weighted least-squares routines in practice could be misleading and will be incompatible with MSE-optimal bandwidth selection for inference, as we discuss in the upcoming sections. From this point on, we 58

4. CONTINUITY-BASED RD APPROACH

CONTENTS

employ software packages that are specifically tailored to RD point estimation and inference. In particular, we focus on the rdrobust software package, which includes several functions to conduct local polynomial bandwidth selection, RD point estimation and inference, in a fully internally coherent methodological way. To replicate the previous point estimators using the command rdrobust, we use the options p to set the order of the polynomial, kernel to set the type of kernel to weigh the observations, and h to choose the bandwidth manually. By default, rdrobust sets the cutoff value to zero, but this can be changed with the option c. We first use rdrobust to create a local linear RD point estimator with h = 0.20 and uniform kernel. > rdrobust (Y , X , kernel = " uniform " , p = 1 , h = 20) Call : rdrobust ( y = Y , x = X , p = 1 , h = 20 , kernel = " uniform " ) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 Manual Uniform NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 608 1 2 20.0000 20.0000 1.0000

Right 315 280 1 2 20.0000 20.0000 1.0000

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 2.9271 1.2345 2.3710 0.0177 0.5074 5.3467 Robust 0.1018 -0.5822 6.4710 Analogous Stata command . rdrobust Y X, kernel(uniform) p(1) h(20)

The output includes many details. The four rows in the uppermost panel indicate that the total number of observations in the dataset is 2629, the bandwidth was chosen manually as opposed to using an optimal data-driven algorithm, and the observations were weighed with a uniform kernel. The final line indicates that the variance-covariance estimator (VCE) was constructed using nearest-neighbor (NN) estimators instead of sums of squared residuals (this default behavior can be changed with the option vce). We discuss details on variance estimation further below in the context of RD inference. The middle panel resembles the output of rdplot in that it is divided in two columns that give information separately for the observations above (Right) and below (Left) the cutoff. The 59

4. CONTINUITY-BASED RD APPROACH

CONTENTS

first row shows that the 2, 629 observations are split into 2, 314 (control) observations below the cutoff, and 315 (treated) observations above the cutoff. The second row shows the effective number of observations, which refers to the number of observations with scores within h distance from the cutoff and therefore effectively used in the estimation of the RD effect. In other words, these are the observartions with Xi ∈ [¯ x − h, x ¯ + h]. The output indicates that there are 608 observations with Xi ∈ [¯ x − h, x ¯), and 280 observations with Xi ∈ [¯ x, x ¯ + h]. The third line shows the order of the local polynomial used to estimate the main RD effect, τSRD , which in this case is equal to p = 1. The bandwidth used to estimate τSRD is shown on the fifth line, BW Loc Poly (h), where we see that the same bandwidth h = 20 was used to the left and right of the cutoff (below we illustrate how to allow for different bandwidths on either side of the cutoff). We defer discussion of Order Bias (q), BW Bias (b), and rho (h/b) until we discuss methods for inference. Finally, the last panel shows the estimation results. The point estimator is reported in the Coef column on the first row. The estimated RD treatment effect is τˆSRD = 2.9271, indicating that in municipalities where the Islamic party barely wins the female high school attainment share is about 3 percentage points higher than in municipalities where the party barely lost. As expected, this number is identical to the number we obtained with the least-squares command lm. The rdrobust routine also allows us to easily estimate the RD effect using triangular instead of uniform kernel weights. > rdrobust (Y , X , kernel = " triangular " , p = 1 , h = 20) Call : rdrobust ( y = Y , x = X , p = 1 , h = 20 , kernel = " triangular " ) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 Manual Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 608 1 2 20.0000 20.0000 1.0000

Right 315 280 1 2 20.0000 20.0000 1.0000

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 2.9373 1.3429 2.1872 0.0287 0.3052 5.5694 Robust 0.1680 -1.1166 6.4140 Analogous Stata command . rdrobust Y X, kernel(triangular) p(1) h(20)

60

4. CONTINUITY-BASED RD APPROACH

CONTENTS

Once again, this produces the same coefficient of 2.9373 that we found when we used the weighted least-squares command with triangular weights. We postpone the discussion of standard errors, confidence intervals, and the distinction between the Conventional versus Robust results until we discuss methods for inference. Finally, if we wanted to reduce the approximation error in the estimation of the RD effect, we could increase the order of the polynomial and use a local quadratic fit instead of a local linear. This can be implemented in rdrobust setting p=2. > rdrobust (Y , X , kernel = " triangular " , p = 2 , h = 20) Call : rdrobust ( y = Y , x = X , p = 2 , h = 20 , kernel = " triangular " ) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 Manual Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 608 2 3 20.0000 20.0000 1.0000

Right 315 280 2 3 20.0000 20.0000 1.0000

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 2.6487 1.9211 1.3787 0.1680 -1.1166 6.4140 Robust 0.6743 -3.9688 6.1350 Analogous Stata command . rdrobust Y X, kernel(triangular) p(2) h(20)

Note that the estimated effect changes from 2.9373 with p = 1 to 2.6487 with p = 2. It is not unusual to observe a change in the point estimate as one changes the polynomial order used in the estimation. Unless the higher order terms in the approximation are exactly zero, incorporating those terms in the estimation will reduce the approximation error and thus lead to changes in the estimated effect. The relevant practical question is whether such changes in the point estimator change the conclusions of the study. For that, we need to consider inference as well as estimation procedures, a topic we discuss in the upcoming sections. Choosing an ad-hoc bandwidth as shown in the previous commands is not advisable. It is unclear what the value h = 20 means in terms of bias and variance properties, or whether this is the best approach for estimation and inference. The command rdbwselect, which is part of the rdrobust

61

4. CONTINUITY-BASED RD APPROACH

CONTENTS

package, implements optimal, data-driven bandwidth selection methods. We illustrate the use of rdbwselect by selecting a MSE-optimal bandwidth for the local linear estimator of τSRD . > rdbwselect (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ) Call : rdbwselect ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " ) BW Selector Number of Obs NN Matches Kernel Type

mserd 2629 3 Triangular

Left Right Number of Obs 2314 315 Order Loc Poly ( p ) 1 1 Order Bias ( q ) 2 2 h ( left ) h ( right ) b ( left ) b ( right ) mserd 17.23947 17.23947 28.57543 28.57543 Analogous Stata command . rdbwselect Y X, kernel(triangular) p(1) bwselect(mserd)

The MSE-optimal bandwidth choice depends on the choice of polynomial order and kernel function, which is why both have to be specified in the call to rdbwselect.The first output line indicates the type of bandwidth selector; in this case, it is MSE-optimal (mserd). The type of kernel used is also reported, as is the total number of observations. The middle panel reports the number of observations on each side of the cutoff, and the order of polynomial chosen for estimation of the RD effect—the Order Loc Poly (p) row. We postpone discussion of the Order Bias (q) results until we discuss inference. In the bottom panel, we see the estimated optimal bandwidth choices. The bandwidth h refers to the bandwidth used to estimate the RD effect τSRD ; we sometimes refer to it as the main bandwidth. The bandwidth b is an additional bandwidth used to estimate a bias term that is needed for robust inference; we omit discussion of b until the following sections. As shown, the estimated MSE-optimal bandwidth for the local-linear RD point estimator with triangular kernel weights is 17.23947. The option bwselect = "mserd" imposes the same bandwidth h on each side of the cutoff, that is, uses the neighborhood [¯ x − h, x ¯ + h]. This is why the columns h (left) and h (right) have the same value 17.23947. If instead we wish to allow the bandwidth to be different on each side of the cutoff and estimate the RD effect in the neighborhood [¯ x − hleft , x ¯ + hright ], we can choose two MSE-optimal bandwidths by using the bwselect = "msetwo" option. > rdbwselect (Y , X , kernel = " triangular " , p = 1 , bwselect = " msetwo " ) Call : rdbwselect ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " msetwo " )

62

4. CONTINUITY-BASED RD APPROACH

BW Selector Number of Obs NN Matches Kernel Type

CONTENTS

msetwo 2629 3 Triangular

Left Right Number of Obs 2314 315 Order Loc Poly ( p ) 1 1 Order Bias ( q ) 2 2 h ( left ) h ( right ) b ( left ) b ( right ) msetwo 19.96678 17.35913 32.27761 29.72832 Analogous Stata command . rdbwselect Y X, kernel(triangular) p(1) bwselect(msetwo)

This leads to a bandwidth of 19.96678 on the control side, and a bandwidth of 17.35913 on the treated side. Once we select the MSE-optimal bandwidth(s), we could pass them to the function rdrobust using the option h. But it is much easier to use the option bwselect in rdrobust. When we use this option, rdrobust calls rdbwselect internally, selects the bandwidth as requested, and then uses the optimally chosen bandwidth to estimate the RD effect. In order to perform bandwidth selection and point estimation in one step, using p = 1 and triangular kernel weights, we use the rdrobust command. > rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ) Call : rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " ) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 529 1 2 17.2395 28.5754 0.6033

Right 315 266 1 2 17.2395 28.5754 0.6033

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 3.0195 1.4271 2.1159 0.0344 0.2225 5.8165 Robust 0.0758 -0.3093 6.2758

63

4. CONTINUITY-BASED RD APPROACH

CONTENTS

Analogous Stata command . rdrobust Y X, kernel(triangular) p(1) bwselect(mserd)

When the same MSE-optimal bandwidth is used on both sides of the cutoff, the effect of a bare Islamic victory on the female educational attainment share is 3.0195, slightly larger than the 2.9373 effect that we found above when used the ad-hoc bandwidth of 20. We can also explore the rdrobust output to obtain the estimates of the average outcome at the cutoff separately for treated and control observations. > rdout = rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ) > print ( names ( rdout ) ) [1] " tabl1 . str " " tabl2 . str " " tabl3 . str " " N " "N_l" "N_r" "N_h_ l" "N_b_l" "N_b_r" "c" "p" "q" "h_l" "h_r" "b_l" "b_r" " tau _ cl " " tau _ bc " " se _ tau _ cl " " se _ tau _ rb " " bias _ l " " bias _ r " " beta _ p _ l " " beta _ p _ r " [25] " V _ cl _ l " " V _ cl _ r " " V _ rb _ l " " V _ rb _ r " " coef " " bws " " se " "z" " pv " " ci " " call " > print ( rdout $ beta _ p _ r ) [ ,1] [1 ,] 15.6649438 [2 ,] -0.1460846 > print ( rdout $ beta _ p _ l ) [ ,1] [1 ,] 12.6454218 [2 ,] -0.2477231 Analogous Stata command . rdrobust Y X . ereturn list

We see that the RD effect of 3.0195 percentage points in the female high school attainment share is the difference between a share of 15.6649438 percent in municipalities where the Islamic party barely wins and a share of 12.6454218 percent in municipalities where the Islamic party barely loses—that is, 15.6649438 − 12.6454218 ≈ 3.0195. By accessing the control mean at the cutoff in this way, we learn that the RD effect represents an increase of (3.0195/12.6454218) × 100 = 23.87 percent relative to the control mean. This effect, together with the means at either side of the cutoff, can be easily illustrated with rdplot, using the options h, p, and kernel, to specify exactly the same specification used in rdrobust and produce an exact illustration of the RD effect. > bandwidth = rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ) $ h _ l > out = rdplot ( Y [ abs ( X ) <= bandwidth ] , X [ abs ( X ) <= bandwidth ] , + p = 1 , kernel = " triangular " ) > print ( out ) Call : rdplot ( y = Y [ abs ( X ) <= bandwidth ] , x = X [ abs ( X ) <= bandwidth ] , p = 1 , kernel = " triangular " )

64

4. CONTINUITY-BASED RD APPROACH

CONTENTS

Method :

Number of Obs . Polynomial Order Scale

Left 529 1 4

Right 266 1 6

Selected Bins Average Bin Length Median Bin Length

19 17 0.9066 1.0028 0.9066 1.0028

IMSE - optimal bins Mimicking Variance bins

5 19

3 17

Relative to IMSE - optimal : Implied scale 3.8000 5.6667 WIMSE variance weight 0.0179 0.0055 WIMSE bias weight 0.9821 0.9945 Analogous Stata command . rdrobust Y X, p(1) kernel(triangular) bwselect(mserd) . local bandwidth=e(h_l) . rdplot Y X if abs(X)<=‘bandwidth’, p(1) h(‘bandwidth’) kernel(triangular)

20

Outcome

30

40

50

Figure 4.4: Local Polynomial RD Effect Illustrated with rdplot—Meyersson Data

● ● ●

● ●

●

● ● ●

●

●

●

●

● ●

●

●

● ●

● ●

●

●

●

●

● ●

●

●

●

0

10

●

●

●

●

●

●

−15

−10

−5

0

5

10

15

Running Variable

Finally, we note that by default, all MSE-optimal bandwidth selectors in rdrobust include in 65

4. CONTINUITY-BASED RD APPROACH

CONTENTS

the denominator the regularization term that we discussed in subsection 4.2.2. We can exclude the regularization term with the option scaleregul=0 in the rdrobust (or rdbwselect) call > rdrobust (Y , X , kernel = " triangular " , scaleregul = 0 , p = 1 , + bwselect = " mserd " ) Call : rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " , scaleregul = 0) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 1152 1 2 34.9830 46.2326 0.7567

Right 315 305 1 2 34.9830 46.2326 0.7567

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 2.8432 1.1098 2.5619 0.0104 0.6681 5.0184 Robust 0.0171 0.5964 6.1039 Analogous Stata command . rdrobust Y X, kernel(triangular) p(1) bwselect(mserd) scaleregul(0)

In this application, excluding regularization term as a very large impact on the estimated hMSE . ˆ MSE is 17.2395, while excluding regularization increases it to 34.98296, an With regularization, h increase of roughly 100%. Nevertheless, the point estimate remains relatively stable, moving from 3.0195 with regularization to 2.8432 without regularization.

4.3

Local Polynomial Inference

In addition to providing a local polynomial point estimator of the RD treatment effect, we are interested in testing hypotheses and constructing confidence intervals. At first glance, it seems that ordinary least squares (OLS) inference methods could be used since, as we have seen, local polynomial estimation involves simply fitting two weighted least-squares regressions within a region near x ¯ controlled by the bandwidth. However, relying on OLS methods for inference would treat the local polynomial regression model as correctly specified (i.e., parametric), and de facto disregard its fundamental approximation (i.e., nonparametric) nature. To put things differently, it is intellectually and methodologically incoherent to simultaneously select a bandwidth according to a bias-variance 66

4. CONTINUITY-BASED RD APPROACH

CONTENTS

trade-off but then proceed as if the bias is zero, that is, as if the local polynomial fit is correctly specified and no misspecification error exists. In fact, if the model is indeed correctly specified, then the full support of the data will be use (i.e., a very large bandwidth), a procedure that in most applications will lead to severely biased RD point estimators. This considerations imply that valid inference should take into account the degree of misspecification or, alternatively, whether the bandwidth is too large for hypothesis testing and confidence interval estimation. For example, the MSE-optimal bandwidths discussed previously (hMSE , hMSE,− , hMSE,+ ), and many other variants thereof, result in a RD point estimator that is both consistent and optimal in a MSE sense. However, inferences based on hMSE present a challenge because this bandwidth choice is by construction not “small” enough to remove the leading bias term in the standard distributional approximations used to conduct statistical inference. Heuristically, because these bandwidth choices are developed for point estimation purposes, they pay no attention to their effects in terms of distributional properties of typical t-tests or related statistics. Thus, constructing confidence intervals using standard OLS large-sample results using the data with Xi ∈ [¯ x −hMSE , x ¯ +hMSE ] will result in invalid inferences. There are several approaches that one could take to address this difficulty. One approach is to use hMSE only for point estimation, and then choose a different bandwidth for inferences purpose. This method requires selecting an smaller bandwidth than hMSE and then recomputing the point estimator and standard error to conduct inference. This approach is called undersmoothing and requires using more observations for point estimation than for infernece, which some times may be regarded as problematic and will certainly lead to statistical power loss. Another approach is to retain the same bandwidth hMSE for both estimation and inference but with a modified test statistic to account for the effects of misspecification due to the large bandwidth being used as well as the additional sampling error introduced by such modification. This approach is called robust bias correction and has the advantage that the same observations can be used for both estimation and inference, thereby leading to more powerful statistical methods. We discuss both approaches in detail below, and also briefly elaborate on further refinements and extensions to the latter approach leading to more robust and powerful inference procedures.

4.3.1

Using the MSE-Optimal Bandwidth for Inference

The MSE-optimal bandwidth hMSE , or any of the other similar choices, is the bandwidth that minimizes the asymptotic MSE of the point estimator τˆSRD , and is by now the most popular benchmark choice in practice. But the optimal properties of such an MSE-optimal choice for point estimation purposes do not guarantee valid statistical inference based on large sample distributional approximations. We now discuss how to make valid inferences when the bandwidth choice is hMSE , or some data-driven implementation thereof.

67

4. CONTINUITY-BASED RD APPROACH

CONTENTS

The local polynomial RD point estimator τˆSRD has an approximate large-sample distribution τˆSRD − τSRD − B a √ ∼ N (0, 1) V where B and V are, respectively, the asymptotic bias and variance of the RD local polynomial estimator of order p, discussed previously in the context of MSE expansions and bandwidth selection. This distributional result is similar to those encountered in, for example, standard linear regression problems with the important distinction that now the bias term B features explicitly. In fact, the variance term V can be calculated as in (weighted) least squares problems, for instance accounting for heteroskedasticity and/or clustered data. We do not provide the exact formulas for variance estimation to save space and notation, but these formulas can be found in the references given at the end of this section and are all implemented in the RD software available and discussed in the empirical illustration further below. Given the distributional approximation for the RD local polynomial estimator, an asymptotic 95-percent confidence interval for τSRD is approximately given by h √ i CI = (ˆ τSRD − B) ± 1.96 · V . Such confidence intervals depend on the unknown bias or misspecification error B, and any practical procedure that ignores it will lead to incorrect inferences unless this term is negligible (i.e., the local linear regression model is correctly specified). In other words, the bias term arises because the local polynomial approach is a nonparametric approximation: instead of assuming that the underlying regression functions are p-th order polynomials (as would occur in OLS estimation), this approach uses the polynomial to approximate the unknown regression functions. The degree of misspecification error is controlled by the choice of bandwidth, with larger biases the larger the bandwidth(s) used. Thus, the large sample distributional approximation naturally includes the term B to highlight the fact that there is a trade-off between bandwidth choice and misspecification bias locally to the cutoff. As already anticipated, different strategies have been proposed and employed to make inferences based on asymptotic distributional approximations for τˆSRD in the presence of nonparametric misspecification biases. Some strategies are invalid in general, some are theoretically sound but lead to ad-hoc procedures with poor properties and performance in applications, and others are both theoretically valid and perform very well in practice. We now discuss some of these approaches in detail and explain their relative merits. An approach sometimes found in RD empirical work is to ignore the misspecification error even when an MSE-optimal bandwidth is used. This empirical approach is not only invalid but also methodologically incoherent: an MSE-optimal bandwidth cannot be selected in the absence of misspecification error (non-zero bias), and statistical inference based on standard OLS methods (ignoring the bias) cannot be valid when an MSE-optimal bandwidth is employed. Thus, the research 68

4. CONTINUITY-BASED RD APPROACH

CONTENTS

must make a choice: either an ad hoc bandwidth is used assuming that is small enough so that the misspecification error can be ignored, or the misspecification error must be accounted for explicitly when conducting inference. Notice that the former approach leads to valid inference (because a smaller than MSE-optimal bandwidth is used), but the resulting RD treatment effect point estimator is no longer MSE-optimal, thus a sub-optimal treatment effect estimator is reported. To be more specific, the na¨ıve approach to statistical inference ignoring the effects of MSEoptimal bandwidth selection and misspecification bias treats the local polynomial approach as parametric within the neighborhood around the cutoff and de facto ignores the bias term, a procedure that leads to invalid inferences in all cases except when the approximation error is so small that can be ignored. When the bias term is zero, the approximate distribution of the RD estimator p a is τˆSRD − τSRD / Vp ∼ N (0, 1) and the confidence interval is CIus =

h

τˆSRD ± 1.96 ·

p i Vp .

Since this is the same confidence interval that follows from parametric least-squares estimation, we refer to it as conventional. Using the conventional confidence interval CIus is equivalent to assuming that the chosen polynomial gives an exact approximation to the true functions E[Yi (1)|Xi ] and E[Yi (0)|Xi ]. Since these functions are unknown, this assumption is not verifiable and will rarely be credible. If researchers use CIus when in fact the approximation error is non-negligible, all inferences will be incorrect, leading to under-coverage of the true treatment effect or, equivalently, over-rejection of the null hypothesis of zero treatment effect. For this reason, we strongly discourage researchers from using conventional inference when using local polynomial methods, unless the misspecification bias can credibly be assumed small (ruling out, in particular, the use of MSEoptimal bandwidth choices). A common mistake in practice is to employ CIus even when first-order misspecification errors are presented in the distributional approximation due to the (too large) bandwidth choice used. A theoretically sound but ad-hoc alternative procedure is to use these conventional confidence intervals with an “undersmoothed” bandwidth relative to the one used for point estimation (i.e., for constructing the point estimator τˆSRD in the first place). Practically, the procedure involves selecting a bandwidth smaller than the MSE-optimal choice and then constructing the conventional confidence intervals CIus with this smaller bandwidth (both a new point estimator and a new standard error is estimated). The theoretical justification is that, for bandwidths smaller than the MSEoptimal choice, the bias term will become negligible in large sample distributional approximation. The main drawback of this procedure is that there are no clear and transparent criteria for shrinking the bandwidth below the MSE-optimal value: some researchers might estimate the MSE-optimal choice and divide by two, others may chose to divide by three, and yet others may decide to subtract a small number ε from it. Although these procedures can be justified in a strictly theoretical sense, but they are all ad-hoc and can result in lack of transparency and specification searching. Moreover, this general strategy leads to a loss of statistical power because a smaller bandwidth re69

4. CONTINUITY-BASED RD APPROACH

CONTENTS

sults in less observations used for estimation and inference. Finally, from a substantive perspective, some researchers would not like to use different observations for estimation and inference, which is required by any undersmoothing approach. As an alternative to undesmoothing the bandwidth, inference could be based on the MSEoptimal bandwidth so long as the induced misspecification error is manually estimated and removed from the distributional approximation. This approach, known as bias correction, first estimates the ˆ which in fact was already formed for data-driven MSE-optimal bias term B with the estimator B, bandwidth selection, and then constructs confidence intervals that are centered at the bias corrected point estimate: CIbc =

h

√ i τˆSRD − Bˆ ± 1.96 · V

As explained above, the bias term depends on the “curvature” of the unknown regression functions captured via their derivative of order p + 1 at the cutoff. This unknown feature of the underlying data generating process can be estimated with a local polynomial of order q = p + 1 or higher, and another choice of bandwidth denoted b. Therefore, the main point estimate employs the bandwidth h, for example chosen in an MSE-optimal way for RD point estimation, while the bias correction estimate employs the additional bandwidth b, which can be chosen in different ways. The ratio ρ = h/b relates to the variability of the bias correction estimate relative to the point estimator itself, and standard bias correction methods require ρ = h/b → 0, that is, an small ρ. Note that ρ = h/b = 1 (h = b) is not allowed by this method. The bias corrected confidence intervals allow for a wider range of bandwidths h and, in particular, result in valid inferences when the MSE-optimal bandwidth is used. However, these confidence intervals have typically poor performance in applications. The reason is that the variability introduced in the bias estimation step is not incorporated in the variance term used: the same standard errors as in CIus are employed despite the additional estimated term Bˆ now features in the construction of the confidence intervals CIbc , which results in a poor distributional approximation and hence important coverage distortions in practice. A superior strategy that is both theoretically sound and leads to excellent coverage in finite samples is to use robust bias correction for constructing confidence intervals. This approach leads to demonstrably superior inference procedures: for example, the coverage error and average lenght of these confidence intervals are improved relative to those associated with either CIus and CIbc . Furthermore, the robust bias correction approach delivers valid inferences even when the MSEoptimal bandwidth for point estimation is used—no undersmoothing is necessary— and remains valid even when ρ = h/b = 1 (h = b), which implies that exactly the same data can be used for both point estimation and for statistical inference. Robust bias-corrected confidence intervals are based on the bias correction procedure described above, by which the estimated bias term Bˆ is removed from the RD point estimator. But, in contrast to CIbc , the derivation of the robust bias corrected confidence intervals allows the estimated bias

70

4. CONTINUITY-BASED RD APPROACH

CONTENTS

term to converge in distribution to a random variable and thus contribute to the distributional approximation of the RD point estimator. This results in a new asymptotic variance Vbc that, unlike the variance V used in CIconv and CIbc , incorporates the contribution of the bias correction step to the variability of the bias corrected point estimator. Because the new variance Vbc incorporates the extra variability introduced in the bias estimation step, it is larger than the conventional OLS variance V , when the same bandwidth is used. This approach leads to the robust bias-corrected confidence intervals: CIrbc =

h

i p τˆSRD − Bˆ ± 1.96 · Vbc

which are constructed by subtracting the bias estimate from the local polynomial estimator and using the new variance formula for Studentization. Note that CIrbc is centered around the biasˆ not around the uncorrected estimate τˆSRD . These robust conficorrected point estimate, τˆSRD − B, dence intervals result in valid inferences when the MSE-optimal is used, because they have smaller coverage errors and are therefore less sensitive to tuning parameter choices. In practice, the confidence intervals can be implemented by setting ρ = h/b = 1 (h = b) and choosing h = hMSE , or by selecting both h and b to be MSE-optimal for the corresponding estimators, in which case ρ is set by hMSE /bMSE or by their data-driven implementations. We summarize the differences between the three types of confidence intervals discussed in Table 4.1. The conventional OLS confidence intervals CIus ignore the bias term and are thus centered at the local polynomial point estimator τˆSRD , and use the conventional standard error Vˆ. The biascorrected confidence intervals CIbc remove the bias estimate from the conventional point estimator, ˆ these bias corrected confidence intervals, however, ignore and are therefore centered at τˆSRD − B; the variability introduced in the bias correction step and thus continue to use the standard error Vˆ, which is the same standard error used by the conventional confidence intervals CIus . The robust bias corrected confidence intervals are also centered at the bias-corrected point estimator τˆSRD − Bˆ q but, in contrast to CIbc , they employ a different standard error, Vˆbc , which is larger than the p conventional standard error Vˆ when the same bandwidth h is used. However, as discussed above, if h = hMSE then CIus are invalid confidence intervals! Table 4.1: Local Polynomial Confidence Intervals Centered at Conventional: CIus

τˆSRD τˆSRD − Bˆ τˆSRD − Bˆ

Bias Corrected: CIbc Robust Bias Corrected: CIrbc

Standard Error p ˆ pV ˆ qV Vˆbc

Relative to the conventional confidence intervals, the robust bias-corrected confidence intervals are both re-centered and re-scaled. This implies that CIrbc are not centered at the conventional point estimator τˆSRD and, in fact, the RD point estimator does not need to be within the interval

71

4. CONTINUITY-BASED RD APPROACH

CONTENTS

CIrbc . This illustrates some of the fundamental conceptual differences between point estimation and confidence interval estimation. Nevertheless, in practice, the RD point estimator will often be covered by robust bias corrected confidence interval, and when it is not this can be taken as evidence of fundamental misspecification of the underlying local polynomial estimators. From a practical perspective, the most important feature of the robust bias-corrected confidence intervals CIrbc is that they can be used along with the MSE-optimal point estimator τˆSRD when constructed using the MSE-optimal bandwidth choice hMSE . In other words, the same observations with score Xi ∈ [¯ x − hMSE , x ¯ + hMSE ] are used for both optimal point estimation and valid statistical inference.

4.3.2

Using Different Bandwidths for Point Estimation and Inference

Conceptually, the invalidity of the conventional confidence intervals CIus based on the MSE-optimal bandwidth hMSE stems from using for inference a bandwidth that was optimally chosen for point estimation purposes. Using hMSE for estimation of the RD effect τSRD results in a point estimator τˆSRD that is not only consistent but also has minimal asymptotic MSE. Thus, from a point estimation perspective, hMSE leads to highly desirable properties of the RD treatment effect estimator. In contrast, serious methodological challenges arise when researchers attempt to use hMSE for building confidence intervals and making inferences in the standard parametric way, because the MSE-optimal bandwidth choice is not designed with the goal of ensuring good (or even valid) distributional approximations. Robust bias correction restores a valid standard normal distributional approximation when hMSE is used by recentering and rescaling the usual t-statistic to, respectively, remove the underlying misspecification bias and account for additional variability introduced by the bias correction estimate. Thus, by using the robust bias corrected confidence intervals CIrbc , researchers can use the same bandwidth hMSE for both point estimation and inference. While employing the MSE-optimal bandwidth for both optimal point estimation and valid statistical inference is certainly useful and practically relevant, it may be important to also consider statistical inference that is optimal. A natural optimality criterion associated with robustness properties of confidence intervals is the minimization of their coverage error. This is, for confidence intervals, an analogous idea to minimization of MSE for point estimators. Thus, an alternative is to decouple the goal of point estimation from the goal of inference, and to use a different bandwidth for each case. In particular, this strategy involves estimating the RD effect with hMSE and constructing confidence intervals using a different bandwidth, where the latter is specifically derived to provide optimal inference properties. In fact, h can be chosen to minimize an approximation to the coverage error of the confidence intervals CIrbc , that is, the discrepancy between the empirical coverage of the confidence interval and the nominal level. For example, if a 95% confidence interval contains the true parameter 80% of the time, the coverage error is 15 percentage points. Therefore, while hMSE minimizes the asymptotic MSE of the point estimator τˆSRD , the CER72

4. CONTINUITY-BASED RD APPROACH

CONTENTS

optimal bandwidth hCER minimizes the asymptotic coverage error rate of the robust bias corrected confidence interval for τSRD . This bandwidth cannot be obtain in closed form, but it can be shown that it has a faster rate of decay than hMSE , which implies that all practically relevant sample sizes hCER < hMSE . By design, choosing h = hCER and then using that bandwidth choice to construct CIrbc leads to confidence intervals that are not only valid but also have the fastest rate of coverage error decay. Furthermore, it follows that using hCER for point estimation will result in a point estimator that has too much variability relative to its bias and is therefore not MSE-optimal (but nonetheless valid). It is best practice to continue to use hMSE for MSE-optimal point estimation of τSRD , and then researchers can either use the same bandwidth or hCER to build confidence intervals CIrbc , where the resulting confidence intervals will valid (former case) or CER-optimal (latter case).

4.3.3

Statistical Inference in Practice

We can now discuss the full output of our previous call to rdrobust with p = 1 and triangular kernel, which we reproduce below. > rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ) Call : rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " ) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 529 1 2 17.2395 28.5754 0.6033

Right 315 266 1 2 17.2395 28.5754 0.6033

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 3.0195 1.4271 2.1159 0.0344 0.2225 5.8165 Robust 0.0758 -0.3093 6.2758 Analogous Stata command . rdrobust Y X, kernel(triangular) p(1) bwselect(mserd)

As reported before, the local linear RD effect estimate is 3.0195, which is estimated within the MSE-optimal bandwidth of 17.2395. The last output panel provides all the necessary information to make inferences. The row labeled Conventional reports, in addition to the point estimator τˆSRD , the 73

4. CONTINUITY-BASED RD APPROACH

conventional standard error

CONTENTS

p p Vˆ, the standardized test statistic τˆSRD − τSRD / Vˆ, the corresponding

p-value, and the 95% conventional confidence interval CIus . This confidence interval ranges from 0.2225 to 5.8165 percentage points, suggesting a positive effect of an Islamic victory on the female education share. Note that CIus is centered around the conventional point estimator τˆSRD : 3.0195 + 1.4271 · 1.96 = 5.816616 ≈ 5.8165 3.0195 − 1.4271 · 1.96 = 0.222384 ≈ 0.2225 where the differences are due to rounding error. The row labeled Robust reports the robust bias-corrected confidence interval CIrbc . In contrast ˆ to CIus , CIrbc is centered around the point q estimator τˆSRD − B (which is by default not reported), and scaled by the robust standard error Vˆbc (not reported either). CIrbc ranges from -0.3093 to 6.2758. And, in contrast to the conventional confidence interval, it does include zero. As expected, CIrbc is not centered at τˆSRD . Also, its length is longer than the length of CIus : Length of CIus = 5.8165 − 0.2225 = 5.594 Length of CIrbc = 6.2758 − (−0.3093) = 6.5851. For aqfixed common bandwidth, the length of CIrbc is always greater than the length of CIus p because Vˆbc > Vˆ. However, this will not be necessarily true if different bandwidths are used to construct each confidence interval. The omission of the bias-corrected point estimator that is at the center of CIrbc from the rdrobust output is intentional: the bias-corrected estimator is suboptimal relative to τˆSRD , in terms of point estimation properties, when the MSE-optimal bandwidth for τˆSRD is used. (The bias-corrected estimator is nevertheless always consistent and valid whenever τˆSRD is.) Practically, it is usually desiriable to report an MSE-optimal point estimator and then form valid confidence intervals either with the same MSE-optimal bandwidth or some other optimal choice specifically tailored for inference. In order to see all the ingredients that go into building the robust confidence intervals , we can use the all option in rdrobust. > rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " mserd " , + all = TRUE ) Call : rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " , all = TRUE ) Summary : Number of Obs 2629 BW Type mserd Kernel Type Triangular

74

4. CONTINUITY-BASED RD APPROACH

VCE Type

CONTENTS

NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 529 1 2 17.2395 28.5754 0.6033

Right 315 266 1 2 17.2395 28.5754 0.6033

Estimates : Coef Std . Err . Conventional 3.0195 1.4271 Bias - Corrected 2.9832 1.4271 Robust 2.9832 1.6799

z 2.1159 2.0905 1.7758

P >| z | 0.0344 0.0366 0.0758

CI Lower 0.2225 0.1862 -0.3093

CI Upper 5.8165 5.7802 6.2758

Analogous Stata command . rdrobust Y X, kernel(triangular) p(1) bwselect(mserd) all

The three rows in the bottom output panel are analogous to the the rows in Table 4.1: the Conventional row reports CIus , the Bias-Corrected row reports CIbc , and the Robust p row reports ˆ CIrbc . We can see that the standard error used q by CIus and CIbc is the same ( V = 1.4271), while CIrbc uses a different standard error ( Vˆbc = 1.6799). We also see that the conventional confidence interval is centered at the conventional, non-bias-corrected point estimator 3.0195, while both CIbc and CIrbc are centered at the bias-corrected point estimator 2.9832. Since we know that τˆSRD = 3.0195 and τˆSRD − Bˆ = 2.9832, we can deduce that the bias estimate is Bˆ = 3.0195−2.9832 = 0.0363. Finally, we investage the properties of robust bias corrected inference when employing a CERoptimal bandwidth choice. This is obtain via rdrobust with the option bwselect="cerrd". > rdrobust (Y , X , kernel = " triangular " , p = 1 , bwselect = " cerrd " ) Call : rdrobust ( y = Y , x = X , p = 1 , kernel = " triangular " , bwselect = " cerrd " ) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 cerrd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 360 1 2 11.6288 28.5754 0.4070

Right 315 216 1 2 11.6288 28.5754 0.4070

75

4. CONTINUITY-BASED RD APPROACH

CONTENTS

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 2.4298 1.6824 1.4443 0.1487 -0.8676 5.7272 Robust 0.1856 -1.1583 5.9795 Analogous Stata command . rdrobust Y X, kernel(triangular) p(1) bwselect(cerrd)

The common for both control and treatment units bandwidth used is hCER = 11.6288, which is smaller than the MSE-optimal bandwidth previously employed hMSE = 17.2395. The results are qualitatively similar, but now with a larger p-value as the nominal 95% robust bias corrected confidence intervals change from [−0.3093, 6.2758] when using the MSE-optimal bandwidth to [−1.1583, 5.9795] when using the MSE-optimal bandwidth. The RD point estimator changes from the MSE-optimal value 2.9832 to the undersmoothed value 2.4298, where the latter RD estimate can be interpreted as having less bias but more variability than the former. Since the change in bandwidth choice from MSE-optimal to CER-optimal is practically important, as well as whether a common bandwidth or two different bandwidths are used, we conclude this section with a report of all the bandwidth choices available in the RD software employed. This is obtained using the all option in rdbwselect command. > rdbwselect (Y , X , kernel = " triangular " , p = 1 , all = TRUE ) Call : rdbwselect ( y = Y , x = X , p = 1 , kernel = " triangular " , all = TRUE ) BW Selector Number of Obs NN Matches Kernel Type

All 2629 3 Triangular

Left Right Number of Obs 2314 315 Order Loc Poly ( p ) 1 1 Order Bias ( q ) 2 2

mserd msetwo msesum msecomb1 msecomb2 cerrd certwo cersum cercomb1 cercomb2

h ( left ) h ( right ) b ( left ) b ( right ) 17.23947 17.23947 28.57543 28.57543 19.96678 17.35913 32.27761 29.72832 17.77206 17.77206 30.15343 30.15343 17.23947 17.23947 28.57543 28.57543 17.77206 17.35913 30.15343 29.72832 11.62878 11.62878 28.57543 28.57543 13.46848 11.70950 32.27761 29.72832 11.98804 11.98804 30.15343 30.15343 11.62878 11.62878 28.57543 28.57543 11.98804 11.70950 30.15343 29.72832 Analogous Stata command

. rdbwselect Y X, kernel(triangular) p(1) all

76

4. CONTINUITY-BASED RD APPROACH

4.4

CONTENTS

Extensions: Covariates and Clustering

In our discussion of local polynomial methods above, we assumed that the local polynomial fit includes only the running variable as a regressor—that is, we considered a local polynomial fit of the outcome on the score alone. However, in some cases, researchers may want to augment their local polynomial analysis by including, in addition to the RD score, a set of predetermined covariates in their model specification. We now briefly discuss the most important issues related to adding covariates, including when it is justified to include covariates, how exactly they should be included, and how to modify the estimation and inference methods outlined above. We let Zi (1) and Zi (0) denote two vectors of potential covariates—where Zi (1) represents the value taken by the covariates above the cutoff (i.e., under treatment), and Zi (0) the value taken below the cutoff (i.e., under control). For adjustment, researchers use the observed covariates, Zi , defined as

 Z (0) i Zi = Z (1) i

if Xi < x ¯

.

if Xi ≥ x ¯

If the covariates are truly predetermined, their values are determined before the treatment is ever assigned and it must be the case that the potential covariate values under control will be identical to the potential covariate values under treatment. Formally, predetermined covariates satisfy Zi (1) =d Zi (0) for all i. Covariates can be included in multiple different ways to augment the basic RD estimation and inference methods. The two most natural approaches are conditioning, which makes the most sense when only a few discrete covariates are used, and partialling out via local polynomial methods. The first approach amounts to employ all the methods presented and discussed so far, after subsetting the data along the different subclasses generated by the interacted values of the covariates being used. No modifications are needed, and all the methods can be applied directly. The second approach to incorporating covariates based on augmenting the local polynomial model allows for many covariates, which can be discrete or continuous. In this case, the idea is to directly include as many pre-intervention covariates as possible without affecting the validity of the point estimator, while at the same time improving its efficiency. To describe the covariateadjustment RD method, we retain all the previous notation and ingredients underlying the RD local polynomial estimator, and then define the following joint estimation problem:  ψˆ− n X  ˆ  (Yi − ψ−,0 − ψ, 1 (Xi − x ¯) − · · · − ψ−,p (Xi − x ¯)p  ψ+  = arg min ψ− ,ψ+ ,γ i=1 ˆ γ 

− ψ+,0 Ti − ψ+,1 Ti (Xi − x ¯) − · · · − ψ+,p Ti (Xi − x ¯)p − Z0i γ)2 K

X − x ¯ i , h

where ψ− = (ψ−,0 , ψ−,1 , · · · , ψ−,p )0 and ψ+ = (ψ+,0 , ψ+,1 , · · · , ψ+,p )0 . The covariate-adjusted RD 77

4. CONTINUITY-BASED RD APPROACH

CONTENTS

estimator is τ˜SRD = ψˆ+,0 , as this estimate captures the jump at the cutoff point in a fully interacted local polynomial regression fit after partialling out the effect of the covariates Zi . In words, the approach is to fit a weighted least squares regression of the outcome Yi on (i) a constant, (ii) the treatment indicator Ti , (iii) a p-order polynomial on the running variable, (Xi − x ¯), (Xi − x ¯)2 , . . . , (Xi − x ¯)p , (iv) a p-order polynomial on the running variable interacted with the treatment (Xi − x ¯)·Ti , (Xi − x ¯)2 ·Ti , . . . , (Xi − x ¯)p ·Ti , and (v) the covariates Zi , using the weights K((Xi − x ¯)/h) for all observations with x ¯ −h ≤ Xi ≤ x ¯ +h. The approach, of course, reduces to the standard RD estimation when no covariates are included, that is, τ˜SRD = τˆSRD when γ = 0 is set before estimation. As it should be apparent, including covariates in a linear-in-parameters way requires the same type of choices as in the standard RD treatment effect estimation case: the researcher needs to choose a polynomial order p, a kernel function K(·) and a bandwidth h. A crucial question is whether the covariate-adjusted estimator τ˜SRD estimates the same parameter as the unadjusted estimator τˆSRD . It can be shown that if the covariates Zi are truly predetermined, then τ˜SRD is a consistent estimator of the sharp RD treatment effect τSRD —that is, both τ˜SRD and τˆSRD estimate the same parameter. If the covariates are not predetermined, in the sense that E[Zi (0)|Xi = x] 6= E[Zi (1)|Xi = x], then the covariate-adjusted estimator will not generally recover the RD treatment effect τSRD . As before, the implementation of the covariate-adjusted local polynomial RD estimator will require choosing the bandwidth h. The recommended strategy is to use optimal data-driven methods to choose this bandwidth in practice. Since the covariate-adjusted point estimator τ˜SRD is a function of the covariates, the optimal bandwidth choices will also depend on the covariates and will be different from the previously discussed hMSE choice in general. Thus, principle implementation of local polynomial methods using covariate adjustments would employ an MSE-optimal bandwidth choice accounting for the inclusion of the covariates in the bandwidth selection step. We omit the details here to conserve space, but note that both the MSE-optimal and the CER-optimal bandwidth choices accounting for covariate-adjustmetn are avilable in theory and are implemented in general purpose RD software. We illustrate the inclusion of covariates using the Meyersson application. We use the predetermined covariates introduced in Section 1.2 above: variables from the 1994 election (vshr islam1994, partycount, lpop1994), and the geopgraphic indicators (merkezi, merkezp, subbuyuk, buyuk). In order to keep the same number of observations as in the analaysis without covariates, we exclude the indicator for electing an Islamic party in the 1989 election (i89) because this variable has missing values. We start by using rdbwselect to choose a MSE-optimal bandwidth using the default options: a polynomial of order one, a triangular kernel, and the same bandwidth on each side of the cutoff

78

4. CONTINUITY-BASED RD APPROACH

CONTENTS

(mserd option). > Z = cbind ( data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 , + data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk ) > colnames ( Z ) = c ( " vshr _ islam1994 " , " partycount " , " lpop1994 " , " merkezi " , + " merkezp " , " subbuyuk " , " buyuk " ) > rdbwselect (Y , X , covs = Z , kernel = " triangular " , scaleregul = 1 , + p = 1 , bwselect = " mserd " ) Call : rdbwselect ( y = Y , x = X , p = 1 , covs = Z , kernel = " triangular " , bwselect = " mserd " , scaleregul = 1) BW Selector Number of Obs NN Matches Kernel Type

mserd 2629 3 Triangular

Left Right Number of Obs 2314 315 Order Loc Poly ( p ) 1 1 Order Bias ( q ) 2 2 h ( left ) h ( right ) b ( left ) b ( right ) mserd 14.40877 14.40877 23.73098 23.73098 Analogous Stata command . global covariates "vshr_islam1994 partycount lpop1994 merkezi merkezp subbuyuk buyuk" . rdbwselect Y X, covs($covariates) p(1) kernel(triangular) bwselect(mserd) scaleregul(1)

The MSE-optimal bandwidth including covariates is 14.40877, considerable different from the value of 17.2395 that we found before without covariate adjustment. This illustrates the general principle that covariate-adjustment will generally change the values of the optimal bandwidths. To perform local polynomial inference including covariate-adjustment, we use the rdrobust command using the covs option. > Z = cbind ( data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 , + data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk ) > colnames ( Z ) = c ( " vshr _ islam1994 " , " partycount " , " lpop1994 " , " merkezi " , + " merkezp " , " subbuyuk " , " buyuk " ) > rdrobust (Y , X , covs = Z , kernel = " triangular " , scaleregul = 1 , + p = 1 , bwselect = " mserd " ) Call : rdrobust ( y = Y , x = X , p = 1 , covs = Z , kernel = " triangular " , bwselect = " mserd " , scaleregul = 1) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 mserd Triangular NN

79

4. CONTINUITY-BASED RD APPROACH

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 448 1 2 14.4088 23.7310 0.6072

CONTENTS

Right 315 241 1 2 14.4088 23.7310 0.6072

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 3.1080 1.2839 2.4207 0.0155 0.5915 5.6244 Robust 0.0368 0.1937 6.1317 Analogous Stata command . global covariates "vshr_islam1994 partycount lpop1994 merkezi merkezp subbuyuk buyuk" . rdrobust Y X, covs($covariates) p(1) kernel(triangular) bwselect(mserd) scaleregul(1)

The estimated RD effect is now 3.1080, similar to the unadjusted estimate of 3.0195 that we found before. As we explained, this similarity is reassuring because if the included covariates are truly predetermined, the unadjusted estimator and the covariate-adjusted estimator are estimating the same parameter and thus should result in roughly the same estimate. With the inclusion of covariates, the 95% robust confidence interval is now [0.1937 , 6.1317]. The unadjusted robust confidence interval we estimated in the previous section is [−0.3093 , 6.2758]. Thus, including covariates reduced the length of the confidence interval from 6.2758 − (−0.3093) = 6.5851 to 6.1317 − 0.1937 = 5.938, a length reduction of (|5.938 − 6.5851|/6.5851) × 100 = 9.82 percent. The shorter confidence interval obtained with covariate adjustment (and the slight increase in the point estimate) results in the robust p-value decreasing from 0.0758 to 0.0368. This exercise illustrates the main benefit of covariate-adjustment in local polynoimial RD estimation: when successful, the inclusion of covariates in the analysis decreases the length of the confidence interval while it simultaneously leaves the point estimate (roughly) unchanged. Finally, before closing this section, we briefly remark that all the variance estimators discussed throughout can be extended to account for clustered data. This extension is straightforward, but implies once again that the optimal bandwidth choices and corresponding optimal point estimators needs to be modified accordingly. Fortunately, all these modifications and extensions are readily available in general purpose software, and we will employ them briefly in Section 7.

4.5

Recommendations for Practice

We offer several general recommendations for the implementation of local polynomial RD estimation and inference. First, we recommend using a local linear polynomial with a triangular kernel as the initial specification, and to select the bandwidth in a data-drive automatic fashion. A natural initial choice is the MSE-optimal bandwidth for the RD point estimator. This gives an initial benchmark 80

4. CONTINUITY-BASED RD APPROACH

CONTENTS

for further analysis. For point estimation purposes, it is natural to report the point estimator prior bias correction and without covariate adjustment constructed using the MSE-optimal bandwidth. For inference purposes, the robust bias corrected confidence intervals can be reported using the same MSE-optimal bandwidth used for point estimation and, in addition, the analogous confidence intervals constructed using the CER-optimal bandwidth can be reported. Inclusion of covariates, accounting for clustering, etc., can also be done as appropriate, and usually as a further robustness check. If covariates are pre-intervention and the estimand of interest is the sharp RD treatment effect, then the RD point estimator with and without covariates should not change much. Different bandwidth on each side of the cutoff can also be used, both MSE-optimal or CER-optimal, as an additional improvement in some cases.

4.6

Further Readings

A textbook discussion of nonparametric local polynomial methods can be found in Fan and Gijbels (1996). The specific application of local polynomial methods to RD estimation and inference was first discussed by Hahn et al. (2001) and Porter (2003). Gelman and Imbens (2014) discuss the problems associated with using global polynomial estimation for RD analysis. MSE-optimal bandwidth selection for the local polynomial RD point estimator was first developed by Imbens and Kalyanaraman (2012), and then generalized by Calonico et al. (2014b), Arai and Ichimura (2016, 2017), Bartalotti and Brummet (2017) and Calonico et al. (2017c) to different RD designs and settings. Robust bias corrected confidence intervals were proposed by Calonico et al. (2014b), and their higher-order properties as well as CER-optimal bandwidth selection for local polynomial confidence intervals were developed by Calonico et al. (2017b,a). An overview of bandwidth selection methods for RD analysis is provided by Cattaneo and Vazquez-Bare (2016). Bootstrap methods based on robust bias corrected distributional approximations and inference are developed in Bartalotti et al. (2017) and Chiang et al. (2017). Identification, estimation and inference when the local polynomial RD analysis is performed with the addition of predetermined covariates is discussed in Calonico et al. (2017c). Other extensions of estimation and inference using local polynomial methods and robust bias correction inference are discussed in Xu (2017) and Dong (2017). An interesting empirical example assessing the performance of robust bias correction inference methods is discussed in Tukiainen et al. (2017). Estimation and inference when multiple/many RD cutoff are discussed in Cattaneo et al. (2016a) and Bertanha (2017). Further related results and references are given in the contemporaneous edited volume Cattaneo and Escanciano (2017).

81

5. LOCAL RANDOMIZATION RD APPROACH

5

CONTENTS

The Local Randomization Approach to RD Analysis

The continuity-based approach to RD analysis discussed in the previous section is the most commonly used in practice. That approach is based on assumptions of continuity (and further smoothness) of the regression functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x]. In contrast, the approach we describe in this section is based on a formalization of the idea that the RD design can be viewed as a randomized experiment near the cutoff x ¯. When the RD design was first introduced by Thistlethwaite and Campbell (1960), the justification for this novel research design was not based on approximation and extrapolation of smoothness regression functions, but instead on the idea that the abrupt change in treatment status that occurs at the cutoff leads to a treatment assignment mechanism that, near the cutoff, resembles the assignment that a randomized experiment would have. Indeed, the authors described a hypothetical experiment where the treatment is randomly assigned near the cutoff as an “experiment for which the regression-discontinuity analysis may be regarded as a substitute” (Thistlethwaite and Campbell, 1960, p. 310). The idea that the treatment assignment is “as good as” randomly assigned in a neighborhood of the cutoff is often invoked in the continuity-based framework to describe the required identification assumptions in an intuitive way, and it has been used to develop formal results. However, within the continuity-based framework, the formal derivation of identification and estimation results always relies on continuity and differentiability of regression functions, and the idea of local randomization is used as a heuristic device only. In contrast, what we call the local randomization approach to RD analysis formalizes that idea that the RD design behaves like a randomized experiment near the cutoff by imposing explicit randomization-type assumptions that are stronger than the standard continuity-type conditions. In a nutshell, this approach imposes conditions so that units whose score values lie in a small window around the cutoff can be analyzed as-if they were randomly assigned to treatment or control. The local randomization approach adopts the local randomization assumption explicitly, not as a heuristic interpretation, and builds a set of statistical tools exploiting that specific feature. We now introduce the local randomization approach in detail, discussing how adopting an explicit randomization assumption near the cutoff allows for the use of new methods of estimation and inference for RD analysis. We also discuss the differences between the standard continuitybased approach and the local randomization approach. When the running variable is continuous, the local randomization approach typically requires stronger assumptions than the continuity-based approach; in these cases, it is natural to use the continuity-based approach for the main RD analysis, and to use the local randomization approach as a robustness check. But in settings where the running variable is discrete (with few mass points) or other departures from the canonical RD framework occur, the local randomization approach can not only be very useful but also possibly the only valid method for estimation and inference in practice.

82

5. LOCAL RANDOMIZATION RD APPROACH

5.1

CONTENTS

Local Randomization Approach: Overview

When the RD is based on a local randomization assumption, instead of assuming that the unknown regression functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are continuous at the cutoff, the researcher assumes that there is a small window around the cutoff, W0 = [¯ x −w0 , x ¯ +w0 ], such that for all units whose scores fall in that window their placement above or below the cutoff is assigned as in a randomized experiment—sometimes called as if random assignment. Formalizing the assumption that the treatment is (locally) assigned as it would have been assigned in an experiment requires careful consideration of the conditions that are guaranteed to hold in an actual experimental assignment. There are important differences between the RD design and an actual randomized experiment. To discuss such differences, we start by noting that any simple experiment can be recast as an RD design where the score is a randomly generated number, and the cutoff is chosen to ensure a certain treatment probability. For example, consider an experiment in a student population that randomly assigns a scholarship with probability 1/2. This experiment can be seen as an RD design where each student is assigned a random number with uniform distribution between 0 and 100, say, and the scholarship is given to students whose score is above 50. We illustrate this scenario in Figure 5.1(a). The crucial feature of a randomized experiment recast as an RD design is that the running variable, by virtue of being a randomly generated number, is unrelated to the average potential outcomes. This is the reason why, in Figure 5.1(a), the average potential outcomes E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] take the same constant value for all values of x. Since the regression functions are flat, the vertical distance between them can be recovered by the difference between the average observed outcomes among all units in the treatment and control groups, i.e. E[Yi |Xi ≥ 50] − E[Yi |Xi < 50] = E[Yi (1)|Xi ≥ 50] − E[Yi (0)|Xi < 50] = E[Yi (1)] − E[Yi (0)], where the last equality follows from the assumption that Xi is a randomly generated number and thus is unrelated to Yi (1) and Yi (0). In contrast, in the standard continuity-based RD design there is no requirement, and in most applications it will not be the case, that the potential outcomes will be unrelated to the running variable over its support. Figure 5.1(b) illustrates a standard continuity-based RD design where the average treatment effect at the cutoff is the same as in the experimental setting in Figure 5.1(a), α1 − α0 , but where the average potential outcomes are non-constant functions of the score. This relationship between running variable and potential outcomes is characteristic of many RD designs: since the score is often related to the the units’ ability, resources, or performance (poverty index, vote shares, test scores), units with higher score values are often systematically different from units whose scores are lower. For example, as shown in the graphical analysis of the Meyersson application in Section 3, in municipalities where the Islamist party wins the Turkish mayoral elections of 1994, the Islamist margin of victory is negatively associated with female high school attainment—i.e., the plot has negative slope above the cutoff. As illustrated in Figure 5.1(a), a nonzero slope in the plot of E[Yi |Xi = x] against x does not occur in an actual experiment, because in an experiment x 83

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Figure 5.1: Experiment versus RD Design 2

2

E[Y(1)|X]

E[Y(1)|X] 1

●

Average Treatment Effect

0

●

E[Y(0)|X]

E[Y(1)|X], E[Y(0)|X]

E[Y(1)|X], E[Y(0)|X]

1

Cutoff

−1

−100

−50

x

●

τSRD

0

●

E[Y(0)|X]

Cutoff

−1

50

100

−100

−50

Score (X)

x

50

Score (X)

(a) Randomized Experiment

(b) RD Design

is an arbitrary random number unrelated to the potential outcome. The crucial difference between the scenarios in Figures 5.1(a) and 5.1(b) is our knowledge about the functional form of the regression functions. As discussed in Section 4, the RD treatment effect in 5.1(b) can be estimated by calculating the limit of the average observed outcomes as the score approaches the cutoff for the treatment and control groups, limx↓¯x E[Yi |Xi = x] − limx↑¯x E[Yi |Xi = x]. The estimation of these limits requires that the researcher approximate the regression functions, and this approximation will typically contain an error that may directly affect estimation and inference. This is in stark contrast to the experiment depicted in Figure 5.1(a), where the random assignment of the score implies that the average potential outcomes are unrelated to the score and estimation does not require functional form assumptions—by construction, the regression functions are constant in the entire region where the score was randomly assigned. A point often overlooked is that the known functional form of the regression functions in a true experiment does not follow from the random assignment of the score per se, but rather from the score being an arbitrary computer-generated number that is unrelated to the potential outcomes. If the value of the score were randomly assigned but had a direct effect on the average outcomes, the regression functions in Figure 5.1(a) would not necessarily be flat. Thus, a local randomization approach to RD analysis must be based not only on the assumption that placement above or below the cutoff is randomly assigned within a window of the cutoff, but also on the assumption that the value of the score within this window is unrelated to the potential outcomes—a condition that is guaranteed neither by the random assignment of Xi nor by the random assignment of Zi . Formally, letting W0 = [¯ x − w, x ¯ + w], the local randomization assumption can be stated as the

84

100

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

two following conditions: (LR1) The distribution of the running variable in the window W0 , FXi |Xi ∈W0 (x), is known, is the same for all units, and does not depend on the potential outcomes: FXi |Xi ∈W0 (x) = F (x) (LR2) Inside W0 , the potential outcomes depend on the running variable solely through the treatment indicator Ti =

1(Xi ≥ x¯) but not directly: Yi (Xi , Ti ) = Yi (Ti ) for all i such that

Xi ∈ W0 . Under these conditions, inside the window W0 , placement above or below the cutoff is unrelated to the potential outcomes, and the potential outcomes are unrelated to the running variable; therefore, the regression functions are flat inside W0 . This is illustrated in Figure 5.2, where E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are constant for all values of x inside W0 , but have non-zero slopes outside of it. The contrast between Figures 5.1(a), 5.1(b), and 5.2 illustrates the differences between the actual experiment where the score was a random number, a continuity-based RD design, and a local randomization RD design. In the randomly assigned score experiment, the potential outcomes are unrelated to the score for all possible score values—i.e., in the entire support of the score. In this case, there is no uncertainty about the functional forms of E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x]. In the continuity-based RD design, the potential outcomes can be related to the score everywhere; the functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are completely unknown, and estimation and inference is based on approximating them at the cutoff. Finally, in the local randomization RD design, the potential outcomes can be related to the running variable far from the cutoff, but there is a window around the cutoff where this relationship ceases. In this case, the functions E[Yi (1)|Xi = x] and E[Yi (0)|Xi = x] are unknown over the entire support of the running variable, but inside the window W0 they are assumed to be constant functions of x. In some applications, assuming that the score will have no effect on the (average) potential outcomes somewhere very near the cutoff may be regarded as unrealistic or true restrictive. However, such an assumption can be taken as an approximation, at least for the very few units with score extremely close to the RD cutoff. As we will discuss below, a key advantage of the local randomization approach is that it leads to valid and powerful finite sample inference methods, which remain valid and can be used even when only a handful of observations very close to the cutoff are considered for (estimation and) inference. Furthermore, the restriction that the score cannot directly affect the (average) potential outcomes near the cutoff can be relaxed if the researcher is willing to impose more parametric assumptions (locally to the cutoff). As described and formalized so far, the local randomization assumption assumes that, inside the window where the treatment is assumed to have been randomly assigned, the potential outcomes are entirely unrelated to the running variable. This assumption, also known as the exclusion restriction, leads to the flat regression functions in Figure 5.2. It is possible to consider a slightly weaker version of this assumption, where condition ?? is relaxed. In this version, 85

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Figure 5.2: Local Randomization RD

2

E[Y(1)|X]

E[Y(1)|X], E[Y(0)|X]

1

τLR E[Y(0)|X] 0

Cutoff

−1

−100

−50

x−w0

x

x + w0

50

100

Score (X)

the potential outcomes are allowed to depend on the running variable, but researchers assume that there exists a transformation that, once applied to the potential outcomes of the units inside the window where the treatment is assumed to be randomly assigned, leads to transformed potential outcomes that are unrelated to the running variable. Using the random potential outcomes notation, the exclusion restriction in (LR2) requires that, for units with Xi ∈ W0 , the potential outcomes satisfy Yi (Xi , Ti ) = Yi (Ti )—that is, the potential outcomes depend on the running variable only via the treatment assignment indicator and not via the particular value taken by Xi . In contrast, the weaker alternative assumption requires that, for units with Xi ∈ W0 , the exists a transformation φ(·) such that φ(Yi (Xi , Ti ), Xi , Ti ) = Y˜i (Ti ).

86

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

This condition says that, although the potential outcomes are allowed to depend on the running variable Xi directly, but the transformed potential outcomes Y˜i (Ti ) depend only on the treatment assignment indicator and thus satisfy the original exclusion restriction in (LR2). For implementation, a transformation φ(·) must be assumed; for example, one can use a polynomial of order p on the unit’s score, with slopes that are constant for all individuals on the same side of the cutoff. This transformation has the advantage of linking the local randomization approach to RD analysis to the continuity-based approach discussed in the previous section. Finally, we note that these conditions can be defined analogously for fixed (i.e., non-random) potential outcome functions yi (·).

5.2

Local Randomization Estimation and Inference

Adopting a local randomization approach to RD analysis implies assuming that the assignment of units above or below the cutoff was random inside the window W0 (condition LR1), and that in this window the potential outcomes are unrelated to the score (condition LR2)—or can be somehow transformed to be unrelated to the score. Therefore, given knowledge of W0 , under a local randomization RD approach we can analyze the data as we would analyze an experiment. If the number of observations inside W0 is large, researchers can use the full menu of standard experimental methods, all of which are based on large-sample approximations—that is, on the assumption that the number of units inside W0 is large enough to be well approximated by large sample limiting distributions. These methods may or may not involve the assumption of random sampling, and may or may not require LR2 per se (though removing LR2 will change the interpretation of the RD parameter in general). In contrast, if the number of observations inside W0 is very small, as it is usual the case when local randomization methods are invoked in RD designs, estimation and inference based on large-sample approximations may be invalid; in this case, under appropriate assumptions, researchers can still employ randomization-based inference methods that are exact in finite samples and do not require large-sample approximations for their validity. These methods rely on the random assignment of treatment to construct confidence intervals and hypothesis tests. We review both types of approaches below. The implementation of experimental methods to analyze RD designs requires knowledge or estimation of two important ingredients: (i) the window W0 where the local randomization assumption is invoked; and (ii) the randomization mechanism that will be used to approximate the assignment of units within W0 to treatment and control (i.e., to above or below the cutoff). In real applications, W0 is fundamentally unknown and must be selected by the research (ideally in an objective and data-driven way). Once W0 has been estimated, the choice of the randomization mechanism can be guided by the structure of the data, and sometimes it may be irrelevant if large sample approximations are invoked. In most applications, the most natural assumption for the randomization mechanism is either complete randomization or a Bernoulli assignment, where all units in W0 are assumed to have the same probability of being placed above or below the cutoff. We first assume that W0 is known and choose a particular random assignment mechanism inside W0 . In Section 87

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

5.3, we discuss a principled method to choose the window W0 in a data-driven way. 5.2.1

Finite Sample Methods

In many RD applications, a local randomization assumption will only be plausible in a very small window around the cutoff, and by implication this small window will often contain very few observations. In this case, it is natural to employ Fisherian inference approach, which is valid in any finite sample, and thus leads to correct inferences even when the number of observations inside W0 is very small. The Fisherian approach sees the potential outcomes as non-stochastic; this is contrast to the inference approaches used in the continuity-based RD approach, where the potential outcomes are random variables as a consequence of random sampling. More precisely, in Fisherian inference, the total number of units in the study, n, is seen as fixed—i.e., there is no random sampling assumption; moreover, inferences do not rely on assuming that this number is large. This setup is then combined with the so-called sharp null hypothesis that the treatment has no effect for any unit: HF0 : Yi (0) = Yi (1) for all i. The combination of fixed units and the sharp null hypothesis leads to inferences that are (typeI error) correct for any sample size because, under HF0 , both potential outcomes (i.e., Yi (1) and Yi (0)) can be imputed for every unit and there is no missing data. In other words, under the sharp null hypothesis, and the observed outcome of each unit is equal to the unit’s two potential outcomes. Thus, when the treatment assignment is known, the fact that all potential outcomes are observed under the null hypothesis allows us to derive the null distribution of any test statistic from the randomization distribution of the treatment assignment alone. Since the latter distribution is finite-sample exact, the Fisherian framework allows researchers to make inferences without relying on large-sample approximations. A hypothetical example To illustrate how Fisherian inference leads to the exact distribution of test statistics, we use a hypothetical example. We imagine that we have five units inside W0 , and we randomly assign nW0 ,+ = 3 units to treatment and nW0 ,− = nW0 − nW0 ,+ = 5 − 3 = 2 units to control, where nW0 is the total number of units inside W0 . We choose the difference-in-means as the test-statistic. The treatment indicator continues to be Ti , and we collect in the set TW0 all possible nW0 -dimensional treatment assignment vectors t within the window. For implementation, we must choose a particular treatment assignment mechanism. In other words, after assuming that placement above and below the cutoff was done as it would have been done in an experiment, we must choose a particular randomization distribution for the assignment. 88

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Of course, a crucial difference between an actual experiment and the RD design is that, in the RD design, the true mechanism by which units are assigned a value of the score smaller or larger than x ¯ inside W0 is fundamentally unknown. Thus, the choice of the particular randomization mechanism is best understood as an approximation. A common choice is the assumption that, within W0 , nW0 ,+ units are assigned to treatment and nW0 − nW0 ,+ units are assigned to control, where each −1 unit has probability nnWW0,+ of being assigned to the treatment (i.e. above the cutoff) group. This 0

is commonly known as a complete randomization mechanism or a fixed margins randomization— under this mechanism, the number of treated and control units is fixed, as all treatment assignment vectors result in exactly nW0 ,+ treated units and nW0 − nW0 ,+ control units. In our example, under complete randomization, the number of elements in ΩW0 is

5 3

= 10—

that is, there are ten different ways to assign five units to two groups of size three and two. We assume that Yi (1) = 5 and Yi (0) = 2 for all units, so that the treatment effect is constant and equal to 3 units. The top panel of Table 5.1 shows the ten possible treatment assignment vectors, t1 , . . . , t10 , and also the two potential outcomes. Suppose that the observed treatment assignment inside W0 is t6 , so that units 1, 4 and 5 are assigned to treatment, and units 2 and 3 are assigned to control. Given this assignment, the vector of observed outcomes is Y = (5, 2, 2, 5, 5), and the observed value of the difference-in-means statistic is S obs = Y¯+ − Y¯− = 5+5+5 − 2+2 = 5 − 2 = 3. The bottom panel of Table 5.1 shows the 3

2

distribution of the test statistic under the null—that is, the ten different possible values that the difference-in-means can take when HF0 is assumed to hold. The observed difference-in-means S obs is the largest of the ten, and the exact p-value is therefore pF = 1/10 = 0.10. Thus, we can reject HF0 with a test of level α = 0.10. Note that, since the number of possible treatment assignments is ten, the smallest value that pF can take is 1/10. This p-value is finite-sample exact, because the null distribution in Table 5.1 was derived directly from the randomization distribution of the treatment assignment, and does not rely on any statistical model or large-sample approximations. This example illustrates that, in order to implement a local randomization RD analysis, we need to specify, in addition to the choice of W0 , the particular way in which the treatment was randomized—that is, knowledge of the distribution of the treatment assignment. In practice, the latter will not be known, but in many applications it can be approximated by assuming a complete randomization within W0 . Moreover, we need to choose a particular test statistic; the differencein-means is a simple choice, but below we discuss other options. The General Fisherian Inference Framework We can generalize the above example to provide a general formula for the exact p-value associated with a test of HF . As before, we let TW0 be the treatment assignment for the nW0 units in W0 , and collect in the set TW0 all the possible treatment assignments that can occur given the assumed randomization mechanism. In a complete or fixed margins randomization, TW0 includes all vectors of length nW0 such that each vector has nW0 ,+ ones and nW0 ,− = nW0 − nW0 ,+ zeros. Similarly, 89

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Table 5.1: Hypothetical Randomization Distribution with Five Units

All Possible Treatment Assignments

Unit Unit Unit Unit Unit

1 2 3 4 5

Yi (1)

Yi (0)

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

5 5 5 5 5

2 2 2 2 2

1 1 1 0 0

1 1 0 1 0

1 1 0 0 1

1 0 1 1 0

1 0 1 0 1

1 0 0 1 1

0 1 1 1 0

0 1 1 0 1

0 1 0 1 1

0 0 1 1 1

Distribution of Difference-in-Means When T = t6 and Y = (5, 2, 2, 5, 5) Y¯+ Y¯− Y¯+ − Y¯−

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

3 5 -2

4 3.5 0.5

4 3.5 0.5

4 3.5 0.5

4 3.5 0.5

5 2 3

3 5 -2

3 5 -2

4 3.5 0.5

4 3.5 0.5

YW0 collects the nW0 observed outcomes for all units with Xi ∈ W0 . We also need to choose a test statistic, which we denote S(TW0 , YW0 ), that is a function of the treatment assignment TW0 and the vector YW0 of observed outcomes for the nW0 units in the experiment that is assumed to occur inside W0 . Of all the possible values of the treatment vector TW0 that can occur, only one will have occurred obs the observed in W0 ; we call this value the observed treatment assignment, tobs W0 , and we denote S obs = t(tobs , Y value of the test-statistic associated with tobs W0 ). (In the hypothetical example W0 , i.e. S W0

discussed above, we had tobs W0 = t6 .) Then, the one-sided finite-sample exact p-value associated with a test of the sharp null hypothesis HF0 is the probability that the test-static exceeds its observed value: pF = P(S(TW0 , YW0 ) ≥ S obs ) =

1(S(tW0 , YW0 ) ≥ S obs ) · P(TW0 = tW0 ).

X tW0 ∈TW0

When each of the treatment assignments in TW0 is equally likely, P(T = t) =

1 #{TW0 }

with

#{TW0 } the number of elements in TW0 , and this expression simplifies to the number of times the test-statistic exceeds the observed value divided by the total number of test-statistics that can possibly occur, F

p = P(S(ZW0 , YW0 ) ≥ S

obs

# S(tW0 , YW0 ) ≥ S obs )= . #{TW0 }

As in the hypothetical example above, under the sharp null hypothesis, all potential outcomes are known and can be imputed. To see this, note that under HF0 we have YW0 = YW0 (1) = YW0 (0), so that S(TW0 , YW0 ) = S(TW0 , YW0 (0)). Thus, under HF0 , the only randomness in S(ZW0 , YW0 )

90

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

comes through the random assignment of the treatment, which is assumed to be known. In practice, it often occurs that the total number of different treatment vectors tW0 that can occur inside the window W0 is too large, and enumerating them exhaustively is unfeasible. For example, assuming a fixed-margins randomization inside W0 with 15 observations on each side of F the cutoff, there are nnWW0,t = 30 15 = 155, 117, 520 and calculating p by complete enumeration is not 0

possible or very time consuming. When exhaustive enumeration is infeasible, we can approximate pF using simulations, as follows: 1. Calculate the observed test statistic, S obs = S(tobs W0 , YW0 ). 2. Draw a value tjW0 from the treatment assignment distribution, P(TW0 ≤ tW0 ). 3. Calculate the test statistic for the j th draw tjW0 , S(tjW0 , YW0 ). 4. Repeat steps 2 and 3 B times. 5. Calculate the simulation approximation to pF as B 1 X p˜ = 1(S(tjW0 , YW0 ) ≥ S obs ). B F

j=1

Fisherian confidence intervals can be obtained by specifying sharp null hypotheses about treatment effects, and then inverting these tests. In order to apply the Fisherian framework, the null hypotheses to be inverted must be sharp—that is, under these null hypotheses, the full profile of potential outcomes must be known. This requires specifying a treatment effect model, and testing hypotheses about the specified parameters. A simple and common choice is a constant treatment effect model, Yi (1) = Yi (0) + τ , which leads to the null hypothesis HFτ0 : τ = τ0 —note that HF0 is a special case of HFτ0 when τ0 = 0. Under this model, a 1 − α confidence interval for τ is be obtained by collecting the set of all the values τ0 that fail to be rejected when we test HFτ0 : τ = τ0 with an α-level test. To test HFτ0 , we build test statistics based on an adjustment to the potential outcomes that renders them constant under this null hypothesis. Under HFτ0 , the observed outcome is Yi = Ti · Yi (1) + (1 − Ti ) · Yi (0) = Ti · (Yi (0) + τ0 ) + (1 − Ti ) · Yi (0) = Ti · τ0 + Yi (0). Thus, the adjusted outcome Y¨i ≡ Yi − Zi τ0 = Yi (0) is constant under the null hypothesis HFτ0 . A randomization-based test of HF proceeds by first calculating the adjusted outcomes Y¨i for all τ0

the units in the window i = 1, . . . , nW0 , and then computing the test statistic using the adjusted ¨ W0 ). Once the adjusted outcomes outcomes instead of the raw outcomes, i.e. computing S(TW0 , Y 91

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

¨ W0 ) = S(TW0 , YW0 (0)) as before, and a are used to calculate the test statistic, we have S(TW0 , Y F ¨ W0 ) test of Hτ : τ = τ0 can be implemented as a test of the sharp null hypothesis HF , using S(ZW0 , Y 0

0

instead of S(ZW0 , YW0 ). We use pFτ0 to refer to the p-value associated with a randomization-based test of HFτ0 . In practice, assuming that τ takes values in [τmin , τmax ], computing these confidence intervals requires building a grid Gτ0 = τ01 , τ02 , . . . , τ0G , with τ01 ≥ τmin and τ0G ≤ τmax , and collecting all τ0 ∈ Gτ0 that fail to be rejected with an α-level test of HFτ0 . Thus, the Fisherian (1 − α) × 100% confidence intervals are CILRF = τ0 ∈ Gτ0 : pFτ0 ≤ α . The general principle of Fisherian inference is to use the randomization-based distribution of the test statistic under the sharp null hypothesis to derive p-values and confidence intervals. In our hypothetical example, we illustrated the procedure using the difference-in-means test statistic and the fixed margins randomization mechanism. But the Fisherian approach to inference is general and works for any appropriate choice of test statistic and randomization mechanism. Other test statistics that could be used include the Kolmogorov-Smirnov (KS) statistic and the Wilcoxon rank sum statistic. The KS statistic is defined as SKS = supy |Fˆ1 (y) − Fˆ0 (y)|, and measures the maximum absolute difference in the empirical cumulative distribution functions (CDF) of the treated and control outcomes—denoted respectively by Fˆ1 (·) and Fˆ0 (·). Because SKS is the treated-control difference in the outcome CDFs, it is well suited to detect departures from the null hypothesis that involve not only differences in means but also differences in other moments and in quantiles. Another test statistic commonly used is the Wilcoxon rank sum statistic, which is based P on the ranks of the outcomes, denoted Riy . This statistic is SWR = i:Ti =1 Riy , that is, it is the sum of the ranks of the treated observations. Because SWR is based on ranks, it is not affected by the particular values of the outcome, only by their ordering. Thus, unlike the difference-in-means, SWR is insensitive to outliers. In addition to different choices of statisics, the Fisherian approach also allows for different randomization mechanisms. An alternative to the complete randomization mechanism discussed above is a Bernoulli assignment, where each unit is assigned to treatment with some fixed equal probability. For implementation, researchers can set this probability equal to 1/2 or, alternatively, equal to the proportion of treated units in W0 . The disadvantage of a Bernoulli assignment is that it can result in a treated or a control group with few or no observations—a phenomenon that can never occur under complete randomization. In practice, nevertheless, complete randomization and Bernoulli randomization often lead to very similar conclusions.

92

5. LOCAL RANDOMIZATION RD APPROACH

5.2.2

CONTENTS

Large Sample Methods

Despite the conceptual elegance of finite-sample Fisherian methods, the most frequently chosen methods in the analysis of experiments are based on large-sample approximations. These methods are appropriate to analyze RD designs under a local randomization assumption when the number of observations inside W0 is sufficiently large to ensure that the moment and/or distributional approximations are sufficiently similar to the finite-sample distributions of the statistics of interest. A classic framework for experimental analysis is known as the Neyman approach. This approach relies on large-sample approximations to the randomization distribution of the treatment assignment, but still assumes that the potential outcomes are fixed or non-stochastic. In other words, the Neyman approach to experimental analysis is based on approximations to the randomization distribution but does not assume that the data is a (random) sample from a larger super-population. Since in the Neyman framework the potential outcomes are non-stochastic, the parameter of interest is the finite-sample average treatment effect inside the window if point estimation is the goal, or otherwise a given hypothesis test such as equal (sample) potential outcome means. To be more specific, consider the local randomization sharp RD effect, defined as LR τSRD = Y¯ (1) − Y¯ (0),

Y¯ (1) =

1

X

nW0

Y¯ (0) =

Yi (1),

i:Xi ∈W0

1

X

nW 0

Yi (0)

i:Xi ∈W0

where Y¯ (1) and Y¯ (0) are the average potential outcomes inside the window. In this definition, we have assumed that the potential outcomes are non-stochastic. LR is different from the more conventional continuity-based RD paNote that the parameter τSRD LR is an average effect inside an interval (the window W ), rameter τSRD defined in Section 4: while τSRD 0

τSRD is an average at a single point (the cutoff x ¯) where, by construction, the number of observations is zero. Thus, the decision to adopt a continuity-based approach versus a local randomization approach directly affects the definition of the parameter of interest. Naturally, if the window W0 is LR and τ extremely small, τSRD SRD become more conceptually similar.

Under the assumption of complete randomization inside W0 , the observed difference-in-means LR . Thus a natural estimator for the RD effect τ LR is the difference is an unbiased estimator of τSRD SRD

between the average observed outcomes in the treatment and control groups, τˆRSRD = Y¯+ − Y¯− ,

Y¯+ =

1 nW0 ,+

X

Yi 1(Xi ≥ x ¯),

i:Xi ∈W0

Y¯− =

1 nW0 ,−

X

Yi 1(Xi < x ¯),

i:Xi ∈W0

where Y¯+ and Y¯− are the average treated and control observed outcomes inside W0 , and nW0 ,+ and nW0 ,− are the number of treatment and control units inside W0 , respectively. In this case, LR is given by the sum of the sample variance in a conservative estimator of the variance of τSRD 2 2 σ ˆ σ ˆ 2 and σ 2 denote the sample variance of the outcome each group, Vˆ = nW+,+ + nW−,− , where σ ˆ+ ˆ− 0

0

for the treatment and control units within W0 , respectively. A confidence 100(1 − α)% confidence 93

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

interval can be constructed in the usual way by relying on a normal large-sample approximation to the randomization distribution of the treatment assignment. For example, an approximate 95% confidence interval is LRN

CI

h

RSRD

= τˆ

p i b . ± 1.96 · V

Testing of the null hypothesis that the average treatment effect is zero can also be based on Normal approximations. The Neyman null hypothesis is HN0 : Y¯ (1) − Y¯ (0) = 0. In contrast to Fisher’s sharp null hypothesis HF0 , HN0 does not allow us to calculate the full profile of potential outcomes for every possible realization of the treatment assignment vector t. Thus, unlike the Fisherian approach, the Neyman approach to hypothesis testing must rely on approximation and is therefore not exact. In the Neyman approach, we can construct the usual t-statistic using ¯ ¯ the point and variance estimators just introduced, S = Y+√−Y− , and then use Normal approximation ˆ V

to its distribution. For example, for a one-sided test, the p-value associated with a test of HN0 , is pN = 1 − Φ(t), where Φ(·) is the Normal CDF. Finally, it is possible to consider (random) sampling from a superpopulation, in addition to large-sample approximations to the randomization mechanism. To be more specific, in the Neyman framework just introduced there is no random sampling; instead, the potential outcomes are considered fixed and inferences are based on an approximation to the randomization distribution of treatment for large samples. Other experimental methods based on large-sample approximations assume instead that the data {Yi , Xi }ni=1 is a random sample from a larger population—the same assumption made by the continuity-based methods discussed in Section 4. When random sampling is assumed, the potential outcomes Yi (1) and Yi (0) are considered random variables, and the units inside W0 are seen as a random sample from a (large) super-population. Because the potential outcomes within W0 become stochastic by virtue of the random sampling, the parameter of interest is the super-population average treatment effect, E[Yi (1) − Yi (0)|Xi ∈ W0 ]. Adopting this superpopulation perspective, however, does not change the estimation or inference procedures discussed above, though it does affect the interpretation of the results slightly.

5.2.3

Estimation and Inference in Practice

We illustrate the finite-sample and large-sample inference procedures described above using the Meyersson application. For this, we use the function rdrandinf, which is part of the rdlocrand package. The main arguments of rdrandinf include the outcome variable Y, the running variable X, and the upper and lower limits of the window where inferences will be performed (wr and wl). We choose the ad-hoc window [-2.5, 2.5], and postpone the discussion of automatic data-driven window selection until the next section. To make inferences in W = [−2.5, 2.5], we set wl = −2.5 and

94

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

wr = 2.5. Since Fisherian methods are simulation-based, we also choose the number of simulations via the argument reps, in this case choosing 1, 000 simulations. Finally, in order to be able to replicate the Fisherian simulation-based results at a later time, we set the random seed using the argument seed. > out = rdrandinf (Y , X , wl = -2.5 , wr = 2.5 , seed = 50) Selected window = [ -2.5;2.5] Running randomization - based test ... Randomization - based test complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

2629 0 uniform 1000 set by user 0 fixed margins

Cutoff c = 0 Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

Left of c 2314 68 13.972 8.541 -2.5

Right of c 315 62 15.044 9.519 2.5 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 4.27

Diff . in means

1.072

0.488

0.501

0.765

Analogous Stata command . rdrandinf Y X, wl(-2.5) wr(2.5) seed(50)

The output is divided in three panels. The top panel first presents the total number of observations in the entire dataset (that is, in the entire support of the running variable), the order of the polynomial, and the kernel function that is used to weigh the observations. By default, rdlocrand uses a polynomial of order zero, which means the outcomes are not transformed. In order to transform the outcomes via a polynomial as explained above, users can use the option p in the call to rdlocrand. The default is also to use a uniform kernel, that is, to compute the test statistic using the unweighted observations. This default behavior can be changed with the option kernel. The rest of the top panel reports the number of simulations used for Fisherian inference, the method used to choose the window, and the null hypothesis that is tested (default is τ0 = 0, i.e. a test of HF0 and HN0 ). Finally, the last row of the top panel reports the chosen randomization mechanism, which by default is fixed margins (i.e. complete) randomization. 95

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

The middle panel reports the number of observations to the left and right of the cutoff in both the entire support of the running variable, and in the chosen window. As shown in the output, although there is a total of 2314 control observations and 315 treated observations in the entire dataset, the number of observations in the window [−2.5, 2.5] is much smaller, with only 68 municipalities below the cutoff and 62 municipalities above the cutoff. The middle panel also reports the mean and standard deviation of the outcome inside the chosen window. The last panel reports the results. The first column reports the type of test statistic employed for testing the Fisherian sharp null hypothesis (the default is the difference-in-means), and the column labeled T reports its value. In this case, the difference-in-means is 1.072; given the information in the Mean of outcome row in the middle panel, we see that this is the difference between a female education share of 15.044 percentage points in municipalities where the Islamic party barely won, and a female education share of 13.972 percentage points in municipalities where the Islamic party barely lost. The Finite sample column reports the p-value associated with a randomization-based test of the Fisherian sharp null hypothesis HF0 (or the alternative sharp null hypothesis HFτ0 based on a constant treatment effect model if the user sets τ0 6= 0 via the option nulltau). This p-value is 0.488, which means we fail to reject the sharp null hypothesis. Finally, the Large sample columns in the bottom panel report Neyman inferences based on the large sample approximate behavior of the (distribution of the) statistic. The p-value reported in the large-sample columns is thus pN , the p-value associated with a test of the Neyman null hypothesis HN0 that the average treatment effect is zero. The last column in the bottom panel reports the power of the Neyman test to reject a true average treatment effect equal to d, where by default d is set to one half of the standard deviation of the outcome variable for the control group, which in this case is 4.27 percentage points. The value of d can be modified with the options d or dscale. Like pN , the calculation of the power versus the alternative hypothesis d is based on the Normal approximation discussed in the previous section. The large-sample p-value is 0.501, indicating that the Neyman null hypothesis also fails to be rejected at conventional levels. The power calculation indicates that the probability of rejecting the null hypothesis when the true effect is equal to half a (control) standard deviation is relatively high, at 0.765. Thus, it seems that the failure to reject the null hypothesis stems from the small size of the average treatment effect estimated in this window, which is just 1.072/(4.27 × 2) = 1.072/8.54 = 0.126 standard deviations of the control outcome—a very small effect. It is also important to note the different interpretation of the difference-in-means test statistic in the Fisherian and Neyman frameworks. In Fisherian inference, the difference-in-means is simply one of the various test statistics that can be chosen to test the sharp null hypothesis, and should not be interpreted as an estimated effect—remember that the in Fisherian framework, the focus is on testing null hypotheses that are sharp. In contrast, in the Neyman framework, the focus is on the sample average treatment effect; since the difference-in-means is an unbiased estimator of this parameter, it can be appropriately interpreted as an estimated effect.

96

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

To illustrate how robust Fisherian inferences can be to the choice of randomization mechanism and test statistic, we modify our call to randinf to use a binomial randomization mechanism, where every unit in the ad-hoc window [−2.5, 2.5] has a 1/2 probability of being assigned to treatment. For this, we must first create an auxiliary variable that contains the treatment assignment probability of every unit in the window; this auxiliary variable is then passed as an argument to rdrandinf. > > > >

bern _ prob = numeric ( length ( X ) ) bern _ prob [ abs ( X ) > 2.5] = NA bern _ prob [ abs ( X ) <= 2.5] = 1 / 2 out = rdrandinf (Y , X , wl = -2.5 , wr = 2.5 , seed = 50 , bernoulli = bern _ prob )

Selected window = [ -2.5;2.5] Running randomization - based test ... Randomization - based test complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

130 0 uniform 1000 set by user 0 Bernoulli

Cutoff c = 0 Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

Left of c 68 68 13.972 8.541 -2.5

Right of c 62 62 15.044 9.519 2.5 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 4.27

Diff . in means

1.072

0.469

0.501

0.765

Analogous Stata command . gen bern_prob=1/2 if abs(X)<=2.5 . rdrandinf Y X, wl(-2.5) wr(2.5) seed(50) bernoulli(bern_prob)

The last row of the top panel now says Randomization = Bernoulli, indicating that the Fisherian randomization-based test of the sharp null hypothesis is assuming a Bernoulli treatment assignment mechanism, where each unit has probability q of being placed above the cutoff—in this case, given our construction of the bern prob variable, q = 1/2 for all units. The Fisherian finite-sample p-value is now 0.469, very similar to the 0.488 p-value obtained above under the assumption of a fixed margins randomization. The conclusion of failure to reject HF0 is therefore unchanged. This robustness of the Fisherian p-value to the choice of fixed margins versus Bernoulli 97

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

randomization is typical of most applications. Note also that the large-sample results are exactly the same as before—this is expected, since the choice of randomization mechanism does not affect the large-sample Neyman inferences. We can also change the test statics used to test the Fisherian sharp null hypothesis. For example, to use the Kolmogorov-Smirnov (KS) test statistic instead of the difference-in-means, we set the option statistic = "ksmirnov". > out = rdrandinf (Y , X , wl = -2.5 , wr = 2.5 , seed = 50 , statistic = " ksmirnov " ) Selected window = [ -2.5;2.5] Running randomization - based test ... Randomization - based test complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

2629 0 uniform 1000 set by user 0 fixed margins

Cutoff c = 0 Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

Left of c 2314 68 13.972 8.541 -2.5

Right of c 315 62 15.044 9.519 2.5 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 4.27

Kolmogorov - Smirnov

0.101

0.846

0.898

NA

Analogous Stata command . rdrandinf Y X, wl(-2.5) wr(2.5) seed(50) statistic(ksmirnov)

The bottom panel now reports the value of the KS statistic in the chosen window, which is 0.101. The randomization-based test of the Fisherian sharp null hypothesis HF based on this statistic has p-value 0.846, considerably larger than the 0.488 p-value found in the same window (and with the same fixed-margins randomization) when the difference-in-means was chosen instead. Note that now the large-sample results report a large-sample approximation to the KS test p-value, and not a test of the Neyman null hypothesis HN . Moreover, the KS statistic has no interpretation as a treatment effect in either case. Finally, we illustrate how to obtain confidence intervals in our call to rdrandinf. Remember that in the Fisherian framework, confidence intervals are obtained by inverting tests of sharp null 98

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

hypothesis. To implement this inversion, we must specify a grid of τ values; rdrandinf will then test the null hypotheses HFτ0 : Yi (1)−Yi (0) = τ0 for all values of τ0 in the grid, and collect in the confidence interval all the hypotheses that fail to be rejected in a randomization-based test of the desired level (default is level α = 0.05). To calculate these confidence intervals, we create the grid, and then call rdrandinf with the ci option. For this example, we choose a grid of values for τ0 between −10 and 10, with 0.25 increments. Thus, we test Hτ0 for all τ0 ∈ Gτ0 = {−10, −9.75, −9.50, . . . , 9.50, 9.75, 10}. > ci _ vec = c (0.05 , seq ( from = -10 , to = 10 , by = 0.25) ) > out = rdrandinf (Y , X , wl = -2.5 , wr = 2.5 , seed = 50 , reps = 1000 , + ci = ci _ vec ) Selected window = [ -2.5;2.5] Running randomization - based test ... Randomization - based test complete . Running sensitivity analysis ... Sensitivity analysis complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

2629 0 uniform 1000 set by user 0 fixed margins

Cutoff c = 0 Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

Left of c 2314 68 13.972 8.541 -2.5

Right of c 315 62 15.044 9.519 2.5 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 4.27

Diff . in means

1.072

0.488

0.501

0.765

95% confidence interval : [ -2 ,4] Analogous Stata command . rdrandinf Y X, wl(-2.5) wr(2.5) seed(50) ci(0.05 -10(0.25)10)

The Fisherian 95% confidence interval is [−2, 4]. As explained, this confidence interval assumes a constant treatment effect model. The interpretation is therefore that, given the assumed randomization mechanism, all values of τ between -2 and 4 in the constant treatment effect model 99

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Yi (1) = Yi (0) + τ fail to be rejected with a randomization-based 5%-level test. In other words, in this window, and given a constant treatment effect model, the empirical evidence based on a local randomization RD framework is consistent with both negative and positive true effects of Islamic victory on the female education share.

5.3

How to Choose the Window

In the previous sections, we assumed that W0 was known. However, in practice, even when a researcher is willing to assume that there exists a window around the cutoff where the treatment is as-if randomly assigned, the location of this window will be typically unknown. This is another fundamental difference between local randomization RD designs and actual randomized controlled experiments, since in the latter there is no ambiguity about the population of units that were subject to the random assignment of the treatment. Thus, the most important step in the implementation of the local randomization RD approach is to select the window around the cutoff where the treatment can be plausibly assumed to have been as-if randomly assigned. One option is to choose the randomization window in an ad-hoc way, selecting a small neighborhood around the cutoff where the researcher is comfortable assuming local randomization. For example, a scholar may believe that elections decided by 0.5 percentage points or less are essentially decided as if by the flip of a coin, and chose the window [¯ x − 0.5, x ¯ + 0.5]. The obvious disadvantage of selecting the window arbitrarily is that the resulting choice is based neither on empirical evidence nor on a systematic procedure, and thus lacks objectivity and replicability. A preferred alternative is to choose the window using the information provided by relevant predetermined covariates—variables that reflect important characteristics of the units, and whose values are determined before the treatment is assigned and received. This approach requires assuming that there exists at least one important predetermined covariate of interest, Z, that is related to the running variable everywhere except inside the window W0 . Figure 5.3 shows a hypothetical illustration, where the conditional expectation of Z given the score, E(Z|X) is plotted against X. Outside of W0 , E(Z|X) and X are related: a mild U-shaped relationship to the left of x ¯, and monotonically increasing to the right—possibly due to correlation between the score and another characteristic that also affects Z. However, inside the window W0 where local randomization holds, this relationship disappears by virtue of applying conditions LR1 and LR2 to Z, taking Z as an “outcome” variable. Moreover, because Z is a predetermined covariate, the effect of the treatment on Z is zero by construction. In combination, these assumptions imply that there is no association between E(Z|X) and X inside W0 , but these two variables are associated outside of W0 . This suggests a data-driven method to choose W0 . We define a null hypothesis H0 stating that the treatment is unrelated to Z (or that Z is “balanced” between the groups). In theory, this hypothesis could be the Fisherian hypothesis HF0 or the Neyman hypothesis HN0 . However, since the procedure we discuss will be based on some small windows with very few observations, we

100

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

recommend using randomization-based tests of the Fisherian hypothesis HF0 , which takes the form HF0 : Vi (1) = Vi (0). Naturally, the effect of the treatment on Z is zero for all units inside W0 because the covariate is predetermined. However, the window selection procedure is based on the assumption that, outside W0 , the treatment and control groups differ systematically in Z—not because the treatment has a causal effect on Z but rather because the running variable is correlated with Z outside W0 . This assumption is important; without it, the window selector will not recover the true W0 . The procedure starts with the smallest possible window—W1 in Figure 5.3—and tests the null hypothesis of no effect H0 . Since there is no relationship between E(Z|X) and Z inside W1 , H0 will fail to be rejected. Once H0 fails to be rejected, a smaller window W2 is selected, and the null hypothesis is tested again inside W2 . The procedure keeps increasing the length of the window and re-testing H0 in each larger window, until a window is reached where H0 is rejected at the chosen significance level α? ∈ (0, 1). In the figure, assuming the test has perfect power, the null hypothesis will not be rejected in W0 , nor will it be rejected in W2 or W1 . The chosen window is the largest window such that H0 fails to be rejected inside that window, and in all windows contained in it. In Figure 5.3, the chosen window is W0 . In spirit, this nested procedure is analogous to the use of covariate balance tests in randomized controlled experiments. In essence, the procedure chooses the largest window such that covariate balance holds in that window and all smaller windows inside it. As we show in our empirical illustration, this data-driven window selection method can be implemented with several covariates, for example, rejecting a particular window choice when H0 is rejected for at least one covariate. As mentioned, we recommend choosing HF0 as the null hypothesis. In addition, the practical implementation of the procedure requires several other choices: • Choose the relevant covariates. Researchers must decide which covariates to use in the window selection procedure; these covariates should be related to both the outcome and the treatment assignment. If multiple covariates are chosen, the procedure can be applied using either the p-value of an omnibus test statistic, or by testing H0 for each covariate separately and then making the decision to reject H0 based on the minimum p-value across all covariates—i.e., rejecting a particular window choice when H0 is rejected for at least one covariate. • Choose the test statistic. Researchers must choose the statistic on which the randomizationbased test of the Fisherian null hypothesis will be based. This can be the difference-in-means, one of the alternatives statistics discussed above, or other possibilities. • Choose the randomization mechanism. Researchers must select the randomization mechanism that will be assumed inside the window to test the sharp null hypothesis HF0 using Fisherian methods. In many applications, an appropriate choice is a complete randomization mechanism where every unit in the window has treatment assignment probability 1/ nnWW0,t , as discussed 0

above. 101

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Figure 5.3: Window Selector Based on Covariate Balance in Locally Random RD

Cutoff

1

H0 is true

H0 is false

H0 is false

E[V|X]

E[V|X] 0

W6 W5 W4 W3 W0 W2 W1

−1

−w6

−w5

−w4

−w3

−w0

−w1 x

w1

w0

w3

w4

w5

w6

Score (X)

• Choose a minimum number of observations in the smallest window. In actual applications, if the smallest window around x ¯ where the null hypothesis is tested is too small, it will contain too few observations and the test will not have enough power to reject the null hypothesis even if it is false. Thus, researchers must ensure that the smallest window considered contains a minimum number of observations to ensure acceptable power; we recommend that the minimum window have at least roughly ten observations on either side of the cutoff. • Choose α? . The threshold significance level determines when a null hypothesis is considered rejected. Since the main concern is failing to reject a null hypothesis when it is false—in contrast to the usual concern about rejecting a true null hypothesis—the level of the test should be higher than conventional levels. When we test H0 at a higher level, we tolerate a higher probability of Type I error and a lower probability of concluding that the covariate is unrelated to the treatment assignment when in fact it is. We recommend setting α? ≥ 0.15 if 102

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

possible, and ideally no smaller than 0.10. Once researchers have selected the relevant covariates, the test statistic, the randomization mechanism in the window, and the threshold significance level α? , the window selection procedure can be implemented with the following algorithm: 1. Start with a symmetric window of length 2wj , Wj = [¯ x − wj , x ¯ + wj ]. 2. For every covariate Zk , k = 1, 2, . . . , K use the test statistic Tk and the chosen randomization mechanism to test HF0 using a Fisherian framework. Use only observations whose score values are inside Wj . Compute the associated p-value, pk . (If using an omnibus test, compute the single omnibus p-value.) 3. Compute the minimum p-value pmin = min(p1 , p2 , . . . , pK ). (If using an omnibus test, replace pmin by the single omnibus p-value.) (a) If pmin > α? , do not reject the null hypothesis and increase the length of the window by 2wstep , Wj+1 = [¯ x − wj − wstep , x ¯ + wj + wstep ]. Set Wj = Wj+1 and go back to step (2). (b) If pmin ≤ α? , reject the null hypothesis and conclude that the largest window where the local randomization assumption is plausible is Wj . To see how the procedure works in practice, we use it to select a window in the Meyersson application using the set of predetermined covariates described in Section 1.2: vshr islam1994, partycount, lpop1994, i89, merkezi, merkezp, subbuyuk, buyuk. We use the function rdwinselect, which is one of the functions in rdlocrand. The main arguments are the score variable X, the matrix of predetermined covariates, and the sequence of nested windows; for simplicity, only symmetric windows are considered. We also choose 1,000 simulations for the calculation of Fisherian p-values in each window. There are two ways to increment the length of the windows in rdwinselect. One is to increment the length of the window in fixed steps, which can be implemented with the option wstep. For example, if the first window selected is [0.1, 0.1] and wstep = 0.1, the sequence is W1 = [0.1, 0.1],W2 = [0.2, 0.2], W3 = [0.3, 0.3], etc. The other is to increase the length of the window so that the number of observations increases by a minimum fixed amount on every step, which can be done via the option wobs. For example, by setting wobs = 2, every window in the sequence is the smallest symmetric window such that the number of added observations on each side of the cutoff relative to the prior window is at least 2. By default, rdwinselect starts with the smallest window that has at least 10 observations on either side, but this default behavior can be changed with the options wmin or obsmin. Finally, rdwinselect uses the chosen level α? to recommend the chosen window; the default is α? = 0.15, but this can be modified with the level option. We start by considering a sequence of symmetric windows where we increase the length in every step by the minimum amount so that we increase at least 2 observations in each step on either side. 103

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

We achieve this by setting wobs=2. > + > + >

Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 , data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk ) colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " , " merkezi " , " merkezp " , " subbuyuk " , " buyuk " ) out = rdwinselect (X , Z , seed = 50 , reps = 1000 , wobs = 2)

Window selection for RD under local randomization Number of obs Order of poly Kernel type Reps Testing method Balance test

= = = = = =

Cutoff c = 0 Number of obs 1 st percentile 5 th percentile 10 th percentile 20 th percentile Window length / 2 0.446 0.486 0.536 0.699 0.856 0.944 1.116 1.274 1.343 1.42

2629 0 uniform 1000 rdrandinf diffmeans Left of c 2314 24 115 231 463 p - value 0.451 0.209 0.253 0.238 0.27 0.241 0.048 0.059 0.352 0.365

Right of c 315 3 15 31 62 Var . name i89 i89 i89 i89 i89 i89 i89 i89 merkezi i89

Bin . test 1 1 1 0.585 0.377 0.627 0.28 0.371 0.312 0.272

Obs < c 9 11 12 13 13 17 17 19 20 22

Obs >= c 10 12 13 17 19 21 25 26 28 31

Recommended window is [ -0.944;0.944] with 38 observations (17 below , 21 above ) . Analogous Stata command . global covs "i89 vshr_islam1994 partycount lpop1994 merkezi merkezp subbuyuk buyuk" . rdwinselect X $covs, seed(50) wobs(2)

The top and middle panels in the rdwinselect output are very similar to the corresponding panels in the rdrandinf output. One difference is the Testing method, which indicates whether randomization-based methods are used to test HF0 , or Normal approximations methods are used to test HN0 . The default is randomization-based methods, but this can be changed with the approximate option. The other difference in the output is the Balance test, which indicates the type of test statistic used for testing the null hypothesis—the default is diffmeans, the difference-in-means. The option statistic allows the user to select a different test statistic; the available options are 104

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

the Kolmogorov-Smirnov statistic (ksmirnov), the Wilcoxon-Mann-Whitney studentized statistic (ranksum), and Hotelling’s T-squared statistic (hotelling), all of which we defined above. The bottom panel shows the results of the tests of the null hypothesis for each window considered. By default, rdwinselect starts with the smallest symmetric window that has at least 10 observations on either side of the cutoff. Since we set wobs=2, we continue to consider the smallest possible (symmetric) windows so that at least 2 observations are added on each side of the cutoff in every step. For every window,the column p-value reports either pmin —the minimum of the K p-values, p1 , p2 , . . . , pK associated with a test of the null hypothesis of no effect for each of the covariates Z1 , Z2 , . . . , ZK —or the unique p-value if an omnibus test is used to test H0 jointly for all covariates. The column Var. name reports the covariate associated with the minimum p-value— that is, the covariate Zk such that pk = pmin . Finally, the column Bin. test uses a Binomial test to calculate the probability of observing nW,+ successes out of nW trials, where nW,t is the number of observations within the window that are above the cutoff (reported in column Obs>c) and nW is the total number of observations within the window (which can be calculated by adding the number reported in columns Obs + > + > +

Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 , data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk ) colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " , " merkezi " , " merkezp " , " subbuyuk " , " buyuk " ) out = rdwinselect (X , Z , seed = 50 , reps = 1000 , wobs = 2 , nwindows = 50 , plot = TRUE )

Window selection for RD under local randomization Number of obs Order of poly Kernel type Reps Testing method Balance test

= = = = = =

2629 0 uniform 1000 rdrandinf diffmeans

105

5. LOCAL RANDOMIZATION RD APPROACH

Cutoff c = 0 Number of obs 1 st percentile 5 th percentile 10 th percentile 20 th percentile Window length / 2 0.446 0.486 0.536 0.699 0.856 0.944 1.116 1.274 1.343 1.42 1.49 1.556 1.641 1.858 1.914 2.019 2.097 2.158 2.319 2.433 2.583 2.643 2.746 3.009 3.051 3.094 3.178 3.462 3.595 3.704 3.821 3.963 4.181 4.287 4.417 4.488 4.585 4.719 4.899 5.03 5.2 5.346 5.421 5.482 5.593 5.676 5.779 5.878 6.039 6.266

Left of c 2314 24 115 231 463 p - value 0.451 0.209 0.253 0.238 0.27 0.241 0.048 0.059 0.352 0.365 0.171 0.31 0.188 0.206 0.244 0.26 0.211 0.096 0.048 0.016 0.008 0.006 0.002 0.002 0.005 0.001 0 0 0 0 0 0 0 0 0 0.001 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CONTENTS

Right of c 315 3 15 31 62 Var . name i89 i89 i89 i89 i89 i89 i89 i89 merkezi i89 merkezi merkezi merkezi merkezi merkezi merkezi vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994

106

Bin . test 1 1 1 0.585 0.377 0.627 0.28 0.371 0.312 0.272 0.229 0.245 0.26 0.328 0.403 0.416 0.653 0.657 1 0.839 0.842 0.922 1 1 1 0.928 0.929 0.491 0.444 0.356 0.251 0.258 0.153 0.138 0.192 0.168 0.113 0.1 0.067 0.095 0.098 0.089 0.078 0.092 0.081 0.073 0.075 0.119 0.053 0.034

Obs < c 9 11 12 13 13 17 17 19 20 22 23 25 27 29 31 33 37 38 44 50 52 53 54 56 58 62 64 72 74 77 82 84 89 92 94 95 99 101 106 107 109 112 114 114 116 119 120 121 128 131

Obs >= c 10 12 13 17 19 21 25 26 28 31 33 35 37 38 39 41 42 43 45 47 49 51 54 56 58 60 62 63 64 65 67 69 70 72 76 76 77 78 80 83 85 87 88 89 90 92 93 97 98 98

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Recommended window is [ -0.944;0.944] with 38 observations (17 below , 21 above ) . Analogous Stata command . global covs "i89 vshr_islam1994 partycount lpop1994 merkezi merkezp subbuyuk buyuk" . rdwinselect X $covs, seed(50) wobs(2) nwindows(50) plot

Figure 5.4: Window selection: minimum P-value against window length—Meyersson data

0.4

●

● ●

0.3

●

● ● ●

0.2

Pvals

●

●

●

●

●

● ●

0.1

●

●

● ●

●

0.0

●

1

2

●● ●

●●●●

● ●● ● ●

3

4

● ● ●●● ● ● ● ● ●●● ●● ●● ●

5

●

6

window.list

From the plot, we see that, in the sequence of windows considered, the minimum p-value is above 0.20 for all windows smaller than [-1,1], and decreases below 0.1 as the windows get larger than [-1,1]. Although the p-values increase again for windows between [-1,1] and [-2,2], they decrease sharply once windows larger than [-2,2] are considered. If we want to choose the window using a sequence of symmetric windows of fixed length rather than controlling the minimum number of observation, we simply use the wstep option. Calling rdwinselect with wstep=0.1 performs the covariate balance tests in a sequence of windows that starts at the minimum window and increases the length by 0.1 at each side of the cutoff. > + > + > +

Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 , data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk ) colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " , " merkezi " , " merkezp " , " subbuyuk " , " buyuk " ) out = rdwinselect (X , Z , seed = 50 , reps = 1000 , wstep = 0.1 , nwindows = 25)

107

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Window selection for RD under local randomization Number of obs Order of poly Kernel type Reps Testing method Balance test

= = = = = =

Cutoff c = 0 Number of obs 1 st percentile 5 th percentile 10 th percentile 20 th percentile Window length / 2 0.446 0.546 0.646 0.746 0.846 0.946 1.046 1.146 1.246 1.346 1.446 1.546 1.646 1.746 1.846 1.946 2.046 2.146 2.246 2.346 2.446 2.546 2.646 2.746 2.846

2629 0 uniform 1000 rdrandinf diffmeans Left of c 2314 24 115 231 463 p - value 0.451 0.241 0.22 0.241 0.243 0.234 0.061 0.069 0.057 0.366 0.254 0.24 0.187 0.198 0.224 0.242 0.259 0.095 0.066 0.056 0.012 0.011 0.006 0 0.003

Right of c 315 3 15 31 62 Var . name i89 i89 i89 i89 i89 i89 i89 i89 i89 merkezi merkezi i89 merkezi merkezi i89 merkezi merkezi vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994 vshr _ islam1994

Bin . test 1 1 0.851 0.473 0.377 0.627 0.349 0.28 0.451 0.312 0.22 0.298 0.26 0.215 0.268 0.477 0.567 0.657 0.914 1 0.839 0.841 0.922 1 1

Obs < c 9 12 13 13 13 17 17 17 19 20 22 25 27 27 28 32 35 38 42 45 50 51 53 53 55

Obs >= c 10 13 15 18 19 21 24 25 25 28 32 34 37 38 38 39 41 43 44 45 47 48 51 54 55

Recommended window is [ -0.946;0.946] with 38 observations (17 below , 21 above ) . Analogous Stata command . global covs "i89 vshr_islam1994 partycount lpop1994 merkezi merkezp subbuyuk buyuk" . rdwinselect X $covs, seed(50) wstep(0.1) nwindows(25)

The suggested window is [−0.946, 0.946], very similar to the [−0.944, 0.944] window chosen above with the wobs=2 option. 108

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

We can now use rdrandinf to perform a local randomization analysis in chosen window. For this, we use the options wl and wr to input, respectively, the lower and upper limit of the chosen window W ? = [−0.944, 0.944]. We also use the option d = 3.0195 to calculate the power of a Neyman test to reject the null hypothesis of null average treatment effect when the true average difference is 3.0195. This value is the linear polynomial point estimate obtained in Section 4 with the continuity-based approach. The difference in means in the chosen window W ? = [−0.944, 0.944] is 2.638, considerably similar to the continuity-based local linear point estimate of 3.0195. However, using a Neyman approach, we cannot distinguish this average difference from zero, with a p-value of 0.333. Similarly, we fail to reject the Fisherian sharp null hypothesis that an electoral victory by the Islamic party has no effect on the female education share for any municipality (p-value 0.386). As shown in the last column, the large-sample power to detect a difference of around 3 percentage points is only 19.4%. Naturally, the small number of observations in the chosen window (22 and 27 below and above the cutoff, respectively) limits statistical power. In addition, the effect of 2.638, although much larger than the 1.072 estimated in the ad-hoc [-2.5, 2.5] window, is still a very small effect. The 2.638 effect is less than a third third of one standard deviation of the female education share in the control group—we see this by calculating 2.638/8.615 = 0.306. Consistent with these results, the Fisherian 95% confidence interval under a constant treatment effect model is [−2.8 , 8.5], consistent with both positive and negative effects. > ci _ vec = c (0.05 , seq ( from = -10 , to = 10 , by = 0.1) ) > out = rdrandinf (Y , X , wl = -0.944 , wr = 0.944 , seed = 50 , reps = 1000 , + ci = ci _ vec ) Selected window = [ -0.944;0.944] Running randomization - based test ... Randomization - based test complete . Running sensitivity analysis ... Sensitivity analysis complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

Cutoff c = 0 Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

2629 0 uniform 1000 set by user 0 fixed margins Left of c 2314 23 13.579 8.615 -0.944

Right of c 315 27 16.217 10.641 0.944

109

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 4.307

Diff . in means

2.638

0.386

0.333

0.353

95% confidence interval : [ -2.8 ,8.5] Analogous Stata command . rdrandinf Y X, wl(-0.944) wr(0.944) seed(50) ci(0.05 -10(0.1)10)

Finally, we mention that instead of calling rdwinselect first and rdrandinf second, we can choose the window and perform inference in one step by using the covariates option in rdrandinf. > + > + >

Z = cbind ( data $ i89 , data $ vshr _ islam1994 , data $ partycount , data $ lpop1994 , data $ merkezi , data $ merkezp , data $ subbuyuk , data $ buyuk ) colnames ( Z ) = c ( " i89 " , " vshr _ islam1994 " , " partycount " , " lpop1994 " , " merkezi " , " merkezp " , " subbuyuk " , " buyuk " ) out = rdrandinf (Y , X , covariates = Z , seed = 50 , d = 3.019522)

Running rdwinselect ... rdwinselect complete . Selected window = [ -0.917068123817444;0.917068123817444] Running randomization - based test ... Randomization - based test complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

2629 0 uniform 1000 rdwinselect 0 fixed margins

Cutoff c = 0 Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

Left of c 2314 22 13.266 8.682 -0.917

Right of c 315 27 16.217 10.641 0.917 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 3.02

Diff . in means

2.952

0.301

0.285

0.194

110

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

Analogous Stata command . global covs "i89 vshr_islam1994 partycount lpop1994 merkezi merkezp subbuyuk buyuk" . rdrandinf Y X, covariates($covs) seed(50) d(3.019522)

However, it is usually better to first choose the window using rdwinselect and then use rdrandinf. The reason is that calling rdwinselect by itself will never show outcome results, and will reduce the possibility of choosing the window where the outcome results are in the “expected” direction—in other words, setting on a window choose before looking at the outcome results minimizes pre-testing and specification-searching issues.

5.4

When To Use The Local Randomization Approach

Unlike an experiment, the treatment assignment mechanism in the RD design does not logically imply that the treatment is randomly assigned within some window. Like the continuity assumption, the local randomization assumption must be made in addition to the RD assignment mechanism, and is not directly testable. But the local randomization assumption is strictly stronger than the continuity assumption, in the sense that if there is a window around x ¯ in which the regression functions are flat, then these regression functions will also be continuous at x ¯—but the converse is not true. Why, then, would researchers want to impose stronger assumptions to make their inferences? In order to see in what type of situations the stronger assumption of local randomization is appropriate, it is useful to remember that the local-polynomial approach, although based on the weaker condition of continuity, it necessarily relies on extrapolation because there are no observations exactly at the cutoff. The continuity assumption does not imply a specific functional form of the regression functions near the cutoff, as it approximates these functions using non-parametric methods; however, this approximation relies on extrapolation methods and introduces an approximation erorr that is only negligible if the sample size is large enough. This makes the continuity assumption more appealing if there are enough observations near the cutoff to approximate the shape of the regression functions with reasonable accuracy—but possibly inadequate when the number of observations is small. In applications with few observations, the local randomization approach has the advantage of requiring minimal extrapolation and avoiding the use of smoothing methods. Another situation in which a local randomization approach may be preferable to a continuitybased approach is when the running variable is discrete—i.e., when the set of values that the score can take is countable. Examples of discrete scores include age measured in years or days, population counts or vote totals. By definition, a discrete running variable will have mass points: multiple units will have a score of the same value. In contrast, when the score is continuous, the probability that two units have the same score value is zero. Examples of continuous scores include vote shares, income, poverty rates, etc. The continuity-based approach described above requires a continuous running 111

5. LOCAL RANDOMIZATION RD APPROACH

CONTENTS

variable and is not applicable when the running variable is discrete without further assumptions. In this case, the local randomization approach. Because RD designs with discrete running variables are ubiquitous in many social sciences, we discuss them in detail in Section 7.

5.5

Recommendations for Practice

The decision to use a large-sample or finite-sample approach in the implementation of a local randomization RD approach must be made on a case by case basis. Methods based on large-sample approximations will be most appropriate when the sample size inside the window W0 is sufficiently large. In many applications, however, the number of observations within W0 will tend to be very small. The reason is that, in most RD applications, there is a tension between the plausibility of the local randomization assumption and the length of the window around the cutoff where this assumption is invoked: the smaller the window, the more similar the values of the score for units inside the window, and the more credible the local randomization assumption tends to be. Since a small window will tend to have a small number of observations, in many applications large-sample methods will tend to be unreliable. Therefore, Fisherian finite-sample inference methods are natural and most useful when the number of observations inside W0 is small (for example, less than 30 on each side), and either large-sample or finite-sample methods when the number of observations in W0 is sufficiently large to ensure that large-sample approximations are reliable. When using Fisherian methods, employing fixed-margins randomization and the difference-in-means test statistic is quite natural, specially if there are no concerns about outliers. The advantage of the difference-in-means test statistic is that it can be given a point estimator interpretation in the Neyman framework. For window selection purposes, setting α? to 0.15 and starting with a minimum window that has at least 10 observations on either side, whenever possible, is reasonable and has proven to work well in several applications.

5.6

Further Readings

Textbook reviews of Fisherian and Neyman estimation and inference methods in the context of analysis of experiments are given by Rosenbaum (2002, 2010) and Imbens and Rubin (2015). The later book also discusses super-population approaches and their connections to finite population infernece mehtods. Ernst (2004) gives a nice discussion of the connection and distinctions between randomization inference and permutation inference methods. Cattaneo et al. (2015) were the first to propose Fisherian randomization-based inference to analyze RD designs based on a local randomization assumption; these authors also proposed the window selection procedure based on balance tests on predetermined covariates. Cattaneo et al. (2017d) relaxed the local randomization assumption to allow for a weaker exclusion restriction, and also compare RD analysis in continuity-based and randomization-based approaches. The interpretation of the RD design as a local experiment and its connection to the continuity-based framework is also discussed by Sekhon and Titiunik

112

5. LOCAL RANDOMIZATION RD APPROACH

(2016, 2017).

113

CONTENTS

6. VALIDATION AND FALSIFICATION

6

CONTENTS

Validation and Falsification of the RD Design

A main advantage of the RD design is that the mechanism by which treatment is assigned is known and based on observable features, giving researchers an objective basis to distinguish pretreatment from post-treatment variables, and to identify qualitative information regarding the treatment assignment process that can be helpful to justify assumptions. However, the known rule that assigns treatment based on whether a score exceeds a cutoff is not by itself enough to guarantee that the assumptions needed to recover the causal effect of interest are met. For example, a scholarship may be assigned based on whether the grade students receive on a test is above a cutoff, but if the cutoff is known to the students’ parents and there are mechanisms to appeal the grade, then the RD design may be invalid whenever systematic differences among students are present due to the appeal process. More formally, a systematically successful appeal process could invalidate the assumption that the average potential outcomes are continuous at the cutoff. To give a concrete example, if some parents decide to appeal the grade when their child is barely below the cutoff and, crucially, these parents are systematically successful and different than other parents that choose not to appeal, then the RD design based on the final grade assigned to each student would be invalid (while the RD design based on the original grade assigned would not). The reasoning is as follows: parents who choose to appeal on behalf of students whose score is barely below the cutoff and (systematically) manage to change their child’s score so that it reaches the cutoff may also be more involved in other aspects of their children’s education which would also have a direct impact on the outcome variable of interest. For instance, if parents’ involvement affects students’ future academic achievement then, on average, the potential outcomes of those students above the cutoff may be discontinuously different from the potential outcome of those students below the cutoff, making the RD design invalid for causal inference at the cutoff. In other words, students barely below and barely above the cutoff would be systematically different in ways that also affect the outcome of interest, making impossible to disentangle the effect of the treatment from the effect of their other systematic underlying (usually unobserved) differences. In general, if the cutoff that determines treatment is known to the units that will be the beneficiaries of the treatment, researchers must worry about the possibility of units actively changing or manipulating the value of their score when they miss the treatment barely. The first type of information that can be provided is whether an institutionalized mechanism to appeal the score exists, and if so, how often it is used to successfully change the score and which units use it. Qualitative data about the administrative process by which scores are assigned, cutoffs determined and publicized, and treatment decisions appealed, is extremely useful to validate the design. To give another empirical example, social programs are commonly assigned based on some poverty index: if the program officers bump units with an index barely below the cutoff to the treatment group in a systematic way (e.g., all households with small children), then the RD design would be invalid whenever the systematic differences between units near the cutoff have a direct effect on the outcome of interest. This type of behavior can typically be identified using qualitative information 114

6. VALIDATION AND FALSIFICATION

CONTENTS

from the program administration officers. In many cases, however, qualitative information will be limited and the possibility of units manipulating their score cannot be ruled out. Crucially, the fact that there are no institutionalized mechanisms to appeal and change scores does not mean that there are no informal mechanisms by which this may happen. Thus, an essential step in evaluating the plausibility of the RD assumptions is to provide empirical evidence supporting the validity of the design. Naturally, the continuity and local randomization assumptions that guarantee the validity of the RD design are about unobservable features and as such are inherently untestable. At the same time, the RD design is perhaps the one non-experiment research design that offers an array of empirical methods gear to providing plausible evidence in favor of its validity. More precisely, there are several important empirical implications of the unobservable assumptions underlying RD designs that can be expected to hold in most cases and can provide indirect evidence about its validity. We consider three such empirical tests based on: (i) continuity of the score density around the cutoff, (ii) null treatment effect on pre-treatment covariates or placebo outcomes, and (iii) treatment effect at artificial cutoffs values, exclusion of nearby observations, and bandwidth choices. As we discuss below, the implementation of each of the tests differs according to whether a continuity or a local randomization assumption is invoked.

6.1

Density of Running Variable

The first type of falsification test examines whether, in a local neighborhood near the cutoff, the number of observations below the cutoff is “surprinsigly” different from the number of observations above it. The underlying assumption is that if individuals do not have the ability to precisely manipulate the value of the score that they receive, the number of treated observations just above the cutoff should be approximately similar to the number of control observations below it. In other words, even if units actively attempt to manipulate their score, in the absence of precise manipulation, random change would place roughly the same amount of units on either side of the cutoff, leading to continuous probability density function when the score is continuously distributed. Although this assumption is neither necessary nor sufficient for the validity of an RD design, RD applications where there is an unexplained abrupt change in the number of observations right at the cutoff will tend to be less credible. This kind of test is often called a density test. Figure 6.1 shows a histogram of the running variable in two hypothetical RD examples. In the scenario illustrated in Figure 6.1(a), the number of observations above and below the cutoff is very similar. In contrast, Figure 6.1(b) illustrates a case in which the density of the score right below the cutoff is considerably lower than just above it—a finding that is compatible with units systematically increasing the value of their original score so that they are assigned to the treatment group instead of the control. In addition to a graphical illustration of the density of the running variable, researchers should

115

6. VALIDATION AND FALSIFICATION

CONTENTS

Figure 6.1: Histogram of Score Cutoff

Cutoff

80

Number of Observations

Number of Observations

80

Control

40

Control

40 Treatment

20

Treatment

20

0

0 x

−100

100

−100

x

Score (X)

Score (X)

(a) No sorting

(b) Sorting

100

test the assumption more formally. The implementation of the formal test depends on whether one adopts a continuity-based or a local randomization approach to RD analysis. In the former approach, the null hypothesis is that the density of the running variable is continuous at the cutoff, and its implementation requires the estimation of the density of observations near the cutoff, separately for observations above and below the cutoff. We employ here an implementation based on a local polynomial density estimator that does not require pre-binning of the data and leads to size and power improvements relative to other approaches. The null hypothesis is that there is no “manipulation” of the density at the cutoff, formally stated as continuity of the density functions for control and treatment units at the cutoff. Therefore, failing to reject implies that there is no statistical evidence of manipulation at the cutoff, and thus offers support in favor of the RD design. In order to perform this density test using the Meyersson data, we use the rddensity command, which only needs to receive the running variable as an argument. > out = rddensity ( X ) > summary ( out ) RD Manipulation Test using local polynomial density estimation . Number of obs = Model = Kernel = BW method = VCE method =

2629 unrestricted triangular comb jackknife

Cutoff c = 0 Number of obs

Left of c 2314

Right of c 315

116

6. VALIDATION AND FALSIFICATION

CONTENTS

Eff . Number of obs Order est . ( p ) Order bias ( q ) BW est . ( h )

965 2 3 30.54

301 2 3 28.285

Method Robust

T -1.394

P > |T| 0.1633 Analogous Stata command

. rddensity X

Figure 6.2: Histogram and Estimated Density of the Score

0.025

40

30

0.015

Density

Number of Observations

0.020

20

0.010

10

0.005

0

0.000 −30

−20

−10

0

10

20

−30

30

−20

−10

0

10

20

30

Score

Score

(a) Histogram

(b) Estimated Density

The value of the statistic used to test is −1.394 and the associated p-value is 0.1633. This means that under the continuity-based approach, we fail to reject the null hypothesis of no difference in the density of treated and control observations at the cutoff. Figure 6.2 provides a graphical representation of the continuity in density test approach, exhibiting both a histogram of the data (panel (a)) and the actual density estimate (panel (b)). The implementation of the density test is different under the local randomization approach. In this case, the null hypothesis is that, within the window W0 where the treatment is assumed to be randomly assigned, the number of the number of observations in the control and treatment groups is consistent with whatever assignment mechanism is assumed to generate the treatment assignment rule. For example, assuming a simple “coin flip” or Bernoulli trial with say q success probabily, we would expect that the control sample size nW0 ,− and treatment sample size nW0 ,+ within W0 is compatible with what a treatment assigment of Bernoulli trials with a pre-specified treatment probability q would generate. The assumption is therefore that the number of treated and control units in W0 follows a binomial distribution, and the null hypothesis of the test is that the probability of success in the nW0 Bernoulli experiments is q. As we have discussed, the true 117

6. VALIDATION AND FALSIFICATION

CONTENTS

probability of treatment is unknown, but in practice q = 1/2 is the most natural choice in the absence of additional information (and this choice can be justified from a large sample perspective when the score is continuous). The binomial test is implemented in all common statistical software, and is also part of the rdlocrand package and is implemented via the command rdwinselect. Using the Meyersson data, we can implement this falsifcation test after selecting a window W0 where the local randomization assumption is assumed to hold. Since we need to implement the density test in only this window, we use the option nwindows(1). We employ W0 = [−1.859, 1.859], as selected in the previous section. > out = rdwinselect (X , wmin = 0.944 , nwindows = 1)

Window selection for RD under local randomization Number of obs Order of poly Kernel type Reps Testing method Balance test

= = = = = =

2629 0 uniform 1000 rdrandinf diffmeans

Cutoff c = 0 Number of obs 1 st percentile 5 th percentile 10 th percentile 20 th percentile Window length / 2

Left of c 2314 24 115 231 463

Right of c 315 3 15 31 62

p - value

0.944

Var . name

NA

NA

Bin . test 0.672

Obs < c 23

Obs >= c 27

Analogous Stata command . rdwinselect X, wmin(0.944) nwindows(1)

The important results in this case are the number of observations in each side of the cutoff (44 and 49). These observation numbers and the probability q = 1/2 are the ingredients for the binomial test, which has a p-value of 0.679, as is shown in the Bin. test column in the rdwinselect output. Based on this test, we find no evidence of “sorting” around the cutoff in the window W0 = [−1.859, 1.859]—in other words, the difference in the number of treated and control observations in this window is entirely consistent with what would be expected if municipalities were assigned to an Islamic win or loss by the flip of an unbiased coin. The binomial test is also implemented in the base distribution R and Stata. We can perform the same test by simply executing the canned binomial test command. > binom . test (27 , 50 , 1 / 2)

118

6. VALIDATION AND FALSIFICATION

CONTENTS

Exact binomial test data : 27 and 50 number of successes = 27 , number of trials = 50 , p - value = 0.6718 alternative hypothesis : true probability of success is not equal to 0.5 95 percent confidence interval : 0.3932420 0.6818508 sample estimates : probability of success 0.54 Analogous Stata command . bitesti 50 27 1/2

As expected, the two-sided p-value is 0.6785, which is equal (after rounding) to the p-value obtained using rdwinselect. Thus, we arrive to the same conclusion and fail to reject the null hypothesis.

6.2

Treatment Effect on Predetermined Covariates and Placebo Outcomes

Another important RD falsification test involves examining whether, near the cutoff, treated units are similar to control units in terms of observable characteristics. The idea behind this approach is simply that, if units lack the ability to precisely manipulate the score value they receive, there should be no systematic differences between units with similar values of the score. Thus, except for their treatment status, units just above and just below the cutoff should be similar in all those characteristics that could not have been affected by the treatment. These variables can be divided into two groups: variables that are determined before the treatment is assigned, which we call predetermined covariates, and variables that are determined after the treatment is assigned but, according to substantive knowledge about the treatment’s causal mechanism, could not possibly have been affected by the treatment, which we call placebo outcomes. Importantly, predetermined covariates are always unambiguously defined, but placebo outcomes are application specific. Any characteristic that is determined before the treatment assignment is a predetermined covariate. In contrast, whether a variable is a placebo outcome depends on the particular treatment under consideration. For example, if the treatment is access to clean water and the outcome of interest is child mortality, a treatment effect is expected on mortality due to waterbone illnesses but not on mortality due to other causes such as car accidents. Thus, mortality from road accidents would be a reasonable placebo outcome in this example. However, mortality from road accidents would not be a placebo outcome if access to clean water occurred simultaneously with a safety program that educated parents in the proper installment of car seats. Once again, the particular implementation of this type of falsification test depends on whether researchers adopt a continuity-based or a local-randomization-based approach. But despite differ119

6. VALIDATION AND FALSIFICATION

CONTENTS

ences in implementation, a fundamental principle applies to both: all predetermined covariates and placebo outcomes should be analyzed in the same way as the outcome of interest. In the continuitybased approach, this principle means that for each predetermined covariate (and placebo outcome), researchers should first choose an optimal bandwidth, and then use local polynomial techniques within that bandwidth to estimate the “treatment effect” and then employ valid inference procedures such as the robust bias corrected methods discussed previously. In the local randomization approach, this principle means that the effect for both covariates and placebo outcomes should be analyzed in the window that the assumption of local randomization is assumed to hold. In both approaches, since the predetermined covariate (or placebo outcome) could not have been affected by the treatment, the idea is that the null hypothesis of no treatment effect will fail to be rejected.

6.2.1

Continuity-based Approach

When using the continuity-based approach to RD analysis, this falsification test employs the local polynomial techniques discussed in Section 4 to test if the predetermined covariates and placebo outcomes are continuous at the cutoff—in other words, to test if the treatment has an effect on them. We illustrate with the Meyersson application, using the set of predetermined covariates introduced in Section 5.3 for window selection in purposes. We start by presenting a graphical analysis, creating an RD plot for every covariate using rdplot with the default options (mimicking variance, evenly-spaced bins). The plots are presented in Figure 6.3—the specific commands are omitted to conserve space.

120

6. VALIDATION AND FALSIFICATION

CONTENTS

12

14

14

Figure 6.3: RD Plots for Predetermined Covariates—Meyersson Application

12

10

●

●

● ●

8

●

●

●

10

●

● ●

6

●

● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ●

● ●●

8

● ●

●

●

●

●● ●

● ● ●● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ●● ●●●● ●●●● ●

● ● ●

● ● ●

● ●

● ●

●

●● ●●●● ● ●

●●●

●●

● ● ● ●

● ● ● ● ●

●

●

●

● ●●●

●

●

● ●

●

●

● ●

2

●

●

●

●

● ● ●

6

●

● ● ●

●●

4

● ●

●● ●

●

●

●

●

−100

−50

0

50

100

−100

−50

0

Running Variable

50

100

Running Variable

●

●

●

0.8

0.8

●

1.0

(b) Number of Parties Receiving Votes in 1994

1.0

(a) Log Population in 1994

● ●

0.6

●

0.6

● ● ●

● ●

●

●

●

●

●

0.4

0.4

● ● ●

● ● ●

−100

−50

●

0.2

● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●●● ● ● ● ● ●● ● ●● ●● ● ● ●●●● ● ●● ● ●●● ●● ● ● ● ● ● ●●

● ●

● ●

0

50

100

●

●●

●●

●●●●●●●●●●

●●

●

−100

●

●

●

● ●

● ●●

●

● ●●●

●

●

● ●●

●

●●

●

●

● ●

●

● ●

●

−50

0

Running Variable

●

50

Running Variable

(d) Islamic Mayor in 1989 1.0

(c) Islamic Vote Share in 1994 1.0

●

●

●

0.0

0.0

0.2

●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●

●

●

●

●●

● ●

0.8

0.8

●

●

0.6

0.6

●

● ● ●

●

●

●

●●

● ● ●

● ●

●

●●

● ●● ●

●● ●

0.4

0.4

●

● ●

● ●

● ●

● ●

●● ●●

●

●● ● ●

● ●● ●

● ●

● ●

●

●

●

0.2

0.2

●

●

●

●

−100

−50

0

● ●

●

●

●

● ● ●●●●● ●

50

Running Variable

(e) Province center indicator

●

●

100

0.0

0.0

● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ●●● ●●● ● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●● ●● ● ● ● ● ●●● ● ● ●●● ● ●●●●●

121

●● ●●●●●

−100

●●

● ●● ● ●●● ●

−50

●●● ● ●

0 Running Variable

(f) District center indicator

50

●

●

100

6. VALIDATION AND FALSIFICATION

CONTENTS

The graphical analysis does not reveal obvious discontinuities at the cutoff, but of course a statistical analysis is required before we can reach a more objective and formal conclusion. In order to implement the analysis, an optimal bandwidth must be chosen for each test, and this bandwidth is not necessarily the bandwidth used to analyze the original outcome of interest. As shown very clearly in the RD plots, each covariate Zj exhibits a quite different estimated regression function, with different curvature and overall functional form. As a result, the optimal bandwidth for local polynomial estimation and inference will be different for every variable, and must be re-estimated accordingly in each case. This implies that the statistical analysis is conducted separately for each covariate, choosing the appropriate optimal bandwidth for each covariate considered. To implement this formal falsification test, we simply run rdrobust using each covariate of interest as the outcome variable. As an example, we analyze the RD treatment effect on the covariate lpop1994, the logarithm of the municipality population in 1994. Since this covariate is measured in 1994, it could not have been affected by the treatment—that is, by the party that wins the 1994 election. We estimate a local linear RD effect with triangular kernel weights and common MSE-optimal bandwidth using rdrobust (remember that the default bandwidth selection option is bwselect="mserd"). > rdrobust ( data $ lpop1994 , X ) Call : rdrobust ( y = data $ lpop1994 , x = X ) Summary : Number of Obs BW Type Kernel Type VCE Type

2629 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2314 400 1 2 13.3186 21.3661 0.6234

Right 315 233 1 2 13.3186 21.3661 0.6234

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 0.0124 0.2777 0.0447 0.9643 -0.5319 0.5567 Robust 0.9992 -0.6442 0.6448 Analogous Stata command . rdrobust lpop1994 X

The point estimate is very close to zero and the robust p-value is 0.9992, so we find no evidence that, at the cutoff, treated and control municipalities differ systematically in this covariate differs 122

6. VALIDATION AND FALSIFICATION

CONTENTS

systematically between treated and control units at the cutoff. In other words, we find no evidence that the size of the municipalities is discontinuous at the cutoff. In order to provide a complete falsification test, the same estimation and inference procedure should be repeated for all important covariates—that is, for all potential confounders. In a convincing RD design, these tests would show that there are no discontinuities in any variable. Table 6.1 contains the local polynomial estimation and inference results for four different pre-determined covariates available in the Meyersson dataset. All results were obtained employing rdrobust with the default specifications, as shown for lpop1994 above. Table 6.1: Formal Continuity-Based Analysis for Covariates Variable Share Men aged 15-20 with High School Education Islamic Mayor in 1989 Islamic vote share 1994 Number of parties receiving votes 1994 Log Population in 1994 District center Province center Sub-metro center Metro center

MSE-Optimal Bandwidth

RD Estimator

12.055 11.782 13.940 12.166 13.319 13.033 11.556 10.360 13.621

0.016 0.053 0.006 -0.168 0.012 -0.067 0.029 -0.016 0.008

Robust Inference p-value Conf. Int. 0.358 0.333 0.711 0.668 0.999 0.462 0.609 0.572 0.723

[-0.018, [-0.077, [-0.028, [-1.357, [-0.644, [-0.285, [-0.064, [-0.114, [-0.047,

Number of Observations

0.049] 0.228] 0.041] 0.869] 0.645] 0.130] 0.109] 0.063] 0.068]

590 418 668 596 633 624 574 513 642

All point estimates are small and most are remarkably close to zero, and all 95% confidence intervals contain zero, with p-values ranging from 0.462 to 0.999. In other words, there is no empirical evidence that these predetermined covariates are discontinuous at the cutoff. Note that the number of observations used in the analysis varies between covariates because the MSE-optimal bandwidth is different for every covariate. As explained above, this is to be expected, since each optimal bandwidth depends on the particular shape of the conditional expectation of each covariate. Finally, in this illustration we employ the default options for simplicity, but for falsification purposes the CER-optimal bandwidth choice is naturally more appropriate because only testing the null hypothesis is of interest in this case. Nevertheless, switching to bwselect=cerrd" does not change any of the empirical conclusions. We complement these results with a graphical illustration of the RD effects for every covariate, to provide further evidence that in fact these covariates do not jump discretely at the cutoff. For this, we plot each covariate inside of their respective MSE-optimal bandwidth, using a polynomial of order one, and a triangular kernel function to weigh the observations. We create these plots with the rdplot command. Below we illustrate the specific command for the lpop1994 covariate analyzed above. > > > +

bandwidth = rdrobust ( data $ lpop1994 , X ) $ h _ l xlim = ceiling ( bandwidth ) rdplot ( data $ lpop1994 [ abs ( X ) <= bandwidth ] , X [ abs ( X ) <= bandwidth ] , p = 1 , kernel = " triangular " , x . lim = c ( - xlim , xlim ) , x . label = " Running Variable " ,

123

6. VALIDATION AND FALSIFICATION

CONTENTS

+ y . label = " " , title = " " ) Analogous Stata command . rdrobust lpop1994 X . local bandwidth=e(h_l) . rdplot lpop1994 X if abs(X)<=‘bandwidth’, h(‘bandwidth’) p(1) kernel(triangular)

We follow this same procedure for each of the four covariates of interest. The resulting plots are presented in Figure 6.4. Consistent with the formal statistical results, the graphical analysis within the optimal bandwidth indicates that all covariates appear to be continuous at the cutoff.

10

8

12

10

12

14

14

Figure 6.4: Graphical Illustration of Local Linear RD Effects for Predetermined Covariates— Meyersson data

●

●

●

●

●

●

●

●

8

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

● ●

6

●

● ●

●

●

● ●

2

6

4

●

−15

−10

−5

0

5

10

15

−10

−5

0

Running Variable

5

10

Running Variable

(b) Number of Parties Receiving Votes in 1994

0.8

0.5

1.0

(a) Log Population in 1994

● ● ●

●

0.4

● ● ● ●

●

● ●

●

0.6

● ●

● ●

●

●

●

●

● ●

●

●

●

0.4

0.3

● ●

●

● ●

0.2

●

●

●

●

● ● ●

●

0.2

● ●

● ●

● ●

●

0.1

● ●

●

●

●

● ●

0.0

●

●

−15

−10

−5

0

5

10

15

●

−10

Running Variable

●

−5

●

●

0

●

●

●

●

5

Running Variable

(c) Islamic Vote Share in 1994

(d) Islamic Mayor in 1989

124

●

●

●

10

●

6. VALIDATION AND FALSIFICATION

CONTENTS

The continuity of the estimated conditional expectations at the cutoff for each of these covariates does stand in contrast to the analogous plot for the outcome of interest, female education share, given in Figure 4.4.

6.2.2

Local Randomization Approach

In the local randomization RD approach, we can similarly employ pre-intervention covariates and placebo outcomes to evaluate whether there is evidence of potential manipulation. As before, the idea is that the null hypothesis of no treatment effect should be tested within W0 for all predetermined covariates and placebo outcomes, using the same inference procedures and the same assumptions regarding the treatment assignment mechanism and the same test statistic used for the analysis of the outcome of interest. Since in this approach W0 is the window where the treatment is assumed to have been randomly assigned, all covariates and placebo outcomes should be analyzed within this window. This illustrates a fundamental difference between the continuity-based and the randomization-based approach: in the former, in order to estimate and test hypotheses about treatment effects we need to approximate the unknown functional form of outcomes, predetermined covariates and placebo outcomes, which requires estimating separate bandwidths for each variable analyzed; in the latter, since the treatment is thought as-if-randomly assigned in W0 , all analysis occur within the same window, W0 . As it was discussed in Section 5, the chosen window for the local-randomization approach is W0 = [−1.859, 1.859]. Therefore, in order to test if covariates are balanced within this window, we must compare their behavior to each side of the cutoff using randomization inference techniques. We should see that the observed difference in means is not statistically significant, since otherwise we would have evidence against the thought that treatment is as-if-randomly assigned in W0 . In order to test this formally, we can use rdrandinf and the outcome of interest as a covariate. For example, in order to study the covariate vshr islam1994 we should run: > out = rdrandinf ( data $ vshr _ islam1994 , X , wl = -0.944 , wr = 0.944) Selected window = [ -0.944;0.944] Running randomization - based test ... Randomization - based test complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization Cutoff c = 0

= = = = = = =

2629 0 uniform 1000 set by user 0 fixed margins Left of c

Right of c

125

6. VALIDATION AND FALSIFICATION

Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

2314 23 0.328 0.094 -0.944

CONTENTS

315 27 0.319 0.081 0.944 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 0.047

Diff . in means

-0.009

0.717

0.707

0.466

Analogous Stata command . rdrandinf vshr_islam1994 X, wl(-.944) wr(.944)

This shows that the difference-in-means statistic is very small (0.328 − 0.319 = −0.009), and the finite sample p-value is large (0.689). That is, the covariate is as-if-randomly assigned inside W0 . Table 6.2 contains a summary of this analysis for all covariates using randomization inference. Table 6.2: Formal Local-Randomization Analysis for Covariates Variable Share Men aged 15-20 with High School Education Islamic Mayor in 1989 Islamic vote share 1994 Number of parties receiving votes 1994 Log Population in 1994 District center Province center Sub-metro center Metro center

Mean of Controls

Mean of Treated

Diff-in-Means Statistic

Fisherian p-value

Number of Observations

0.192 0.000 0.328 6.348 8.566 0.609 0.087 0.087 0.043

0.217 0.150 0.319 6.074 8.594 0.519 0.037 0.074 0.037

0.025 0.150 -0.009 -0.274 0.028 -0.090 -0.050 -0.013 -0.006

0.275 0.222 0.712 0.794 0.954 0.595 0.580 1.000 1.000

50 37 50 50 50 50 50 50 50

We cannot conclude that the control and treatment means are different for any covariate, since the p-values are above 0.15 in all cases. There is no statistical evidence of imbalance (in terms of their means) inside of this window. We should also notice that the number of observations is fixed in all cases. This happens because the window in which we analyze these covariates is not changing, is always W0 , whereas in the continuity-based approach the MSE-optimal bandwidth depends on the particular outcome being used, and therefore the number of observations is different for every covariate. This analysis can also be carried out visually, by using rdplot restricted to W0 , with p = 0 and a uniform kernel. If it is true that the window was chosen appropriately, then all covariates should be continuous at the cutoff with the specifications given above. This is equivalent to assert that their means are statistically equal, since the local-polynomial approach with p = 0 and a uniform kernel inside of W0 will be in fact testing if the control and treatment means are different

126

6. VALIDATION AND FALSIFICATION

CONTENTS

from zero. For example, we can construct an RD plot with these characteristics for the covariate vshr islam1994. > rdplot ( data $ vshr _ islam1994 [ abs ( X ) <= 0.944] , X [ abs ( X ) <= 0.944] , + p = 0 , kernel = " uniform " , x . label = " Running Variable " , + y . label = " " , title = " " ) Analogous Stata command

0.40

0.45

0.50

. rdplot vshr_islam1994 X if abs(X)<=.944, h(.944) p(0) kernel(uniform)

●

0.35

●

● ●

0.30

●

●

0.25

●

0.20

●

−2

−1

0

1

2

Running Variable

Figure 6.5 contains analogous RD plots for the six predetermined covariates analyzed above. In most cases, the visual inspection shows that the means of these covariates seem to be similar on each side of the cutoff, consistent with the results from the window selection procedure discussed in Section 5 and the formal analysis in Table 6.2.

127

6. VALIDATION AND FALSIFICATION

CONTENTS

8

10

12

12

14

14

Figure 6.5: Illustration of Local Randomization RD Effects on Covariates—Meyersson Data

10

●

●

● ●

6

●

●

●

●

●

●

●

●

8

●

2

4

●

−2

−1

0

1

2

−2

−1

0

Running Variable

1

2

Running Variable

(b) Number of Parties Receiving Votes in 1994

0.6

0.40

0.8

0.45

1.0

0.50

(a) Log Population in 1994

●

0.35

●

● ●

0.30

0.4

●

●

●

0.2

0.25

●

●

●

0.0

0.20

●

−2

−1

0

1

2

●

−2

●

−1

0

Running Variable

1

2

Running Variable

1.0

(d) Islamic Mayor in 1989

1.0

(c) Islamic Vote Share in 1994

0.8 0.6

0.6

0.8

●

● ●

●

●

0.4

●

0.4

●

0.2

0.2

●

●

0.0

0.0

●

−2

−1

0

1

2

−2

Running Variable

(e) District center indicator

●

−1

●

●

●

●

●

●

●

0

●

●

●

●

1

Running Variable

128

(f) Province center indicator

2

6. VALIDATION AND FALSIFICATION

6.3

CONTENTS

Other Design-Based Methods

We close this section with a brief discussion of three other design-specific falsification approaches: (i) null treatment effect for placebo cutoffs, (ii) treatment effect sensitivity to units near the cutoff, and (iii) treatment effect sensitivity to bandwidth choice. All these empirical tests can be conducted using either the continuity based or the local randomization approach. To conserve some space, here we discuss only implementation and empirical results using local polynomial methods within the continuity-based approach, but the the accompanying replication files include analogous implementations using randomization inference methods. The first falsification approach, based on placebo cutoff points, was hinted at already when discussing RD plots in Section 3. The motivation behind this method start by recalling that the key identifying assumption underlying RD designs is continuity (or lack of abrupt changes) of the regression functions for treatment and controls at the cutoff in the absence of the treatment. While such condition is fundamentally untestable at the cutoff, researchers can investigate empirically whether the estimable regression functions for control and treatment units are continuous over the support of the score variable, that is, at points away from the cutoff. Evidence of continuity away from the cutoff is, of course, neither necessary nor sufficient for continuity at the cutoff, but the presence of discontinuities away from the cutoff can be interpreted as potentially casting doubt on the RD deign, at the very least in cases where such discontinuities can not be explained by substantive knowledge of the specific application. Practically, the method replaces the true cutoff value by another value at which the treatment status does not really change to then perform estimation and inference using this “fake” cutoff point. The motivating idea is that a significant treatment effect should occur only at the true cutoff value and not at other values of the score where the treatment status is constant. A graphical implementation of this falsification approach follows directly from the RD plots described previously, by simply assessing whether there are jumps in the observed regression functions at points other than the true cutoff. A more formal implementation of this idea conducts statistical estimation and inference for RD treatment effect at placebo or artificial cutoff points, using control and treatment units separately. Once again, the implementation depends on the approach adopted: in the continuity-based approach, we would use local-polynomial methods within an optimally-chosen bandwidth around the fake cutoff to estimate treatment effects on the outcome, as it was explained in Section 4. In the local-randomization approach, we would choose a window around the fake cutoff where randomization is plausible, and make inferences for the true outcome within that window, as it was explained in Section 5. In order analyze the alternative cutoffs using the continuity-based approach, we employ rdrobust after restricting to the appropriate group and specifying the artificial cutoff. For example, we consider only the treatment group with a placebo cutoff point x ¯ = 1. We do not expect to find an statistically significant effect in this case, since treatment did not change discontinuously at any other value different from 0. Here is the empirical result of this exercise using the Meyersson data: 129

6. VALIDATION AND FALSIFICATION

CONTENTS

> rdrobust ( Y [ X >= 0] , X [ X >= 0] , c = 1) Call : rdrobust ( y = Y [ X >= 0] , x = X [ X >= 0] , c = 1) Summary : Number of Obs BW Type Kernel Type VCE Type

315 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 30 30 1 2 2.3016 3.2845 0.7007

Right 285 49 1 2 2.3016 3.2845 0.7007

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional -0.9935 4.2782 -0.2322 0.8164 -9.3786 7.3917 Robust 0.7599 -9.8276 13.4594 Analogous Stata command . rdrobust Y X if X>=0, c(1)

In order to estimate at the alternative cutoff we must use the option c = 1 in rdrobust, and thus we need to restrict only to the treatment group, since otherwise the estimation on the left would be invalid due to the actual non-zero treatment effect observed at 0. This is forcing the program to compare the educational outcomes of municipalities were Islamic mayors won by a margin of more than 1%, with municipalities were they also won by less than 1%. That is, in both sides of the cutoff we will have municipalities with an Islamic mayor and therefore, we should find any discontinuity of the outcome at 1%. This is in fact what happens, since the p-value is much larger than 0.1. That is, we can conclude that the outcome of interest did not jump at the specific cutoff value of 1%. Table 6.3 summarizes this analysis for alternative cutoffs ranging from −5% to 5% with increments of 1%, and Figure 6.6 depicts the main results from this falsification test graphically.

130

6. VALIDATION AND FALSIFICATION

CONTENTS

Table 6.3: Continuity-Based Analysis for Alternative Cutoffs

Alternative Cutoff

MSE-Optimal Bandwidth

RD Estimator

-5 -4 -3 -2 -1 0 1 2 3 4 5

4.576 4.300 3.934 4.642 4.510 17.239 2.362 2.697 2.850 2.584 2.287

2.535 -0.995 1.688 -2.300 -3.003 3.020 -1.131 -1.973 3.766 -9.854 3.790

Robust Inference p-value Conf. Int. 0.495 0.374 0.421 0.991 0.992 0.076 0.787 0.488 0.668 0.056 0.462

N. of Obs. Left Right

[-4.472, 9.250] [-9.611, 3.611] [-3.509, 8.397] [-9.414, 9.518] [-11.295, 11.409] [-0.309, 6.276] [-9.967, 13.147] [-15.333, 7.313] [-8.700, 13.569] [-24.963, 0.329] [-7.645, 16.815]

134 125 135 152 139 529 30 53 68 49 41

138 117 74 47 24 266 49 50 56 52 45

-30

-20

RD Treatment Effect -10 0

10

20

Figure 6.6: RD Estimation for Placebo Cutoffs

-5

-4

-3

-2

-1 0 1 Placebo Cutoff

2

3

4

5

The cutoff equal to 0 is being included in order to have a benchmark. Zero is in fact the true cutoff, and the particular results regarding this cutoff were discussed at length in Section 4. All other cutoffs are “fake” or placebo, in the sense that treatment did not actually change at those points. We find that in all of the artificial cutoff points the RD estimator is smaller in magnitude from the true RD estimator (3.019) and the corresponding p-values at those cutoffs are larger than 0.1 in all cases. Therefore, we can conclude that the outcome of interest does not jump discontinuously 131

6. VALIDATION AND FALSIFICATION

CONTENTS

at any other cutoff different from 0. The second falsification falsification approach, based on sensitivity to observations near the cutoff, seeks to investigate how sensitive the results are to the response of units very close to the cutoff point. If manipulation was present, it is natural to assume that the units closest to the cutoff are those most likely to have manipulated in a systematic way their score value. Thus, the idea behind this approach is to exclude such units to then repeat the estimation and inference analysis using the remaining sample. This idea is sometimes referred to as a “donut hole” approach. Once again, it can be done using either continuity-based and local randomization methods, though it is most natural in the former context because the latter setting tends to employ very few observations to begin with and does not rely on extrapolation as much. Indeed, this approach is also useful to assess the extrapolation sensitivity as a function of the few observations closest to the cutoff, as they are likely to be the most influential when fitting the local polynomials. To implement the falsification method based on excluding the closest units to the cutoff, we employ rdrobust after subsetting the data accordingly. For example, we consider first the case where units with score |Xi | < 0.25 are excluded from the analysis. As discussed before, this implies that a new optimal bandwidth will be selected and, in this case, de facto more extrapolation than before will take place. The result using the Meyersson data is follows: > rdrobust ( Y [ abs ( X ) >= 0.25] , X [ abs ( X ) >= 0.25]) Call : rdrobust ( y = Y [ abs ( X ) >= 0.25] , x = X [ abs ( X ) >= 0.25]) Summary : Number of Obs BW Type Kernel Type VCE Type

2617 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 2308 483 1 2 16.0276 27.4569 0.5837

Right 309 248 1 2 16.0276 27.4569 0.5837

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 3.4118 1.5126 2.2556 0.0241 0.4472 6.3764 Robust 0.0540 -0.0594 6.9499 Analogous Stata command . rdrobust Y X if abs(X)>=0.25

In practice, it is natural to repeat this exercise a few times to assess the actual sensitivity for 132

6. VALIDATION AND FALSIFICATION

CONTENTS

different amounts of units excluded. Table 6.4 illustrates this approach, and Figure 6.7 depicts the results graphically. Table 6.4: Continuity-Based Analysis for the Donut-Hole Approach Donut-Hole Radius

MSE-Optimal Bandwidth

RD Estimator

0.00 0.25 0.50 0.75 1.00 1.25

17.239 16.028 15.422 14.921 15.682 14.243

3.020 3.412 3.745 3.482 3.030 1.885

Robust Inference p-value Conf. Int. 0.076 0.054 0.028 0.078 0.145 0.460

Number of Observations

[-0.309, 6.276] [-0.059, 6.950] [0.408, 7.292] [-0.395, 7.439] [-1.016, 6.929] [-2.639, 5.834]

795 731 697 658 681 617

Excluded Obs. Left Right 0 6 13 17 24 30

0 6 14 25 30 33

-2

0

RD Treatment Effect 2 4

6

8

Figure 6.7: RD Estimation for the Donut-Hole Approach

0

.25

.5 .75 Donut Hole Radius

1

1.25

Finally, the last falsification method commonly encounter in empirical work is related to sensitivity to bandwidth choice or window length. This method complements the donut hole approach just discussed, which investigated sensitivity of the empirical findings as units from the center of the neighborhood around the cutoff are removed. Now, instead, we investigative the sensitivity as units are added or removed from the end points of the neighborhood. The implementation of this method is also straightforward, as it requires employing the same commands with different bandwidth or window length choices. However, the interpretation of the results must be taken with care: as we discussed in detail along this monograph, choosing the bandwidth is perhaps the most important problem in RD design analysis because result are largely affected by this choice. In fact, it is well understood how the bandwidth will affect the results: as the bandwidth increases, the bias 133

6. VALIDATION AND FALSIFICATION

CONTENTS

of the estimator increases and its variance decreases. Thus, it is natural to expect that the larger the bandwidth, the smaller the confidence intervals will be but they will also be displaced (because of the bias). The considerations above suggest that investigating sensitivity to bandwidth is only useful over small ranges around the MSE-optimal bandwidth, since otherwise the results will be mechanically determined by the statistical properties of the estimation method. To illustrate this approach in practice, we present Figure 6.8, where we report the empirical analysis of the Meyersson data over five bandwidth choices.

-30

-20

RD Treatment Effect -10 0 10

20

Figure 6.8: Sensitivity to Bandwidth in the Continuity-Based Approach

.944

11.629

17.239 Bandwidth

23.258

34.478

Figure 6.8 reports local polynomial RD point estimators and robust confidence intervals using as bandwidth: (i) the local randomization choice hLR = 0.944, (ii) the CER-optimal choice hCER = 11.629, (iii) the MSE-optimal choice hMSE = 17.239, (iv) 2 · hCER = 23.258, and (v) 2 · hMSE = 34.478.

6.4

Further Readings

The density test was first proposed by McCrary (2008). Cattaneo et al. (2017a) developed a local polynomial density estimator that does not require pre-binning of the data and leads to size and power improvements relative to other implementation approaches, and Frandsen (2017) developed a related manipualtion test for cases where the score is discrete. The importance of falsification tests and the use of placebo outcomes is generally discussed in in the analysis of experiments literature (e.g., Rosenbaum, 2002, 2010; Imbens and Rubin, 2015). Lee (2008) applied and extended these ideas to the context of RD designs, and Canay and Kamat (2017) developed a permutation

134

6. VALIDATION AND FALSIFICATION

CONTENTS

inference approach in the same context. Ganong and J¨ager (2017) developed a permutation inference approach based on the idea of placebo RD cutoffs for the Kink RD designs, Regression Kink designs, and related settings. Finally, falsification testing based on donut hole specifications are discussed in Bajari et al. (2011) and Barreca et al. (2016).

135

7. EXAMPLE WITH DISCRETE SCORE

7

CONTENTS

Empirical Example with Discrete Running Variable

The canonical continuity-based RD design assumes that the score that determines treatment is a continuous random variable. A random variable is continuous when the set of values that it can take contains an uncountable number of elements. For example, a share such as a party’s proportion of the vote is continuous, because it can take any value in the [0, 1] interval. In practical terms, when a variable is continuous, all the observations in the dataset have distinct values—i.e., there are no ties. In contrast, a discrete random variable such as date of birth can only take a finite number of values; as a result, a random sample of a discrete variable will contain “mass points”—that is, values that are shared by many observations. When the continuous score assumption does not hold, some of the local polynomial methods we described in Section 4 are not directly applicable. This is a practically relevant issue, because many real RD applications have a discrete score that can only take a finite number of values. We now consider an empirical RD example where the running variable has mass points in order to illustrate some of the strategies that can be used to analyze RD designs with a discrete running variable. We employ this empirical application to illustrate how identification, estimation, and inference can be modified when the dataset contains multiple observations per value of the running variable. As we illustrate and discuss below, the key issue when deciding how to analyze a RD design with a discrete running variable is the number of distinct mass points in the running variable. Local polynomial methods will behave essentially as if each mass point is a single observation. Thus, if the score is discrete but the number of mass points is sufficiently large, using local polynomial methods may still be appropriate. In contrast, if the number of mass points is very small, local polynomial methods will not be directly applicable. In this case, analyzing the RD design using the local randomization approach is a natural alternative. When the score is discrete, the local randomization approach has the advantage of that the window selection procedure is no longer needed, as the smallest window is well defined. Regardless of the estimation and inference method employ, issues of interpretability and extrapolation will naturally arise. In the upcoming sections, we discuss and illustrate all these issues using an example from the education literature.

7.1

The Effect of Academic Probation on Future Academic Achievement

The example we re-analyze is the study by Lindo, Sanders and Oreopoulos (2010, LSO hereafter), who use an RD design to investigate the impact of placing students on academic probation on their future academic performance. Our choice of an education example is intentional. The origins of the RD design can be traced to the education literature, and RD methods continue to be used extensively in education because interventions such as scholarships or probation programs are often assigned on the basis of a test score and a fixed approving threshold. Moreover, despite being continuous in principle, it is common for test scores and grades to be discrete in practice.

136

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

LSO analyze a policy at a large Canadian university that places students on academic probation when their grade point average (GPA) falls below a certain threshold. As explained by LSO, the treatment of placing a student on academic probation involves setting a standard for the student’s future academic performance: a student who is placed on probation in a given term must improve her GPA in the next term according to campus-specific standards, or face suspension. Thus, in this RD design, the unit of analysis is the student, the score is the student’s GPA, the treatment of interest is placing the student on probation, and the cutoff is the GPA value that triggers probation placement. Students come from three different campuses. In campuses 1 and 2, the cutoff is 1.5. In campus 3 the cutoff is 1.6. In their original analysis, the authors adopt the normalizing-and-pooling strategy we discussed in Section 2.4, centering each student’s GPA at the appropriate cutoff, and pooling the observations from the three campuses in a single dataset. Thus, the original running variable is the difference between the student’s GPA and the cutoff; this variable ranges from -1.6 to 2.8, with negative values indicating that the student was placed on probation, and positive values indicating that the student was not placed on probation. Table 7.1 contains basic descriptive statistics for the score, treatment, outcome and predetermined covariates that we use in our re-analysis. There are 40,582 student-level observations coming the 1996-2005 period. LSO focus on several outcomes that can be influenced by academic probation. In order to simplify our illustration, we focus on two of them: the student’s decision to permanently leave the university (), and the GPA obtained by the student in the term immediately after he was placed on probation (). Naturally, the second outcome is only observed for students who decide to continue at the university, and thus the effects of probation on this outcome must be interpreted with caution, as the decision to leave the university may itself be affected by the treatment. We also investigate some of the predetermined covariates included in the LSO dataset: an indicator for whether the student is is male (male), the student’s age at entry (age), the total number of credits for which the student enrolled in the first year (totcredits year1), an indicator for whether the student’s first language is English (english), an indicator for whether the student was born in North America (bpl north america), the percentile of the student’s average GPA in standard classes taken in high school (hsgrade pct), and indicators for whether the student is enrolled in each of the three different campuses (loc campus1, loc campus2, and loc campus3). As LSO, we employ these covariates to study the validity of the RD design.

7.2

Counting the Number of Mass Points in the RD Score

The crucial issue in the analysis of RD designs with discrete scores is the number of mass points that actually occur in the dataset. When this number is large, it will be possible to apply the tools from the continuity-based approach to RD analysis, after possibly changing the interpretation of the treatment effect of interest. When this number is either moderately small or very small, a local randomization approach will be most appropriate. In the latter situation, (local or global) polynomial fitting will be useful only as an exploratory devise unless the research is willing to impose 137

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

Table 7.1: Descriptive Statistics for Lindo et al. (2010) Variable Next Term GPA (normalized) Left University After 1st Evaluation Next Term GPA (not normalized) Distance from cutoff Treatment Assignment High school grade percentile Credits attempted in first year Age at entry Male Born in North America English is first language At Campus 1 At Campus 2 At Campus 3

Mean 1.047 0.049 2.571 -0.913 0.161 50.173 4.573 18.670 0.383 0.871 0.714 0.584 0.173 0.242

Median 1.170 0.000 2.700 -0.980 0.000 50.000 5.000 19.000 0.000 1.000 1.000 1.000 0.000 0.000

Std. Deviation 0.917 0.216 0.910 0.899 0.368 28.859 0.511 0.743 0.486 0.335 0.452 0.493 0.379 0.429

Min. -1.600 0.000 -2.384 -2.800 0.000 1.000 3.000 17.000 0.000 0.000 0.000 0.000 0.000 0.000

Max. 2.800 1.000 4.300 1.600 1.000 100.000 6.500 21.000 1.000 1.000 1.000 1.000 1.000 1.000

Obs. 40582 44362 40582 44362 44362 44362 44362 44362 44362 44362 44362 44362 44362 44362

strong parametric assumptions. Therefore, the first step in the analysis of an RD design with a discrete running variable is to analyze the empirical distribution of the score and determine the total number of observations, the total number of mass points, and the total number of observations in per mass point. We continue to illustrate this step of the analysis using the LSO application. Since only those students who have GPA below a certain level are placed on probation, the treatment—the assignment to probation—is administered to students whose GPA is to the left of the cutoff. As we discussed in Section 2, the convention is to define the RD treatment indicator as equal to one for units whose score is above (i.e., to the right of) the cutoff. To conform to this convention, we invert the original running variable in the LSO data. We multiply the original running variable—the distance between GPA and the campus cutoff—by -1, so that, according to the transformed running variable, students placed on probation (i.e. those with GPA below the cutoff) are now above the cutoff, and students not placed on probation (i.e. those with GPA above the cutoff) are now below the cutoff. For example, a student who has Xi = −0.2 in the original score is placed on probation because her GPA is 0.2 units below the threshold. The value of the transformed running variable for this ˜i = 0.2. Moreover, since we define the treatment as 1(X ˜ i ≥ 0), this student treated student is X will be now be placed above the cutoff. The only caveat is that we must shift slightly those control students who were exactly at the cutoff in the original score, since multiplying the running variable by -1 does not alter their score (the cutoff is zero). In the scale of the transformed variable, we need these students to be below zero to continue to assign them to the control condition. Therefore, we manually change the score of students who are exactly at zero to Xi = −0.000005. A histogram of the transformed running variable is shown in Figure 7.1.

138

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

Figure 7.1: Histogram of Transformed Running Variable—LSO Data 2000

Number of Observations

1500

1000

500

0 −3

−2

−1

0

1

Score

We first check how many total observations we have in the dataset, that is, how many observations have a non-missing value of the score. > length ( X ) [1] 44362 Analogous Stata command . count if X!=.

The total sample size in this application is large, with 44,362 observations. However, because the running variable is discrete, the crucial step is to calculate how many mass points we have. > length ( unique ( X ) ) [1] 430 Analogous Stata command . codebook X

The 44,362 total observations in the dataset take only 430 distinct values. This means that, on average, there are roughly 100 observations per value. To have a better idea of the density of observations near the cutoff, Table 7.2 shows the number of observations for the five mass points closest to the cutoff; this table also illustrates how the score is transformed. Since the original score ranges between -1.6 and 2.8, and our transformed score ranges from -2.799 to 1.6. Both the original and the transformed running variables are discrete, because the GPA increases in increments of 0.01 units and there are many students with the same GPA value. For example, there are 76 students who are 0.02 GPA units below the cutoff. Of these 76 students, 44 + 5 = 49 have a GPA of 1.48 139

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

(because the cutoff in Campuses 1 and 2 is 1.5), and 27 students have a GPA of 1.58 (because the cutoff in Campus 3 is 1.6). The same phenomenon of multiple observations with the same value of the score occurs at all other values of the score; for example, there are 228 students who have a value of zero in the original score (and -0.000005 in our transformed score). Table 7.2: Observations at Closest Mass Points

Original Score

Transformed Score

Treatment Status

.. . -0.02 -0.01 0 0.01 0.02 .. .

.. . 0.02 0.01 -0.000005 -0.01 -0.02 .. .

.. . Treated Treated Control Control Control .. .

7.3

Number of Observations All Campuses Campus 1 Campus 2 .. . 76 70 228 77 137 .. .

.. . 44 25 106 30 58 .. .

.. . 5 16 55 11 34 .. .

Campus 3 .. . 27 29 67 36 45 .. .

Using the Continuity-Based Approach when the Number of Mass Points is Large

When the number of mass points in the discrete running variable is sufficiently large, it is possible to use the tools from the continuity-based approach to RD analysis. The LSO application illustrates a case in which a continuity-based analysis might be possible, since the total number of mass points is 430, a moderately large value. Because there are mass points, extrapolation between them is unavoidable, but this is empirically no different from analyzing a (finite sample) dataset with a sample size of n = 430. We start with a falsification analysis, doing a continuity-based density test and a continuitybased analysis of the effect of the treatment on predetermined covariates. First, we use rddensity to test whether the density of the score is continuous at the cutoff. > out = rddensity ( X ) > summary ( out ) RD Manipulation Test using local polynomial density estimation . Number of obs = Model = Kernel = BW method = VCE method =

44362 unrestricted triangular comb jackknife

Cutoff c = 0 Number of obs Eff . Number of obs

Left of c 37211 10083

Right of c 7151 4137

140

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

Order est . ( p ) Order bias ( q ) BW est . ( h )

2 3 0.706

2 3 0.556

Method Robust

T -0.4544

P > |T| 0.6496 Analogous Stata command

. rddensity X

The p-value is 0.6496, and we fail to reject the hypothesis that the density of the score changes discontinuously at the cutoff point. Next, we use rdrobust to use local polynomial methods to estimate the RD effect of being placed on probation on several predetermined covariates. We use the default specifications in rdrobust, that is, a MSE-optimal bandwdith that is equal on each side of the cutoff, a triangular kernel, a polynoimial of order one, and a regularization term. For example, we can estimate the RD effect of probation on hsgrade pct, the measure of high school performance. > rdrobust ( data $ hsgrade _ pct , X ) Call : rdrobust ( y = data $ hsgrade _ pct , x = X ) Summary : Number of Obs BW Type Kernel Type VCE Type

44362 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 37211 6115 1 2 0.4647 0.7590 0.6123

Right 7151 3665 1 2 0.4647 0.7590 0.6123

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 1.3282 1.0104 1.3145 0.1887 -0.6522 3.3087 Robust 0.1943 -0.7880 3.8780 Analogous Stata command . rdrobust hsgrade_pct X

And we can also explore the RD effect graphically using rdplot.

141

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

> rdplot ( data $ hsgrade _ pct , X , x . label = " Running Variable " , y . label = " " , + title = " " ) Analogous Stata command . rdplot hsgrade_pct X

20

40

60

80

100

Figure 7.2: RD plot for hsgrade pct—LSO data ● ●●●● ● ●● ● ●● ●● ● ●●● ●● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●●● ●● ● ● ● ● ●●● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ●● ●● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ●●●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ●●●●● ●●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●●● ●●● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0

●

−2

−1

0

1

Running Variable

Both the formal analysis and the graphical analysis indicate that, according to this continuitybased local polynomial analysis, the students right above and below the cutoff are similar in terms of their high school performance. We repeat this analysis for the nine predetermined covariates mentioned above. Table 7.3 presents a summary of the results, and Figure 7.3 shows the associated RD plots for six of the nine covariates. Table 7.3: RD Effects on Predetermined Covariates—LSO data, Continuity-Based Approach Variable High school grade percentile Credits attempted in first year Age at entry Male Born in North America English is first language At Campus 1 At Campus 2 At Campus 3

MSE-Optimal Bandwidth

RD Estimator

0.465 0.301 0.463 0.482 0.505 0.531 0.289 0.476 0.271

1.328 0.081 0.017 -0.012 0.014 -0.035 -0.020 -0.018 0.036

142

Robust Inference p-value Conf. Int. 0.194 0.005 0.637 0.506 0.374 0.085 0.356 0.333 0.139

[-0.788, 3.878] [0.027, 0.157] [-0.060, 0.099] [-0.067, 0.033] [-0.018, 0.049] [-0.083, 0.005] [-0.093, 0.034] [-0.063, 0.021] [-0.015, 0.109]

Number of Observations 9780 6443 9780 10121 10757 11239 5985 9910 5786

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

6.5 6.0 5.5

●

4.0

4.5

5.0

●

●

● ● ●● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●●● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ●● ●● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●

●

●

●

● ●

●

●●

3.5

20

40

60

80

100

Figure 7.3: RD Plots for Predetermined Covariates—LSO Application ● ●●●● ● ●● ● ●● ●● ● ●●● ●● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ●●● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ●● ●● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ●●●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ●●●●● ●●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●● ● ●●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●●● ●●● ● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●

●

● ● ●

0

−2

−1

0

●

3.0

●

●

1

−2

−1

Running Variable

0

1

Running Variable

(b) Total Credits in First Year

21

1.0

(a) High School Grade Percentile

●

●

●●

●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ●● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.8 20

●

●

● ●

● ●●

●

● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●●●●●●●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ●●● ●● ●● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ●●● ● ●●●● ● ●● ● ● ●●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ●● ●● ●●●● ●●●● ● ● ●● ● ● ●●● ● ● ●● ● ● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ●●● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●

●

●

●

● ● ●

−1

0

●● ● ●

●● ●

1

−2

Running Variable

● ● ●

1.0

1.0

0

● ●●● ●● ● ●●

● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●●●●● ● ●● ●●●●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ●●● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●

0.8

●

● ● ●● ● ●

●

●

● ●●

●● ●

● ●●

● ●

0.6

●

●

● ●

● ●

● ●

● ● ●

●

●

●● ● ●● ●

●

● ●

●

●●

●

● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ●● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ● ●● ● ● ● ● ●● ●● ● ● ●●● ● ● ●●●● ● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ●●● ● ● ●● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

● ●●

0.4

●●

●

●

0.2

0.4 0.2

● ●

●

−2

−1

0

Running Variable

(e) Male Indicator

1

●

0.0

0.0

●

●

● ●● ● ●

●

● ●

● ●

● ●● ●● ●

●●

●

●

●

0.6

●

●

1

(d) English First Language Indicator

●

0.8

−1 Running Variable

(c) Age at Entry

●

●

●●

0.0

17

−2

●

●

●

● ●

● ●

●

●

● ●

●

●

●

● ●●

0.2

18

●

0.4

19

●

●

0.6

● ●

●

●

●

● ●

143

●●

−2

−1

0

1

Running Variable

(f) Born in North America Indicator

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

Overall, the results indicate that the probation treatment has no effect on the predetermined covariates, with two exceptions. First, the the effect on totcredits year1 has associated p-value 0.004, rejecting the hypothesis of no effect at standard levels. However, the point estimate is small: treated students take an additional 0.081 credits in the first year, but the average value of totcredits year1 in the overall sample is 4.43, with a standard deviation of roughly 0.5. Second, students who are placed on probation are 3.5 percentage points less likely to speak English as a first language, an effect that is significant at 10%. This difference is potentially more worrisome, and is also somewhat noticeable in the RD plot in Figure 7.3(d). Next, we analyze the effect of being placed on probation on the outcome of interest, nextGPA, the GPA in the following academic term. We first use rdplot to visualize the effect. > out = rdplot ( nextGPA _ nonorm , X , binselect = " esmv " ) > print ( out ) Call : rdplot ( y = nextGPA _ nonorm , x = X , binselect = " esmv " ) Method :

Number of Obs . Polynomial Order Scale

Left 34854 4 16

Right 5728 4 28

Selected Bins Average Bin Length Median Bin Length

690 0.0041 0.0041

362 0.0044 0.0044

IMSE - optimal bins Mimicking Variance bins

44 690

13 362

Relative to IMSE - optimal : Implied scale 15.6818 27.8462 WIMSE variance weight 0.0003 0.0000 WIMSE bias weight 0.9997 1.0000 Analogous Stata command . rdplot nextGPA_nonorm X, binselect(esmv) /// > graph_options(graphregion(color(white)) /// > xtitle(Running Variable) ytitle(Outcome))

144

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

●

2 1

Outcome

3

4

●

● ●● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●●●●● ●● ●● ●● ● ●● ● ● ● ●●●●● ● ● ● ● ● ● ●●●●●● ● ●●● ●● ● ●● ● ●●● ●● ● ● ●●●● ●●● ● ● ● ●● ●●●● ●● ●● ●●● ● ● ●●● ● ● ●● ●● ●●● ● ● ●●●●● ●● ●● ● ● ●●● ●●● ● ●● ●● ●● ● ●●● ●●● ● ● ● ● ● ●●● ● ●●● ● ● ●●● ● ●● ● ● ● ● ● ●●●● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●●● ●● ● ●● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●●● ●● ●●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ●● ● ●● ●● ● ●● ●● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ●● ●● ●● ●● ●● ● ●●● ● ● ●● ●● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●

0

●

●● ●

−2

−1

0

1

Running Variable

Figure 7.4: RD Plot for nextGPA—LSO Data

Overall, the plot shows a very clear negative relationship between the running variable and the outcome: students who have a low GPA in the current (and thus have a higher value of the running variable) tend to also have a low GPA in the following term. The plot also shows that students with scores just above the cutoff (who are just placed on probation) tend to have a higher GPA in the following term relative to students who are just below the cutoff and just avoided probation. These results are confirmed when we use a local linear polynomial and robust inference to provide a formal statistical analysis of the RD effect. > rdrobust ( nextGPA _ nonorm , X , kernel = " triangular " , p = 1 , bwselect = " mserd " ) Call : rdrobust ( y = nextGPA _ nonorm , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " ) Summary : Number of Obs BW Type Kernel Type VCE Type

40582 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 34854 5249 1 2 0.4375 0.7171 0.6101

Right 5728 3038 1 2 0.4375 0.7171 0.6101

Estimates :

145

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 0.2221 0.0396 5.6146 0.0000 0.1446 0.2997 Robust 0.0000 0.1217 0.3044 Analogous Stata command . rdrobust nextGPA_nonorm X, kernel(triangular) p(1) bwselect(mserd)

As shown, students who are just placed on probation improve their GPA in the following term by 0.2221 additional points, relative to students who just missed probation. The robust p-value is less than 0.00005, and the robust 95% confidence interval ranges from 0.1217 to 0.3044. Thus, the evidence indicates that, conditional on not leaving the university, being placed on academic probation translates into an increase in future GPA. The point estimate of 0.2221—oobtained with rdrobust within a MSE-optimal bandwidth of 0.4375—is very similar to the effect of 0.23 grade points found by LSO within an ad-hoc bandwidth of 0.6. To better understand this effect, we may be interested in knowing the point estimate for the controls and treated students separetely. To see this information, we explore the information returned by rdrobust. > rdout = rdrobust ( nextGPA _ nonorm , X , kernel = " triangular " , p = 1 , + bwselect = " mserd " ) > print ( names ( rdout ) ) [1] " tabl1 . str " " tabl2 . str " " tabl3 . str " " N " "N_l" "N_r" "N_h_ l" "N_b_l" "N_b_r" "c" "p" "q" "h_l" "h_r" "b_l" "b_r" " tau _ cl " " tau _ bc " " se _ tau _ cl " " se _ tau _ rb " " bias _ l " " bias _ r " " beta _ p _ l " " beta _ p _ r " [25] " V _ cl _ l " " V _ cl _ r " " V _ rb _ l " " V _ rb _ r " " coef " " bws " " se " "z" " pv " " ci " " call " > print ( rdout $ beta _ p _ r ) [ ,1] [1 ,] 2.0671526 [2 ,] -0.6713546 > print ( rdout $ beta _ p _ l ) [ ,1] [1 ,] 1.8450372 [2 ,] -0.6804159 Analogous Stata command . rdrobust nextGPA_nonorm X . ereturn list

This output shows the estimated intercept and slope from the two local regressions estimated separately to the rigth (beta p r) and left (beta p l) of the cutoff. At the cutoff, the average GPA in the following term for control students who just avoid probation is 1.8450372 , while the average future GPA for treated students who are just placed on probation is 2.0671526. The increase is the estimated RD effect reported above, 2.0671526 − 1.8450372 = 0.2221154. This represents approximately a 12% GPA increase relative to the control group, a considerable effect. 146

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

An alternative to the simplest use of rdrobust illustrated above is to cluster the standard errors by every value of the score—this is the approach recommended by Lee and Card (2008), as we discuss in the Further Readings section below. We implement this using the cluster option in rdrobust. > clustervar = X > rdrobust ( nextGPA _ nonorm , X , kernel = " triangular " , p = 1 , bwselect = " mserd " , + vce = " hc0 " , cluster = clustervar ) Call : rdrobust ( y = nextGPA _ nonorm , x = X , p = 1 , kernel = " triangular " , bwselect = " mserd " , vce = " hc0 " , cluster = clustervar ) Summary : Number of Obs BW Type Kernel Type VCE Type

40582 mserd Triangular Cluster

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 34854 4357 1 2 0.3774 0.6270 0.6019

Right 5728 2709 1 2 0.3774 0.6270 0.6019

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 0.2149 0.0332 6.4625 0.0000 0.1497 0.2800 Robust 0.0000 0.1316 0.2818 Analogous Stata command . gen clustervar=X . rdrobust nextGPA_nonorm X, kernel(triangular) p(1) bwselect(mserd) vce(cluster clustervar)

The conclusions remain essentially unaltered, as the 95% robust confidence interval changes only slightly from [0.1217 , 0.3044] to [0.1316 , 0.2818]. Note that the point estimate moves slightly from 0.2221 to 0.2149 because the MSE-optimal bandwidth with clustering shrinks to 0.3774 from 0.4375, and the bias bandwidth also decreases.

7.4

Interpreting Continuity-Based RD Analysis with Mass Points

Provided that the number of mass points in the score is reasonably large, it is possible to analyze an RD design with a discrete score using the tools from the continuity-based approach. However, it is important to understand how to correctly interpret the results from such analysis. We now analyze the LSO application further, with the goal of clarifying these issues.

147

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

When there are mass points in the running variable, local polynomial methods for RD analysis behave essentially as if we had as many observations as mass points, and therefore the method implies extrapolation from the closest mass point on either side to the cutoff. In other words, when applied to a RD design with a discrete score, the effective number of observations used by continuity-based methods is the number of mass points or distinct values, not the total number of observations. Thus, in practical terms, fitting a local polynomial to the raw data with mass points is roughly equivalent to fitting a local polynomial to a “collapsed” version of the data, where we aggregate the original observations by the discrete score values, calculating the average outcome for all observations that share the same score value. Thus, the total number of observations in the collapsed dataset is equal to the number of mass points in the running variable. To illustrate this procedure with the LSO data, we calculate the average outcome for each of the 430 mass points in the score value. The resulting dataset has 430 observations, where each observation consists of a score-outcome pair: every score value is paired with the average outcome across all students in the original dataset whose score is equal to that value. When then use rdrobust to estimate the RD effect with a local polynomial. > data2 = data . frame ( nextGPA _ nonorm , X ) > dim ( data2 ) [1] 44362 2 > collapsed = aggregate ( nextGPA _ nonorm ~ X , data = data2 , mean ) > dim ( collapsed ) [1] 429 2 > rdrobust ( collapsed $ nextGPA _ nonorm , collapsed $ X ) Call : rdrobust ( y = collapsed $ nextGPA _ nonorm , x = collapsed $ X ) Summary : Number of Obs BW Type Kernel Type VCE Type

429 mserd Triangular NN

Number of Obs Eff . Number of Obs Order Loc Poly ( p ) Order Bias ( q ) BW Loc Poly ( h ) BW Bias ( b ) rho ( h / b )

Left 274 51 1 2 0.5057 0.8053 0.6280

Right 155 50 1 2 0.5057 0.8053 0.6280

Estimates : Coef Std . Err . z P >| z | CI Lower CI Upper Conventional 0.2456 0.0321 7.6400 0.0000 0.1826 0.3085 Robust 0.0000 0.1659 0.3165

148

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

Analogous Stata command . collapse (mean) nextGPA_nonorm, by(X) . rdrobust nextGPA_nonorm X

The estimated effect is 0.2456, with robust p-value less than 0.00005. This is similar to the 0.2221 point estimate obtained with the raw dataset. The similarity between the two point estimates is remarkable, but not unusual, considering that the former was calculated using 430 observations, while the latter was calculated using 40, 582 observations, more than a ninety-fold increase. Indeed, the inference conclusions from both analysis are extremely consistent, as the robust 95% confidence interval using the raw data is [0.1217 , 0.3044], while the robust confidence interval for the collapsed data is [0.1659, 0.3165], both indicating that the plausible values of the effect are in roughly the same positive range. This analysis shows that the seemingly large number of observations in the raw dataset is effectively much smaller, and that the behavior of the continuity-based results is governed by the average behavior of the data at every mass point. Thus, a natural point of departure for researchers who wish to study a discrete RD design with many mass points is to collapse the data and estimate the effects on the aggregate results. As a second step, these aggregate results can be compared to the results using the raw data—in most cases, both sets of results should lead to the same conclusions. While the mechanics of local polynomial fitting using a discrete running variable are clear now, the actual relevance and interpretation of the treatment effect may change. As we will discuss below, researchers may want to change the parameter of interest altogether when the score is discrete. Alternatively, (parametric) extrapolation is unavoidable to point identification. To be more precise, because the score is discrete, it is not possible to nonparametrically point identify the vertical distance between τSRD = E[Yi (1)|Xi = x ¯] − E[Yi (0)|Xi = x ¯], even asymptotically, because conceptually the lack of denseness in Xi makes it impossible to appeal to large sample approximations. Put differently, if researchers insist on retaining the same parameter of interest as in the canonical RD design, then extrapolation from the closest mass point to the cutoff will be needed, no matter how large the sample size is. Of course, there is no reason why the same RD treatment effect would be of interest when the running variable is discrete or, if it is, then any extrapolation method would be equally valid. Thus, continuity-based methods, that is, simple local linear extrapolation towards the cutoff point is natural and intuitive. When only a few mass points are present, then bandwidth selection makes little sense, and the research may just conduct linear (parametric) extrapolation globally, as this is essentially the only possibility, if the goal is to retain the same canonical treatment effect parameter.

149

7. EXAMPLE WITH DISCRETE SCORE

7.5

CONTENTS

Local Randomization RD Analysis with Discrete Score

A natural alternative to analyze an RD design with a discrete running variable is to use the local randomization approach, which effectively changes the parameter of interest from the RD treatment effect a the cutoff to the RD treatment effect in the neighborhood around the cutoff where local randomization is assumed to hold. A key advantage of this alternative conceptual framework is that, unlike the continuity-based approach illustrated above, it can be used even when there are very few mass points in the running variable: indeed, it can be used with as few as two mass points. To compare the change in RD parameter of interest, consider the extreme case where the score takes five values −2, −1, 0, 1, 1 and the RD cutoff is x ¯ = 0. Then, the continuity-based parameter of interest is τSRD = E[Yi (1)|Xi = 0] − E[Yi (0)|Xi = 0], which is not nonparametrically identifiable, but the local randomization parameter will be τLR = E[Yi (1)|Xi = 0] − E[Yi (0)|Xi = −1] when W0 = [−1, 0], say, which is nonparametrically identifiable under the conditions discussed in Section 5. Going from τLR to τSRD requires extrapolating from E[Yi (0)|Xi = −1] to E[Yi (0)|Xi = 0], which is impossible without additional assumptions even in large samples because of the intrinsic discreteness of the running variable. In some specific applications, additional features may allow researchers to extrapolate (e.g., rounding), but in general extrapolation will require additional restrictions on the data generating process. Furthermore, from a conceptual point of view, it can be argue that the parameter τLR is more interesting and policy relevant than the parameter τSRD when the running variable is discrete. When the score is discrete using the local randomization approach for inference does not require choosing a window in most applications. In other words, with a discrete running variable the researcher knows the exact location of the minimum window around the cutoff: this window is the interval of the running variable that contains the two mass points, one on each side of the cutoff, that are immediately consecutive to the cutoff value. Crucially, if local randomization holds, then it must hold for the smallest window in the absence of design failures such as manipulation of the running variable. To illustrate, as shown in Table 7.2, in the LSO application the original score has a mass point at zero where all observations are control (because they reach the minimum GPA required to avoid probation), and the mass point immediately below it occurs at -0.01, where all students are placed on probation because they fall short of the threshold to avoid probation. Thus, the smallest window around the cutoff in the scale of the original score is W0 = [0.00, −0.01]. Analogously, in the scale of the transformed score, the minimum window is W0 = [−0.000005, 0.01]. Regardless of the scale used, the important point is that the minimum window around the cutoff in a local randomization analysis of an RD with a discrete score is precisely the interval between the two consecutive mass points where the treatment status changes from zero to one. Note that the particular values taken by the score are irrelevant, as the analysis will proceed to assume that the treated and control groups were assigned to treatment as-if randomly, and will typically make the exclusion restriction assumption that the particular value of the score has no direct impact on the outcome of interest. Moreover, the location of the cutoff is no longer meaningful, as any value 150

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

of the cutoff between the minimum value of the score on the treated side and the maximum value of the score in the control side will produce identical treatment and control groups. Once the researcher finds the treated and control observations located at the two mass points around the cutoff, the local randomization analysis can proceed as explained in Section 5. We first conduct a falsification analysis, to determine whether the assumption of local randomization in the window [−0.00005, 0.1] seems consistent with the empirical evidence. We conduct a density test using the rdwinselect function using the option nwindows=1 to see only results for this window, to test whether the density of observations in this window is consistent with the density that would have been observed in a series of unbiased coin flips. > out = rdwinselect (X , wmin = 0.01 , nwindows = 1 , cutoff = 5e -06)

Window selection for RD under local randomization Number of obs Order of poly Kernel type Reps Testing method Balance test

= = = = = =

Cutoff c = 0 Number of obs 1 st percentile 5 th percentile 10 th percentile 20 th percentile Window length / 2 0.01

44362 0 uniform 1000 rdrandinf diffmeans Left of c 37211 298 1817 3829 7588

Right of c 7151 0 269 663 1344

p - value NA

Var . name NA

Bin . test 0

Obs < c 228

Obs >= c 77

Analogous Stata command . rdwinselect X, wmin(0.01) nwindows(1) cutoff(0.000005)

As shown in the rdwinselect output and also showed previously in Table 7.2, there are 228 control observations immediately below the cutoff, and 77 above the cutoff. In other words, there are 228 students who get exactly the minimum GPA needed to avoid probation, and 77 students who get the maximum possible GPA that still places them on probation. The number of control observations is roughly three times higher than the number of treated observations, a ratio that is inconsistent with the assumption that the probability of treatment assignment in this window was 1/2—the p-value of the Binomial test reported in column Bin. test is indistinguishable from zero. We can also obtain this result by using the Binomial test commands directly.

151

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

> binom . test (77 , 305 , 1 / 2) Exact binomial test data : 77 and 305 number of successes = 77 , number of trials = 305 , p - value < 2.2 e -16 alternative hypothesis : true probability of success is not equal to 0.5 95 percent confidence interval : 0.2046807 0.3051175 sample estimates : probability of success 0.252459 Analogous Stata command . bitesti 305 77 1/2

Although these results alone do not imply that the local randomization RD assumptions are violated, the fact that there are many more control than treated students is consistent with what one would expect if students were actively avoiding an undesirable outcome. The results raise some concern that students may have been aware of the probation cutoff, and may have tried to appeal their final GPA in order to avoid being placed on probation. Strictly speaking, an imbalanced number of observations would not pose any problems if the types of students in the treated and control groups were on average similar. To establish whether treated and control students at the cutoff are similar in terms of observable characteristics, we use rdrandinf to estimate the RD effect of probation on the predetermined covariates introduced above. We report the full results for the covariate hsgrade pct. > out = rdrandinf ( data $ hsgrade _ pct , X , wl = -0.005 , wr = 0.01 , + seed = 50) Selected window = [ -0.005;0.01] Running randomization - based test ... Randomization - based test complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

Cutoff c = 0 Number of obs Eff . number of obs

44362 0 uniform 1000 set by user 0 fixed margins Left of c 37211 228

Right of c 7151 77

152

7. EXAMPLE WITH DISCRETE SCORE

Mean of outcome S . d . of outcome Window

29.118 21.737 -0.005

CONTENTS

32.675 21.625 0.01 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 10.868

Diff . in means

3.557

0.219

0.213

0.968

Analogous Stata command . rdrandinf hsgrade_pct X , seed(50) wl(-.005) wr(.01)

We repeated this analysis for all predetermined covariates, but do not present the individual runs to conserve space. A summary of the results is reported in Table 7.4. As shown, treated and control students seem indistinguishable in terms of prior high school performance, total number of credits, age, sex, and place of birth. On the other hand, the Fisherain sharp null hypothesis is that the treatment has no effect on the English-as-first-language indicator is rejected with p-value of 0.009. The average differences in this variable are very large: 75% of control students speak English as first language, but only 62.3% of treated students do. These difference is consistent with the local polynomial results we reported for this variable in Table 7.3, although the difference is much larger (an average difference of -3.5 percentage points in the continuity-based analysis, versus an average difference of -15 percentage points in the local randomization analysis). A similar phenomenon occurs for the Campus 2 and Campus 3 indicators, which appear imbalanced in the local randomization analysis (with Fisherian p-values of 0.075 and 0.009) but appear balanced with a continuity-based analysis. These differences illustrate how a continuity-based analysis of a discrete RD design can mask differences that occurs in mass points closest to the cutoff. In general, when analyzing a RD design with a discrete running variable, it is advisable to perform falsification tests with the two mass points closest to the cutoff in order to detect phenomena of sorting or selection that may go unnoticed when a continuity-based approach is used.

153

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

Table 7.4: RD Effects on Predetermined Covariates—LSO data, Local Randomization Approach

Variable

Mean of Controls

Mean of Treated

Diff-in-Means Statistic

Fisherian p-value

Number of Observations

29.118 4.228 18.772 0.377 0.890 0.772 0.465 0.241 0.294

32.675 4.318 18.688 0.442 0.844 0.623 0.390 0.143 0.468

3.557 0.090 -0.084 0.064 -0.046 -0.149 -0.075 -0.098 0.174

0.201 0.157 0.421 0.336 0.322 0.009 0.259 0.075 0.009

305 305 305 305 305 305 305 305 305

High school grade percentile Credits attempted in first year Age at entry Male Born in North America English is first language At Campus 1 At Campus 2 At Campus 3

Finally, we also investigate the extent to which the particular window around the cutoff, including only two mass points, is driving the empirical results by repeating the analysis using different nested windows. These exercise is easily implemented using the command rdwinselect: > + + > + + > +

Z = cbind ( data $ hsgrade _ pct , data $ totcredits _ year1 , data $ age _ at _ entry , data $ male , data $ bpl _ north _ america , data $ english , data $ loc _ campus1 , data $ loc _ campus2 , data $ loc _ campus3 ) colnames ( Z ) = c ( " hsgrade _ pct " , " totcredits _ year1 " , " age _ at _ entry " , " male " , " bpl _ north _ america " , " english " , " loc _ campus1 " , " loc _ campus2 " , " loc _ campus3 " ) out = rdwinselect (X , Z , p = 1 , seed = 50 , wmin = 0.01 , wstep = 0.01 , cutoff = 5e -06)

Window selection for RD under local randomization Number of obs Order of poly Kernel type Reps Testing method Balance test

= = = = = =

Cutoff c = 0 Number of obs 1 st percentile 5 th percentile 10 th percentile 20 th percentile Window length / 2 0.01 0.02 0.03 0.04 0.05 0.06

44362 1 uniform 1000 rdrandinf diffmeans Left of c 37211 298 1817 3829 7588 p - value 0.008 0 0 0 0 0

Right of c 7151 0 269 663 1344 Var . name loc _ campus3 totcredits _ year hsgrade _ pct loc _ campus1 totcredits _ year hsgrade _ pct

154

Bin . test 0 0 0 0 0 0

Obs < c 228 298 374 494 636 714

Obs >= c 77 214 269 375 418 497

7. EXAMPLE WITH DISCRETE SCORE

0.07 0.08 0.09 0.1

0 0 0 0

CONTENTS

hsgrade _ pct totcredits _ year totcredits _ year totcredits _ year

0 0 0 0.001

807 877 1049 1131

663 727 815 973

Analogous Stata command . rdwinselect X $covariates, cutoff(0.00005) wmin(0.01) wstep(0.01)

The empirical results continue to provide evidence of invalance in at least one pre-intervention covariate for each window consider, using randomization inference methods for the difference in means test statistic. To complete the empirical illustration, we investigate the local randomization RD treatment effect on the main outcome of interest using rdrandinf. > out = rdrandinf ( nextGPA _ nonorm , X , wl = -0.005 , wr = 0.01 , seed = 50) Selected window = [ -0.005;0.01] Running randomization - based test ... Randomization - based test complete .

Number of obs Order of poly Kernel type Reps Window H0 : tau Randomization

= = = = = = =

40582 0 uniform 1000 set by user 0 fixed margins

Cutoff c = 0 Number of obs Eff . number of obs Mean of outcome S . d . of outcome Window

Left of c 34854 208 1.83 0.868 -0.005

Right of c 5728 67 2.063 0.846 0.01 Finite sample

Large sample

Statistic

T

P >| T |

P >| T |

Power vs d = 0.434

Diff . in means

0.234

0.063

0.051

0.952

Analogous Stata command . rdrandinf nextGPA_nonorm X, seed(50) wl(-0.005) wr(0.01)

Remarkably, the difference-in-means between the 208 control students and the 67 treated students in the smallest window around the cutoff is 0.234 grade points, extremely similar to the 155

7. EXAMPLE WITH DISCRETE SCORE

CONTENTS

continuity-based local polynomial effects of 0.2221 and 0.2456 that we found using the raw and aggregated data, respectively. (The discrepancy between the treated and control sample sizes of 77 and 288 reported in Table 7.2 and the sample sizes of 67 and 208 reported in the rdrandinf output occurs because there are missing values in the nextGPA outcome, as students who leave the university do not have any future GPA.) Moreover, we can reject the null hypothesis of no effect at 10% level using both the Fisherian and the Neyman inference approaches. This shows that the results for next term GPA are remarkably robust: we found very similar results using the 208 + 67 = 275 observations closest to the cutoff in a local randomization analysis, the total 40, 582 observations using a continuity-based analysis, and the 429 aggregated observations in a continuity-based analysis.

7.6

Further Readings

Lee and Card (2008) proposed alternative assumptions under which the local polynomial methods in the continuity-based RD framework can be applied when the running variable is discrete. Their method requires assuming a random specification error that is orthogonal to the score, and modifying inferences by using standard errors that are clustered at each of the different values taken by the score. Similarly, Dong (2015) discusses the issue rounding in the running variable. Both approaches have in common that the score is assumed to be inherently continuous, but somehow imperfectly measured—perhaps because of rounding errors—in such a way that the dataset available to the researcher contains mass points. Cattaneo et al. (2015, Section 6.2) discuss explicitly the connections between discrete scores and the local randomization approach; see also Cattaneo et al. (2017d).

156

8. FINAL REMARKS

8

CONTENTS

Final Remarks

157

BIBLIOGRAPHY

BIBLIOGRAPHY

Bibliography Angrist, J. D., and Rokkanen, M. (2015), “Wanna get away? Regression discontinuity estimation of exam school effects away from the cutoff,” Journal of the American Statistical Association, 110, 1331–1344. Arai, Y., and Ichimura, H. (2016), “Optimal bandwidth selection for the fuzzy regression discontinuity estimator,” Economic Letters, 141, 103–106. (2017), “Simultaneous Selection of Optimal Bandwidths for the Sharp Regression Discontinuity Estimator,” Quantitative Economics, forthcoming. Bajari, P., Hong, H., Park, M., and Town, R. (2011), “Regression Discontinuity Designs with an Endogenous Forcing Variable and an Application to Contracting in Health Care,” NBER Working Paper No. 17643. Barreca, A. I., Lindo, J. M., and Waddell, G. R. (2016), “Heaping-Induced Bias in RegressionDiscontinuity Designs,” Economic Inquiry, 54, 268–293. Bartalotti, O., and Brummet, Q. (2017), “Regression Discontinuity Designs with Clustered Data,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 383–420. Bartalotti, O., Calhoun, G., and He, Y. (2017), “Bootstrap Confidence Intervals for Sharp Regression Discontinuity Designs,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 421–453. Bertanha, M. (2017), “Regression Discontinuity Design with Many Thresholds,” Working paper, University of Notre Dame. Bertanha, M., and Imbens, G. W. (2017), “External Validity in Fuzzy Regression Discontinuity Designs,” National Bureau of Economic Research, working paper 20773. Calonico, S., Cattaneo, M. D., and Farrell, M. H. (2017a), “Coverage Error Optimal Confidence Intervals for Regression Discontinuity Designs,” working paper, University of Michigan. (2017b), “On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference,” Journal of the American Statistical Association, forthcoming. Calonico, S., Cattaneo, M. D., Farrell, M. H., and Titiunik, R. (2017c), “Regression Discontinuity Designs Using Covariates,” working paper, University of Michigan. (2017d), “rdrobust: Software for Regression Discontinuity Designs,” Stata Journal, forthcoming. 158

BIBLIOGRAPHY

BIBLIOGRAPHY

Calonico, S., Cattaneo, M. D., and Titiunik, R. (2014a), “Robust Data-Driven Inference in the Regression-Discontinuity Design,” Stata Journal, 14, 909–946. (2014b), “Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs,” Econometrica, 82, 2295–2326. (2015a), “Optimal Data-Driven Regression Discontinuity Plots,” Journal of the American Statistical Association, 110, 1753–1769. (2015b), “rdrobust: An R Package for Robust Nonparametric Inference in RegressionDiscontinuity Designs,” R Journal, 7, 38–51. Canay, I. A., and Kamat, V. (2017), “Approximate Permutation Tests and Induced Order Statistics in the Regression Discontinuity Design,” Working paper, Northwestern University. Card, D., Lee, D. S., Pei, Z., and Weber, A. (2015), “Inference on Causal Effects in a Generalized Regression Kink Design,” Econometrica, 83, 2453–2483. Card, D., Lee, D. S., Pei, Z., and Weber, A. (2017), “Regression Kink Design: Theory and Practice,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 341–382. Cattaneo, M. D., and Escanciano, J. C. (2017), Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), Emerald Group Publishing. Cattaneo, M. D., and Farrell, M. H. (2013), “Optimal convergence rates, Bahadur representation, and asymptotic normality of partitioning estimators,” Journal of Econometrics, 174, 127–143. Cattaneo, M. D., Frandsen, B., and Titiunik, R. (2015), “Randomization Inference in the Regression Discontinuity Design: An Application to Party Advantages in the U.S. Senate,” Journal of Causal Inference, 3, 1–24. Cattaneo, M. D., Jansson, M., and Ma, X. (2017a), “Simple Local Regression Distribution Estimators with an Application to Manipulation Testing,” working paper, University of Michigan. (2017b), “rddensity: Manipulation Testing based on Density Discontinuity,” working paper, University of Michigan. Cattaneo, M. D., Keele, L., Titiunik, R., and Vazquez-Bare, G. (2016a), “Interpreting Regression Discontinuity Designs with Multiple Cutoffs,” Journal of Politics, 78, 1229–1248. (2017c), “Extrapolating Treatment Effects in Multi-Cutoff Regression Discontinuity Designs,” Working paper, University of Michigan. Cattaneo, M. D., and Titiunik, R. (2017), “Regression Discontinuity Designs: A Review of Recent Methodological Developments,” manuscript in preparation, University of Michigan. 159

BIBLIOGRAPHY

BIBLIOGRAPHY

Cattaneo, M. D., Titiunik, R., and Vazquez-Bare, G. (2016b), “Inference in Regression Discontinuity Designs under Local Randomization,” Stata Journal, 16, 331–367. (2017d), “Comparing Inference Approaches for RD Designs: A Reexamination of the Effect of Head Start on Child Mortality,” Journal of Policy Analysis and Management, forthcoming. Cattaneo, M. D., and Vazquez-Bare, G. (2016), “The Choice of Neighborhood in Regression Discontinuity Designs,” Observational Studies, 2, 134–146. Cerulli, G., Dong, Y., Lewbel, A., and Poulsen, A. (2017), “Testing Stability of Regression Discontinuity Models,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 317–339. Chiang, H. D., Hsu, Y.-C., and Sasaki, Y. (2017), “A Unified Robust Bootstrap Method for Sharp/Fuzzy Mean/Quantile Regression Discontinuity/Kink Designs,” Working paper, Johns Hopkins University. Chiang, H. D., and Sasaki, Y. (2017), “Causal Inference by Quantile Regression Kink Designs,” Working paper, Johns Hopkins University. Cook, T. D. (2008), ““Waiting for Life to Arrive”: A history of the regression-discontinuity design in Psychology, Statistics and Economics,” Journal of Econometrics, 142, 636–654. Dong, Y. (2015), “Regression Discontinuity Applications with Rounding Errors in the Running Variable,” Journal of Applied Econometrics, 30, 422–446. (2017), “Regression Discontinuity Designs with Sample Selection,” Journal of Business & Economic Statistics, forthcoming. Dong, Y., and Lewbel, A. (2015), “Identifying the Effect of Changing the Policy Threshold in Regression Discontinuity Models,” Review of Economics and Statistics, 97, 1081–1092. Ernst, M. D. (2004), “Permutation Methods: A Basis for Exact Inference,” Statistical Science, 19, 676–685. Fan, J., and Gijbels, I. (1996), Local polynomial modelling and its applications: monographs on statistics and applied probability 66, Vol. 66, CRC Press. Frandsen, B. (2017), “Party Bias in Union Representation Elections: Testing for Manipulation in the Regression Discontinuity Design When the Running Variable is Discrete,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 281–315. Ganong, P., and J¨ ager, S. (2017), “A Permutation Test for the Regression Kink Design,” Journal of the American Statistical Association, forthcoming. 160

BIBLIOGRAPHY

BIBLIOGRAPHY

Gelman, A., and Imbens, G. W. (2014), “Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs,” NBER Working Paper 20405, New York: National Bureau of Economic Research. Hahn, J., Todd, P., and van der Klaauw, W. (2001), “Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design,” Econometrica, 69, 201–209. Imbens, G., and Lemieux, T. (2008), “Regression Discontinuity Designs: A Guide to Practice,” Journal of Econometrics, 142, 615–635. Imbens, G., and Rubin, D. B. (2015), Causal Inference in Statistics, Social, and Biomedical Sciences, Cambridge University Press. Imbens, G. W., and Kalyanaraman, K. (2012), “Optimal Bandwidth Choice for the Regression Discontinuity Estimator,” Review of Economic Studies, 79, 933–959. Jales, H., and Yu, Z. (2017), “Identification and Estimation using a Density Discontinuity Approach,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 29–72. Keele, L. J., and Titiunik, R. (2015), “Geographic Boundaries as Regression Discontinuities,” Political Analysis, 23, 127–155. Kleven, H. J. (2016), “Bunching,” Annual Review of Economics, 8, 435–464. Lee, D. S. (2008), “Randomized Experiments from Non-random Selection in U.S. House Elections,” Journal of Econometrics, 142, 675–697. Lee, D. S., and Card, D. (2008), “Regression discontinuity inference with specification error,” Journal of Econometrics, 142, 655–674. Lee, D. S., and Lemieux, T. (2010), “Regression Discontinuity Designs in Economics,” Journal of Economic Literature, 48, 281–355. Lindo, J. M., Sanders, N. J., and Oreopoulos, P. (2010), “Ability, Gender, and Performance Standards: Evidence from Academic Probation,” American Economic Journal: Applied Economics, 2, 95–117. Ludwig, J., and Miller, D. L. (2007), “Does Head Start Improve Children’s Life Chances? Evidence from a Regression Discontinuity Design,” Quarterly Journal of Economics, 122, 159–208. McCrary, J. (2008), “Manipulation of the running variable in the regression discontinuity design: A density test,” Journal of Econometrics, 142, 698–714. Meyersson, E. (2014), “Islamic Rule and the Empowerment of the Poor and Pious,” Econometrica, 82, 229–269. 161

BIBLIOGRAPHY

BIBLIOGRAPHY

Papay, J. P., Willett, J. B., and Murnane, R. J. (2011), “Extending the regression-discontinuity approach to multiple assignment variables,” Journal of Econometrics, 161, 203–207. Porter, J. (2003), “Estimation in the Regression Discontinuity Model,” working paper, University of Wisconsin. Reardon, S. F., and Robinson, J. P. (2012), “Regression discontinuity designs with multiple ratingscore variables,” Journal of Research on Educational Effectiveness, 5, 83–104. Rosenbaum, P. R. (2002), Observational Studies, New York: Springer. (2010), Design of Observational Studies, New York: Springer. Sekhon, J. S., and Titiunik, R. (2016), “Understanding Regression Discontinuity Designs as Observational Studies,” Observational Studies, 2, 174–182. (2017), “On Interpreting the Regression Discontinuity Design as a Local Experiment,” in Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. M. D. Cattaneo and J. C. Escanciano, Emerald Group Publishing, pp. 1–28. Shen, S., and Zhang, X. (2016), “Distributional Regression Discontinuity: Theory and Applications,” Review of Economics and Statistics, 98, 685–700. Thistlethwaite, D. L., and Campbell, D. T. (1960), “Regression-discontinuity Analysis: An Alternative to the Ex-Post Facto Experiment,” Journal of Educational Psychology, 51, 309–317. Tukiainen, J., Saarimaa, T., Hyytinen, A., Meril¨ainen, J., and Toivanen, O. (2017), “When Does Regression Discontinuity Design Work? Evidence from Random Election Outcomes,” VATT Working Papers 59. Wing, C., and Cook, T. D. (2013), “Strengthening The Regression Discontinuity Design Using Additional Design Elements: A Within-Study Comparison,” Journal of Policy Analysis and Management, 32, 853–877. Wong, V. C., Steiner, P. M., and Cook, T. D. (2013), “Analyzing Regression-Discontinuity Designs With Multiple Assignment Variables A Comparative Study of Four Estimation Methods,” Journal of Educational and Behavioral Statistics, 38, 107–141. Xu, K.-L. (2017), “Regression Discontinuity with Categorical Outcomes,” Working Paper, Indiana University.

162