Abstract This paper develops a nonparametric model that represents how sequences of outcomes and treatment choices are influenced by each other in a dynamic manner. In this setting, we are interested in identifying the average outcome of individuals in each period had a particular treatment sequence been assigned. The identification of this quantity allows us to identify the average treatment effects (ATE’s) as well as the optimal treatment regime, namely, the regime that maximizes the (weighted) sum of the average potential outcomes, possibly less the cost of treatments. The main contribution of this paper is to relax the sequential randomization assumption widely used in the biostatistics literature by introducing a flexible choice-theoretic framework for a sequence of endogenous treatments. We show that the parameters of interest are identified under two-way exclusion restrictions in each period, i.e., with instruments excluded from the outcome-determining process and other exogenous variables excluded from the treatment selection process. We also consider partial identification in the case where the latter variables are not available. The results of this paper are extended to a setting where treatments do not appear in every time period. JEL Numbers: C14, C35, C57 Keywords: Dynamic treatment effects, optimal treatment regimes, dynamic models, endogenous treatments, binary endogenous variables, average treatment effects, instrumental variables.

1

Introduction

This paper develops a nonparametric model that represents how sequences of outcomes and treatment choices are influenced by each other in a dynamic manner. Often times, treatments are repeatedly chosen multiple times over a horizon, affecting a series of outcomes. ∗

The author is very grateful to Dan Ackerberg, Xiaohong Chen and Ed Vytlacil for thoughtful comments and discussions.

1

Examples are medical interventions affecting health outcomes, educational interventions affecting academic achievements, job training programs affecting employment status, or online advertisements affecting consumers’ preferences or purchase decisions. The relationship of interest is dynamic in the sense that the current outcome is determined as a function of the past outcomes as well as the current and past treatments, and the current treatment as a function of the past outcomes as well as the past treatments. Such dynamic relationships are clearly present in the examples mentioned. A static model misrepresents the nature of the problem (e.g., state dependence, learning) and fails to capture important policy questions (e.g., optimal timing of interventions). In this setting, we are interested in identifying the causal effect of a sequence of endogenous treatments on a sequence of outcomes, or on a terminal outcome that may or may not be of the same kind as transition outcomes. Specifically, we are interested in learning about the average of an outcome in each period had a particular treatment sequence been assigned up to that period, which defines the potential outcome in the dynamic setting. We are also interested in the average treatment effects (ATE’s) defined based on the average potential outcome. For example, one may be interested in whether the success rate of a particular outcome is larger with a sequence of treatments assigned in relatively later periods rather than earlier, or with a sequence of alternating treatments rather than consistent treatments. Lastly, we are interested in an optimal treatment regime, namely, a sequence of treatments that maximizes the (weighted) sum of the average potential outcomes, possibly less the cost of treatments. For example, a firm may be interested in the optimal timing of advertisements that maximizes the aggregate sales probabilities over time, or a sequence of educational programs may be aimed to maximize the college attendance rate. We show that the optimal regime is a natural extension of a static object commonly sought for in the literature, namely the sign of the ATE. Analogous to the static environment, the knowledge about the optimal treatment regime may have important policy implications. For example, a social planner can at least hope to rule out specific sequences of treatments that are on average harmful. Dynamic treatment effects have been extensively studied in the biostatistics literature for decades (Robins (1986, 1987), Murphy et al. (2001), Murphy (2003), among others), based on a counterfactual framework with a sequence of treatments. In this literature, the crucial condition for identification of the average potential outcome is a dynamic version of a random assignment assumption, called the sequential randomization. It assumes that the treatment is randomized in every period within those individuals who have the same history of outcomes and treatments.1 This assumption is suitable in experimental studies (e.g., clinical trials, field experiments) with perfect compliance of subjects, but hard to justify in studies with partial compliance or in many observational contexts as in the examples above. The main contribution of this paper is to relax the assumption of sequential randomization widely used in the literature by introducing a flexible choice-theoretic framework for a sequence of endogenous treatments. Towards this end, we consider a simple nonparametric structural model for a dynamic endogenous selection process and dynamic outcome formation. In this model, individuals are allowed not to fully comply with each period’s assignment in experimental settings or allowed to make an endogenous choice in each period 1 In the econometrics literature, Vikstr¨ om et al. (2017) consider treatment effects on a process of transition to a destination state, and carefully analyze what the sequential randomization assumption can identify under the presence of dynamic selection.

2

in observational settings. The heterogeneity in potential outcomes is given by a switchingregression type of a model with a sequential version of rank similarity. The joint distribution of the full history of unobservable variables in the outcome and treatment equations is still flexible, allowing for arbitrary forms of treatment endogeneity as well as serial correlation, e.g., via time-invariant individual effects. Relative to the counterfactual framework, the dynamic mechanism is clearly formulated in this structural model, which in turn facilitates our identification analysis. We show that the average potential outcome, or equivalently the average recursive structural function (ARSF) given the structural model we introduce, is identified under two-way exclusion restrictions (Vytlacil and Yildiz (2007), Shaikh and Vytlacil (2011), Han (2018)), i.e., with instruments excluded from the outcome-determining process and other exogenous variables excluded from the treatment selection process. Examples of the former would be randomized treatments or a sequence of policy shocks, examples of the latter would be factors that agents cannot anticipate when making treatment decisions or earlier, but that determine the outcome. The identification of each period’s ARSF allows us to identify the ATE’s as well as the optimal treatment regime. In this paper, we also consider the cases where the two-way exclusion restriction are violated in the sense that only a standard exclusion restriction holds or where the variation of the exogenous variables are limited. In these cases, we can calculate the bounds on the ARSF and the ATE. As an extension of the results of this paper, we consider another empirically relevant situation where treatments do not appear in every time period while outcomes are constantly observed. We show that the parameters of interest and the identification analysis can be easily modified to incorporate this situation. Despite its importance, studies on the effects of dynamic endogenous treatments are still limited in the literature. Abbring and Van den Berg (2003) analyze the effect of a continuous endogenous treatment (i.e., a treatment duration) on a outcome duration and take a parametric approach by considering a mixed proportional hazard model. Cunha et al. (2007) and Heckman and Navarro (2007) consider a semiparametric discrete-time duration model for the choice of the treatment timing and associated outcomes. Building on these papers, Heckman et al. (2016) consider not only the ordered choice models but also unordered choice models for up-or-out treatment choices.2 An interesting feature of the results in their paper is that dynamic treatment effects are decomposed into direct effects and continuation values. The present paper considers nonparametric dynamic models for a sequence of treatments and outcomes with a general form of evolution where the processes can freely change states. It includes irreversible processes of duration models as a special case. The paper complements to the previous papers in that it considers a different form of dynamics for the treatment choices and outcomes, different identifying assumptions, and it focuses on the identification of the ATE’s and related parameters. This paper’s structural approach is only relative to the counterfactual framework of Robins. A fully structural model of dynamic programming is considered in seminal work by Rust (1987) and more recently by, e.g., Blevins (2014) and Buchholz et al. (2016). This literature typically considers a single rational agent’s optimal decision, whereas this paper considers multiple heterogenous agents with no assumptions on agents’ rationality nor strong 2 As related work, the settings of Angrist and Imbens (1995) and Lee and Salani´e (2017) for multiple treatments effects can be applied to a dynamic setting. Also see Abbring and Heckman (2007) for a survey on dynamic treatment effects.

3

parametric assumptions. Most importantly, this paper’s focus is on the identification of the effects of treatments formed as agents’ decisions on outcomes. The robust approach of this paper is in spirit similar to Heckman and Navarro (2007) and Heckman et al. (2016). But unlike these papers, we do not necessarily invoke infinite supports of exogenous variables while remaining flexible for economic and non-economic components of the model. Lastly, Torgovitsky (2016) extends the literature on dynamic discrete choice models (with no treatment) by considering a counterfactual framework without imposing parametric assumptions. In his framework, Yt−1 takes the role of a treatment for Yt and the “treatment effect” captures the state dependence. In the present paper, we consider the effects of the treatment Dt on Yt , and introduce a selection equation for Dt as an important component of the model. In terms of notation, let W t ≡ (W1 , .., Wt ) denote a row vector that collects r.v.’s Wt across time up to t, and wt be its realization. We sometimes write W ≡ W T for convenience. For a vector W without the t-th element, we write W−t ≡ (W1 , ..., Wt−1 , Wt+1 , ..., WT ) with realization w−t . More generally, let W− with realization w− denote some subvector of W . Lastly for r.v.’s Y and W , we sometimes abbreviate Pr[Y = y|W = w] to Pr[Y = y|w] or P [y|w].

2

Robins’s Framework

We first introduce Robins’s counterfactual framework and state the assumption of sequential randomization commonly used in the biostatistics literature (Robins (1986, 1987), Murphy et al. (2001), Murphy (2003)). For a finite horizon t = 1, ..., T with fixed T , Yt is the outcome at t with realization yt and Dt is a binary treatment at t with realization dt . The underlying data structure is panel data (cross-sectional index i suppressed throughout, unless necessary). We call YT a terminal outcome and Yt for t ≤ T − 1 a transition outcome. Let Y and D ⊆ {0, 1}T be the supports of Y ≡ (Y1 , ..., YT ) and D ≡ (D1 , ..., DT ), respectively. Consider a treatment regime d ≡ (d1 , ..., dT ) ∈ D, which is defined as a predetermined hypothetical sequence of interventions over time, i.e., a sequence of each period’s assignment decisions of whether to treat or not, or whether to assign treatment A or treatment B.3 Then a potential outcome at t can be written as Yt (d). It can be understood as an outcome of an individual had a particular treatment sequence been assigned. Although the genesis of Yt (d) can be very general under this counterfactual framework, the mechanism under which the sequence of treatments interacts with the sequence of outcomes is opaque. The definition of Yt (d) becomes more transparent later with a structural model introduced in this paper. Given these definitions, we state the assumption of sequential randomization by Robins: For each d ∈ D, (Y1 (d), ..., YT (d)) ⊥ Dt |Y t−1 , D t−1

(2.1)

for t = 1, ..., T .4 This assumption asserts that, holding the history of outcomes and treatments fixed, the current treatment is fully randomized. In the next section, we relax this assumption 3

This is called a static regime in the biostatistics literature. A dynamic regime is a sequence of treatment assignments, each of which is a predetermined function of past outcomes; see e.g., Murphy et al. (2001). A static regime can be seen as being its special case where this function is constant. 4 This assumption is also called sequential conditional independence or sequential ignorability in the literature.

4

and specify dynamic selection equations for a sequence of treatments that are allowed to be endogenous. Apart from this assumption, we maintain the same preliminaries introduced in this section. Remark 2.1 (Irreversibility). As an important special case of our setting, the process of Yt and/or Dt may be irreversible in that the process only moves from an initial state to a destination state, i.e., the destination state is an absorbing state. The up-or-out treatment decision in Heckman et al. (2016) can be seen as the case where Dt = 1 once Dt−1 = 1 is reached. The survival of patients (Yt = 0) can be an example where the transition of the outcome satisfies Yt = 1 once Yt−1 = 1. In this case, it may be that Dt is missing when Yt−1 = 1, which can be dealt conventionally by assuming Dt = 0 if Yt−1 = 1. When processes are irreversible, the supports Y and D are strict subsets of {0, 1}T . Remark 2.2 (Terminal outcome of a different kind). As in Murphy et al. (2001) and Murphy (2003), we may be interested in a terminal outcome that is of a different kind than transition outcomes. For example, a terminal outcome can be college attendance while transition outcomes are high school performances. In this case, we replace YT with a random variable RT to denote a terminal outcome, while maintaining Yt with t ≤ T − 1 to denote transition outcomes. In this case, RT (d) denotes the potential terminal outcome. Then the analysis of this paper can be readily followed with the change of notation.5

3

A Structural Model and Objects of Interest

We introduce the main framework of this paper. We consider a structural model and identifying assumptions that are plausible in observational studies and are economically interpretable, relative to the counterfactual framework in the previous section. Consider a dynamic structural function for the outcome Yt that has a form of switching regression models: For t = 1, ..., T , Yt = µt (Yt−1 , Dt , Xt , Ut (Dt )), where µt (·) is an unknown scalar-valued function, Xt is a set of exogenous variables, which we discuss the details later, and Y0 = 0 for convenience. The unobservable variable satisfies Ut (Dt ) = Dt Ut (1) + (1 − Dt )Ut (0), where Ut (dt ) is the “rank variable” that captures unobserved characteristics or rank, specific to treatment state dt (Chernozhukov and Hansen (2005)). We allow Uit (dt ) to contain a permanent component (i.e., individual effects) and a transitory component.6 Given this structural equation, we can express the potential outcome 5

Extending this framework to incorporate irreversibility discussed in Remark 2.1 is not straightforward. We leave this for future research. 6 It may make sense that the permanent component does not depend on each dt but the transitory component does.

5

Yt (d) using a recursive structure: Yt (d) = Yt (dt ) = µt (Yt−1 (dt−1 ), dt , Xt , Ut (dt )), .. . Y2 (d) = Y2 (d2 ) = µ2 (Y1 (d1 ), d2 , X2 , U2 (d2 )), Y1 (d) = Y1 (d1 ) = µ1 (Y0 , d1 , X1 , U1 (d1 )), where each potential outcome at time t is only a function of dt (not the full d). This is related to the “no-anticipation” condition (Abbring and Van den Berg (2003), Abbring and Heckman (2007)), which is implied from the structure of the model in our setting. This recursive structure provides us with a useful interpretation of the potential outcome Yt (d) in a dynamic setting, and thus facilitates our identification analysis. Note that, conditional on X t ≡ (X1 , ..., Xt ), the remaining randomness in Yt (d) comes from U t (dt ) ≡ (U1 (d1 ), ..., Ut (dt )). By an iterative argument, one can show that the potential outcome equals the observed outcome when the observed treatments are consistent with the assigned regime: Yt (d) = Yt when D = d, or equivalently, X Yt = 1{D = d}Yt (d). d∈D

In this paper, we consider the average potential terminal outcome, conditional on X ≡ (X1 , ..., XT ), as the fundamental parameter of interest: E[YT (d)|X = x].

(3.1)

This parameter is also called the average recursive structural function (ARSF) in the terminal period, named after the recursive structure in YT (d). In general, we can consider the average potential outcome of any time period, i.e., E[Yt (d)|X t = xt ] for any t, but we focus on E[YT (d)|X = x] just for concreteness. The knowledge on the ARSF is useful to recover other related parameters. First, we are interested in the conditional ATE: ˜ E[YT (d) − YT (d)|X = x]

(3.2)

˜ For example, one may be interested in comparing more for two different regimes d and d. versus less consistent treatment sequences, or earlier versus later treatments. As an additional parameter of interest, we consider an optimal treatment regime: d∗ (x) = arg max E[YT (d)|X = x] d∈D

(3.3)

with |D| ≤ 2T . That is, we are interested in a treatment regime that delivers the maximum expected outcome conditional on X = x.7 Note that in a static model, the identification of d∗ is equivalent to the identification of the sign of the static ATE, which is information typically sought for from a policy point of view. One can view d∗ as a natural extension of this information to a dynamic setting, which is identified by establishing the signs of 7

A dynamic version of an optimal treatment regime is considered in, e.g., Murphy (2003).

6

all the possible ATE’s defined as (3.2). The optimal regime may serve as a guideline in developing future policies. Moreover, it may be too costly to find a customized treatment scheme for every individual and it may be a realistic goal for a social planner to find a scheme that maximizes the average benefit. Yet, the optimal regime is customized up to observed characteristics, as it is a function of covariates values x. More ambitious than the identification of d∗ (x) may be recovering an optimal regime based on an objective function that delivers a cost-benefit analysis, granting than each dt can be costly: d† (x) = arg max Π(d; x), d∈D

where Π(d; x) ≡ w1 E[YT (d)|X = x] − w0

T X

dt

t=1

or Π(d; x) ≡

T X

wt E[Yt (d)|X = x] − w0

t=1

T X

dt

t=1

with (w0 , w1 ) and (w0 , w) being predetermined weights. The latter objective function concerns the weighted sum of the average potential outcomes throughout the entire period less the cost for treatments. Note that establishing the signs of ATE’s will not identify d† , and a stronger identification result, i.e., the point identification of E[YT (d)|X = x]’s for all d (or E[Yt (d)|X = x]’s for all t and d), is required. In order to facilitate identification of the parameters of interest without assuming sequential randomization, we introduce a sequence of selection equations for binary endogenous treatments Dt ’s: For t = 1, ..., T Dt = 1{πt (Yt−1 , Dt−1 , Zt ) ≥ Vt }, where πt (·) is an unknown scalar-valued function, Zt is the period-specific instruments, Vt is the unobservable variable that may contain permanent and transitory components, and Y0 = 0 and D0 = 0 for convenience. This dynamic selection process represents the agent’s endogenous choices over time, e.g., due to learning or other optimal behaviors. The nonparametric threshold-crossing structure, however, posits a minimal notion of optimality for the agent. We take an agnostic and robust approach by avoiding strong assumptions of the standard dynamic economic models pioneered by Rust (1987), e.g., forward looking behaviors and being able to compute a present value discounted flow of utilities. When we are to maintain the assumption of rational agents, the selection model can be seen as a reduced-form approximation of a solution to a dynamic programming problem. For the analysis of this paper, we consider binary Yt and impose weak separability in the

7

outcome equation as in the treatment equation.8 Then the full model can be summarized as Yt = 1{µt (Yt−1 , Dt , Xt ) ≥ Ut (Dt )},

(3.4)

Dt = 1{πt (Yt−1 , Dt−1 , Zt ) ≥ Vt }.

(3.5)

In this model, the observable variables are (Y , D, X, Z). All other covariates common to (3.4) and (3.5) are suppressed for simplicity of exposition. Importantly in this model, the joint distribution of the unobservable variables (U (d), V ) for given d is not specified in that Ut (dt ) and Vt0 for any t, t0 are allowed to be arbitrarily correlated to each other (allowing endogeneity) as well as within themselves across time (allowing serial correlation, e.g., via individual effects).9 We avoid strong assumptions in the model specification, such as conditional independence assumptions, which is commonly introduced in the standard dynamic economic models, and other parametric functional forms. Remark 3.1 (Irreversibility—continued). A process that satisfies Dt = 1 if Dt−1 = 1 is consistent with having a structural function that satisfies πt (yt−1 , dt−1 , zt ) = +∞ if dt−1 = 1. Likewise, processes that satisfy Yt = 1 and Dt = 0 if Yt−1 = 1 are consistent with µt (yt−1 , dt , xt ) = +∞ and πt (yt−1 , dt−1 , zt ) = −∞ if yt−1 = 1. The ARSF E[YT (d)|X] can be interpreted as (one minus) a potential survival rate. Remark 3.2 (Terminal outcome of a different kind—continued). When we replace YT with RT to denote a terminal outcome of a different kind, the model (3.4) only satisfies for t ≤ T − 1 and we additionally introduce RT = 1{µT (YT −1 , DT , XT ) ≥ UT (DT )} as the terminal structural function. The potential terminal outcome RT (d) can accordingly be expressed with the structural functions. The ARSF is written as E[RT (d)|X] and the other parameters can be defined accordingly.10

4

Identification Analysis

We first identify the ARSF’s, i.e., E[Yt (d)|X t ] for every d and t, which will then identify the ATE’s and the optimal regimes d∗ and d† . We maintain the following assumptions on (U (d), V ) for every d as well as on (Z, X). Assumption C. The distribution of (U (d), V ) has strictly positive density with respect to Lebesgue measure on R2T . Assumption SX. (Z, X) and (U (d), V ) are independent. Assumption C is a regularity condition to ensure smoothness of relevant conditional probabilities. Assumption SX imposes strict exogeneity. The variable Zt is standard excluded instruments. Examples are sequential randomized treatments or a sequence of policy shocks. In addition to Zt , we introduce exogenous variables Xt in the outcome equation 8

Binary Yt and weak separability may not be necessary (Vytlacil and Yildiz (2007), Han (2018)) for the results of this paper, but they simplify the exposition. 9 Note that because of this, Yt and Dt are not Markov processes unless conditional on both the observables and unobservables. 10 Extending this framework to incorporate irreversibility is not straightforward. We leave this for future research.

8

(3.4), that is excluded from the selection equation (3.5) of the same period. We make a behavioral/information assumption that there are outcome-determining factors that the agent cannot anticipate when making a treatment decision at period t. In fact, no anticipation is implicitly assumed to hold for periods prior to t as well, since Xt is also excluded from the outcome and treatment equations of all the previous periods. Next, we introduce a sequential version of the rank similarity assumption (Chernozhukov and Hansen (2005)): Assumption RS. For every t and d−t , {U (dt , d−t )}dt are identically distributed, conditional on (U t−1 (dt−1 ), V t ). Rank invariance (i.e., {U (d)}d being equal to each other) is particularly restrictive in the multi-period context, since it requires that the same rank is realized across different treatment states for all t. Assumption RS, which we call sequential rank similarity, states that (Ut (1), Ut (dt+1 ), ..., UT (dT )) and (Ut (0), Ut (dt+1 ), ..., UT (dT )) are identically distributed conditional on (U t−1 (dt−1 ), V t ). That is, holding the history of ranks (and treatment unobservables) fixed, it only requires that the joint distributions of the ranks at t and beyond are identical between states that differ by dt = 1 and 0. Therefore it allows an individual to have different realized ranks across different d’s. Now we are ready to derive a period-specific result. Define the following period-specific quantity directly identified from the data, i.e., from the distribution of (Y , D, X, Z): ht (zt , z˜t , xt , x ˜t ; z t−1 , xt−1 , dt−1 , yt−1 ) ≡ Pr[Yt = 1, Dt = 1|z t , xt , dt−1 , yt−1 ] + Pr[Yt = 1, Dt = 0|z t , x ˜t , xt−1 , dt−1 , yt−1 ] − Pr[Yt = 1, Dt = 1|˜ zt , z t−1 , xt , dt−1 , yt−1 ] − Pr[Yt = 1, Dt = 0|˜ zt , z t−1 , x ˜t , xt−1 , dt−1 , yt−1 ]. Lemma 4.1. Suppose Assumptions C, SX and RS hold. For each t and (xt−1 , z t−1 , dt−1 , yt−1 ), suppose zt and z˜t satisfy Pr[Dt = 1|xt−1 , z t , dt−1 , yt−1 ] − Pr[Dt = 1|xt−1 , z˜t , z t−1 , dt−1 , yt−1 ] > 0.

(4.1)

Then for given (xt , x ˜t ), the sign of ht (zt , z˜t , xt , x ˜t ; z t−1 , xt−1 , dt−1 , yt−1 ) equals the sign of µt (yt−1 , 1, xt ) − µt (yt−1 , 0, x ˜t ). The sign of µt (yt−1 , 1, xt ) − µt (yt−1 , 0, x ˜t ) itself is already useful for calculating bounds on the ARSF’s and thus on the ATE’s, without relying on further assumptions; we discuss the partial identification in Section 5. Proof of Lemma 4.1: For the analysis of this paper which deals with a dynamic model, it is convenient to define the U -set and V -set, namely the sets of the histories of the unobservable variables that determine the current outcome and current treatment. To focus our attention on this dependence of the potential outcomes and potential treatments on the unobservables, let Yt (dt , xt ) ≡ 1{µt (Yt−1 (dt−1 , xt−1 ), dt , xt ) ≥ Ut (dt )}

9

be the potential outcome given (d, x) with Y1 (d1 , x1 ) = µ1 (y0 , d1 , x1 , U1 (d1 )). Also, let Dt (y t−1 , z t ) ≡ 1{πt (yt−1 , Dt−1 (y t−2 , z t−1 ), zt ) ≥ Vt } be the potential treatment given (y, z) with D1 (y0 , z1 ) = 1{π1 (y0 , d0 , z1 ) ≥ V1 }. Now, for t = 1, ..., T , define a set of U t (dt ) as U t (dt , yt ) ≡ U t (dt , yt ; xt ) ≡ {U t (dt ) : yt = Yt (dt , xt )} and a set of V t as V t (dt ) ≡ V t (dt ; z t ) ≡ {V t : ds = Ds (y s−1 , z s ) for all s ≤ t and some y t−1 }. Using these definitions, we have Pr[Dt = 1|xt−1 , z t , dt−1 , yt−1 ] = Pr[Vt ≤ πt (yt−1 , dt−1 , zt )|xt−1 , z t , V t−1 (dt−1 ), U t−1 (dt−1 , yt−1 )] = Pr[Vt ≤ πt (yt−1 , dt−1 , zt )|V t−1 (dt−1 ), U t−1 (dt−1 , yt−1 )], where the last equality is by Assumption SX. Note that the sets V t−1 (dt−1 ) and U t−1 (dt−1 , yt−1 ) do not change at the change in zt . Let πt ≡ (yt−1 , dt−1 , zt ) and π ˜t ≡ (yt−1 , dt−1 , z˜t ) for abbreviation. Then, under Assumptions C and SX, 0 < Pr[Dt = 1|xt−1 , z t , dt−1 , yt−1 ] − Pr[Dt = 1|xt−1 , z˜t , z t−1 , dt−1 , yt−1 ] = Pr[Vt ≤ πt |V t−1 (dt−1 ), U t−1 (dt−1 , yt−1 )] − Pr[Vt ≤ π ˜t |V t−1 (dt−1 ), U t−1 (dt−1 , yt−1 )], which implies πt > π ˜t . Next, we have Pr[Yt = 1, Dt = 1|z t , xt , dt−1 , yt−1 ] = Pr[Ut (1) ≤ µt (yt−1 , 1, xt ), Vt ≤ πt |V t−1 (dt−1 ), U t−1 (dt−1 , yt−1 )] by Assumption SX. Again, note that V t−1 (dt−1 ) and U t−1 (dt−1 , yt−1 ) do not change at the change in (zt , xt ), which is key. Therefore, we have ht (zt , z˜t , xt , x ˜t ; z t−1 , xt−1 , dt−1 , yt−1 ) = Pr[Ut (1) ≤ µt (yt−1 , 1, xt ), π ˜t ≤ Vt ≤ πt |V t−1 (dt−1 ), U t−1 (dt−1 , yt−1 )] − Pr[Ut (0) ≤ µt (yt−1 , 0, x ˜t ), π ˜t ≤ Vt ≤ πt |V t−1 (dt−1 ), U t−1 (dt−1 , yt−1 )], of which sign identifies the sign of µt (yt−1 , 1, xt ) − µt (yt−1 , 0, x ˜t ) by Assumption RS. For example, when this quantity is zero then µt (yt−1 , 1, xt ) − µt (yt−1 , 0, x ˜t ) = 0. For point identification of the ARSF’s, a final assumption we introduce concerns the

10

variation of the exogenous variables (Z, X) . Define the following sets: n o St (yt−1 , dt ) ≡ (xt , x ˜t ) : µt (yt−1 , dt , xt ) = µt (yt−1 , d˜t , x ˜t ) for d˜t 6= dt ,

(4.2)

Tt (yt−1 ) ≡ {(xt , x ˜t ) : ∃(zt , z˜t ) with (xt , zt ), (˜ xt , zt ), (xt , z˜t ), (˜ xt , z˜t ) ∈ Supp(Xt , Zt |Yt−1 = yt−1 )} , Xt (yt−1 , dt ) ≡ {xt : ∃˜ xt with (xt , x ˜t ) ∈ St (yt−1 , dt ) ∩ Tt (yt−1 )} , Xt (dt ) ≡ Xt (0, dt ) ∩ Xt (1, dt ),

(4.3) (4.4) (4.5)

where (4.2) is related to sufficient variation of Xt and (4.3) is related to rectangular support for (Xt , Zt ). Assumption SP. For each t and dt , Pr[Xt ∈ Xt (dt )] > 0. This assumption requires that Xt varies enough to achieve µt (yt−1 , dt , xt ) = µt (yt−1 , d˜t , x ˜t ) while holding Zt to be zt and z˜t , respectively.11 It is a dynamic version of the support assumption found in Vytlacil and Yildiz (2007).12 Note that even though this assumption seems to be written in terms of the unknown object µt (·), it is testable because the sets above have an empirical analog as shown in Lemma 4.1. Now we are ready to state the main identification result of this paper. Theorem 4.1. Under Assumptions C, SX, RS and SP, E[YT (d)|x] is identified for d ∈ D and xt ∈ Xt (dt ) for all t. Based on Theorem 4.1, we can identify the ATE’s. Since the identification of all E[Yt (d)|xt ]’s can be shown analogous to Theorem 4.1, we can identify the optimal treatment regimes d∗ (x) and d† (x) as well. ˜ Corollary 4.1. Under Assumptions C, SX, RS and SP, E[YT (d) − YT (d)|x] is identified ∗ † for d, d˜ ∈ D and xt ∈ Xt (dt ) ∩ Xt (d˜t ) for all t, and d (x) and d (x) are identified for xt ∈ Xt (0) ∩ Xt (1) for all t. Proof of Theorem 4.1: To begin with, note that E[YT (d)|x] = Pr[U (d) ∈ U T (d, 1; x)|x] = Pr[U (d) ∈ U T (d, 1; x)|x, z] = E[YT (d)|x, z] by Assumption SX. As the first step of identifying E[YT (d)|x, z] for given d = (d1 , ..., dT ), x = (x1 , ..., xT ) and z = (z1 , ..., zT ), we apply the result of Lemma 4.1. Fix t = 2, ..., T and yt−1 ∈ {0, 1}. Suppose x0t is such that µt (yt−1 , dt , xt ) = µt (yt−1 , d0t , x0t ) with d0t 6= dt by applying Lemma 4.1. The existence of x0t is guaranteed by Assumption SP as xt ∈ Xt (dt ). The implication of µt (yt−1 , dt , xt ) = µt (yt−1 , d0t , x0t ) for relevant U -sets is as follows: by the definition of the U -set, U ∈ Ud (x; yT ) is equivalent to U ∈ Ud0t ,d−t (x0t , x−t ; yT ) conditional on Yt−1 (dt−1 , xt−1 ) = yt−1 for all x−t and d−t , as long as U (d) and U (dt , d−t ) have the same support. The latter is guaranteed by Assumption RS.13 Based on this result, we can equate the unobserved quantity 11 Although Assumption SP requires sufficient rectangular variation in (Xt , Zt ), it clearly differs from the large variation assumptions in Heckman and Navarro (2007) and Heckman et al. (2016), which are employed for identification at infinity arguments. 12 It is possible that Xt (dt ) is nonempty even when Zt is discrete as long as Xt contains continuous elements with sufficient support (Vytlacil and Yildiz (2007)), 13 The following analysis is significantly simplified when µt (yt−1 , dt , xt ) = µt (yt−1 , d0t , x0t ) satisfies for all yt−1 . This situation, however, is hard to occur.

11

E[YT (d)|x, z, dt−1 , d0t , yt−1 ] with a quantity that matches the assigned treatment and the observed treatment: E[YT (d)|x, z, dt−1 , d0t , yt−1 ] = Pr[U (d) ∈ U T (d, 1; x)|x, z, V t (dt−1 , d0t ), U t−1 (dt−1 , yt−1 )] = Pr[U (d) ∈ U T (d, 1; x)|V t (dt−1 , d0t ), U t−1 (dt−1 , yt−1 )] = Pr[U (d0t , d−t ) ∈ U T (d0t , d−t , 1; x0t , x−t )|V t (dt−1 , d0t ), U t−1 (dt−1 , yt−1 )] = Pr[U (d0t , d−t ) ∈ U T (d0t , d−t , 1; x0t , x−t )|x0t , x−t , z, V t (dt−1 , d0t ), U t−1 (dt−1 , yt−1 )] =E[YT (d0t , d−t )|x0t , x−t , z, dt−1 , d0t , yt−1 ],

(4.6)

where the second and fourth equalities are by Assumption SX, and the third equality is by Assumption RS and the discussion above. The last quantity in (4.6) is still unobserved since ds for s ≥ t + 1 are not realized treatments, but the quantity will be useful in the iterative argument below. Note that in the derivation of (4.6), the key is to consider the average potential outcome for a group defined by the treatments of time t or earlier and the lagged outcome, for which xt is excluded as mentioned before. We use the result (4.6) in the next step. First, note that E[YT (d)|x, z, dT ] = E[YT |x, z, dT ] is trivially identified for any generic values (d, x, z). We prove by means of mathematical induction. For given t = 2, ..., T −1, suppose E[YT (d)|x, z, dt ] is identified for any generic values (d, x, z), and consider identification of E[YT (d)|x, z, dt−1 ] = Pr[Dt = dt |x, z, dt−1 ]E[YT (d)|x, z, dt−1 , dt ] + Pr[Dt = d0t |x, z, dt−1 ]E[YT (d)|x, z, dt−1 , d0t ]. The only unobserved term on the r.h.s. can be shown to satisfy E[YT (d)|x, z, dt−1 , d0t ] = Pr[Yt−1 = 1|x, z, dt−1 , d0t ]E[YT (d)|x, z, dt−1 , d0t , Yt−1 = 1] + Pr[Yt−1 = 0|x, z, dt−1 , d0t ]E[YT (d)|x, z, dt−1 , d0t , Yt−1 = 0]. (4.7) But note that E[YT (d)|x, z, dt−1 , d0t , yt−1 ] = E[YT (d0t , d−t )|x0t , x−t , z, dt−1 , d0t , yt−1 ]

(4.8)

by (4.6), which is assumed to be identified from the previous step. Therefore E[YT (d)|x, z, dt−1 ] is identified. Lastly when t = 1, E[YT (d)|x, z] = Pr[D1 = d1 |x, z]E[YT (d)|x, z, d1 ] + Pr[D1 = d01 |x, z]E[YT (d)|x, z, d01 ]. Noting that y0 = 0, suppose x01 is such that µ1 (0, d1 , x1 ) = µ1 (0, d01 , x01 ) with d01 6= d1 by

12

applying Lemma 4.1. Then E[YT (d)|x, z, d01 ] = Pr[U (d) ∈ U T (d, 1; x)|x, z, V 1 (d01 )] = Pr[U (d) ∈ U T (d, 1; x)|V 1 (d01 )] = Pr[U (d01 , d−1 ) ∈ U T (d01 , d−1 , 1; x01 , x−1 )|V 1 (d01 )] = Pr[U (d01 , d−1 ) ∈ U T (d01 , d−1 , 1; x01 , x−1 )|x01 , x−1 , z, V 1 (d01 )] = E[YT (d01 , d−1 )|x01 , x−1 , z, d01 ], which is identified from the previous step for t = 2. Therefore E[YT (d)|x, z] is identified, which completes the proof of Theorem 4.1. 2 Note that this proof provides a closed form expression for E[YT (d)|x] in an iterative manner, which can be immediately used for estimation. For concreteness, we provide an expression for E[YT (d)|x] when T = 2: E[Y2 (d)|x] =P [d|x, z]E[Y2 |x, z, d] + P [d1 , d02 |x, z]µ2,d1 ,d02 + P [d01 , d2 |x, z]E[Y2 |x01 , x2 , z, d01 , d2 ] + P [d01 , d02 |x, z]µ2,d01 ,d02 ,

(4.9)

where µ2,d1 ,d02 ≡P [y1 |x, z, d1 , d02 ]E[Y2 |x1 , x02 , z, d1 , d02 , y1 ] + P [y10 |x, z, d1 , d02 ]E[Y2 |x1 , x002 , z, d1 , d02 , y10 ], µ2,d01 ,d02 ≡P [y1 |x01 , x2 , z, d01 , d02 ]E[Y2 |x01 , x02 , z, d01 , d02 , y1 ] + P [y10 |x01 , x2 , z, d01 , d02 ]E[Y2 |x01 , x002 , z, d01 , d02 , y10 ] for (x01 , x02 , x002 ) such that µ1 (0, d1 , x1 ) = µ1 (0, d01 , x01 ), µ2 (y1 , d2 , x2 ) = µ2 (y1 , d02 , x02 ), and µ2 (y10 , d2 , x2 ) = µ2 (y10 , d02 , x002 ). Earlier, it is shown that Lemma 4.1 recovers period-specific information about the structural function for Yt . Before closing this section, we proceed towards this direction further and show identification of the transition-specific ATE (conditional on Xt ) E[Yt (1) − Yt (0)|Yt−1 = yt−1 , Xt = xt ], where Yt (dt ) = µt (Yt−1 , dt , Xt , Ut (dt )) is the period-specific potential outcome at time t. In fact, when rank invariance is assumed, Lemma 4.1 immediately identifies the sign of the transition-specific ATE. With Assumption RS instead and Assumption SP we show the identification of this treatment parameter. Theorem 4.2. Under Assumptions C, SX, RS and SP, E[Yt,1 |yt−1 ]−E[Yt,0 |yt−1 ] is identified for each t and yt−1 . With Yt being binary, the transition-specific ATE becomes the treatment effect on the transition probability: Pr[Yt (1) = 1|Yt−1 = 0] − Pr[Yt (0) = 1|Yt−1 = 0]. This quantity can especially be relevant with a irreversible process discussed in Remarks 2.1 and 3.1. The treatment effects on transition has been studied in, e.g., Abbring and Van den Berg (2003), Heckman and Navarro (2007) and Vikstr¨om et al. (2017).14 14

The definition of the treatment effect on the transition probability in this paper is restrictive than Vikstr¨ om et al. (2017), since they allow treatments to be assigned earlier than the transition of interest.

13

Proof of Theorem 4.2: Consider X E[Yt (dt )|yt−1 , xt , z t ] = Pr[D t = (dt , d˜t−1 )|yt−1 , xt , z t ]E[Yt (dt )|yt−1 , xt , z t , dt , d˜t−1 ] d˜t−1 ∈Dt−1

+

X

X

Pr[D t = d˜t |yt−1 , xt , z t ]E[Yt (dt )|yt−1 , xt , z t , d˜t ],

d˜t 6=dt d˜t−1 ∈Dt−1

where E[Yt (dt )|yt−1 , xt , z t , d˜t ] (for d˜t 6= dt ) is the only unobservable terms. By Assumptions SX, RS and SP, we identify those terms for xt ∈ Xt (yt−1 , dt ) according to E[Yt (dt )|yt−1 , xt , z t , d˜t ] = Pr[Ut (dt ) ≤ µt (yt−1 , dt , xt )|V t (d˜t ), U t−1 (d˜t−1 , yt−1 )] = Pr[Ut (d˜t ) ≤ µt (yt−1 , d˜t , x ˜t )|˜ xt , xt−1 , z t , V t (d˜t ), U t−1 (d˜t−1 , yt−1 )] = E[Yt |yt−1 , x ˜t , xt−1 , z t , d˜t ].

(4.10)

Finally, we have Z Z E[Yt (dt )|yt−1 ] = Xt ∈Xt (Yt−1 ,dt )|X t−1 ,Z t

E[Yt (dt )|yt−1 , xt , z t ]dFXt |X t−1 ,Z t ,Yt−1 dFX t−1 ,Z t |Yt−1 ,

which completes the proof. Remark 4.1. In estimating the parameters identified in this section, one can improve efficiency by aggregating the conditional expectations (4.8) (or (4.10)) with respect to x0t (or x ˜t ) over the following set: λt (xt ; z t−1 , xt−1 , dt−1 , yt−1 ) ≡ {˜ xt : ht (zt , z˜t , xt , x ˜t ; z t−1 , xt−1 , dt−1 , yt−1 ) = 0 for some (zt , z˜t )}.

5

Partial Identification

Suppose Assumption SP does not hold in that Xt does not exhibit rectangular variation, or there is no Xt that is excluded from the selection equation at time t. In this case, we partially identify the ARSF’s and ATE’s. The partial identification of d∗ (x) (or d† (x)) may not yield informative bounds unless there are a sufficient number of ATE’s whose bounds are informative about their signs. Even though the bounds are not informative about the optimal regime, they may still be useful from a planner’s perspective if they can help her rule out one or two harmful regimes. We briefly illustrate the calculation of bounds on the ARSF when the sufficient rectangular variation is not guaranteed.15 For each unknown term E[YT (d)|x, z, dt−1 , d0t ] in the proof of Theorem 4.1, we can calculate its upper and lower bounds depending on the sign of µt (yt−1 , 1, xt )−µt (yt−1 , 0, x ˜t ), which is identified in Lemma 4.1. In the context of this section, x ˜t does not necessarily differ from xt . To begin with a simple example, for given t = 2, ..., T , suppose µt (yt−1 , dt , xt ) − µt (yt−1 , d0t , x0t ) ≥ 0 for all yt−1 where x0t is allowed to equal xt . Then, by the definition of the U -set and under Assumption RS, it satisfies that U T (d, yT ; x) ⊇ 15

The case where Xt does not exist at all can be dealt in a similar way, which is omitted.

14

U T (d0t , d−t , yT ; x0t , x−t ) regardless of the value of Yt−1 (dt−1 , xt−1 ). Therefore, we have a lower bound as E[YT (d)|x, z, dt−1 , d0t ] = Pr[U (d) ∈ U T (d, 1; x)|x, z, V t (dt−1 , d0t )] = Pr[U (d) ∈ U T (d, 1; x)|V t (dt−1 , d0t )] ≥ Pr[U (d0t , d−t ) ∈ U T (d0t , d−t , yT ; x0t , x−t )|V t (dt−1 , d0t )] = Pr[U (d0t , d−t ) ∈ U T (d0t , d−t , yT ; x0t , x−t )|x0t , x−t , z, V t (dt−1 , d0t )] = E[YT (d0t , d−t )|x0t , x−t , z, dt−1 , d0t ]. As a more realistic example, when the sign of µt (yt−1 , dt , xt ) − µt (yt−1 , d0t , x0t ) is identified for each yt−1 , it is possible to calculate the bounds by (4.7). For instance, if µt (1, dt , xt ) ≥ µt (1, d0t , x0t ) where x0t is allowed to equal xt , then E[YT (d)|x, z, dt−1 , d0t , Yt−1 = 1] ≥ E[YT |x0t , x−t , z, dt−1 , d0t , Yt−1 = 1], since E[YT (d)|x, z, dt−1 , d0t , Yt−1 = 1] = Pr[U (d) ∈ U T (d, 1; x)|x, z, V t (dt−1 , d0t ), U t−1 (dt−1 , 1)] = Pr[U (d) ∈ U T (d, 1; x)|V t (dt−1 , d0t ), U t−1 (dt−1 , 1)] and U T (d, 1; x) ⊇ U T (d0t , d−t , 1; x0t , x−t ) given Yt−1 (dt−1 , xt−1 ) = 1. Once the bounds on E[YT (d)|x, z, dt−1 , d0t ] are established, the bounds on E[YT (d)|x] = E[YT (d)|x, z] can be calculated using the iterative scheme introduced in the proof of Theorem 4.1. Lastly, depending on the signs of the ATE’s, we can construct bounds on d∗ (x), which will be expressed as strict subsets of D.

6

Subsequences of Treatments

An important extension of the model introduced in this paper may be a model where treatments do not appear in every time period while outcomes are constantly observed. For example, institutionally, there may be only a one-shot treatment at the beginning of time or a few treatments earlier in the horizon, or there may be evenly-spaced treatment decisions with a lower frequency than outcomes. A potential outcome that corresponds to this situation can be defined as a function of a certain subsequence d− of d. Let d− = (dt1 , ..., dtK ) ∈ D− ⊆ {0, 1}K be a 1 × K subvector of d where t1 < t2 < · · · < tK ≤ T and K < T . Then the potential outcomes Yt (d− ) and associated structural functions are defined as follows: A potential outcome in the period when a treatment exists is expressed using a switching regression model as Ytk (d− ) = Ytk (dtk ) = µtk (Ytk −1 (dt(k−1) ), dtk , Xtk , Utk (dtk ))

15

for k = 1, ..., K with Yt1 −1 (dt0 ) = Yt1 −1 , and a potential outcome when there is no treatment is expressed as Yt (d− ) = Yt (dtk ) = µt (Yt−1 (dtk ), Xt , Ut ) for t such that tk < t < t(k+1) (k = 1, ..., K − 1), and Yt (d− ) = Yt = µt (Yt−1 , Xt , Ut ) for t < t1 and Yt (d− ) = Yt (dtK ) = µt (Yt−1 (dtK ), Xt , Ut ) for t > tK . Each structural model associated with the latter type of potential outcomes is a plain dynamic model with a lagged dependent variable. Now all the parameters introduced in Section 3 can be readily modified by replacing d with d− for some d− ; we omit the definitions for the sake of brevity. Moreover, the identification analysis of Section 4 can be easily modified in accordance with the extended setting. Assumption RS can be modified as follows. Let U− (d− ) ≡ (Ut1 (dt1 ), ..., UtK (dtK )). Assumption RS0 . For every tk and dtk , {U− (dtk , (d− )−tk )}dtk are identically distributed t

t

conditional on (U−(k−1) (d−(k−1) ), V tk ). Restricting the definition of Xt (dt ) in (4.2)–(4.5) to hold only for t = tk , we can also modify Assumption SP: Assumption SP0 . For each tk for k = 1, ..., K, Pr[Xtk ∈ Xtk (dtk )] > 0. Theorem 6.1. Under Assumptions C, SX, RS0 and SP0 , E[YT (d− )|x] is identified for d− ∈ D− , xtk ∈ Xtk (dtk ) for all tk , and xt ∈ supp(Xt ) for t 6= tk , for k = 1, ..., K. Corollary 6.1. Under Assumptions C, SX, RS0 and SP0 , E[YT (d− ) − YT (d˜− )|x] is identified for d− , d˜− ∈ D− and xtk ∈ Xtk (dtk ) ∩ Xtk (d˜tk ) for all tk , and d∗− (x) and d†− (x) are identified for xtk ∈ Xtk (0) ∩ Xtk (1) for all tk , all above for xt ∈ supp(Xt ) for t 6= tk , for k = 1, ..., K.

References Abbring, J. H., Heckman, J. J., 2007. Econometric evaluation of social programs, part iii: Distributional treatment effects, dynamic treatment effects, dynamic discrete choice, and general equilibrium policy evaluation. Handbook of econometrics 6, 5145–5303. 2, 3 Abbring, J. H., Van den Berg, G. J., 2003. The nonparametric identification of treatment effects in duration models. Econometrica 71 (5), 1491–1517. 1, 3, 4 Angrist, J. D., Imbens, G. W., 1995. Two-stage least squares estimation of average causal effects in models with variable treatment intensity. Journal of the American statistical Association 90 (430), 431–442. 2 Blevins, J. R., 2014. Nonparametric identification of dynamic decision processes with discrete and continuous choices. Quantitative Economics 5, 531–554. 1 Buchholz, N., Shum, M., Xu, H., 2016. Semiparametric estimation of dynamic discrete choice models. 1 16

Chernozhukov, V., Hansen, C., 2005. An IV model of quantile treatment effects. Econometrica 73 (1), 245–261. 3, 4 Cunha, F., Heckman, J. J., Navarro, S., 2007. The identification and economic content of ordered choice models with stochastic thresholds. International Economic Review 48 (4), 1273–1309. 1 Han, S., 2018. Multiple treatments with strategic interaction. 1, 8 Heckman, J. J., Humphries, J. E., Veramendi, G., 2016. Dynamic treatment effects. Journal of econometrics 191 (2), 276–292. 1, 2.1, 11 Heckman, J. J., Navarro, S., 2007. Dynamic discrete choice and dynamic treatment effects. Journal of Econometrics 136 (2), 341–396. 1, 11, 4 Lee, S., Salani´e, B., 2017. Identifying effects of multivalued treatments. Columbia University. 2 Murphy, S. A., 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2), 331–355. 1, 2, 2.2, 7 Murphy, S. A., van der Laan, M. J., Robins, J. M., Group, C. P. P. R., 2001. Marginal mean models for dynamic regimes. Journal of the American Statistical Association 96 (456), 1410–1423. 1, 2, 3, 2.2 Robins, J., 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling 7 (9-12), 1393–1512. 1, 2 Robins, J., 1987. A graphical approach to the identification and estimation of causal parameters in mortality studies with sustained exposure periods. Journal of chronic diseases 40, 139S–161S. 1, 2 Rust, J., 1987. Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica: Journal of the Econometric Society, 999–1033. 1, 3 Shaikh, A. M., Vytlacil, E. J., 2011. Partial identification in triangular systems of equations with binary dependent variables. Econometrica 79 (3), 949–955. 1 Torgovitsky, A., 2016. Nonparametric inference on state dependence with applications to employment dynamics. Tech. rep., University of Chicago. 1 Vikstr¨om, J., Ridder, G., Weidner, M., 2017. Bounds on treatment effects on transitions. 1, 4, 14 Vytlacil, E., Yildiz, N., 2007. Dummy endogenous variables in weakly separable models. Econometrica 75 (3), 757–779. 1, 8, 4, 12

17