Strategic Experimentation with Congestion∗ Caroline D. Thomas† December 2, 2014

Abstract We consider a game of strategic experimentation in the presence of competition between two agents. Each agent faces a two-armed bandit problem in which she continually chooses between her private, risky arm and a common, safe arm. An agent has exclusive access to her private arm. However, only one of the agents can activate the common arm at any point in time, imposing a negative payoff externality on her opponent. The quality of the risky arms is independent across agents. We show that an agent attaches a strategic option value to being able to return to her private arm after having used the common arm. As a result, the common arm is more attractive than in the absence of competition. By occupying it an agent increases the likelihood of ending the payoff externality imposed by her opponent. We analyse the interesting and counter-intuitive behaviour that ensues in the unique Markov perfect equilibrium of this game. The first agent to occupy the common arm is the one who is more optimistic about her private arm. She does so in a state where even a myopic single decision-maker would prefer her private arm. She eventually returns to her private arm, even if this means forgoing her access to the common arm forever. We show that the common arm also carries a strategic option value when it is risky. JEL Classification: C72, C73, D83 Keywords: Strategic Experimentation, Multi-Armed Bandit, Bayesian Learning, Poisson Process, Congestion, Payoff Externalities ∗

I thank my Ph.D. supervisors Martin Cripps and Guy Laroque for helpful comments. This paper has greatly benefited from discussions with V. Bhaskar, Peter Eso, Antonio Guarino, Philippe Jehiel, Godfrey Keller, Meg Meyer, Lars Nesheim, Bal´ azs Szentes, Andreas Uthemann, and various seminar audiences. † Department of Economics, University of Texas at Austin, [email protected]

1

1

Introduction

This paper considers a model of strategic experimentation when there is competition between two agents. Each agent faces a two-armed bandit problem in which she continually chooses between her private arm and a common arm. An agent has exclusive access to her private arm. However, only one of the two agents can activate the common arm at any point in time. A player who is currently using the common arm gains priority over its use; her opponent can only use the common arm if the first agent leaves it and returns to her private arm. This “congestion” effect creates negative payoff externalities between the two agents. Our main finding is that the congestion gives rise to new strategic considerations: players perceive a strategic option-value from activating the common arm, making it more attractive than in the absence of congestion. To fix ideas, consider the following example. Firm 1 is a copper producer and Firm 2 is an oil producer. A firm must acquire a land claim before it can drill or mine a particular location. A claim holder has the exclusive rights to prospect for and extract the minerals in the claim area. A claim may be renewed subject to the holder expending a given required amount in exploration or extraction operations on the claim lands. Prospecting and extraction are costly, and a firm cannot simultaneously be active at two locations. Copper ore potentially occurs at locations A1 and C. Petroleum potentially occurs at locations C and A2 . Assuming that the occurrences of petroleum and copper ore are uncorrelated, the two firms cannot learn about their own extraction prospects at any location from observing the results of the other firm’s experimentation at any location. Thus there are no informational externalities. Firm 1’s prospects are better if firm 2 strikes oil at its private location A2 . In this event firm 2 has no more interest in the common location, leaving firm 1 with no competition. Thus firm 1 has an incentive to take steps that make it more likely that firm 2 strikes oil at its private location. One way it can do this is by mining location C, leaving firm 2 no choice but to explore location A2 . In other words, firm 1 has an incentive to mine the common location C over and above its learning motives. Our assumptions differ from those in the existing literature. Economic models adapting the standard multiarmed bandit decision problem1 to multi-agent interaction have predominantly focused on the question of learning in a common value environment. They share the assumption that all agents are learning about the same underlying bandit, each possessing her own identical copy of that bandit. In these models, an agent can learn about her own prospective payoff by observing the experimentation of other agents on their bandits: an 1

See Gittins and Jones (1974), Whittle (1988), Gittins, Glazebrook, and Weber (2011)

2

agent’s experimentation generates informational externalities. This paper adapts the standard multiarmed bandit decision problem to multi-agent interaction in a different way, since we assume that the quality of each arm is independent across agents. In our setting, there are no informational externalities2 . Moreover we assume that the common arm can only be activated by one of the agents at any point, and refer to this phenomenon as “congestion”. Thus, an agent can get in the way of another’s experimentation, negatively impacting their payoff: an agent’s experimentation generates direct payoff externalities. The central insight of this paper is that there exists a strategic option value associated with activating a contested arm and being able to return to one’s private arm. This option value only exists in the presence of other agents and is absent in the single decision-maker problem, or when a switch to the common arm cannot be revoked. This has a number of interesting implications. First, it gives rise to behaviour excluded in the standard bandit model, in particular the temporary interruption of experimentation. This is because congestion, through its strategic implications, makes contested options more attractive than they would be without. Second, it implies that preemption need not be irreversible, and causes the agent seemingly facing less urgency to preempt more aggressively. Strikingly, an agent may find it optimal to return to her private arm, knowing she must thereafter forgo the common arm forever because her opponent will monopolise it. In short, in the presence of congestion, an agent’s behaviour is sometimes primarily aimed at redirecting her opponent’s interest away from her own. We first assume that the common arm is safe. The unique Markov perfect equilibrium features a-priori counterintuitive dynamics. Both agents have an incentive to preempt their opponent’s switch to the common arm. One of the agents switches first. Remarkably, this is the agent who at that point is more optimistic about her risky arm. Moreover, she only remains on the common arm for a finite amount of time. Meanwhile, her less optimistic opponent is effectively forced to experiment with her own risky arm, since she is left with no other action. Two things might happen. On the up-side, the opponent might experience a success and learn that her private arm is good. In this case the competition for the safe arm ends and the first agent can optimise her experimentation process assuming that she has exclusive access to both her private and the common arm. On the down-side, if the opponent does not experience a success, her belief eventually drops so low that she would immediately switch to the safe arm and activate it forever, if it were available. In that case, strikingly, the first agent also returns to her own risky arm, 2

This assumption is not essential: allowing for information externalities strengthens our results.

3

even though she must then forgo the safe arm forever! This seemingly counter-intuitive behaviour is optimal because the prospect of ending the competition has become very unlikely whereas the agent has remained optimistic about her private arm. We call the expected value of this gamble a “strategic” option value to highlight that it does not result from informational free-riding. In particular it is not the case that occupying the safe arm allows an agent to learn about her own risky arm by observing her opponent’s experimentation, while simultaneously collecting the safe flow payoff. This is excluded, as the types of the private arms are independent. Instead a player only learns about – but can also affect – the likelihood that the payoff externality imposed by her opponent will end. We show that this strategic option value persists when the common arm is assumed to be risky. To our knowledge this paper is the first to consider a game3 of strategic experimentation with direct payoff externalities in the context of individual decision-making. Strulovici (2010) considers a game in which payoff externalities arise as a consequence of group decision-making. In his paper, a group jointly decide whether to all activate the risky or the safe arm of an exponential two-armed bandit. Over time players learn that they have different preferences over the two choices. In contrast, the effects of informational externalities in games of strategic experimentation have been widely studied in economics4 . Several models consider a setup in which N players operate identical versions of a two-armed-bandit. The resulting informational externalities cause players to free-ride off one-another’s experimentation. The first to make this observation were Bolton and Harris (1999, 2000). In their model, each arm yields a payoff with independent Brownian noise. While the common underlying payoff of the safe arms is known, that of the risky arms is not. Besides providing free-riding motives, the informational externalities can also “encourage” agents to experiment at a belief where a single decision-maker would chose the safe arm. This second effect is absent5 in the “exponential bandit” framework of Keller, Rady, and Cripps (2005) where safe arms yields a constant flow payoff while a risky arms independently yields a payoff at a common 3

Dayanik et al. (2008) examine the performance of a generalised Gittins Index when a single decisionmaker must decide at each point in time which of N arms to activate, knowing that arms may exogenously break down, and thereby disappear from the choice set, temporarily or permanently. In contrast, Strulovici (2010) and this paper present the disappearance of an option from a player’s choice set as the endogenous consequence of strategic interaction. 4 See Bergemann and V¨ alim¨ aki (2006) for a broader survey of the use of multi-armed bandits in economics. 5 The encouragement effect is restored in the framework of “Poisson bandits”, where the Poisson process associated with the risky arm has a positive arrival rate, though it is unknown whether it is high or low. See Keller and Rady (2010).

4

Poisson rate that is either positive or equal to zero. In that framework the authors study the less inefficient asymmetric equilibria of that game. Klein and Rady (2011) assume that the realised type of the risky arm is perfectly negatively correlated across two players. Murto and V¨alim¨aki (2011) assume that the qualities of different arms are correlated but their payoff realisations are private information6 to the players who only observe each other’s decision to continue experimenting or stop. We borrow the exponential two-armed bandit from Keller, Rady, and Cripps (2005). However, the rest of our model and particularly the economic question it addresses are quite different. First, we assume that the quality of risky arms is independent across agents. Thus, an agent cannot learn about her risky arm from the experimentation of others and there is no informational free-riding. Second, and this is our major departure from the literature, we assume that an agent can only experiment with an arm if no other agent currently uses it. An agent’s actions therefore determine which arms her opponent may choose, and thus directly impacts her opponent’s payoff. Our notion of congestion bears some resemblance to the mechanism of exploding offers.7 Here, instead of an offer expiring exogenously, it expires because someone else has taken it. The resulting preemption incentives cause inefficiencies. This is in line with the literature on preemption games. See Fudenberg and Tirole (1985). Interestingly, allowing the preemption to be revoked adds an allocative inefficiency, as the most optimistic agent occupies the safe arm first while her less optimistic opponent is forced to experiment. The fact that one agent is able to experiment at the exclusion of others may come from legal constraints, as is the case with mineral exploration rights or patents. The constraints may also be purely physical. Running the experiment may require the use of a scarce specialised piece of equipment, for instance an fMRI scanner, a deep space telescope, or a Large Hadron Collider. In practice, the hoarding of scarce resources, such as ”landbanking” or long-term capacity-booking of gas pipelines is mostly understood as a barrier to entry and expansion8 . Our results suggest that firms might have additional reasons to monopolise scarce resources. The paper is organised as follows. In section 2 we present the formal model when the common arm is safe. We begin by considering, in section 3, a game in which switching to the safe arm is irrevocable, that is, players are constrained to use stopping strategies. We show that this constraint is binding: there exist continuation games in which the player activating the common arm would strictly benefit from returning to her private risky arm. 6

On private monitoring of payoffs, see also Rosenberg, Solan, and Vieille (2007), Heidhues, Rady, and Strack (2012) or Thomas (2013). 7 See Armstrong and Zhou (2011). 8 See for instance Freeman et al. (2008) or Cardoso et al. (2010)

5

In section 4 we characterise the strategic option value that a player attaches to being able to return to her private arm and discuss the striking equilibrium dynamics that ensue. In section 5 we illustrate that the existence of a strategic option value does not rely on the assumption that the common arm is safe, but persists, albeit in a more complex form, if we assume that the common arm is also risky. The inefficiency of our equilibria is established in section 6, which discusses the relevant planner solutions. Section 7 discusses extensions and concludes.

2

Model

Time, t ∈ [0, ∞), is continuous. There are two players, 1 and 2, whom we will index by i ∈ {1, 2} and j := 3 − i. Each player faces a two-armed bandit problem a` la Keller et al. (2005), where she continually has to decide whether to activate a private risky arm or a common safe arm so as to maximise her expected discounted payoff over the infinite time horizon. We let ρ > 0 denote the common discount rate. Risky arms: Each risky arm is either “good” or “bad”. A good risky arm yields the player activating it a lump-sum payoff of 1 at each jump (“success”) of a Poisson process with arrival rate λ > 0. Activating a “bad” risky arm never produces a success. The type of each arm is independently realised once and for all at the beginning of the game. At date t = 0 player 1’s risky arm is good with prior probability p10 and player 2’s with prior probability p20 , with (p10 , p20 ) ∈ [0, 1]2 . Safe arm: The safe arm yields a flow payoff of a ∈ (0, λ) with certainty to the player activating it. Thus when her risky arm is known to be good, a player strictly prefers it to the safe arm, and vice-versa if she were certain her risky arm is bad. Precedence rule: Each player has exclusive and unconstrained access to her private risky arm. Both players share access to the safe arm, but it can only be activated by one player at a time. A player who occupies the safe arm gains absolute priority over its use: her opponent can then only use the safe arm if the incumbent leaves it and returns to her private risky arm. We assume that if player j occupies the safe arm at dates t − s with s → 0+ and chooses the control k j (t) = 0, then at date t she is allocated the safe arm, while player i is allocated her own risky arm regardless of whether she uses the control k i (t) = 0 or k i (t) = 1. If both players simultaneously switch from their risky arm to the safe arm, a tie-break rule allocates the safe arm to player 1 with probability ι ∈ (0, 1).

6

Beliefs: The players’ actions and successes are publicly observed. Therefore, given the players’ common prior (p10 , p20 ) ∈ [0, 1]2 , their action profile to date and the realised number of successes, they share a common posterior at each date t > 0 denoted (p1t , p2t ) ∈ [0, 1]2 . If over the time interval [t, t+s), for any s > 0, player i continuously activates her risky arm without producing a success, the belief about her risky arm at t + s is, by Bayes’ rule, pit e−λs . pit+s = i −λs pt e + 1 − pit This is decreasing in s: the longer player i experiments without success, the more pessimistic both players become about her risky arm being good. Differentiating the expression above, we obtain that when k i (t) = 1 the differential equation describing the evolution of pit is given by : (1)

dpit = −pit (1 − pit )λdt.

Once a risky arm produces a success, the commonly held belief about that arm jumps to 1 and remains there forever. Finally, if player i’s risky arm is never activated over the time interval [t, t + s), for any s > 0, we have pit+s = pit . At any date t ≥ 0 the expected Poisson arrival rate on player i’s risky arm when activated is pit λ. We introduce two further pieces of notation. For all 0 < q ≤ p < 1, let σ(p, q) denote the length of time for which a risky arm must be activated for the corresponding belief to fall from p to q. It satisfies: p e−λσ(p,q) q= . p e−λσ(p,q) + 1 − p Let π(p, q) denote the associated probability that no success occurs over the time interval [t, t + σ(p, q)) given the prior pit = p: π(p, q) = 1 − p + p e−λσ(p,q) . Strategies: For each player i, a strategy is a mapping k i from public histories of both players’ previous actions and successes into {0, 1}, where k i (t) = 1 indicates that player i chooses to activate her risky arm over the small time interval [t, t + dt). In line with the literature, we restrict attention to Markov strategies, which are measurable with respect to the vector of posterior beliefs (p1t , p2t ). This will turn out to be without loss of generality: the use of backward induction in our proofs guarantees that there are no sequential equilibria other than the unique Markov perfect equilibrium we derive.

7

The benchmark problem of a single decision-maker (DM) who faces no congestion is well-known.9 The single DM’s optimal policy prescribes that she activate her risky arm if aρ . and only if her belief p is strictly above the single DM’s optimal threshold pV := λ(ρ+λ−a) a We call the myopic threshold, pM := λ , the belief at which a myopic DM is indifferent between activating her risky and the safe arm. This threshold can also be understood to be the belief at which an agent, myopic or not, is indifferent between her risky and the safe arm if she has to make her choice once and for all. A non-myopic DM finds it optimal to continue activating the risky arm on the interval (pV , pM ), provided she is able to switch to the safe arm in case she does not produce a success and her belief reaches the threshold pV . The availability of the safe arm generates a positive option value, making experimentation beyond the myopic threshold worthwhile. Let V (p) denote the value function in the single agent decision problem. We have h i ( λ λ −ρσ(p,pV ) a − pV ρ if p > pV , p ρ + π(p, pV ) e ρ V (p) = a if p ≤ pV . ρ When p > pV , the first term of V (p) is the DM’s payoff if she activated her risky arm forever. The second term is the option value she derives from being able to switch to the safe arm in case the risky arm does not produce a success before her belief reaches the threshold pV . At that point the DM obtains the continuation payoff aρ instead of the pV λρ she would receive if she were unable to switch to the safe arm. The term e−ρσ(p,pV ) discounts this increase in payoff, and π(p, pV ) is the probability that no success occurs before the DM’s belief reaches pV . This option value is decreasing in p: the DM values the availability of the safe arm more highly the more pessimistic she becomes.

3

Game with irrevocable switching

In the benchmark single DM problem, when activating her risky arm, the DM perceives an option value from being able to switch to the safe arm at any time. She attaches no value to being able to switch back to her risky arm. In contrast, we show that in a two-player game, a player values the ability to switch back and forth between her two arms. To illustrate this, we begin by considering the game in which switching to the safe arm is irrevocable: once a player occupies the safe arm, she may not switch back to her risky arm. Her opponent therefore loses access to the safe arm forever, and the strategic interaction is effectively over. We thus impose on players the use of stopping strategies: at each date, player i chooses to either activate her risky arm over the time interval [t + dt) 9

Keller et al. (2005) provide the analysis.

8

(k i (p1t , p2t ) = 1) or to irrevocably switch to the safe arm (k i (p1t , p2t ) = 0), so as to maximise her expected discounted payoff. We derive the unique Markov perfect equilibrium of this game and illustrate typical equilibrium dynamics. To finish, we show that the constraint that the switch to the safe arm is irrevocable is binding. Observe that with irrevocable switching, the assumption that the common arm is safe is not essential. Our analysis would exactly go through if the common arm were risky and had an expected arrival rate of a for both players. A Markov perfect equilibrium (MPE) is a pair of strategies (k 1 (.), k 2 (.)) such that in each state (p1 , p2 ) ∈ [0, 1]2 player i’s action maximises her expected discounted continuation payoff given player j’s strategy. Let Wi (.) denote player i’s value function, assuming that neither player has switched to the safe arm as yet. It solves the following dynamic program: n Wi (p1 , p2 ; k j (.)) = maxki (p1 ,p2 )∈{0,1} k i (p1 , p2 ) LR Wi (p1 , p2 ; k j (.)) o +(1 − k i (p1 , p2 )) LS Wi (p1 , p2 ; k j (.)) , where LR Wi and LS Wi denote player i’s continuation payoffs from choosing k i (p1 , p2 ) = 1 and k i (p1 , p2 ) = 0 respectively. Her payoff from switching to the safe arm in state (p1 , p2 ) is h i LS Wi (p1 , p2 ; k j (.)) = k j (p1 , p2 ) aρ + (1 − k j (p1 , p2 )) ι aρ + (1 − ι) pi λρ , The first term is player i’s payoff if player j does not switch to the safe arm in state (p1 , p2 ). Player i then obtains a/ρ from activating the safe arm forever. The second term is her payoff if player j also switches and the arm is allocated in a tie-break. Player i’s payoff from activating her risky arm in state (p1 , p2 ) satisfies the recursion:   h LR Wi (p1 , p2 ; k j (.)) = pi λdt 1 + e−ρdt λρ + (1 − pi λdt) (1 − k j (p1 , p2 )) e−ρdt pi0 λρ i (2) j 1 2 −ρdt j i0 j i 10 20 j +k (p , p ) e [p λdtV (p ) + (1 − p λdt)W (p , p ; k (.))] , with pi0 = pi − pi λ(1 − pi )dt. The first term is player i’s payoff if her risky arm produces a success and she henceforth activates it forever. If it does not she updates her belief to pi0 and player j’s action in state (p1 , p2 ) matters. If j switches to the safe arm, player i is constrained to activate her risky arm forever. If j activates her risky arm and instantly produces a success, the competition ends and player i effectively faces the single-agent decision problem. Otherwise j’s belief is also revised downwards. Theorem 1 describes the unique Markov perfect equilibrium in the game in which the decision to switch to the safe arm is irrevocable.

9

Theorem 1. The strategy profile (k 1∗ , k 2∗ ) constitutes irrevocable switching, where   j    p < pM ,    0 if pj = pM ,  k i∗ (p1 , p2 ) =  j  p > pM ,    1 otherwise.

the unique MPE of the game with

pi < pM , pi ≤ pM , pi ≤ pV ,

Proof : See appendix A. 

Figure 1: Equilibrium strategies of player 1 (left) and player 2 (right) when the decision to switch to the safe arm is irrevocable. For states (p1 , p2 ) is in the green (dark) area, the player switches to the safe arm. For states in the orange (light) area, the player plays her risky arm.

The MPE strategy profile (k 1∗ , k 2∗ ) is illustrated in Figure 1.10 We now provide an intuition for Theorem 1. The myopic threshold pM plays a central role. Observe that in states in which player j switches to the safe arm, player i is effectively trading off the payoff a/ρ from irrevocably switching to the safe arm, with the payoff pi λ/ρ from being forced to activate her risky arm forever. In states (p1 , p2 ) such that pi = pM , these payoffs are equalised and player i is indifferent between the two outcomes. She strictly prefers being held to her risky arm when pi > pM , and to the safe arm when pi < pM . In our proof we first note that once player i’s belief falls below her single DM threshold, switching to the safe arm is a dominant strategy for her. We then proceed by backward induction and show that the players’ incentives to preempt one another in switching to the safe arm cause the switching decisions to unravel in all states (p1 , p2 ) such that p1 < pM and p2 < pM . The preemption motives only disappear if one of the players is indifferent between preempting her opponent and activating her risky arm. In equilibrium this is the player with the highest expected Poisson arrival rate. That player’s equilibrium strategy prescribes 10

In all our figures we choose λ = 2, a = 1 and ρ = 1.

10

that she activate her risky arm when indifferent and let her opponent, who would otherwise have strict incentives to preempt her, take the safe arm. If both players are simultaneously indifferent, in equilibrium both switch to the safe arm and it is allocated in a tie-break. In Figure 2 we illustrate typical equilibrium dynamics, conditional on neither player having a success. Notice that in moving from case 1 to case 3, i.e. as the discrepancy in priors increases, the belief at which the first player switches to the safe arm in equilibrium gets closer to the single DM threshold.

Figure 2: In Case 1, the prior is p10 = p20 > pM . In Case 2, the priors p10 > p20 are such that at date t > 0 satisfying p1t = pM we have that p2t > pV , while in Case 3 we have that p2t ≤ pV .

Case 1: If the prior is p10 = p20 > pM , then in equilibrium both players switch to the safe arm when beliefs reach pM , and the safe arm is allocated in a tie break (illustrated in figure 4 for player 2 winning the tie-break). When p1t = p2t = pM , both players are indifferent in equilibrium between activating their risky arm and switching to the safe arm. Case 2: Player 1’s strategy prescribes that she continue activating her risky arm for all p1t ≥ pM . Player 2’s strategy prescribes that she switch to the safe arm once p1t = pM , i.e. when player 1 is indifferent between preempting and letting player 2 take the safe arm. Observe that it is the player who is least likely to have a good risky arm who gets the safe arm in this equilibrium. Case 3: Here player 1 is initially so much more likely than player 2 to have a good risky arm that even when player 2’s posterior belief reaches the single DM threshold pV , player 1’s posterior belief is still above the myopic threshold belief pM , and player 1 strictly prefers activating her risky arm regardless of player 2’s action. Thus, given these priors, player 2 effectively faces the single DM problem. Although the safe option always goes to the player who at that point has the lowest expected arrival rate, the equilibrium behaviour is inefficient, whether it involves preemption (cases 1 and 2) or not (case 3). This is because players fail to internalise the negative payoff

11

externality they impose on their opponent. We analyse the planner problem in section 6.1 and show that it requires experimentation beyond the single DM threshold. To finish, we emphasise that when there is sufficient competition for the safe arm, the irrevocability constraint binds. Suppose that in equilibrium, player 2 switches to the safe arm when her posterior belief is still above the single DM threshold (as illustrated in cases 1 and 2 above). Suppose then that player 1, who is still experimenting, produces a success. Now player 1 will never want to switch to the safe arm. Ideally, player 2 would like to resume experimenting on her risky arm, as her access to the safe arm at a later date is now guaranteed. However since we have assumed that a switch to the safe arm cannot be revoked, she is not able to do so. Thus, the irrevocability constraint is binding. This observation underlines that, contrary to the single DM problem, in a two-player game, a player values the ability to return to her risky arm after having activated the safe arm. We therefore conclude that when switching to the safe arm is revocable players will not use stopping strategies in equilibrium. In the next section, we derive the MPE with revocable switching, and show that the common arm carries a strategic option value: if she is able to ultimately return to her own experimentation, a player may benefit from occupying the common arm temporarily, in order to force her opponent to experiment.

4

Game with revocable switching

We now consider the game without imposing the use of stopping strategies. Instead we assume that the decision to switch to the safe arm is revocable: a player may freely switch back and forth between her risky arm and the safe arm, subject to her opponent’s actions and the precedence rule given in section 2. This section contains the main result of this paper: arms subject to congestion carry a strategic option value. We derive the unique MPE of this game, and show that in equilibrium, a player has an incentive to temporarily interrupt her own experimentation and take the safe arm. We begin by describing the players’ problem in this game. At each date, player i chooses to either activate her risky arm (k¯i (pi , pj ) = 1) or the safe arm (k¯i (pi , pj ) = 0) over the time interval [t + dt) so as to maximise her expected discounted payoff. A Markov perfect equilibrium is a pair of strategies (k¯1 (.), k¯2 (.)) such that in each state (p1 , p2 ) ∈ [0, 1]2 player i’s action maximises her expected discounted continuation payoff given player j’s strategy. At histories at which player i already occupies the safe arm and wishes to retain it, the precedence rules imply that player j is allocated her private risky arm regardless of which control she chooses. Given any equilibrium, player j’s strategy at these histories can therefore be modified without affecting the outcome of the game, or the players’ payoffs. 12

When we claim that our equilibrium is unique, we mean up to similar irrelevant changes in the players’ strategies. Let Ui (.) denote player i’s value function in states in which neither player is currently occupying the safe arm. It solves the following dynamic program: n Ui (p1 , p2 ; k¯j (.)) = maxk¯i (p1 ,p2 )∈{0,1} k¯i (p1 , p2 ) LR Ui (p1 , p2 ; k¯j (.)) o i 1 2 S i 1 2 ¯j ¯ +(1 − k (p , p )) L U (p , p ; k (.)) where LR Ui and LS Ui denote player i’s continuation values from choosing the controls k¯i (p1 , p2 ) = 1 or k¯i (p1 , p2 ) = 0 respectively. We let Si (pi , pj ) denote player i’s value from activating the safe arm and thus forcing player j to experiment; and Ri (pi , pj ) player i’s payoff if she has no other choice but to activate her risky arm because player j is activating the common arm. We derive expressions for these functions the next sub-section. Player i’s payoff from choosing to switch to the safe arm in state (p1 , p2 ) satisfies the recursion: ¯j (p1 , p2 ))(1 − ι) LS Ui (p1 , p2 ; k¯j (.)) = (1    −k  pi λdt 1 + e−ρdt λρ + (1 − pi λdt) e−ρdt Ri (pi0 , pj )   ¯j (p1 , p2 ))(1 − ι) + 1 − (1 − k   adt + e−ρdt [pj λdt V (pi ) + (1 − pj λdt) Si (pi , pj0 )] with pi0 = pi − pi λ(1 − pi )dt for i ∈ {1, 2}. In the expression above the first summand is player i’s payoff if player j also chooses to switch to the safe arm in state (p1 , p2 ) and player i loses the tie-break. In that case it is player j who obtains the safe arm and player i is forced to activate her risky arm. If player i instantly produces a success, she keeps activating her risky arm forever. If not, her belief is updated to pi0 and play enters a phase in which, as long as player j keeps activating the safe arm, player i has no other choice but to experiment on her private risky arm. She therefore receives the continuation payoff Ri (pi0 , pj ). The second summand is player i’s payoff if in state (p1 , p2 ) player j chooses not to switch to the safe arm or loses the tie-break. Player i then activates the safe arm and collects the deterministic flow payoff a while observing player j’s experimentation. In the event that player j produces a success it becomes optimal for her to keep activating her private arm forever and the competition for the safe arm effectively ends. Thus, player i faces the single DM problem and since she is able to return to her private arm, she can implement the single DM optimal policy and obtain the payoff V (p). This is the best possible outcome for player i. If instead j is unsuccessful, the belief concerning her risky arm is updated 13

to pj0 and play enters the phase where player i forces player j to experiment. Player i’s continuation payoff is therefore Si (pi , pj0 ). Player i’s payoff from choosing to continue activating her risky arm in state (p1 , p2 ) satisfies the recursion:   LR Ui (p1 , p2 ; k¯j (.)) = pi λdt 1 + e−ρdt λρ  i −ρdt + (1 − p λdt) e (1 − k¯j (p1 , p2 )) Ri (pi0 , pj )   + k¯j (p1 , p2 ) pj λdt V (pi0 ) + (1 − pj λdt) Ui (p10 , p20 ; k¯j (.)) , with pi0 = pi − pi λ(1 − pi )dt for i ∈ {1, 2}. The first summand is player i’s payoff if player i’s risky arm instantly produces a success. If it does not, player j’s strategy in state (p1 , p2 ) matters. If player j takes the safe arm, player i receives the continuation payoff Ri (pi0 , pj ) from being forced to experiment. If instead player j also activates her risky arm, then player i receives the single DM payoff if j instantly produces a success, otherwise the players face the same dynamic problem anew in state (p10 , p20 ). We derive the unique MPE of this game (Theorem 2). The remainder of this section is organised as follows. First we derive expressions for the continuation payoffs Si (pi , pj ) and Ri (pi , pj ) (section 4.1). Depending on a player’s motive for occupying the common arm, we distinguish two classes of subgames in which one player occupies the common arm. We say that player i strategically forces her opponent to experiment if player i occupies the safe arm even though pi > pV . That is, absent her opponent, player i would prefer activating her risky arm. We wish to distinguish it from player i occupying the safe arm not for strategic purposes, but because at the current beliefs it is her dominant action. This is the case if and only if pi ≤ pV . We show that in the two-player game, occupying the common arm produces a strategic option-value. The functions Si (pi , pj ) and Ri (pi , pj ) serve to describe the equilibrium strategies (section 4.2). We show that in equilibrium, in a subgame where player j is strategically forced to experiment, she is forced to experiment until her belief falls below the single DM threshold, unless she first produces a success. This implies that in equilibrium, it is never the case that players successively force one another to experiment for strategic purposes. We show this constructively by considering the last subgame in which a player strategically forces her opponent to experiment, and showing by backward induction that is it only preceded by both players activating their risky arms. Finally, in section 4.3, we state Theorem 2, our main result, and illustrate the equilibrium dynamics. We discuss the effect on the equilibrium behaviour of the strategic option value associated with a player’s ability to occupy the common arm temporarily. First, it makes the safe arm even more attractive than in the case where switching is irrevocable. 14

The first player to occupy the safe arm does so when her belief about her risky arm is still above the myopic threshold pM . Second, a player’s strategic option value when occupying the safe option decreases as the likelihood of her opponent’s success decreases. Consequently, a player returns to her risky arm even though her belief about that arm is the same as it was when she switched to the safe arm. Even more strikingly, she returns to her risky arm even though she thereby forgoes the future use of that arm, as her opponent will occupy it forever.

4.1

Strategically forced experimentation

Let Si0 denote player i’s value from strategically forcing j to experiment in the last subgame with strategically forced experimentation (indicated by the subscript 0). We define it as a subgame in which player i either keeps the safe arm forever, or leaves it in a state with pj ≤ pV in which it is a dominant strategy for player j to take it forever. Let Rj0 denote the payoff received by player j in that subgame. If in the unique MPE there is only one subgame in which a player strategically forces her opponent to experiment, then Si = Si0 and Rj = Rj0 . We first derive an expression for Si0 . It satisfies the dynamic program: n o (3) Si0 (pi , pj ) = max k¯i (p1 , p2 )LR Si0 (pi , pj ) + (1 − k¯i (p1 , p2 ))LS Si0 (pi , pj ) ¯i (p1 ,p2 )∈{0,1} k

If player i switches back to her risky arm, we have assumed that player j occupies the safe arm forever. Therefore player i’s continuation value is: LR Si0 (pi , pj ) = pi λρ . Player i’s continuation value from remaining on the safe arm satisfies the recursion: h i S i i j −ρdt j i j i i j0 L S0 (p , p ) = adt + e p λdt V (p ) + (1 − p λdt) S0 (p , p ) , with pj0 = pj − pj λ(1 − pj )dt. Over the interval [t, t + dt), player i collects the certain flow payoff a from the safe arm and observes j’s experimentation. Observe that the belief pi about her private risky arm remains fixed. If j produces a success, player i adopts the single DM optimal policy, receiving value V (pi ). Otherwise she faces the same dynamic problem in the new state, (pi , pj0 ). Proposition 1 describes the optimal behaviour of player i when she is strategically forcing player j to experiment, in the last subgame with strategically forced experimentation: she forces player j to experiment as long as pj is above the following threshold, derived in appendix B: 15

pjSi (pi ) 0

(4)

a iλ 1 p ρ−ρ := . λ V (pi ) − pi λρ

Observe that pjSi (pi ) > 0 if and only if pi ≥ pM and that this threshold can therefore never 0 be reached for pi < pM . The closed-form expression for the value function Si0 (pi , pj ) given below includes a strategic option value, which we discuss after stating the proposition. Proposition 1. Consider the last subgame in which player i strategically forces player j to experiment. In states (p1 , p2 ) such that pj > pjSi (pi ), remaining on the safe arm is optimal 0 for player i and her payoff is ( a + Gi (pi , pj ) + Hi (pi , pj ), pi > pM , ρ Si0 (pi , pj ) = a + Gi (pi , pj ), pi ≤ pM , ρ where Gi (pi , pj ) := pj

λ λ+ρ

h i V (pi ) − aρ ,

Hi (pi , pj ) := π(pj , pjSi (pi )) e

−ρσ(pj ,pj i (pi )) S0

0

h

pi λ ρ





a ρ

i + Gi (pi , pjSi (pi )) . 0

Conversely, in states (p1 , p2 ) such that pj ≤ pjSi (pi ), it is optimal for player i to return 0 to her risky arm and let player j take the safe arm forever. In these states, Si0 (pi , pj ) = pi λρ . Proof : See appendix B.  In section 3 we had noted that an irrevocability constraint on strategies may be binding and that a player occupying the safe arm would attach a positive value to being able to resume her own experimentation if her opponent had a success. Proposition 1 shows that the ability to return to one’s risky arm may be of value even if one’s opponent is not successful! We call the difference Si0 (pi , pj ) − a/ρ player i ’s strategic option value from being able to return to her risky arm when she is strategically forcing her opponent to experiment. It consists of two distinct option-values, Gi (pi , pj ) and Hi (pi , pj ), each accruing in a different continuation game. According to Proposition 1, if player i activates the common arm it is optimal for her to return to her risky arm at the first of the following two events: player j produces a success, or the belief pj reaches the threshold pjSi (pi ). 0 The term Gi (pi , pj ) measures player i’s option value from being able to return to her risky arm and implement the single DM policy following a success by player j. In that case player i obtains the single DM value V (pi ) rather than a/ρ. A success on player j’s risky arm occurs at expected discounted rate pj λ/(λ + ρ) As long as player j does not produce a success pj and Gi (pi , pj ), decrease. 16

If pi ≤ pM , then pjSi (pi ) ≤ 0 and the second event never occurs Accordingly, if player j 0 does not have a success, player i occupies the safe arm forever. However for pi > pM , player i perceives an additional option value, Hi (pi , pj ). If pj , and hence Gi (pi , pj ), are sufficiently low, player i is optimistic enough to benefits from returning to her risky arm, even though that entails losing access to the safe arm forever. Hi (pi , pj ) measures the resulting expected discounted increase in player i’s continuation payoff. Once pj reaches pjSi (pi ), player i 0

returns to her risky arm and obtains pi λ/ρ rather than the

a ρ

+ Gi (pi , pjSi (pi )) she would 0

−ρσ(pj ,pj i (pi ))

S0 obtain if she kept waiting for player j to produce a success. The term e discounts this increase in payoff and π(pj , pjSi (pi )) is the probability that no success occurs 0

before pj reaches pjSi (pi ). The option value Hi (pi , pj ) tends to zero as pjSi (pi ) tends to zero. 0 0 Therefore Si0 (pi , pj ) is continuous and has a kink at pi = pM . To summarise: by forcing her opponent to experiment, player i achieves the single DM value in the case that j’s experimentation proves successful. If it does not, player i benefits from being able to eventually return to her own experimentation if she is sufficiently optimistic about her own risky arm, even if this implies losing her access to the safe arm forever. This last effect is surprising and somewhat counter-intuitive. We can now derive an expression for Rj0 (pj , pi ), player j’s payoff from being strategically forced to experiment and then taking the safe arm forever if given the opportunity:   pj λρ , p i ≤ pM , j j i h i (5) R0 (p , p ) = −ρσ(pj ,pj i (pi )) j a i λ S0  pj λ + π(pj , pj i (pi )) e − p (p ) , pi > pM . ρ ρ ρ S Si 0

0

As previously noted, for states pi ≤ pM we have pjSi (pi ) ≤ 0 and player j is forced 0 to activate her risky arm forever. In that case her payoff is pj λρ . On the other hand, for pi > pM , pjSi (pi ) > 0 and player i will stop forcing player j to experiment if pj reaches 0

the threshold pjSi (pi ) defined in (4). That is, starting from any initial state (pi , pj ) with 0

pj > pjSi (pi ), if she does not produce a success player j will only be forced to experiment 0

temporarily, for the finite duration σ(pj , pjSi (pi )). The second term in (5) reflects player 0

j’s expected discounted payoff from being able to switch to the safe arm at belief pjSi (pi ). The function Rj0 (pj , pi ) is continuous and has a kink at pi = pM .

4.2

0

Boundaries

We now define the boundary Bi (.) in [0, 1]2 that determines player i’s best response to player j switching to the safe arm in the last game with strategically forced experimentation. Fix a state (pi , pj ) ∈ [0, 1]2 and assume that player j switches to the safe arm. Player i faces a 17

choice between being forced to experiment by player j, and also switching to the safe arm and facing her opponent in a tie-break. In other words, player i is comparing the payoffs Si0 (pi , pj ) and Ri0 (pi , pj ).11 We define Bi (pj ) to be the boundary satisfying: (6)

Ri0 (Bi (pj ), pj ) = Si0 (Bi (pj ), pj ).

It is a switching line which partitions [0, 1]2 into two regions, according to player i’s best response. Not switching and letting her opponent occupy the safe option is a (weak) best response if and only if Ri0 (pi , pj ) ≥ Si0 (pi , pj ). This holds if and only if pi ≥ Bi (pj ). The boundary B1 (p2 ) is illustrated in Figure 3 below.

Figure 3: Qualitative illustration of the boundary B1 (p2 ).

For pj < pM , if player j captures the safe arm, she forces player i to experiment until she produces a success, and Ri0 (pi , pj ) = pi λρ . The boundary Bi (pj ) indicates that if player j switches to the safe arm in states with pi > Bi (pj ), player i prefers activating her risky arm to entering a tie with j for access to the safe arm. For pj ≥ pM , if player j does capture the safe arm, she only forces player i to experiment for a finite duration of time. Hence Ri0 (pi , pj ) > pi λρ so that activating her risky arm is more attractive for player i than if she were forced to experiment forever: this is reflected in the kink in Bi (pj ) at pj = pM . Furthermore, for pj ≥ pM , Bi (pj ) has another kink at pM resulting from the change in the composition of the strategic option value in Si0 (pi , pj ), when pi = pM . Continuity of Bi (pj ) at both kinks follows from the continuity of Ri0 (pi , pj ) and Si0 (pi , pj ) at pi = pM and pj = pM . Finally, for pi < pV , switching to the safe arm is a dominant strategy for player i. 11

We can also interpret this as a choice between just preempting player j’s switch to the safe arm or

not.

18

The role played by the boundary Bi (pj ) in the proof of Theorem 2 is analogous to the one played by the myopic threshold in the game with irrevocable switching. If pi < Bi (pj ) and pj < Bj (pi ), the players’ decisions to preempt one another unravels. In equilibrium, a player switches to the safe arm in a state where her opponent is indifferent between also switching and pursuing her own experimentation. This is explained in the next section.

4.3

Equilibrium and Dynamics

Theorem 2 describes the unique Markov Perfect Equilibrium of the game with revocable switching. It involves at most one round of strategically forced experimentation. That is, a player is forced to experiment until her belief falls below the single DM threshold pV . Therefore Si = Si0 and Ri = Ri0 . This equilibrium is inefficient, and in section 6.2 we derive the relevant planner solution. All proofs are in appendix C and we concentrate in this section on describing the equilibrium dynamics and highlighting the peculiar behaviour along the equilibrium path. Theorem 2. The strategy profile (k¯1∗ , k¯2∗ ) constitutes the unique MPE of the game with revocable switching, where   i    p ≤ pV ,    0 if pi > pV , pi < Bi (pj ), pj ≤ Bj (pi ), i∗ 1 2 ¯  k (p , p ) =   p i = pj = pU    1 otherwise, where pU satisfies pU = Bi (pU ). Proof : See appendix C. 

Figure 4: Equilibrium strategies of player 1 (left) and player 2 (right). For states (p2 , p2 ) is in the green (dark) area, the player chooses the safe arm, for states in the orange (light) area, the player chooses her risky arm.

19

The MPE strategy profile (k¯1∗ , k¯2∗ ) is illustrated in Figure 4. Observe that the set of states in which activating the safe arm is a best response is larger than with irrevocable switching. This is because the strategic option value causes the payoff from occupying the safe arm to exceed a/ρ, the payoff from irrevocably switching to the safe arm. We now illustrate the resulting equilibrium dynamics, conditional on no success. Here too, competition increases as the priors get closer.

Figure 5: In Case 1, the prior is p10 = p20 > pM . In Case 2, the prior is such that at date s satisfying p2s = B2 (p1s ) we have p1s ≤ B1 (p2s ). In Case 3, the prior is such that at date s satisfying p2s = pV we have p1s > B1 (p2s ).

Case 1: p10 = p20 > pM . In equilibrium both players switch to the safe arm when the state (pU , pU ) satisfying 2 B (pU ) = B2 (pU ) is reached. Observe that pU > pM At that point, each player is indifferent between being forced to experiment for the finite duration σSi0 (pU , pU ), or switching to the safe arm and forcing her opponent to experiment for the same finite duration. A tie-break determines which player is forced to experiment (here illustrated to be player 2). As the likelihood of her success, p2 , decreases, so does the option value G1 (p1 , p2 ), and player 1 becomes increasingly pessimistic about the prospect of player 2 ceasing to compete for the safe arm. At the same time player 1 remains very optimistic about her private arm. In fact, at her current belief pU , she would prefer her risky arm if she had to make a choice once and for all (since pU > pM ). She therefore returns to her risky arm once p2 reaches p2S1 (pU ) even though this means forgoing the safe arm forever, since p2S1 (pU ) < pV . 0 0 Observe that with symmetric priors the player who is not allocated the safe arm in the tie-break will be forced to experiment unsuccessfully for longer than in any equilibrium with asymmetric priors. Case 2: p10 > p20 are such that at date s > 0 satisfying p2s = B2 (p1s ) we have p1s ≤ B1 (p2s ). Here in equilibrium, player 1 switches to the safe arm when p2 = B2 (p1 ). This is the last date at which player 1 can do so without player 2 wanting to preempt. At that date player 2 is just indifferent between letting player 1 have the safe arm and facing her in a 20

tie-break, and in equilibrium she chooses the former. If player 2 does not have a success, she is forced to experiment until her belief reaches the threshold p2S1 (p1s ) and the game 0 proceeds as in case 1. Observe that it is the player with the highest expected Poisson arrival rate who is the first to occupy the safe arm and that it is the player with the lowest expected arrival rate who is forced to experiment. This is clearly inefficient, and different from the equilibrium when switching is irrevocable. It occurs with revocable switching because in all states (p1 , p2 ) such that p1 ≤ B1 (p2 ) and p2 > B2 (p1 ) player 1 has a stronger incentive to preempt her opponent’s switch to the safe arm than player 2. If unsuccessful, player 2 would only be forced to experiment temporarily, for the duration σ(p2 , p2S1 (p1 )). In contrast, player 1 0 would be forced to experiment forever if p2 ≤ pM ; or, if p2 > pM , for the finite duration σS20 (p2 , p1 ) > σS10 (p1 , p2 ). In both cases, player 1 faces a longer spell of strategically forced experimentation than player 2. Case 3: p10 > p20 are such that at date s > 0 satisfying p2s = p2V we have p1s > B1 (p2s ). Here player 1’s expected arrival rate is so high relative to player 2’s that when player 2’s posterior belief reaches the single DM’s threshold pV , player 1’s belief is above B1 (pV ) and she strictly prefers activating her risky arm. Player j thus effectively faces the single DM problem. In this equilibrium, there is no strategically forced experimentation: player 1’s expected arrival rate is so high that she considers it too costly to interrupt her experimentation, even temporarily, and despite the strategic option value.

4.4

Discussion

There is something seemingly contradictory about the equilibrium dynamics. First, in a state in which even a myopic single DM would prefer the risky arm, player i takes the safe arm out of concern that player j may take it, possibly only temporarily. She later leaves the safe arm at a point when she is certain that player j, if given the opportunity, will occupy the safe arm forever. The driver of these equilibrium dynamics is the strategic option value. By occupying the safe arm, payer i increases the likelihood that player j has a success, in which case player i gets the single DM payoff. If that gamble becomes too unlikely to pay off, player i resumes her own experimentation. The allocation of the safe arm in equilibrium is also surprising. It is the player most optimistic about her risky arm who occupies the safe arm first. This is the opposite of the case with irrevocable switching. We emphasise that whenever she is forcing her opponent to experiment, a player entirely stops learning about her private arm. She does not learn anything from activating the safe arm, nor does she learn anything from her opponent’s experimentation, since the risky arms 21

are independent. In short, a player has no learning motives for occupying the common safe arm for beliefs above pV , but instead purely strategic ones. Allowing for the risky arms to be correlated would encourage informational free-riding and provide additional incentives to occupy the safe arm: a player then obtains information about her own risky arm by observing her opponent’s experimentation and simultaneously collects the certain flow payoff a. If the arms are positively correlated and player j does not produce a success, then player i forces her to experiment for longer than if the arms were independent. In fact if the arms are perfectly correlated, player i has no incentive at all to switch back to her private arm if j does not have a success, and the players effectively play a preemption game similar to section 3. If the arms are negatively correlated, then if her opponent has no success player i returns to her risky arm earlier than with independent private arms.

5

Risky common arm

So far we have assumed that the arm subject to congestion is safe. In this section we show that our previous insights extend and a strategic option value also exists in a model where the common arm is risky. Consider the following variation of our initial model. Each player i faces a two-armed bandit problem in which she continually chooses between activating her private arm, denoted Ai , or the common arm, denoted C. At the beginning of the game, nature chooses θiX from {0, λ} for i ∈ {1, 2} and X ∈ {Ai , C}. Each of these choices are independent. When activated by player i, arm X produces a lump-sum payoff of 1 at the jumping times of a Poisson process with intensity θiX . Observe that since the player-specific types of the common arm are independent, player i cannot learn about θiC from observing player j’s experimentation. In other words, the common arm has an unknown private value for each player. Finally, the common risky arm, C, is subject to the same precedence rules as before: the player currently occupying it has priority over its use. iAi Let piA = λ). t denote the common belief at date t that player i’s private arm is good (θ iC Let pt denote the common belief at date t that the common arm is good when activated by player i (θiC = λ). The vector of common posterior beliefs at date t > 0 is denoted jC jA iC 4 4 pt := (piA t , pt , pt , pt ) ∈ [0, 1] . It is derived from the prior p0 ∈ (0, 1) using Bayes rule and depends on the history of actions and successes on the interval [0, t). A Markov strategy for player i is a mapping k˜i : pt → {0, 1} where k˜i (pt ) = 1 indicates that player i chooses to activate Ai over the time interval [t, t + dt). Since all good arms have the same arrival rate λ and pay the same lump-sum of 1 at

22

each Poisson event, once a player has a success on one arm, it is optimal for her to keep activating that arm forever. The strategic interaction effectively ends at that point: if the success happened on her private arm, a player will never compete for the common arm and her opponent faces a single DM problem. If the success happened on the common arm, the player will keep activating the common arm forever, leaving her opponent no other choice but to activate her private arm.

5.1

Single DM problem

First consider the single DM problem in this model when player j is absent12 . A Markov i iC strategy for player i in the single DM problem is a mapping k˜DM : (piA t , pt ) → {0, 1} i iC where k˜DM (piA t , pt ) = 1 indicates that player i chooses to activate Ai over the time interval [t, t + dt). The DM’s optimal policy requires activating the arm with the highest iC 13 expected arrival rate at each date t ≥ 0. It is not well-defined when piA t = pt . i,∆ iC Given ∆ > 0, a ∆-strategy k˜DM for player i maps her belief (piA t , pt ) at dates t = n∆, i,∆ iC n ∈ {0, 1, 2, ...}, into {0, 1}, where k˜DM (piA t , pt ) = 1 indicates that player i chooses to activate Ai over the time interval [t, t + ∆). The DM’s optimal ∆-policy is well-defined. If iC iA iC piA t 6= pt , player i activates the arm with the highest expected arrival rate. If pt = pt , then over the time interval [t, t + 2∆), player i activates arm Ai over the interval [t, t + ∆). If there is no success, she activates arm C over the interval [t + ∆, t + 2∆). Player i’s payoff from such a policy is decreasing in ∆. We refer to the limit as ∆ → 0 of the optimal ∆-policy as the single DM optimal i ∗ policy 14 and denote it by k˜DM . We define the value V (piA , piC ) in the single DM problem to be the supremum over ∆ of payoffs from optimal ∆-policies. We derive the following expression for it in appendix D.1. h i  iA iC −ρσ(piA ,piC ) iC iC λ iA λ ¯  + π(p , p ) e V (p ) − p piA > piC , p  ρ ρ  h i iC iA (7) V (piA , piC ) = piC λρ + π(piC , piA ) e−ρσ(p ,p ) V¯(piA ) − piA λρ piA < piC ,    ¯ iA V (p ) piA = piC , 12

Notice that this model is different from Keller, Rady, and Cripps (2005) and Keller and Rady (2010), since there is no safe arm. 13 i Contrast our model with an indivisible unit resource with a model with divisibility in which k˜DM maps player i’s beliefs into the interval [0, 1] and a good arm generates a success at date t with probability i iC iA iC ˜i λk˜DM (piA t , pt ) for arm Ai and λ(1 − kDM (pt , pt )) for C. In such a model, the single DM’s optimal i ∗ i ∗ policy when piA = piC is well defined and satisfies λk˜DM (p, p) = λ(1 − k˜DM (p, p)) for all p ∈ (0, 1). (See Presman and Sonin (1990).) 14 See Bellman (1957), Chapter 8.

23

where λ λ − 2p(1 − p) . V¯(p) := 1 − (1 − p)2 ρ λ + 2ρ

(8)

The first term in (8) reflects that player i is guaranteed a payoff of λρ unless both arms Ai and C are bad. The second term is the loss associated with spending time experimenting on the bad arm prior to the first success if only one of them is in fact good.

5.2

Game

Now consider the two-player game at date t ≥ 0, and assume that all arms are equally likely to be good, that is, pt = p¯ := (¯ p, p¯, p¯, p¯) for some p¯ ∈ (0, 1). It is feasible for the players to simultaneously use the single DM policy: over each time interval [t, t + 2∆), player i activates arm Ai and player j activates arm C on the interval [t, t + ∆) and viceversa on the interval [t + ∆, t + 2∆), with ∆ → 0. Conditional on no success, the players keep alternating on the common option. If over the interval [t, t+∆) player i activates arm C and it produces a success, the single DM policy prescribes that she keep activating the corresponding arm forever. The precedence rule implies that player i can keep activating arm C at date t + ∆ regardless of player j’s action. Thus, if the first success is produced by player i on arm C, player j is henceforth constrained to only activate Aj . It follows that, even though the players are able to behave according to the single DM policy, they do not obtain the single DM payoff, V (.). We let U i (pt ), denote player i’s payoff as ∆ → 0 if both her and player j implement the single DM policy. We derive the following expression in appendix D.2 for U i (pt ) evaluated ¯ when the players alternate activating the common arm: at pt = p,   ¯ = V¯(¯ (9) U i (p) p) (1 − pjC ) − pjC pjA L1 (¯ p) − pjC (1 − pjA )L2 (¯ p), where

L1 (¯ p) = L2 (¯ p) =

p¯(1−¯ p)λ3 , ρ(λ+2ρ)(3λ+2ρ) p¯(1−¯ p)λ3 . ρ(λ+2ρ)(2λ+2ρ)

If the common arm is bad for player j, player i is guaranteed the single DM payoff. But if C is good for player j, player i incurs a loss, since it is now possible that player j gets the first success on C, in which case, player i only receives a payoff if her private arm, Ai , is good. Player i suffers the greatest loss, L2 (¯ p), if player j’s private arm is bad, as player j would never lose interest in the common arm. If player j’s private arm is good, there is a chance that it produces a success in which case player j ceases to compete for the common arm. This mitigates the loss in expected payoff to player i, and it is easy to see that L1 (¯ p), the loss incurred in that case, is less than L2 (¯ p). 24

5.3

Deviation

Suppose that both players implement the single DM policy. We show that this is not a subgame perfect equilibrium by constructing a profitable deviation. Consider the symmetric ¯ and the following deviation for player i, parameterised by p ∈ [0, p¯]: initial state pt = p, ( 0 if piC < 1 and pjA ∈ (p, 1), i k˜dev (pt ; p) = i ∗ k˜DM (piA , piC ) if piC = 1 or pjA ∈ {1} ∪ [0, p], According to this deviation, starting in state pt = p¯ player i continuously activates the common arm C and only resumes the single DM optimal policy at the first of the following two events, conditional on her not producing a success on C in the meantime: either player j has a success on Aj , or date t + σ(¯ p, p) is reached at which pjA t+σ(¯ p,p) = p. In the terminology of section 4, whenever player i uses this deviation, she strategically15 forces player j to experiment. The opponent’s response is given by the single DM policy and is therefore well-defined. i Let Udev (pt ; p) denote player i’s payoff from such a deviation. In appendix D.3 we derive the following expression for it: i Udev (pt ; p) = F i (pt ) + G i (pt ) + H i (pt ; p),

where   jA λ jA λ λ+ρ p F i (pt ) := piC + (1 − p ) , t 2λ+ρ t t λ+ρ ρ h  λ λ iC λ iC G i (pt ) := pjA p + (1 − p ) piA t t 2λ+ρ t t ρ λ+ρ  i iC λ −ρσ(piA iC ¯(piC ) − piC λ , t ,pt ) V ) e , p + 2λ+2ρ π(piA t ρ t t t iC

jA −ρσ(pt ,p) H i (pt ; p) := hπ(piC t , p) π(pt , p) e  i jC jC jC i iA i iA U i (piA , p, p , p) − F (p , p, p , p) + G (p , p, p , p) , t t t t t t

and where (10)

iA ,p)

jC jC −ρσ(pt iA λ iA U i (piA t , p, pt , p) = pt ρ + π(pt , p) π(pt , p) e

h i U i (p) − p λρ

denotes player i’s payoff when both her and player j non-strategically implement the single jC DM policy in state (piA t , p, pt , p). We derive expression (10) in appendix D.2.2. Observe i ¯ p) evaluated at p = p¯ is player i’s payoff from not deviating in state pt = p¯ that Udev (p; ¯ and equals U i (p). ∗ i ¯ p). Proposition 2 states that for any symmetric initial Define p (¯ p) := arg maxp Udev (p; state there exists a belief threshold such that the deviation parameterised by this threshold 15

Player i activates only arm C in a state in which a single DM would alternate between Ai and C.

25

is profitable. Furthermore, the optimal deviation for player i has her forcing player j to experiment at most for a finite duration of time. ¯ both players implement the single Proposition 2. Suppose that, starting in state pt = p, p) ∈ (0, p¯), such that the deviation described DM policy. For all p¯ ∈ (0, 1), there exists p(¯ ∗ p) exists and satisfies 0 < p∗ (¯ p) < p¯. above is profitable. Moreover, p (¯ i ¯ p) attains a local minimum at p = p¯, (p; Proof : We show in appendix D.3 that Udev establishing that there exists a profitable deviation. Furthermore the payoff-maximising i ¯ p) is a continuous function of p on the compact support (p; threshold p∗ (¯ p) exists, since Udev ∗ [0, p¯]. It is clear that p (¯ p) > 0. 

5.4

Strategic option value

Player i’s payoff from strategically forcing the opponent to experiment is F i (pt ) if she i has a success on C before player j has a success on Aj . All remaining terms in Udev (pt ; p) reflect player i’s strategic option value from being able to return to Ai whenever she wishes. As in the case where the common option is safe (section 4.1) the strategic option value can be divided into two terms, G i (pt ) and H i (pt ), according to which of the following two events occurs first: player j has success on Aj before player i has a success on C, or neither player has a success and the belief pjA reaches the threshold p at which player i’s deviation ends and she resumes the single DM policy. The term G i (pt ) captures player i’s option value from being able to return to her private arm if player j has a success on Aj . In that case it becomes optimal for player j to keep activating her private arm forever. Player i is therefore guaranteed access to the common arms, and obtains the single DM payoff. At this stage, piA > piB and according to the single DM optimal policy, player i first experiments only on Ai before alternating between Ai and C once her beliefs about these two arms are equalised. Observe that the likelihood that the first Poisson event occurs on Aj and the resulting single DM value for player i depend on whether arm C is good for player i, and hence on piC . The term H i (pt ; p) captures player i’s option-value from being able to return to her private arm if neither her nor player j have produced a success over the time interval [t, t + σ(pjA , p)). If her opponent is not successful on Aj , player i benefits from eventually resuming her own experimentation. Indeed, since p∗ (¯ p) > 0, player i will not find it optimal to occupy C forever, even if both players are unsuccessful. To summarise: when the common arm is risky, player i has an incentive to continuously occupy the common arm when a single DM would not do so, and to strategically force her opponent to experiment for a finite duration of time. If her opponent has a success, player 26

i achieves the single DM payoff. If not, player i eventually returns to her own private arm. We have thus shown that when the common option is subject to congestion, the presence of another player who can generate a negative payoff externality makes the common arm more attractive than in the single DM problem, regardless of whether the common arm is safe or risky. We discuss the difference between the two cases in section 5.6.

5.5

Conjectured equilibrium

Observe that if we had assumed that player j responds to player i’s deviation by occupying the safe arm forever as soon as given the opportunity, Player i’s continuation payoff at the end of her deviation (see equation (26) in appendix D.3) should have been: λ i Udev (piA , p, pjC , p; p) = piA . ρ It can be shown that even in that case p∗ (¯ p) > 0, and it is optimal for player i to strategically force player j to experiment only for a finite duration of time. We therefore conjecture the existence of a MPE, characterised by a sequence {¯ pn }∞ n=0 , in which the players alternately force each other to experiment. Starting in state p0 = (¯ p0 , p¯0 , p¯0 , p¯0 ), with p¯0 ∈ (0, 1), both players attempt to activate the common arm, which is therefore allocated in a tie-break (say to player i). Player i then strategically forces player j to experiment for a finite duration, until either player i has a success on C, or player j on Aj , or until pjA reaches the threshold p¯1 ∈ (0, p¯0 ). At that point, player i returns to Ai and player j is able to activate C. In line with the single DM policy, she keeps activating C until pjC reaches p¯0 , conditional on no success. Our argument can then be repeated for player j in state (¯ p1 , p¯1 , p¯1 , p¯1 ): she has an incentive to remain on C and strategically force player i to experiment until either player produces a success or until pjC reaches the threshold p¯2 ∈ (0, p¯1 ). And so on. The values {¯ pn } ∞ n=0 are determined in equilibrium.

5.6

Discussion

We conclude this section with a discussion of the similarities and differences between the models with a safe or a risky common arm. We have seen in section 4.1 that when the common arm is safe, player i’s option value from strategically forcing her opponent to experiment decreases as the opponent’s likelihood of success decreases. This effect is also present when the common arm is risky. The riskiness of the common arm generates two additional effects. First, as long as player i activates the common arm C without a success, her own expected payoff from 27

activating it decreases. This has a shortening effect on the duration for which she is willing to strategically force player j to experiment relative to the case with a safe common arm, where the payoff from activating the common arm remains constant. Second, if player i has a success on C, player j will be forced to experiment forever. Even p) > 0 and strategically forced experimentation ends in finite time, player j is though p∗ (¯ not guaranteed future access to the common arm. This kind of uncertainty is absent when the common arm is safe. This last effect reduces player j’s payoff from being strategically forced to experiment and strengthens her preemption motives. To finish, observe that the case where the common arm is risky and the private arms are safe is straightforward: for sufficiently high priors, both players immediately switch the the common arm which is then allocated in a tie-break. A player occupying the common arm implements the single DM policy. If the first player has a success, she remains on the common arm forever and her opponents never gets a chance to experiment. If she does not have a success, she leaves when her belief reaches the single DM threshold, and it is her opponent’s turn to implement the single DM policy. This is inefficient, as the planner would optimally alternate the players on the common risky option.

6

Efficient Benchmarks

In this section we present the planner solutions for the games with irrevocable and revocable switching when the common arm is safe.

6.1

Planner solution - Irrevocable switching

We begin by assuming that a switch to the safe arm is irrevocable. The social planner maximises the sum of both players’ payoffs. Given a prior (p10 , p20 ), the planner therefore faces the following stopping problem: At each date t ≥ 0, given his past policy and past realised successes, the planner must choose between letting both players activate their risky arms over the short time interval [t, t + dt) (we call this regime RR), or retiring one players to the safe arm irrevocably so that the other player must continue to activate her risky arm forever (regime RS). For any prior (p10 , p20 ), the planner’s strategy can be expressed in terms of a belief threshold (p1W , p2W ) at which the planner irrevocably switches to regime RS, or equivalently, a stopping date sW conditional on at least one arm not producing a success on the interval [0, sW ). At the switching date, the planner maximises the joint continuation payoff by allocating the player with the lowest expected Poisson arrival rate to the safe arm. The

28

optimal stopping date maximises   Z s W  a λ 1 1 2 2 −ρt 1 2 −ρsW + max(psW , psW ) | (p0 , p0 ) , E e (pt + pt )λdt + e ρ ρ 0 where (p10 , p20 ) is the vector of prior beliefs, and the expectation is taken with respect to the process {p1t , p2t }t≥0 given sW . Let W(p1 , p2 ) denote the value function associated with this problem. It solves the dynamic program 27 in appendix E. We derive the planner solution when switching is irrevocable. For the statement of Lemma 1, without loss of generality we may assume that pi ≥ pj . Lemma 1. In states (pi , pj ) ∈ [0, 1]2 , the regime RR is socially optimal if and only if pj ≥ BW (pi ) where (11)

BW (pi ) :=

aρ . λ (λ + ρ − a + ρ V (pi ) − pi λ)

Otherwise the regime RS is socially optimal. Proof: See appendix E.  The equation pj = BW (pi ) determines a switching line in the belief space [0, 1]2 for pi ≥ pj . It defines the set of optimal belief thresholds (piW , pjW ): Given a prior belief (pi0 , pj0 ) at which RR is optimal, the optimal regime change to RS occurs if and when the posterior beliefs (pit , pjt ) reach the switching line, as illustrated in Figure 6 below. First, notice that for pj ≤ pi < 1, we have BW (pi ) < pV : Under the planner solution, the player who is allocated the safe arm must first experiment beyond the single DM threshold. This is because the planner internalises that player j switching to the safe arm cancels the option value it affords player i. That option value, V (pi ) − a/ρ, is higher for lower values of pi . It appears at the denominator of (11), and the discrepancy between BW (pi ) and player j’s single DM threshold decreases with pi . The discrepancy is maximised for pi0 = pj0 . The threshold in the planner’s solution is then p∗W defined in (37). In contrast, when pi = 1, player i knows that her risky arm is good and derives no option value from being able to switch to the safe arm. In that case the players are not competing for the safe arm. We therefore have BW (1) = pV , and the planner solution amounts to letting player j implement the single DM optimal policy.

29

Figure 6: States in which the regime RS is optimal (shaded region) and threshold beliefs in the planner solution when switching is irrevocable.

In Figure 6, we illustrate a typical trajectory of the state under the planner solution. Since p10 > p20 > BW (p10 ), the planner begins by implementing RR. If no success occurs, the beliefs decrease according to (1), eventually reaching the switching line p2 = BW (p1 ). The planner then allocates player 2 to the safe arm, forcing player 1 to activate her risky arm forever. Even though all equilibria described in section 3 allocate the safe arm to the player with the lowest expected arrival rate, they are generically inefficient in the sense that there is less experimentation than under the planner solution, conditional on no success. (Following a success the equilibrium is always efficient.) The inefficiency results from the players’ preemption motives. These are stronger for closer priors. The most extreme inefficiencies arise for p10 = p20 , when competition for the safe arm is most intense. In the planner solution, both players switch to the safe arm at pW < pV whereas in our equilibrium they switch at the myopic threshold pM .

6.2

Planner solution - Revocable switching

Finally, we consider the planner’s problem when the decision to allocate one player to the safe arm is revocable. The planner effectively faces the following three-armed bandit problem: At each date he must activate, over a short time interval16 of length ∆, two arms of a three-armed exponential bandit with two independent risky arms and one safe arm. 16

We present the solution to the planner problem as ∆ → 0. Following Bellman (1957) Chapter 8, this is a valid approximation to the continuous time problem. In appendix F we discuss how our solution coincides with the solution to a planner problem with divisible resources.

30

For any beliefs (p1t , p2t ), over the short time interval [t, t + ∆), the planner can either let each player activate her risky arm (policy RR), or let one of the players activate the safe arm (policy RS). In states in which policy RS is optimal, the planner maximises the likelihood of a success, and therefore the joint payoff, by letting the player with the highest expected Poisson arrival rate activate her risky arm, and the other player activate the safe arm. At each date t ≥ 0, the planner’s strategy κ ¯ maps the beliefs (p1t , p2t ) ∈ [0, 1]2 into {0, 1}, where κ ¯ t := κ ¯ (p1t , p2t ) takes the value 1 if and only if policy RR is chosen in state κt }t≥0 so as to maximise the expected (p1t , p2t ). The planner’s objective is to choose a path {¯ discounted joint payoff: Z E



e

−ρt

κ ¯ t (p1t

+

p2t )

λ + (1 −

κ ¯ t )[max(p1t , p2t )

  1 2 λ + a] dt | (p0 , p0 ) ,

0

where (p10 , p20 ) ∈ [0, 1]2 is the vector of prior beliefs, and the expectation is taken with κt }t≥0 . respect to the processes {p1t , p2t }t≥0 and {¯ 1 2 Let U(p , p ) denote the joint value function associated with this problem. It solves the following dynamic program: for all (p1 , p2 ) ∈ [0, 1]2 , (12)

U(p1 , p2 ) = max{LRR U(p1 , p2 ), LRS U(p1 , p2 )},

where LRR U and LRS U denote the joint value under policy RR and RS respectively. We first derive an expression for the joint payoff from implementing the policy RS. Without loss of generality, consider states pi ≥ pj such that player i’s risky arm has a higher expected Poisson arrival rate than player j’s risky arm. For all remaining states such that pi ≥ pj , the policy RS requires layer i to activate her risky arm while player j activates the safe arm. If over the time interval [t, t + dt) player i’s risky arm produces a success, the state jumps to (1, pj ). If it does not produce a success, the state evolves continuously to (pi0 , pj ) according to the law of motion (1). The joint value generated by the policy RS thus satisfies the recursion:   λ RS i j i j (13) L U(p , p ) = adt + p λdt 1 + + V (p ) + (1 − pi λdt)(1 − ρdt) LRS U(pi0 , pj ), ρ where pi0 = pi − pi (1 − pi )λdt. Finally, if pi = pj , policy RS prescribes alternating the players on the safe arm, generating the payoff A(p) derived in appendix F.1. Therefore, for pi ≥ pj , the value generated by the policy RS is:  i h λ LRS U(pi , pj ) = aρ + pi λρ + pi λ+ρ V (pj ) − aρ  h  i (14) i j λ +π(pi , pj ) e−ρσ(p ,p ) A(pj ) − aρ + pj λρ + pj λ+ρ V (pj ) − aρ . 31

This expression is derived in appendix F.2. The first term gives the payoff from letting player i activate her risky arm, and player j the safe arm. If they did this forever the joint payoff would be aρ +pi λρ . Instead, if a success occurs, the planner lets player j implement the single DM optimal policy instead of activating the safe arm. This happens at discounted λ if player i’s risky arm is good and generates the bonus: V (pj )− aρ . Once pi = pj the rate λ+ρ regime RS requires alternating the players on the safe arm, generating the payoff A(pj ). This occurs if player i’s risky arm does not generate a success before pi reaches pj , which i j happens with probability π(pi , pj ), and the resulting payoff gain is discounted by e−ρσ(p ,p ) . We now turn our attention to the joint payoff under policy RR. It satisfies the recursion: + (1 − p1 λdt)(1 − p2 λdt)(1 − ρdt)LRR U(p10 , p20 ) LRR U(p1 , p2 ) = p1 λdt p2 λdt 2 λ+ρ ρ   20 (15) + (1 − ρdt)V (p ) +p1 λdt (1 − p2 λdt) λ+ρ  ρ  +p2 λdt (1 − p1 λdt) λ+ρ + (1 − ρdt)V (p10 ) ρ where pi0 = pi −pi (1−pi )λdt. This functional equation is the same as that for LRR W(p1 , p2 ), the joint value under regime RR in the planner problem when switching is irrevocable (equation (28)). We derived the solution (35) for it in appendix E. The planner solution can be expressed in terms of the switching line pj = BU (pi ) defined by equation (40) in appendix F and depicted in Figure 7 below. Lemma 2. Assume without loss of generality that pi ≥ pj The regime RR is optimal if and only if pj ≥ BU (pi ). Otherwise regime RS is optimal. Proof : See appendix F.3. 

Figure 7: Set of states in which policy RS is optimal (shaded area) and threshold beliefs in the planner problem when switching to the safe arm is revocable.

32

Given a prior belief (pi0 , pj0 ) ∈ [0, 1]2 at which RR is optimal, the optimal regime change to RS occurs if and when the posterior belief (pit , pjt ) reaches the switching line. Let sU denote the date at which, conditional on no success, the state reaches this threshold. Then for any t ∈ [0, sU ), the joint payoff under the planner solutions is given by: U(p1t , p2t ) = p1t λρ + p2t λρ (16)

i i h h + 1 − π(p1t , p1sU ) V (p2t ) − p2t λρ + 1 − π(p2t , p2sU ) V (p1t ) − p1t λρ i h  1 1 +π(p1t , p1sU ) π(p2t , p2sU ) e−ρσ(pt ,psU ) LRS U(p1sU , p2sU ) − p1sU λρ + p2sU λρ .

The first line gives the joint payoff if the planner were to let both players activate their risky arms forever. The planner switches to regime RS at date sU if neither player has produced a success in the time interval [t, sU ). This occurs with probability π(p1t , p1sU )π(p2t , p2sU ). The expected discounted increase in payoff resulting from this regime change is given by the last line of (16). If only one risky arm, say player 1’s, produces a success before date sU (this occurs with probability 1−π(p1t , p1sU )), the planner lets player 1 activate her risky arm forever and player 2 implement the single DM’s optimal policy. The joint payoff therefore increases by the single DM option value V (p2t ) − p2t λρ which accrues to player 2 from now being able to switch to the safe arm in case she does not produce a success by the time her belief reaches the single DM’s threshold pV . We now describe a typical trajectory of the state under the planner solution. Consider the initial state (p10 , p20 ) in Figure 7. Since p10 > p20 > BU (p10 ), the planner initially lets both players activate their risky arms. If no success occurs, both beliefs decrease according to the law of motion (1) and the state eventually reaches the switching line p2 = BU (p1 ). The planner then allocates player 2 to the safe arm while keeping player 1 on her risky arm. The belief p2 remains constant while p1 continues to decrease. Once p1 reaches p2 , the planner alternates the two players on the safe arm as long as no success occurs, and the state evolves along the line p1 = p2 . Notice that BU (1) = pV . If player 1’s arm is known to be good it is socially optimal for player 2 to adopt the single DM policy. For p1 < 1 however, there exist states such that the planner policy prescribes regime RS even though both players’ beliefs are above the single DM threshold. In these states, letting both players implement the single-DM policy (i.e. implementing regime RR) instead would only be profitable if there were two safe arms. With only one safe arm, the increase in current payoff from activating the safe arm while experimenting under RS outweighs the possible gains from more intensive experimentation under RR. Conditional on no success, the equilibrium derived in section 4 is clearly inefficient. First, for priors satisfying Case 2, the safe arm is misallocated and the players with the 33

lowest expected arrival rate continues experimenting while the other player activates the safe arm. Second, players alternate on the safe arm at most once, whereas the planner solution requires alternating infinitely often.

7

Concluding Remarks

We have analysed a game of strategic experimentation in which two players compete over the use of a common option. In an attempt to reduce the resulting payoff externality, a player has an incentive to interrupt her own experimentation and strategically force her opponent to experiment. She benefits if the opponent’s experimentation is successful. If the opponent does not succeed, the first player eventually resumes her own experimentation and leaves the common option for her opponent to take. If the common option is risky, this process may be repeated many times. The main insight of this paper is that competition may make options more attractive to players than they would be in its absence. Thus, with congestion, a player’s behaviour is not only motivated by the exploration/exploitation tradeoff of the standard multiarmed bandit decision problem, it is sometimes primarily aimed at deflecting her opponent’s interests away from her own. The model we propose is simple enough to envisage various extensions. We have considered an extreme form of congestion whereby an agent can monopolise an option and entirely exclude her competitor. We could instead envisage milder forms of competition, for instance by adding a second common option delivering a certain flow payoff b ∈ [0, a]. The safe option could then be thought of as a “safe” market, and the first firm to become active in it is able to capture a larger market share. Finally, in this paper, we assume that players observe one another’s actions and payoffs. The players therefore have common beliefs about the likelihood of their respective successes. In a related paper (Thomas (2013)) we investigate how players behave if only actions are publicly observed, while payoffs are private information. The analysis is more complicated and the unique equilibrium in mixed strategies. However, our main insights generalise to this more complicated setting.

34

8

Appendix

A

Proof of Theorem 1

We derive the MPE of the game with irrevocable switching. The proof proceeds as follows: we first show that player i’s payoff from activating her risky arm until some date τ at which player j switches to the safe arm is increasing in her continuation payoff at date τ . We then show that in order to maximise their continuation payoff both players have incentives to preempt their opponent’s switch. We then proceed by backward induction and fully characterise the players’ equilibrium best-response correspondences. From now on we use the approximation e−ρdt ' (1 − ρdt). In what follows, fix an arbitrary initial state (p10 , p20 ) ∈ (pM , 1)2 with pi0 ≥ pj0 (w.l.o.g.). Suppose that player j’s strategy prescribes that over the interval [t, τ ) she continuously activate her risky arm, and that she switches to the safe arm at date τ conditional on no success on [t, τ ). We now determine player i’s best response. If piτ < pV , player i de facto faces no competition from player j, and implements the singleplayer optimal policy, switching to the safe arm if and when her posterior belief reaches the threshold pV . If piτ ≥ pV , let w(p1t , p2t ) denote player i’s expected discounted payoff from also activating her risky arm from date t ≥ 0 until date τ . It satisfies the recursion (2) with k i (p1s , p2s ) = k j (p1s , p2s ) = 1 for all dates s belonging to [t, τ ). Simplifying this recursion using a Taylor expansion about pi for V (pi0 ) and about (p1 , p2 ) to simplify w(p10 , p20 ), then eliminating terms in o(dt2 ), we find that w(p1 , p2 ) satisfies the following PDE: (17)

∂ w(p1 , p2 ) ∂p1 λ+ρ j i λ + p λ V (p ).

(p1 λ + p2 λ + ρ) w(p1 , p2 ) + p1 λ(1 − p1 ) = pi λ

+ p2 λ(1 − p2 )

∂ w(p1 , p2 ) ∂p2

p1 e−λs

t Let w(s) ˜ := w(p1s , p2s ), where p1s = p1 e−λs is the posterior belief at date s conditional on +1−p1t t player 1’s risky arm having been continuously activated over [t, s) without a success (similarly for 1 dp2 ∂ ∂ p2s ). Notice that ddsw˜ = dp ˜ as a function ds ∂p1 w + ds ∂p2 w, we obtain the following ODE for w(s) of s: λ+ρ w ˜ 0 (s) − (p1s λ + p2s λ + ρ) w(s) ˜ = −pis λ − pjs λ V (pis ), λ R 1 2 Multiplying both sides by the integration factor e −(ps λ+ps λ+ρ)ds  1 λ/ρ 1 − p10 1 1 − p20 1 ps 1 − p10 (18) = , 1 − p1s p10 p10 1 − p1s p20 1 − p2s

(19)

=

1 e−(2λ+ρ)s , p1s p2s

we obtain i R R d h 1 2 1 2 w(s) ˜ e −(ps λ+ps λ+ρ)ds = e −(ps λ+ps λ+ρ)ds ds

35

  λ+ρ j i i −ps λ − ps λ V (ps ) . λ

Integrating both sides from date t to date τ and rearranging, we obtain the following expression for w(t): ˜ R  Z τ R −(p1 λ+p2 λ+ρ)ds  1 2 s s λ+ρ e e −(pτ λ+pτ λ+ρ)dτ i j i R R − −ps λ (20) w(t) ˜ = w(τ ˜ ) − ps λ V (ps ) ds. 1 2 −(p1t λ+p2t λ+ρ)dt λ e −(pt λ+pt λ+ρ)dt t e Using dpis = −pis λ(1 − pis )ds, we perform a change of variable in the integral in (20) and rewrite it as :   Z piτ R λ+ρ 1 1 1 λ+p2 λ+ρ)ds i −(p j i s s R ps λ e (21) + ps λ V (ps ) dpi , 1 λ+p2 λ+ρ)dt i −(p λ ps λ(1 − pis ) s t t e pit ρ   1−pis pV 1−pis λ a λ over where pjs := pjs (pis ). Since piτ ≥ pV , we have V (pis ) = pis λρ + 1−p − p V i ρ ρ 1−pV ps V the entire domain of integration. We split expression (21) into three parts:  R piτ R −(p1 λ+p2 λ+ρ)ds λ+ρ 1 1 i s s R e 1 2 λ 1−pis dps pit e −(pt λ+pt λ+ρ)dt R pi R 1 2 1 i + piτ e −(ps λ+ps λ+ρ)ds pjs λρ 1−p i dps s t   ρ R piτ R −(p1 λ+p2 λ+ρ)ds pjs 1  a i λ 1−p p λ i V s s s dps . + pi e ρ − pV ρ 1−pV pi pi 1−pV s

t

s

Using expression (18) for the integration factor in each integral and rewriting the terms in pjs using e−λs = to

pis

1−pj0 pjs pj0 1−pjs

=

1−pi0 pis pi0 1−pis

and

1−pj0 1 pj0 1−pjs

=

1−pi0 pis pi0 1−pis

+

1−pj0 , pj0

we integrate with respect

and obtain: 1 R

e

2 −(p1 t λ+pt λ+ρ)dt



λ λ+ρ ρ 2λ+ρ

 e−(2λ+ρ)τ − e−(2λ+ρ)t +  e−(2λ+ρ)τ − e−(2λ+ρ)t  j  i

λ + λρ 2λ+ρ  + aρ − pV

λ ρ

j

λ 1−p0 ρ pj



0

pV 1−pi0 1−pV pi0

1 1−p0 1−p0 1−pV pi0 pj0

e−(λ+ρ)τ − e−(λ+ρ)t

ρ  λ

pjτ 1−pjτ



pjt 1−pjt



.

Using expression (19) for the integration factor to divide the first three terms, and expression (18) to divide the last term, we simplify further and obtain:  i λ 1−pi 1−pjt pτ 1−pit ρ −pit λρ + piτ λρ 1−pit j 1−piτ pit τ 1−pτ     ρ j  1−pt 1−pit pV 1−pit λ − aρ − pV λρ 1− . j i 1−pV 1−pV p 1−pτ

t

We can now plug this expression back into (20). Noticing that R

1

2

e R −(pτ λ+pτ λ+ρ)dτ 1 2 e −(pt λ+pt λ+ρ)dt

=

1−pit 1−pjt 1−piτ 1−pjτ



piτ 1−pit 1−piτ pit

λ ρ

,

we obtain the following expression for w(t): ˜

(22)

     ρ 1−pjt 1−pit pV 1−pit λ w(p1t , p2t ) = w(t) ˜ = pit λρ + aρ − pV λρ 1− 1−pV 1−pV pit 1−pjτ   λ j  i i 1−p 1−pt pτ 1−pit ρ + w(p1τ , p2τ ) − piτ λρ 1−pit . j 1−pi pi τ

36

1−pτ

τ

t

The first term of the expression above is the expected value player i would receive from activating her risky arm forever. If player j produces a success before date τ , player i gains the option value of being able to switch to the safe arm if she does not produce a success before her belief reaches pV . If neither player has a success before date τ , player j switches to the safe arm and player i’s continuation value, w(p1τ , p2τ ), depends on her strategy at τ . It is easy to see that w(p1t , p2t ) is an increasing function of w(p1τ , p2τ ). Let’s now determine player i’s best-response if player j switches to the safe arm at date τ and piτ ≥ pV . For some arbitrarily small ∆ > 0, • if player i activates her risky arm at τ , then w(p1τ , p2τ ) = piτ λρ , • if player i also switches to the safe arm at τ , then w(p1τ , p2τ ) = ι

a ρ

+ (1 − ι) piτ λρ ,

• if player i switches to the safe arm at τ − ∆, then w(p1τ −∆ , p2τ −∆ ) = ∆ → 0, w(p1τ −∆ , p2τ −∆ ) → aρ (preemption).

a ρ

and in the limit, as

For ι ∈ (0, 1), the relative magnitude of these terms depends solely on the position of piτ relative to the myopic threshold belief, pM . Player i is only indifferent between these three options when piτ = pM . When piτ > pM , player i strictly prefers letting player j occupy the safe arm and being forced to activate her risky arm forever, to occupying the safe arm herself. When piτ < pM , player i is strictly prefers preempting player j’s switch to the safe arm. We now show that the decision to preempt the opponent’s switch to the safe arm unravels in all states (p1 , p2 ) such that p1 < pM and p2 < pM . Once her posterior belief reaches the single-player threshold, switching to the safe arm is a dominant strategy for a player. So in states (p1 , p2 ) such that p1 < pV and p2 < pV , both players strictly prefer switching to the safe arm. As just discussed, player i has a strict incentive to preempt her opponent’s switch as long as pi < pM . Therefore, the decision to switch to the safe arm unravels and for either player, switching to the safe arm is a best-response for all (p1 , p2 ) such that p1 < pM and p2 < pM . For the remainder of the proof, assume without loss of generality that pi ≥ pj . Consider states such that pi = pM and pj < pM . Whereas player j has strict preferences for preempting player i’s switch to the safe arm, player i is now indifferent between preempting player j and letting her take the safe arm. In equilibrium, player i’s best-response is to activate her risky arm at piτ = pM . Suppose this were not the case and player i did instead attempt to take the safe arm at pM . Then player j would best-respond by switching to the safe arm at some τ 0 < τ and would be guaranteed access to the safe arm since player i never finds it optimal to preempt player j at belief piτ 0 > pM . From the single-player decision problem however, we know that, as long as pjτ 0 > pV , conditional on preempting player i, player j finds it optimal to switch to the safe arm as late as possible. In our continuous-time setting, the problem max{τ 0 ∈ R+ |τ 0 < τ } has no solution. Hence, there cannot be an equilibrium in which player i switches to the safe arm when she is indifferent and player j has a strict incentive to preempt.

37

For the same reason, if p10 = p20 , in equilibrium both players switch to the safe arm when p1τ = p2τ = pM and the safe arm is allocated in a tie-break. In states (p1 , p2 ) such that pi > pM and pj ≤ pM , whereas player j would have strict incentives to preempt player i’s switch to the safe arm, player i strictly prefers letting player j have the safe arm. So player j effectively faces the single-player decision problem and switches to the safe arm if pj ≤ pV . For all remaining states, both players strictly prefer activating their risky arm. We have thus established the form of the best-response correspondence in Theorem 1. 

B

Value of forcing the opponent’s experimentation

We derive an expression for Si0 (pi , pj ) the value function solving the dynamic program (3) faced by player i in a subgame where she is occupying the safe arm and strategically forcing player j to experiment. As long as it is optimal for player i to force player j to experiment, Si0 (pi , pj ) = LS Si0 (pi , pj ). Using a Taylor expansion about (pi , pj ) for Si0 (pi , pj0 ) and eliminating terms in o(dt2 ), we find that Si0 solves the following ODE: (23)

ρ + pj λ a + pj λV (pi ) ∂ i i j i i j S (p , p ) + S (p , p ) = . 0 0 ∂pj pj λ(1 − pj ) pj λ(1 − pj )

Multiplying both sides of this expression by the integration factor  j  λρ R ρ+pj λ 1 p dpj j j = , e p λ(1−p ) 1 − pj 1 − pj we obtain " #  j  λρ  j  λρ 1 1 ∂ p p a + pj λV (pi ) i i j = (24) S (p , p ) . 0 ∂pj 1 − pj 1 − pj 1 − pj 1 − pj pj λ(1 − pj ) The indefinite integral with respect to pj of the right-hand-side of (24) equals 1 1 − pj



pj 1 − pj

 λρ 

  a a j λ i +p V (p ) − . ρ λ+ρ ρ

Integrating both sides of (24) from some threshold pj < pj to pj we obtain the following family    pj  λρ  ρ of solutions to (23): R pj λ a+xλV (pi ) 1 1 x i i j S0 (p , p ) + pj 1−x 1−x dx xλ(1−x) 1−pj 1−pj i i j j S0 (p , p ; p ) = .  ρ 1 1−pj

pj 1−pj

λ

For any pi , Si0 (pi , pj ; pj ) is continuous and differentiable at pj . Switching back to her risky arm and letting player j occupy the safe arm forever is optimal for player i at pj if and only if Si0 (pi , pj ; pj )|pj =pj = LR Si0 (pi , pj ) = pi λ/ρ (value-matching) and ∂p∂ j Si0 (pi , pj )|pj =pj = 0 (smoothpasting). Using these constraints in ODE (23) we obtain the optimal threshold: p

j∗

a iλ 1 p ρ−ρ = =: pjSi (pi ) 0 λ V (pi ) − pi λρ

38

For pi ≥ pM , pjSi (pi ) ≥ 0 and Si0 (pi , pj ; pjSi (pi )) is convex in pj and simplifies to the expression in 0 0 Proposition 1 . We conclude that player i’s optimal policy is to occupy the safe arm as long as pj > pjSi (pi ) and to return to her risky arm and let player j occupy the safe arm forever when 0

pj ≤ pjSi (pi ). For pi < pM , pjSi (pi ) < 0 and it is never optimal for player i to let j return to the 0

0

safe arm. We therefore have Si0 (pi , pj ) =

C

a ρ

+ Gi (pi , pj ). 

Proof of Theorem 2

We derive the MPE of the game with revocable switching. We begin by considering the last subgame with strategically forced experimentation. Let us determine player i’s best response to player j switching to the safe arm at date τ , and strategically forcing player i to experiment: i) (for pjτ ≤ pM ) until she produces a success; or ii) (for pjτ > pM ) until she produces a success, or until pi reaches pi j (pjτ ), whichever occurs S0 first. For some arbitrarily small ∆ > 0, • if player i continues activating her risky arm at date τ , her continuation payoff is Ri0 (piτ , pjτ ), • if player i also switches to the safe arm at date τ , a tie-break allocates the safe arm and player i’s continuation payoff is ι Si0 (piτ , pjτ ) + (1 − ι)Ri0 (piτ , pjτ ), • if player i switches to the safe arm at date τ − ∆, then in the limit, as ∆ → 0, her continuation payoff tends to Si0 (piτ , pjτ ) (preemption). Hence, player i is strictly better-off preempting player j’s switch to the safe arm whenever Ri0 (piτ , pjτ ). By construction, the states (piτ , pjτ ) for which this is the case satisfy where Bi (.) is defined in expression (6).

Si0 (piτ , pjτ ) > piτ < Bi (pjτ ),

We now proceed by backward induction. For all states (p1 , p2 ) ∈ [0, pV )2 , switching to the safe arm is a strictly dominant action for both players. For all (p1 , p2 ) such that Si0 (pi , pj ) > Ri0 (pi , pj ) for both i ∈ {1, 2}, both players have a strict incentive to preempt their opponent, and the decision to switch to the safe arm unravels. Thus switching to the safe arm is a best-response for all (p1 , p2 ) such that p1 < B1 (p2 ) and p2 < B2 (p1 ). That set of states is depicted in blue (shaded) below.

39

For the remainder of the proof, assume w.l.o.g. that pi ≥ pj . Consider the states such that pj = Bj (pi ) and pi < Bi (pj ). While player i has a strict preference for preempting player j’s switch to the safe am, player j is indifferent between preempting player i’s switch and letting her take the safe am. In equilibrium, player j’s strategy is to activate her risky arm when indifferent, letting player i take the safe am. Assume this were not the case and that instead player j takes the safe arm in that state. Then player i best-responds by preempt player j’s switch by some ∆. From the single-player problem we know that, since pi > pV , conditional on preempting j, player i wants to switch to the safe arm as late as possible. In continuous time, this problem would have no solution, and therefore it cannot be an equilibrium for player j to activate her risky arm when indifferent. A similar argument applies to the state (pU , pU ) satisfying pU = Bi (pU ), pU = Bj (pU ). Here, both players are indifferent between activating their risky arm and switching to the safe am. In equilibrium, both players must switch to the safe arm with certainty. In states (pi , pj ) such that pi ≥ Bi (pj ) and pj ≤ pV , switching to the safe arm is the dominant action for player j, and player i strictly prefers activating her risky arm and letting player j have the safe am. In states (pi , pj ) such that pi ≤ Bi (pj ) and pj > Bj (pi ), only player i has an incentive to preempt player j’s switch. Player j strictly prefers letting player i take the safe am, i.e. player j’s best-response is to activate her risky arm regardless of whether player i switches to the safe arm or not. From the single-player decision problem, player i’s best-response is therefore to also activate her risky arm. For all remaining states, both players strictly prefer activating their risky arm. Player i’s expected discounted payoff when both players activate their risky arms over the interval [t, s) is LR Ui (pit , pjt , 1). It satisfies the same partial differential equation (17) as LR Wi (pit , pjt , 1) , for which we have given the solution (22) in appendix A. Thus, LR Ui (pit , pjt , 1) is a strictly increasing function of player i’s continuation payoff at date s. Letting s be the first date at which j is just indifferent about preempting player i’s switch completes the argument. We have thus established two things. First, in the unique equilibrium of this game the players use the strategies given in Theorem 2. Second, we have established by backward induction that for any given initial state there is at most one subgame with strategically forced experimentation. If the player strategically forcing her opponent to experiment returns to her risky am, it is in a state in which her opponent’s strategy prescribes that she henceforth occupy the safe am, not for strategic motives but because it is her dominant action. Therefore, Si (pi , pj ) = Si0 (pi , pj ) and Ri (pi , pj ) = Ri0 (pi , pj ). 

40

D

Risky contested arm

D.1

Value in single DM problem.

iC For piA t = pt =: pt , when the DM alternates between the two arms over the time interval dt, her payoff V¯(pt ) satisfies the recursion (we drop the index t for exposition):

λ+ρ V¯(p) = pλdt + (1 − pλdt − ρdt)V¯(p0 ). ρ Using a Taylor series expansion about p for V¯(p0 ) and simplifying, we obtain the ODE in p: λ+ρ λ . V¯(p)(pλ + ρ) + p(1 − p) V¯0 (p) = pλ 2 ρ Solving gives expression (8). The remainder of expression (7) for V (piA , piC ) is obtained as in the single DM problem with a safe common arm. 

D.2

Value from single DM policy in 2-player game.

D.2.1

pt = p¯

jC jA iC Starting at date t with piA t = pt = pt = pt , assume that player j implements the single DM policy. We derive an expression for U i (pt ), player i’s payoff from also using the single DM policy. Over the interval [t, t + dt) each players activates her own risky arm and the common arm, each for a duration dt/2. If player i produces a success on Ai , the single DM policy prescribes that she henceforth only activate Ai , thus receiving continuation payoff λ/ρ. Similarly if player i has a success on C. If player j has a success on C, she henceforth only activates C and player i loses λ access to it forever. Player i’s continuation payoff in that case is therefore piA t ρ . If player j has a success on Aj , she henceforth only activates Aj and player i’s continuation payoff is the single DM payoff V¯(piA t ). i Thus, U (pt ) satisfies the recursion (dropping the index t for exposition):

λ+ρ λ+ρ iC λ jC λ dt piA λ + pjA λ dt V¯i (piA ) ρ + p 2 dt ρ + p 2 2  ρ (piA + piC + pjC + pjA ) λ2 dt − ρdt U i (piA0 , piC0 , pjC0 , pjA0 )

U i (piA , piC , pjC , pjA ) = piA λ2 dt + 1−

with piX0 = piX (1 − piX )λdt/2 for i ∈ {1, 2} and X ∈ {A, C}. Using a Taylor series expansion and simplifying, we obtain the following PDE:  (piA + piC + pjC + pjA ) λ2 + ρ U i (piA , piC , pjC , pjA ) = iC λ λ+ρ + pjC λ piA λ + pjA λ V¯i (piA ) piA λ2 λ+ρ ρ +p 2 ρ 2 ρ 2 ∂ U i (piA , piC , pjC , pjA ) ∂piA −piC (1 − piC ) λ2 dt ∂p∂iC U i (piA , piC , pjC , pjA ) −pjC (1 − pjC ) λ2 dt ∂p∂jC U i (piA , piC , pjC , pjA ) −pjA (1 − pjA ) λ2 dt ∂p∂jA U i (piA , piC , pjC , pjA ).

−piA (1 − piA ) λ2 dt

41

Solving, we obtain: U i (piA , piC , pjC , pjA ) =  iAi  iC )+piC (1−piAi ))λ iAi piC λ λ+ρ (1 − pjC ) (p (1−p λ+2ρ + 2 p 2λ+2ρ ρ   iAi  i n h  iAi p (1−piC )λ (p (1−piC )+piC (1−piAi ))λ piAi piC λ λ+ρ piAi piC λ λ jC jAj +p + 2 + + (1 − p ) 2λ+2ρ 3λ+2ρ  ρ 3λ+2ρ  ρ  iAi2λ+2ρiC h  iAi p (1−p )λ (p (1−piC )+piC (1−piAi ))λ piAi piC λ λ+ρ piAi piC λ λ jAj + 2 4λ+2ρ +p ρ + 3λ+2ρ  + i 4λ+2ρ ρ o  3λ+2ρ (piAi (1−piC )+piC (1−piAi ))λ λ piAi piC λ λ+ρ λ + 3λ+2ρ λ+2ρ + 2 4λ+2ρ 2λ+2ρ ρ The first line reflects player i’s expected payoff if C is bad for player j. Player i is then assured the single DM payoff. This is the highest possible payoff for player i. The term in curly brackets is player i’s payoff if C is good for player j. Player i’s payoff is lowest if C is good for player j but Aj is bad. The sum in the first set of square brackets is player i’s payoff in that case: The first summand her payoff if she has at least one good arm and gets the first success — it is the first of either two (if her other arm is bad) or three (if she has two good arms) possible Poisson events. The second summand is her payoff if the first success is generated by player j on arm C — the first of either two (if C is bad for player i) or three (if it is good) possible Poisson events. Player i then only gets a positive payoff if Ai is good. The sum in the second set of square brackets describes player i’s payoff if both arms C and Aj are good for player j. The first summand is player i’s payoff is she has the first success. Conditional on not getting the first success, player i is better off if player j has her first success on Aj (third summand). Player i then obtains the single DM payoff. If instead player j has her first success on C, then player i only gets a positive payoff if Ai is good (second summand). ¯ we obtain the expression in (9). Simplifying the expression above evaluated at pt = p,

D.2.2

pt = (¯ p, p, p¯, p), p¯ > p

jC jA Starting at date t with piA ¯ > piC t = pt =: p t = pt =: p, assume that player j implements the single DM policy. We derive an expression for U i (pt ), player i’s payoff from also using the single DM policy. Over the interval [t, t + dt), player i activates Ai and player j activates C. If player i has a success on Ai her payoff is 1 + λ/ρ. If player j has a success on C, she henceforth only activates C. As a result, player i loses access to the common arm and her continuation payoff is i piA t λ/ρ. The function U (pt ) therefore satisfies:

(25)

jC λ+ρ iA λ iC jC jA iA U i (piA t , pt , pt , pt ) = pt λdt ρ + pt λdt pt ρ jC i iA0 iC jC0 jA +(1 − piA t λdt − pt λdt − ρdt)U (pt , pt , pt , pt ),

jC0 jC iA where piA0 = piA = pjC t t (1 − pt )λdt and pt t (1 − pt )λdt. At date s at which conditional on no success we have ps = p := (p, p, p, p), the single DM policy requires of both players that they resume alternating on the common option. Player i’s continuation payoff U i (p) at date s is therefore given by (9). Solving (25) for U i (pt ) evaluated jC jA iC at piA t = pt and pt = pt =: p we obtain the expression in (10).

42

D.3

Value from deviating from single DM policy

i (p ; p), player i’s payoff from using the deviation k ˜i when We derive an expression for Udev t dev i ∗ . As long as player i activates C, player j is player j used the single DM optimal policy k˜dev forced to activate Aj . After a success on C, player i keeps activating C forever. After a success on Aj , player j stays on Aj forever and player i is free to implement the single DM policy, thus obtaining the payoff V (piA , piC ) described in (D.1) and evaluated at piA ≥ piC . Therefore, i (p ; p) solves the recursion: Udev t   i (piA , piC , pjC , pjA ; p) = piC λdt 1 + λ + pjA λdt V (piA , piC ) Udev t t t t t t t t ρ jA i iA iC0 jC jA0 +(1 − piC t λdt − pt λdt − ρdt) Udev (pt , pt , pt , pt ; p), jA0 jA jA iC iC where piC0 = piC = pjA t t − pt (1 − pt )λdt and pt t − pt (1 − pt )λdt. i (p ; p) in which piC and pjA are variables, Simplifying we obtain the following PDE for Udev t t t iA jC whereas p and p are constant parameters — we drop their dependence on t to emphasise this: i (piA , piC , pjC , pjA ; p) (piC λ + pjA λ + ρ) = piC λ λ+ρ + pjA λV (piA , piC ) Udev t t t t t t t ρ iC −piC t (1 − pt )λ jA −pjA t (1 − pt )λ

∂ i (piA , piC , pjC , pjA ; p) Udev t t ∂piC t ∂ iA iC jC i Udev (p , pt , p , pjA t ; p). ∂pjA t

Solving, we obtain the following solution:   λ+ρ i (piA , piC , pjC , pjA ; p) = piC pjA λ + (1 − pjA ) λ Udev t t t t λ+ρ  ρ t 2λ+ρ λ λ iC iA λ +pjA piC t 2λ+ρ + (1 − pt ) λ+ρ p ρ t ρ   λ 1−piA 1−piA piC λ iC λ λ t +pjA (1 − piC t )pt ρ 2λ+ρ 2λ+2ρ t 1−piC piA 1−piC t t  ρ p λ 1−piC 1−pjA 1−piC t + 1−pt 1−pt 1−p piC t h i (piA , p, pjC , p; p) Udev   λ+ρ λ λ −p p 2λ+ρ + (1 − p) λ+ρ   ρ λ λ piA λρ −p p 2λ+ρ + (1 − p) λ+ρ ρ  iA p  i iA λ 1−p λ λ λ (1 − p)p −p 1−p iA 1−p 1−p ρ 2λ+ρ 2λ+2ρ p If over the interval [t, t + σ), where σ solves pjA t+σ = p, neither player produces a success, then in iA jC state (p , p, p , p) player i stops forcing player j to experiment and both resume with the single DM policy. Thus, player i’s continuation payoff satisfies: (26)

i Udev (piA , p, pjC , p; p) = U i (piA , p, pjC , p; p)

i (p; ¯ p) admits From now on we use piA = pjC = p¯ to help simplify expressions. We show that Udev a local minimum at p = p¯. We have that:    ρ λ (1−¯ p)3 1−¯ p p ∂ λ λ i (p; ¯ U p) = 1 − p ¯ − p(¯ p − p) 4 ∂p dev ρ λ+2ρ (1−p) p¯ 1−p   ρ p pλ(1−¯ p)(λ+2ρ)(1−p)+ρ(λ+2ρ)(1−p)(p−¯ p) λ − pλ(1−¯p)(λ+2ρ)(1−¯p)+p2 λ2 (1−¯p)(p−¯p) 1−p

43

From the expression above it is easy to see that ∂ i ¯ p) Udev (p; = 0. ∂p p=¯ p Moreover we have that λ p¯λ + λ + 2ρ ∂2 i ¯ p) Udev (p; = >0 2 ∂p ρ (1 − p¯)(λ + 2ρ) p p=¯ i (p; i (p; ¯ p0 ) > Udev ¯ p¯). Hence, there We conclude that there exists a threshold p0 < p¯ such that Udev exists a profitable deviation.

E

Social planner solution, irrevocable switching

When switching is irrevocable, the value function in the planner’s problem solves the following dynamic program: for all (p1 , p2 ) ∈ [0, 1]2 , (27)

 W(p1 , p2 ) = max LRR W(p1 , p2 ), LRS W(p1 , p2 )

where LRR W denotes the value under policy RR and LRS W the value under policy RS at the switching date. We have λ a LRS W(p1 , p2 ) = max(p1 , p2 ) + . ρ ρ The payoff to the policy RR satisfies the recursion:

(28)

1 2 RR W(p10 , p20 ) LRR W(p1 , p2 ) = p1 λdt p2 λdt 2 λ+ρ ρ + (1 − p λdt)(1 − p λdt)(1 − ρdt) L   λ+ρ +p1 λdt (1 − p2 λdt) ρ + (1 − ρdt)V (p20 )   10 +p2 λdt (1 − p1 λdt) λ+ρ ρ + (1 − ρdt)V (p )

where V (.) denotes the value function in the single DM problem and pi0 = pi − pi λ(1 − pi )dt. We begin by deriving a closed form expression for LRR W(p1 , p2 ). Consider the states (p1 , p2 ) ∈ [0, 1]2 in which having both players activate their risky arm is optimal, so that W(p1 , p2 ) = LRR W(p1 , p2 ) ≥ LRS W(p1 , p2 ). In these states, the value function W solves the following partial differential equation: (29)

(p1 λ + p2 λ + ρ) W(p1 , p2 ) + p1 λ(1 − p1 ) ∂p∂ 1 W(p1 , p2 ) + p2 λ(1 − p2 ) h i h i 2 ) + p2 λ λ+ρ + V (p1 ) . = p1 λ λ+ρ + V (p ρ ρ pi e−λt

∂ W(p1 , p2 ) ∂p2

˜ Let W(t) := W(p1t , p2t ), where pit = 1−pi0+pi e−λt denotes the posterior belief at date t condi0 0 tional on player i’s risky arm having been continuously activated over the interval [0, t) without a 1 ˜ dp2 ∂ W ∂ success. Noticing that ddt = dp dt ∂p1 W + dt ∂p2 W, we obtain the following ordinary differential ˜ equations for W(t) as a function of t: h i h i 2 ) − p2 λ λ+ρ + V (p1 ) . ˜ 0 (t) − (p1t λ + p2t λ + ρ) W(t) ˜ (30) W = −p1t λ λ+ρ + V (p t t t ρ ρ

44

−(p1t λ+p2t λ+ρ)dt ,

R

Multiplying both sides by the integration factor e

1 − p10 1 1 − p20 1 p10 1 − p1t p20 1 − p2t

(31)

which can be rewritten as ρ  2 pt 1 − p20 λ , 1 − p2t p20

or as (32)

=

1 e−(2λ+ρ)t , 2 pt

p1t

we obtain      i R R λ+ρ λ+ρ d h˜ 2 2 1 −(p1t λ+p2t λ+ρ)dt 1 −(p1t λ+p2t λ+ρ)dt W(t) e + V (pt ) − pt λ + V (pt ) . =e −pt λ dt ρ ρ Integrating both sides from date α to date β > α and rearranging, we obtain R

˜ W(α) =

(33)

eR e



2 −(p1 β λ+pβ λ+ρ)dβ 1 2 −(pα λ+pα λ+ρ)dα R 2 1 β e −(pt λ+pt λ+ρ)dt R 1 λ+p2 λ+ρ)dα α e −(pα α

˜ W(β) 

R

−p1t λ

h

λ+ρ ρ

i h i 1) + V (p2t ) − p2t λ λ+ρ + V (p dt t ρ

Consider the following term in (33):    Z β R 1 λ+ρ 2 1 λ+p2 λ+ρ)dt −(p 2 1 t t R e + V (pt ) dt =: H p (α, β), −pt λ (34) 1 λ+p2 λ+ρ)dα −(p α α ρ e α 1

and define H p (α, β) analogously. We now simplify this expression. Using dp1t = −p1t λ(1 − p1t )dt, we perform a change of variable in the integral in (34) and rewrite it as :   Z p1 R β 1 1 λ+ρ 2 −(p1t λ+p2t λ+ρ)dt R + V (pt ) dp1t , e 1 λ+p2 λ+ρ)dα 1 −(p α α ρ 1 1 − p e pα t where p2t := p2t (p1t ). Notice that when integrating terms containing V (.), the single-player thresholds pV will matter. ρ   1−p2 pV 1−p2 λ λ a λ 2 2 2 For pβ > pV , we have V (p ) = p ρ + 1−pV ρ − pV ρ for all p2 ∈ [p2β , p2α ]. 1−pV p2 Using expression (31) for the integration factor, we split our integral into three parts, so that the expression above becomes:  R p1β λ+ρ 1−p10  1 2 1−p20 1  p2s 1−p20  λρ 1 1 R dps 1 2 1−p1s 1−p2s p20 p1α ρ p10 p20 1−p2s e −(pα λ+pα λ+ρ)dα ρ R p1 1−p1  1 2 1−p20 p2s  p2s 1−p20  λ 1 + p1β λρ p1 0 1−p dps 1 1−p2s p20 p2 1−p2 α 0   s 1  0 2s  ρ R p1β  a 2 1−p0 1 1−p0 pV 1−p20 λ 1 1 . + p1 ρ − pV λρ dp s 1−pV p2 1−p1 p1 p2 1−pV α

0

s

0

0

We let p1s appear explicitly by rewriting the terms in p2s using e−λs = 1−p20 p20

1 1−p2s

=

1−p10 p10

p1s

1−p1s

+

1 R

e

2 −(p1 α λ+pα λ+ρ)dα

1−p20 p20



1−p20 p2s p20 1−p2s

=

1−p10 p1s p10 1−p1s

. We then integrate with respect to p1s and obtain: λ λ+ρ ρ 2λ+ρ

 e−(2λ+ρ)β − e−(2λ+ρ)α +  e−(2λ+ρ)β − e−(2λ+ρ)α  

λ + λρ 2λ+ρ  + aρ − pV

λ ρ

1 2 1 1−p0 1−p0 1−pV p10 p20

45

1 1−p1β

2 λ 1−p0 ρ p20



1 1−p1α

e−(λ+ρ)β − e−(λ+ρ)α 

pV 1−p10 1−pV p10

ρ  λ

.



and

Using expression (32) for the integration factor to divide the first three terms in the brackets and expression (31) to divide the last term in the brackets, we simplify further and obtain: 1

1

−p1α λρ + p1β λρ π(p1α , p1β ) π(p2α , p2β ) e−ρσ(pα ,pβ )     2 − aρ − pV λρ 1 − π(p1α , p1β ) π(p2α , pV ) e−ρσ(pα ,pV ) 2

p =: H> (α, β). a ρ

For p2α < pV , we have V (p2 ) =

Z

1 R

e

for all p2 ∈ [p2β , p2α ] so that (34) equals

−(p1α λ+p2α λ+ρ)dα

p1β

e

R

−(p1s λ+p2s λ+ρ)dt

p1α

1 λ+ρ+a 1 dps , 1 − p1s ρ

where p2s := p2s (p1s ). Using expression (31) for the integration factor in the integral above and 1−p2 1 1−p10 p1s 1−p2 p2s 1−p20 p1s 1−p10 replacing p2 0 1−p + p2 0 and 1−p with 1−p 2 with 2 p2 1 p1 , we obtain: p1 1−p1 s

0

R p1β

1 e

R

s

0

2 −(p1 α λ+pα λ+ρ)dα

p1α

s

0

λ+ρ+a 1−p10 ρ p10



1 1−p1s

2 

Integrating with respect to p1s and simplifying: 

λ+ρ+a λ ρ 2λ+ρ

1

R

e

2 −(p1 α λ+pα λ+ρ)dα

+

s

0

1−p10 p1s p10 1−p1s

0

1−p20 p20

+



p1s 1−p10 1−p1s p10

e−(2λ+ρ)β − e−(2λ+ρ)α

λ+ρ+a λ 1−p20 ρ λ+ρ p20

e−(λ+ρ)β





λ

dp1s



e−(λ+ρ)α



 .

Using expression (32) for the integration factor to divide the expression in brackets, we further simplify and obtain: h i h i −ρσ(p1α ,p1β ) 2 2 2 λ 1 λ λ+ρ+a 1 − p2 λ 1 1 −p1α λρ λ+ρ+a 1 − p + p α 2λ+ρ β ρ λ+ρ β 2λ+ρ π(pα , pβ ) π(pα , pβ ) e λ+ρ 2

p =: H< (α, β). 2

We are now able to give a closed form expression for H p (α, β), assuming that at date α we have p1α > pV and p2α > pV : ( p2 H> (α, β) if p2β ≥ pV 2 p H (α, β) = 2 2 p p H> (α, t2 ) + H< (t2 , β) if p2β < pV where t2 is the date at which p2t2 = pV , i.e. the date that satisfies e−λt2 = p1

p1

p2

p2

H> (α, β) and H< (α, β) be the analogues of H> (α, β) and H< (α, β). Noticing that ρ  1 R 2 −(p1 pβ 1−p1α λ β λ+pβ λ+ρ)dβ 1−p1α 1−p2α eR = 1−p1 1−p2 1−p1 p1 , −(p1 λ+p2 λ+ρ)dα e

α

α

β

46

β

β

α

pV 1−p20 1−pV p20 .

Let

˜ we finally obtain, from (33), a closed-form expression for W(α), the value in the planner problem evaluated at date α under the assumption that regime RR is optimal over the interval [α, β): ˜ W(α) := W(p1α , p2α ) = 1

1

W(p1β , p2β ) π(p1α , p1β )π(p2α , p2β ) e−ρσ(pα ,pβ )

(35)

 p2 p1  H (α, β) + H (α, β)  > >    H p2 (α, t ) + H p2 (t , β) + H p1 (α, β) 2 > < 2 > − p2 p1 p1  H> (α, β) + H> (α, t1 ) + H< (t1 , β)     p2 p2 p1 p1 H> (α, t2 ) + H< (t2 , β) + H> (α, t1 ) + H< (t1 , β)

if p2β ≥ pV and p1β ≥ pV , if p2β < pV and p1β ≥ pV , if p2β ≥ pV and p1β < pV , if p2β < pV and p1β < pV .

˜ We can now determine the optimal regime change for the planner. Let LRS W(s) := LRS W(p1s , p2s ) = max(p1s , p2s ) λρ + aρ . Consider priors pi0 ≥ pj0 both close to 1. The payoff from letting both players a+λ activate their risky arm is close to 2λ ρ > ρ , and allocating both players to their risky arm, i.e. regime RR, is optimal. Conversely, for pis ≥ pjs both close to zero, the expected payoff from letting both players activate their risky arm is close to 0 < aρ and allocating one player to the safe arm, i.e. regime RS, is optimal. ˜ ˜ A regime change at date s ≥ 0, generates the continuation payoff W(s) = LRS W(s) = a λ 1 2 max(ps , ps ) ρ + ρ (value matching). Let sW˜ ≥ 0 be the date at which such a regime change is ˜ 0 (s ˜ ) = LRS W ˜ 0 (s ˜ ) (smooth-pasting). Hence optimal. By the envelope theorem, it satisfies W W W ˜ ˜ ) constitutes a particular solution to the ODE (30). The optimal switching at sW˜ , LRS W(s W date sW˜ solves, for pi (sW˜ ) ≥ pj (sW˜ ): (36)

pj (sW˜ ) =

a ρ − pi (sW˜ ) λ (ρ V (pj (sW˜ )) − a) . λ (λ + ρ + ρ V (pi (sW˜ )) − pi (sW˜ ) λ − a)

For all pi (sW˜ ) ∈ [0, 1] this equation admits solutions pj (sW˜ ) ∈ [p∗W , pV ] where p∗W is the unique p ∈ [0, 1] satisfying: (37)

p∗W =

a ρ − p∗W λ (ρ V (p∗W ) − a) , λ (λ + ρ + ρ V (p∗W ) − p∗W λ − a)

Therefore, V (p∗W ) = aρ in (36), and we obtain the expression for the switching line pj = BW (pi ) for pi ≥ pj given in Lemma 1. The switching line is illustrated in section 6.1 in the belief space [0, 1]2 .

F F.1

Social planner solution, revocable switching: Joint payoff from implementing policy RS when pi = pj := p.

RS denotes the policy whereby the planner always allocates the player with the lowest posterior belief to the safe arm, while the player with the highest posterior belief activates her risky arm.

47

We first consider the states in which the two beliefs are equal: pi = pj := p, p ∈ [0, 1]. The payoff A(p) to policy RS satisfies the following recursion17 :

(38)

h

i + (1 − ρdt)V (p) h h i 0) +(1 − pλdt)(1 − ρdt) adt + pλdt λ+ρ + (1 − ρdt)V (p ρ i +(1 − pλdt)(1 − ρdt)A(p0 ) ,

A(p) = adt + pλdt

λ+ρ ρ

where V (.) is the value function in the single-player decision problem and p0 = p − p(1 − p)λdt. Using Taylor expansions about p for V (p0 ) and A(p0 ), and eliminating terms in o(dt2 ), we obtain the following ordinary differential equation for A(p):

(39)

pλ(1 − p)A0 (p) + 2(pλ + ρ) A(p) = 2a + 2pλ



 λ+ρ + V (p) . ρ

Notice that when integrating the right-hand side, because it includes the function V (p), the single-player threshold pV will matter. From the above equation we can see that if neither risky arm ever produces a success and the policy RS is played forever, so that p → 0, we have A(0) = aρ . Conversely, if both Poisson processes were known to be good and the policy RS   were nevertheless implemented, if would λ λ a λ generate the value A(1) = 2 ρ − 1 − λ+ρ ρ − ρ , where the second term reflects the loss incurred from waiting for one success before letting both players activate their risky arms forever. Dividing equation (39) through by pλ(1 − p), multiplying by the integration factor:   2ρ R  2(pλ+ρ)  λ 1 p dp − pλ(1−p) = , e 2 (1 − p) 1−p and integrating both sides from 0 to p we obtain the solution: Z p R R − f (p) dp e f (x) dx g(x) dx, A(p) = e 0 17

Formally, the recursion (38) defines Adt (p), the value in a constrained planner problem in which resources are indivisible and an allocation can only be changed at dates t, t + dt, t + 2dt, etc. In that context, policy RS requires alternating the players on the safe arm every short time interval dt while letting the other player activate her risky arm. In our continuous time indivisible resource model, A(p) is formally defined as the limit as dt → 0 of Adt (p). See Bellman (1957) Chapter 8. In a continuous time model with divisible resources in which k i and k j ∈ [0, 1], A(p) is defined as follows: when pi = pj the policy RS prescribes each player allocating half of their divisible unit of resource to the safe arm and the other half to their respective risky arm. The resulting recursion for A(p) is then: h h ii 0 A(p) = 2 p 21 λdt(1 − p 12 λdt) λ+ρ ρ + (1 − ρdt)V (p ) +(1 − p 12 λdt)2 (1 − ρdt)A(p0 ) + 2 12 a dt with dp = −p(1 − p) 12 λdt and p0 = p + dp. The recursion above also simplifies to ODE (39). Therefore, the joint value generated by policy RS in states such that pi = pj is A(.) under both the assumptions of indivisible and divisible resources.

48

i h λ+ρ 2a 2 + V (p) . Notice that the integration factor where f (p):= 2(pλ+ρ) , g(p) := + ρ pλ(1−p) pλ(1−p) (1−p) equals zero when evaluated at p = 0. Solving, we obtain the following expression for A:    (1−p)λ pλ  a  + 1 + , pV ≥ p,  ρ ρ λ+2ρ              a + pλ 1 + (1−p)λ pV ≤ p. ρ ρ λ+2ρ    A(p) = (1−p)λ a  + pλ pλ   ρ − ρ 1 + λ+2ρ  λ+ρ     pλ   +2 aρ − pVρ λ λ+ρ π(p, pV ) e−ρσ(p,pV )       2   + pV λ pV λ − a 1 − (1−pV )λ π(p, pV ) e−ρσ(p,pV ) , λ+ρ ρ ρ λ+2ρ

F.2

Joint payoff from implementing policy RS when pi 6= pj .

As long as the beliefs are different, the regime RS prescribes that the player with the highest expected arrival rate continue activating her risky arm while the other player activates the safe arm. Once the beliefs, and therefore the expected arrival rates are equalised, policy RS prescribes alternating the players on the safe arm, thus generating the joint payoff A(p) described in the previous section. Assume w.l.o.g. that pi ≥ pj . The joint payoff LRS U(pi , pj ) from letting player i activate her risky arm while player j activates the safe arm satisfies the recursion (13). Using a Taylor expansion about pi for LRS U(pi0 , pj ) and eliminating terms in o(dt2 ), we obtain the following ODE, where pj is a constant:   pi λ + ρ a 1 λ+ρ ∂ RS i j RS i j j L U(p , p ) + i L U(p , p ) = i + + V (p ) . ∂pi p λ(1 − pi ) p λ(1 − pi ) 1 − pi ρ R

Multiplying both sides by the integration factor e

pi λ+ρ dpi pi λ(1−pi )

=

1 1−pi



pi 1−pi

ρ/λ

and integrating

from pj to pi , setting LRS U(pj , pj ) = A(pj ), we obtain the expression (14) in section 6.2.

F.3

Social planner solution, revocable switching

We can now determine the optimal regime change from RR to RS in the planner problem when switching to the safe arm is revocable. The joint value function solves the Bellman equation (12) where LRS U(pi , pj ) is given by (14) and LRR U(pi , pj ) satisfies the recursion (15), which simplifies to the PDE (29) for which we derived the general solution (35) in appendix E. We now determine the optimal policy for the planner. Since limp→0 A(p) > 0 = LRR U(0, 0) RS U(1, 1) for for sufficiently low beliefs the regime RS is optimal. Since LRR U(1, 1) = 2λ ρ > L beliefs close to 1 the regime RR is optimal. A regime change at date β from RR to RS induces U(piβ , pjβ ) = LRS U(piβ , pjβ ) (value-matching) in (35) and this regime change is optimal if smoothpasting holds. The set of threshold beliefs at which the regime change is optimal is depicted in

49

Figure 7 in section 6.2. They are obtained by using LRS U(pi , pj ) as a particular solution to the PDE (29): (40) ∂ LRS U(pi , pj ) + pj λ(1 − pj ) (pi λ + pj λ + ρ) LRS U(pi , pj ) + pi λ(1 − pi ) ∂p h i i h i j ) + pj λ λ+ρ + V (pi ) . = pi λ λ+ρ + V (p ρ ρ

∂ LRS U(pi , pj ) ∂pj

Without loss of generality, we restrict attention to states pi ≥ pj . The equation above describes a curve, pj = BU (pi ), which divides into two parts the set of states pi ≥ pj . It is illustrated in Figure 7 in section 6.2. Since the optimal regime change occurs before either player’s belief has reached their single-player threshold, we use the first line in the bracket of (35) and give the following closed-form expression for U(piα , pjα ) when the regime RR is optimal from date α up to date β > α at which the planner switches to regime RS: U(piα , pjα ) = piα λρ + pjα λρ (41)

h h i i +π(piα , piβ ) V (pjα ) − piα λρ + π(pjα , pjβ ) V (piα ) − pjα λρ  h i i i +π(piα , piβ )π(pjα , pjβ ) e−ρσ(pα ,pβ ) LRS U(piβ , pjβ ) − piβ λρ + pjβ λρ .

We obtain expression (16) by letting piα = pit and piβ = pisU . 

50

References Armstrong, M. and J. Zhou (2011). Exploding offers and buy-now discounts. Bellman, R. (1957). Dynamic programming. Princeton University Press. Bergemann, D. and J. V¨ alim¨ aki (2006). Bandit problems, in Steven Durlauf and Larry Blume (eds), The New Palgrave Dictionary of Economics. Bolton, P. and C. Harris (1999). Strategic experimentation. Econometrica, 349–374. Bolton, P. and C. Harris (2000). Strategic experimentation: The undiscounted case. Incentives, Organizations and Public Economics–Papers in Honour of Sir James Mirrlees, 53–68. Cardoso, R. et al. (2010). The commission’s GDF and E.ON Gas decisions concerning longterm capacity bookings use of own infrastructure as possible abuse under article 102 TFEU. Competition policy newsletter, European Commission (3). Dayanik, S., W. Powell, and K. Yamazaki (2008). Index policies for discounted bandit problems with availability constraints. Advances in Applied Probability 40 (2), 377–400. Freeman, P. et al. (2008). The supply of groceries in the UK - market investigation. Competition Commission, UK . Fudenberg, D. and J. Tirole (1985). Preemption and rent equalization in the adoption of new technology. Review of Economic Studies 52 (3), 383–401. Gittins, J., K. Glazebrook, and R. Weber (2011). Multi-armed Bandit Allocation Indices. John Wiley & Sons. Gittins, J. and D. Jones (1974). A dynamic allocation index for the sequential design of experiments. Progress in Statistics 241266. Heidhues, P., S. Rady, and P. Strack (2012). Strategic experimentation with private payoffs. Available at SSRN 2152117 . Keller, G. and S. Rady (2010). Strategic experimentation with Poisson bandits. Theoretical Economics 5 (2), 275–311. Keller, G., S. Rady, and M. Cripps (2005). Strategic experimentation with exponential bandits. Econometrica, 39–68. Klein, N. and S. Rady (2011). Negatively correlated bandits. The Review of Economic Studies 78 (2), 693–732.

51

Murto, P. and J. V¨ alim¨ aki (2011). Learning and information aggregation in an exit game. The Review of Economic Studies 78 (4), 1426–1461. Presman, E. L. and I. N. Sonin (1990). Sequential control with incomplete information: the Bayesian Approach to multi-armed bandit problems. Academic Press. Rosenberg, D., E. Solan, and N. Vieille (2007). Social learning in one-arm bandit problems. Econometrica 75 (6), 1591–1611. Strulovici, B. (2010). Learning while voting: Determinants of collective experimentation. Econometrica 78 (3), 933–971. Thomas, C. D. (2013). Strategic experimentation with congestion and private monitoring. Working Paper, University of Texas at Austin. Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 287–298.

52

Strategic Experimentation with Congestion

Dec 2, 2014 - noted Ai, or the common arm, denoted C. At the beginning of the game, ... t ) = 1 indicates that player i chooses to activate Ai over the time.

669KB Sizes 3 Downloads 240 Views

Recommend Documents

Strategic Experimentation with Congestion
Jun 4, 2018 - stage n to stage n + 1, and the new incumbency state yn+1. For any admissible sequence of control profiles {kt} t≥tn let τi n = min{t ≥ tn : ki.

Strategic Experimentation in Queues
Nov 10, 2015 - σ∗(q,N,M), the queue length at the beginning of the arrival stage of .... by the “renewal” effect of the uninformed first in line reneging after N unsuccessful ... values of N and M are chosen for clarity of illustration and are

Strategic Experimentation in Queues
May 9, 2018 - ... length that perfectly reveals the state to new arrivals, even if the first in line knows that the server is good. ... Section 3 we introduce two concepts in the context of two auxiliary individual optimization. 3 ...... Springer-Ver

Strategic Experimentation in Queues
Dec 16, 2015 - This queue grows at each new arrival and shrinks if service occurs, or if an individual de- cides to stop waiting and leave. Individuals arrive ...

Strategic Experimentation in Queues
Sep 5, 2016 - they arrive; they, therefore, solve a strategic experimentation problem when deciding how long to wait to learn about the probability of service.

Strategic Experimentation in Queues
Dec 13, 2017 - from Deutsche Bank through IAS Princeton. We thank a co-editor and ..... integer part of x. 11 Observe that ψk = E(δ˜τk ), where the random variable ˜τk is the arrival time of the kth service event, for k ∈ N. ...... ion, Custo

Strategic Experimentation in Queues - Birkbeck, University of London
Feb 14, 2014 - Support from Deutsche Bank through IAS Princeton is gratefully ..... 3Our results will apply to the case where ν is sufficiently small and this prior ...

Strategic Experimentation in Queues - Birkbeck, University of London
Feb 14, 2014 - We study a game of strategic experimentation that has both payoff ..... When the server is known to be good, if ψnδw > 1 an individual prefers.

Strategic Experimentation in Queues - Birkbeck, University of London
Feb 14, 2014 - Support from Deutsche Bank through IAS Princeton is gratefully ... problem also arises in many non-economic situations1 (queueing for service in computer ..... We now describe how a team of individuals can act to maximize ...

Delegated Experimentation
19 Oct 2011 - Mauricio Varela, and seminar participants at the University of Bristol, University of Essex, ITAM School of. Business, Kellogg School of Management, University of Miami, Royal Holloway, and University of Warwick. Any remaining errors ar

traffic congestion pdf
Page 1 of 1. File: Traffic congestion pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. traffic congestion pdf.

traffic congestion pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. traffic ...

Forecasting transmission congestion
daily prices, reservoir levels and transmission congestion data to model the daily NO1 price. There is ..... This combination is inspired by a proposal by Bates and Granger (1969), who used the mean ..... Data Mining Techniques. Wiley.

44-Congestion Studies.pdf
weight of the vehicle and as the vehicle moves forward, the deflection corrects itself to its. original position. • Vehicle maintenance costs; 'Wear and tear' on ...

Liquidity and Congestion
May 8, 2008 - beta. (κ, a = 1,b = 1)[κ = 0,κ = 5] ≡ U[0,5]. Parameter values: r = 0.01 d = 2 ... Figure 7: Beta Distribution: (a = 1, b = 1) (a) and (a = 2, b = 15) (b).

Liquidity and Congestion
Sep 11, 2008 - School of Business (University of Maryland), the Board of Governors of the Federal .... sellers than buyers, for example during a fire sale, introducing a ...... the sign of the expression in brackets to determine the sign of ∂ηb.