Predicting Risk in a Multiple Stimulus-Reward ...

Viewer
Transcript

Predicting Risk in a Multiple Stimulus-Reward Environment Mathieu d’Acremont1∗ , Manfred Gilli2 , Peter Bossaerts1 1 Laboratory for Decision-Making under Uncertainty, Ecole Polytechnique F´ed´erale de Lausanne, Odyssea, Station 5, CH-1015 Lausanne, Switzerland 2 Department of Econometrics, University of Geneva, Uni Mail, CH-1205 Geneva, Switzerland ∗ E-mail: Corresponding [email protected]

Abstract There is no doubt that humans are sensitive to risk when taking decisions. Recently, neurobiological evidence has emerged of risk prediction along the lines of simple reinforcement learning. This evidence, however, is limited to one-step-ahead risk, namely, the uncertainty involved in the next forecast or reward. This is puzzling, because it would appear that multi-step prediction risk is more relevant. Multiple stimuli may all predict the same future reward or a sequence of future rewards, and only the risk of the total reward is relevant. It is known (and the neurobiological basis of it is well understood) that subjects are indeed interested in predicting the total reward (sum of the discounted future rewards), and that learning of the expected total reward accords with the Temporal Differencing (TD) algorithm. Here, we posit that subjects should analogously be interested in predicting the risk of the total reward, not just the one-step-ahead risk. Using a simple example, we illustrate what this means, and how the risk of the total reward is related to one-step-ahead risks. We propose an algorithm that the brain may be employing to nevertheless learn total risk on the basis of learning of one-stepahead forecasting risks. Simulations illustrate how our proposed algorithm leads to successful learning of total risk. Our analysis explains why activation correlating with one-step-ahead risk prediction errors emerged recently in a task with multiple stimuli and a single reward. We also discuss how temporal discounting may induce ”ramping” of total risk, suggesting an explanation for a recently documented phenomenon in risk-related firing of dopaminergic neurons.

Author’s version of: d’Acremont, M., Gilli, M., Bossaerts, P. (2009). Predicting Risk in a Multiple Stimulus-Reward Environment. In: Dreher, J.-C., Tremblay, L. (Eds.), Handbook of Reward and Decision Making. Academic Press.

Predicting Risk in a Multiple Stimulus-Reward Environment

2

Key points 1. Reinforcement learning in an environment where multiple stimuli and rewards are experienced through time. 2. Learning algorithm to estimate the total risk carried by future rewards. 3. Mathematical formula to calculate total risk based on one-step ahead risk. 4. Simulation to test the learning algorithm performance. 5. Risk temporal discounting.

Predicting Risk in a Multiple Stimulus-Reward Environment

3

Introduction The simplest strategy for decision making under uncertainty is to select the option with the highest expected reward. However, a multitude of behavioral data documents that humans and animals are sensitive to risk as well [1, 2]. Usually, subjects forego expected return when risk is deemed too high; at other times, subjects show a tendency to seek risk. Recent neurobiological evidence suggests that certain brain regions are specialized in tracking risk (see also Chapter 2 and 6). For instance, single neuron recordings during a visual gambling task with two male rhesus macaques [3] showed that neurons in the parietal cortex had a higher activity when the monkey selected the risky versus the sure option. In humans, the parietal cortex as well as other regions like the anterior cingulate cortex, the insula, the inferior frontal gyrus, the orbitofrontal cortex or the striatum have been related to risk encoding [4–10]. For a review, see [11]. Risk needs to be distinguished from ambiguity. In economics and cognitive neuroscience, ambiguity refers to conditions in which outcome probabilities are incomplete or unknown. Under ambiguity, decision-making has been related to neural response in the amygdala, the orbitofrontal cortex, the inferior frontal gyrus, and the insula [12, 13]. From a Bayesian perspective, the distinction between ambiguity and risk is artificial; among others, ambiguity involves risk, because outcomes are uncertain even when one conditions on an estimate of the probabilities [14]. This may explain the partial overlap of brain activation for decision-making under ambiguity and risk. In many circumstances, expected reward and risk are unknown, and hence, need to be learned. Reinforcement learning has long been associated with learning of expected rewards [15]. Complex extensions of simple reinforcement learning such as temporal difference (TD) learning have been proposed for situations where there are multiple stimuli and rewards; in those cases, the object to be learned is the expected total reward, namely, the expected value of the sum of discounted future rewards (see Chapter 10). The TD learning algorithm has been shown to be consistent with neural activity of the dopaminergic system [16]. The literature on incorporation of risk assessment in learning of discounted future rewards is far less extensive. There are two branches in this literature. First, there are models that directly adjust for risk in learning of total reward by applying a nonlinear (strictly concave) transformation to individual rewards [17] or to the reward prediction error [18]. Note that there is no role for risk learning.1 In contrast, other authors propose a model of reward risk learning based on simple reinforcement learning, on the grounds that risk is separately encoded in the brain (as discussed above), which pre-supposes that it be learned separately from expected reward as well [20]. In fact, transforming rewards through some strictly concave function is inspired by expected utility theory [21]. In this theory of choice under uncertainty, the curvature of the utility function explains risk attitudes, with strict concavity corresponding to risk aversion. Expected utility theory is the dominant 1 Closely

related is an approach whereby one adjusts rewards with a risk penalty obtained from some model of the evolution of risk like GARCH (Generalized Autoregressive Conditional Heteroskedasticity [19]); again, this approach is silent about risk learning.

Predicting Risk in a Multiple Stimulus-Reward Environment

4

approach to modeling choice under uncertainty in the fields of economics. But in finance, the meanvariance model is favored [22], whereby expected (total) reward is separately represented from reward risk, and one is traded off against the other. This approach presents a number of advantages over expected utility (for a review, see [23]), but in particular situations it may lead to dominated choices, and it relies on representation of risk in terms of variance (although higher moments are sometimes considered as well). The separate encoding of risk indicates that the human and nonhuman primate brain may be implementing the mean-variance approach to evaluate choice under uncertainty. The activations are consistent with reward variance as a metric of risk [5, 8, 10]. Higher moments may also be encoded, but this possibility has not yet been probed systematically. The relatively modest amount of risk in usual experiments may preclude detecting activation that correlates with higher moments. In simple situations, it is known how reinforcement learning (the Rescorla-Wagner rule) can be used to learn risk [20]. Figure 1 (Left) illustrates how the Rescorla-Wagner rule operates when there is a single stimulus that predicts a single subsequent stochastic reward. Two types of prediction errors are computed: a reward and risk prediction error. The reward prediction error is the difference between the predicted and the experienced reward. The predicted reward is updated after each trial, based on the reward prediction error and a learning rate. After a sufficient number of trials, and provided the learning rate is reduced appropriately, the predicted reward converges to the expected value of the reward. The risk prediction error is defined as the difference between the predicted and the realized risk. The realized risk is the squared reward prediction error; the predicted risk equals its expectation, and hence, the outcome variance. The predicted risk is updated after each trial based on the risk prediction error and a learning rate. After sufficient number of trials and with a correctly declining learning rate, the predicted risk converges to the expected value of the realized risk, i.e., the true outcome variance. In a recent fMRI study [24], this novel algorithm was applied to uncover the neural signatures of reward and risk prediction errors while subjects played several versions of the Iowa Gambling Task [25] (for a description of the ABCD version, see Chapter 13). In the task, subjects chose repetitively from four decks of cards, and each selection was immediatelly followed by a stochastic payoff. Expected reward and risk differed among the decks, and subjects were free to choose among the four decks. Payoff probabilities are unknown, so expected value and variance need to be learned through experience. Results showed that the reward prediction error was correlated with the Blood Oxygen Level Dependent (BOLD) response in the striatum, while the risk prediction error correlated with the BOLD response in the inferior frontal gyrus and insula. Correlation between activation in the insula and risk prediction error has been documented elsewhere [26]. In this study, a card was drawn from a deck of 10 (numbered from 1 to 10), without replacement, followed several seconds later by a second card. Subjects bet one dollar on whether the second card would be lower than the first card. In contrast with [24], reward probability was unambiguous, implicitly revealed through the numbers on the cards. So, risk and reward prediction errors emerged without a need

Predicting Risk in a Multiple Stimulus-Reward Environment

5

Figure 1: Trial organization of single stimulus-reward (Left) and multiplestimuli, multiple-rewards (Right) environment.

for reinforcement. Decision making was also imperative because subjects had no control over the card selected. For the purpose of the present study, the paradigm in [26] differed in another important respect: it involved two consecutive stimuli that together predicted a single reward. As such, it was a specific case of a multiple-stimuli, multiple-rewards environment like the one depicted in Figure 1 (Right). Evidently, the risk prediction error encoded in the human brain referred only to forecasting one-step-ahead. Specifically: 1. Upon display of the first stimulus, it measured the difference between the risk expected at the beginning of the trial and the risk realized after the stimulus presentation; 2. Upon display of the final outcome (reward), a second risk prediction error emerged, measuring the difference between the risk expected before the outcome and the risk realized after the outcome. As such, it seems that this evidence points to representation in the human brain only of what time series statisticians call the one-step-ahead risk prediction errors. The findings are puzzling in one important respect. In the context of multiple stimuli and multiple rewards the decision maker is not really interested in one-step ahead risk, but in the risk of predicting the total reward (the sum of the discounted future rewards). Why, then, would the brain care to encode one-step-ahead prediction risk errors? There exist more realistic examples of multiple-stimuli, multiple-rewards environments, such as blackjack (or ”twenty-one”). In this game between a player and a dealer, the player initially receives two cards and is asked whether to take additional cards. That is, the player has the choice to ”hit“ (take another card) or ”sit“ (to not take another card). For the player to win, the sum of her cards must be above that of the dealer’s, but not greater than 21. In blackjack, each time the player draws a card, she has to update the one-step-ahead as well as the total risk. This paper provides a mathematical rationale for the encoding of one-step-ahead prediction risks. It does so by exploring how TD learning could be implemented to learn risk in a multiple-stimuli, multiplerewards setting. To fix ideas, the paper will illustrate total reward risk learning in a simple, but generic

Predicting Risk in a Multiple Stimulus-Reward Environment

6

Figure 2: The Multistep Risk Example.

example, which we shall call the Multistep Risk Example. The remainder of the paper is organized as follows. We first introduce the Multistep Risk Example. Subsequently, we introduce formulae for expected total reward and total reward risk for this example. Particular attention is paid to how total reward risk relates to one-step-ahead prediction risks. The relationship turns out to be extremely simple. The paper then proposes how TD learning could be used to learn total reward risk. It is shown that TD learning is an effective way to learn one-step-ahead prediction risks, and from estimates of the latter, total reward prediction risk can be learned. Simulations illustrate how the proposed learning algorithm works. We demonstrate how it provides one potential explanation for the “ramping” in neural encoding of risk [1]. The paper finishes with a discussion of some tangential issues.

Models The Multistep Risk Example To facilitate the study of reinforcement learning of risk in a multiple-stimuli, multiple-rewards environment, we use a generic example, the Multistep Risk Example (see Figure 2). In the example, only risk is manipulated; expected reward and expected total reward remain zero at all times. The example is of the imperative type: the subject will be presented consecutively with pairs of two lotteries (gambles), but has no choice; the lotteries that will determine the final total reward are picked at random without the subject having any control over it. The trial begins at time t = 1 (one can imagine that the subject is presented with a fixation cross). At t = 2, a first pair of lotteries is presented. The pair consists of a high risk and a low risk lottery. One

Predicting Risk in a Multiple Stimulus-Reward Environment

7

of these two lotteries is selected at random. At t = 3, a second pair of lotteries is presented with again a high risk and a low risk lottery. One of the two lotteries is picked. At t = 4, the first selected lottery is played and the reward r4 is revealed. At t = 5, the second lottery is played and the reward r5 is revealed. The trial ends at t = 6. Our subject is obviously interested in the total reward, i.e., the sum of r4 and r5 . In fact, the subject should be interested in the discounted sum, since rewards occurring later in time are usually valued less. Until all the rewards are revealed, the total reward remains stochastic. When using mean-variance analysis, the subject is supposed to evaluate the desirability of all the outcomes by means of the expected value of the total reward and its risk; we measure the latter in terms of the variance (of the total reward). The central feature of our example is that risk changes over time when a lottery is selected and whenever a lottery is played. For instance, the variance of the total reward increases if the most risky lottery is selected at t = 2 but it decreases after the first lottery is played.

Theoretical Values of Expected Reward and Risk Let It refer to the information available at time t. Define the discounted total reward at time t: T

Rt =

∑

τ =t+1

γ τ −t−1 rτ ,

(1)

where T is the number of time steps in the trial, γ is the temporal discounting factor, and rτ denotes the reward at time τ . The expected value of the total reward at time t and given It is: Vt = E(Rt | It ).

(2)

In our example, this expected value Vt is insensitive to information before t = 4; it always equals 0. Define the risk of the total reward at time t conditional on It as: Ht = Var(Rt | It ).

(3)

As an illustration, consider the total reward risk at t = 2 knowing that the high risk lottery was selected. Temporarily ignore discounting, i.e., set γ = 1. The possible values of the total reward from t = 2 onwards can be R2 ∈ {+3, +1, −1, −3}, if the low risk lottery is selected at t = 3, or R2 ∈ {+4, 0, 0, −4}, if the high risk lottery is selected instead. All outcomes are equally likely. The variance of R2 is, therefore, H2 = 6.5. This computation can be made at each time step and for each lottery. Figure 3 displays the results, in red. By comparing the sum of the actually obtained reward and the future discounted expected value of the total reward with the past expectation of the total reward, one generates the Temporal-Difference (TD)

Predicting Risk in a Multiple Stimulus-Reward Environment

Figure 3: Theoretical values of one-step-ahead and total reward risk.

8

Predicting Risk in a Multiple Stimulus-Reward Environment

9

reward error. Specifically, at t + 1, the TD error equals:

δt+1 = rt+1 + γ Vt+1 −Vt .

(4)

Going back to our numerical example, TD errors are always zero before any reward is delivered (i.e., for t < 4). At t = 4 or t = 5, the TD error is generically different from 0. For instance, if the reward at t = 4 r4 equals −1, then δ4 = −1. The one-step-ahead prediction risk at time t conditional on It , is defined as follows: 2 | It ]. ht = Var[δt+1 | It ] = E[δt+1

(5)

As such, the one-step-ahead prediction risk is the variance of the TD error, conditional to It . One can write this differently, as follows: ht = Var[rt+1 + γ Vt+1 | It ].

(6)

To illustrate, let’s imagine that the high risk lottery was selected at t = 2 and the low risk lottery was selected at t = 3. Rewards at t = 4 can be r4 ∈ {+2, −2}. Hence, looking forward from time 3 on, all possible time-4 TD errors are: δ4 ∈ {+2, −2}. These values occur with equal likelihood. Hence, the variance of the TD error, or the one-step-ahead prediction risk, equals h3 = 4. Contrast this with the risk (variance) of the total reward, which in this case would be H3 = 5. In Figure 3, the one-step-ahead risks are indicated in blue. The important message is that the one-step-ahead and total reward risks do not coincide. One wonders, however, whether one-step-ahead and total reward risks are not somehow related. Indeed, they are. In Box 1, we prove the following relationship: T −1

Ht =

∑ γ 2(τ −t) E[hτ | It ].

(7)

τ =t

This result states that the risk of the total (discounted) reward is the discounted sum of the expected one-step ahead prediction risks. Referring again to our illustration, let us calculate the total reward risk as of t = 2, in case the high risk lottery was selected at t = 1. From the one-step-ahead prediction risks for t ≥ 2 displayed in Figure 4+1 3, and using Equation 7, we conclude: H2 = 01 + 4+4 2 + 2 = 6.5. This is the same result one obtains

when directly calculating the variance of the total reward R2 (see Figure 3, red entry in top circle at t = 2). It is possible to write the risk of the total reward Ht as a recursive function involving the immediate one-step-ahead prediction risk (for proof, see Box 1): Ht = ht + γ 2 E(Ht+1 | It ).

(8)

Predicting Risk in a Multiple Stimulus-Reward Environment

10

We exploited this recursive form in the simulations that we will report on later. To see how this works in our illustrative example, note that H2 = 0 + 8+5 2 = 6.5, which again is the value originally computed with Formula 7.

Box 1: Mathematical proof We demonstrate here that total risk is a function of one-step ahead risk. Let the TD error be:

δt+1 = rt+1 + γ Vt+1 −Vt .

(9)

Putting r on the left side, we get: rt+1 = δt+1 − γ Vt+1 +Vt , rt+2 = δt+2 − γ Vt+2 +Vt+1 , rt+3 = δt+3 − γ Vt+3 +Vt+2 , ··· rT

= δT − γ VT +VT −1 .

The discounted total reward at time t is: T −t

Rt =

∑ γ τ −1 rt+τ = γ 0 rt+1 + γ 1 rt+2 + γ 2 rt+3 · · · + γ T −t−1 rT ,

(10)

τ =1

Now consider:

γ 0 rt+1 = γ 0 δt+1 − γ 1Vt+1 + γ 0Vt , γ 1 rt+2 = γ 1 δt+2 − γ 2Vt+2 + γ 1Vt+1 , γ 2 rt+3 = γ 2 δt+3 − γ 3Vt+3 + γ 2Vt+2 , ···

γ T −t−1 rT

= γ T −t−1 δT − γ T −t VT + γ T −t−1VT −1 .

When summed, the underlined elements cancel out. Because no reward can be expected after the last trial, VT = 0 . So, the discounted sum of reward is: Rt = γ 0 δt+1 + γ 1 δt+2 + γ 2 δt+3 · · · + γ T −t−1 δT + γ 0Vt .

(11)

The total risk at time t knowing It equals: Ht = Var(Rt | It ) = Var(γ 0 δt+1 + γ 1 δt+2 + γ 2 δt+3 · · · + γ T −t−1 δT + γ 0Vt | It ).

(12)

Predicting Risk in a Multiple Stimulus-Reward Environment

11

At time t, Vt is known and becomes a fixed variable. Thus it can be ignored in the calculation of the variance. We write the Ht as a sum of variance and covariance: T −t

Ht =

∑ Var(γ τ −1 δt+τ | It ) + 2

τ =1

T −t−1 T −t

∑ ∑

i=1

Cov(γ i−1 δt+i , γ j−1 δt+ j | It ).

(13)

j=i+1

The covariance can be simplified to: Cov(γ i−1 δt+i , γ j−1 δt+ j | It ) = γ i+ j−2 E(δt+i δt+ j | It ),

i< j

(14)

i< j

(15)

because E(δt+k ) = 0 for any k. Now apply the law of iterated expectations: E(δt+i δt+ j | It ) = E(E(δt+i δt+ j | It+i ) | It ) = E(δt+i E(δt+ j | It+i ) | It ) = 0,

because at time t + i, δt+i becomes a fixed variable and can be moved out of the inner expected value. Thus it appears that there is no covariance between the TD errors δ . As a consequence, the total reward variance simplifies to: T −t

T −t

τ =1

τ =1

Ht =

∑ Var(γ τ −1 δt+τ | It ) = ∑ γ 2(τ −1) E(δt+2 τ | It ).

(16)

The one-step ahead risk ht is defined as: 2 ht = E[δt+1 | It ].

(17)

By using the law of iterated expectations, one can write Ht as a function of ht : Ht

= =

T −t

T −t

τ =1 T −t−1

τ =1

∑ γ 2(τ −1) E(E(δt+2 τ | It+τ −1 ) | It ) = ∑

τ =0

γ 2τ E(ht+τ | It ).

∑ γ 2(τ −1) E(ht+τ −1 | It ) (18)

Predicting Risk in a Multiple Stimulus-Reward Environment

12

Ht can be written in a recursive way. To do so we apply the law of iterated and total expectations: T −t−1

Ht

= ht +

∑

γ 2τ E(ht+τ | It ) = ht +

τ =1 T −t−1

= ht + E(

∑

∑

τ =1

γ 2τ E(E(ht+τ | It+1 ) | It ) T −t−2

γ 2τ E(ht+τ | It+1 ) | It ) = ht + E(

τ =1 T −t−2

= ht + γ 2 E(

T −t−1

∑

τ =0

∑

τ =0

γ 2(τ +1) E(ht+τ +1 | It+1 ) | It )

γ 2τ E(ht+τ +1 | It+1 ) | It ) = ht + γ 2 E(Ht+1 |It ).

(19)

Learning We here propose a learning algorithm with which to learn the risk of the total reward. It consists of two steps: (i) a simple reinforcement-learning based algorithm to update one-step-ahead prediction risks; (ii) application of our Formula 7 to update the total reward risk. We represent It with a state vector ⃗st that summarizes what happened in the past (which gamble was picked, etc.). Information in ⃗st is represented with a tapped delay line, a vector composed of 0s and 1s (see Box 2 for details). Let hˆ t denote the estimate of the one-step-ahead risk ht . In our learning algorithm, hˆ t is a function of the state of the world ⃗st and a weight vector ⃗wrisk : ′ hˆ t = ⃗wrisk ⃗st .

(20)

Consider the prediction error for the one-step-ahead risk ht : 2 ξt+1 = δt+1 − hˆ t .

(21)

The prediction error serves to compute an updating vector ⃗Θt , ⃗Θt = β ξt+1⃗st ,

(22)

where β is the learning rate for risk. Intuitively, we can explain the latter formula by saying that only stimuli presented before or at time t will receive the credit (debit) of a positive (negative) prediction error. At the end of the trial, the weight vector ⃗wrisk is updated by summing all the updating vectors: T −1

⃗wrisk = ⃗wrisk +

∑ ⃗Θt .

(23)

t=1

After a sufficient number of trials, by the law of large numbers, hˆ t will exhibit the usual convergence properties of reinforcement learning algorithms. Thus the first step (i) on how to learn one-step-ahead prediction risks is solved. For the second step (ii), we rely on a crucial feature of the algorithm, namely: the weight vector ⃗wrisk can be used to compute

Predicting Risk in a Multiple Stimulus-Reward Environment

13

one-step-ahead risks at any time and for all contingencies within the same trial (see Formula 20). For instance, at time t = 2 in the Multistep Risk Example, it is possible to compute the one-step-ahead risk when the low risk lottery is selected at time t = 3. Likewise, it is possible to compute the one-step-ahead risk when the high risk lottery is selected at time t = 3. As a consequence, it is feasible to estimate the expected value of all possible one-step-ahead risks at t = 3, and also t = 4, and so forth. It becomes straightforward to use Formula 7 (or its recursive version Formula 8) to compute the total reward risk Ht . (In our simulations, we used the recursive formula, because it is more efficient from a programming point of view.) The prediction error ξt+1 is key to reinforcement learning of one-step-ahead risk. But notice that it depends on the TD error δt+1 used to update the expected value of the total reward. This effectively means that the same error that allows one to update the expected total reward can also generate an update of the total reward risk. In other words, our proposed algorithm will use information parsimoniously.

Box 2: Tapped delay line representation Tapped delay line is a usual way to represent information in TD learning. Stimuli are numbered s = 1, 2, · · · , S where S is the total number of stimuli. For instance, we can set s = 1 for the fixation cross, s = 2, 3 for the low and high risk lottery of the first pair, and s = 4, 5 for the low and high risk lotteries of the second pair. Hence S = 5. Each stimuli s is represented by a time dependent Boolean vector ⃗sst of size T . All elements in ⃗sst are equals to 0, except if the stimuli was presented at time t or before. The element number e in⃗sst is set to 1 with e = t − τ + 1. τ indicates the time when the stimuli was presented. For instance, consider the sequence (1) fixation cross and (2) high risk lottery. The fixation cross s = 1 was presented at τ = 1, so at t = 2 we have: ⃗s1,2 = [0, 1, 0, 0, 0]′ . The low risk lottery of the first pair was not presented, so we have: ⃗s2,2 = [0, 0, 0, 0, 0]′ . The high risk lottery of the first pair was presented at τ = 2, so we have: ⃗s3,2 = [1, 0, 0, 0, 0]′ , The low risk lottery of the second pair was not yet presented, so we have: ⃗s4,2 = [0, 0, 0, 0, 0]′ , and so on until the last stimuli is reached (s = S).

Predicting Risk in a Multiple Stimulus-Reward Environment

14

The tapped delay line representation of all stimuli at time t is: ′

′

′

′

′

⃗st = [⃗s1t ,⃗s2t , · · ·⃗sst , · · ·⃗sSt ] .

One-step-ahead risk algorithm Separate algorithms for the one-step-ahead and total risks were written in Matlab. For the one-stepahead algorithm, the value Vt is computed by multiplying a weight vector for reward with the tapped ′

delay line at time t, Vt = ⃗wrew⃗st . In the same way, the one-step ahead risk ht is computed by multiplying ′

a weight vector for risk with the tapped delay line, ht = ⃗wrisk ⃗st . The weight vector for reward and risk are updated through trials with Algorithm 1. Algorithm 1 One-step-ahead risk learning 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

Initialize the reward weights, ⃗wrew = ⃗0 Initialize the risk weights, ⃗wrisk = ⃗0 for trial = 1 : nTrial do {go through trials} for t = 1 : (T − 1) do {go through time steps} Get present state ⃗st ′ Calculate present value, Vt = ⃗wrew⃗st Get next state ⃗st+1 ′ Calculate next values, Vt+1 = ⃗wrew⃗st+1 Compute TD error, δt+1 = rt+1 + γ Vt+1 −Vt Update for reward, ⃗∆t = αδt+1⃗st ′ Get one-step ahead risk, ht = ⃗wrisk ⃗st 2 −h Compute risk prediction error, ξt+1 = δt+1 t Update for risk, ⃗Θt = β ξt+1⃗st end for Last time step in trial, t = T T −1 ⃗ Update reward weights, ⃗wrew = ⃗wrew + ∑t=1 ∆t T −1 ⃗ Update risk weights, ⃗wrisk = ⃗wrisk + ∑t=1 Θt end for

Total risk algorithm The number of different states that can be experienced at time k is denoted nk with k = 1, · · · , T . T is the number of time steps in trials. In the Multistep Risk Example, n1 = 1 (fixation cross), n2 , n3 = 2 (low / high risk lottery), and n4 , n5 , n6 = 1 (non-stimuli). Stimuli within each time step can be identified with an index i ∈ {1, 2, · · · nk }. In the paradigm, the index of the low risk lotteries is arbitrarily set to i = 1 and the index of the high risk lotteries set to i = 2. The fixation cross and the absence of stimuli have

Predicting Risk in a Multiple Stimulus-Reward Environment

15

index i = 1. The vector of information ⃗It contains the index of the stimuli experienced until time t. For example, if the sequence of events was (1) fixation cross, (2) high risk lottery, (3) low risk lottery, and (4) no stimulus, then the information available at time 4 is ⃗I4 = {1, 2, 1, 1}. It is easy to build a tapped delay line based on ⃗It . To compute the total risk, we use a recursive function ComputeH (Algorithm 2). The function takes as argument the information available at time t and returns the total risk, Ht = ComputeH(⃗It ). The function looks at the number of possible stimuli in the next time step. For instance, after the fixation cross (parent), there are two possible stimuli (children): the low and high risk lottery. The function is then applied on each child (who becomes the parents in the next time step). When the total reward risk Ht+1 of all parent’s children has been computed, the mean H¯ t+1 of these value is calculated. The one-step ahead risk of the parent ht is computed using the risk weights. Adding ht to the discounted H¯ t+1 gives the total reward risk Ht of the parent. The recursive call of the function ends when the last time step T is reached (parents with no children).

Algorithm 2 Total risk function, Ht = ComputeH(⃗It ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Get present time, t = length(⃗It ) if t < T then {last time step not reached} Get the number stimuli in the next time step, n = nt+1 for i = 1 : n do {go through stimuli in the next time step} Add the new piece of information, ⃗It+1 = [⃗It , i] (i) Compute risk for the stimulus i, Ht+1 = ComputeH(⃗It+1 ) end for (i) Compute the mean, H¯ t+1 = 1n ∑ni=1 Ht+1 else {reach last time step} Set the mean to 0, H¯ t+1 = 0 end if Build ⃗st based on ⃗It ′ Compute the one-step ahead risk, ht = ⃗wrisk ⃗st Compute the total risk, Ht = ht + γ 2 H¯ t+1 return Ht

Results One-Step-Ahead Prediction Risk and Total Reward Risk We performed simulations to illustrate that our reinforcement learning algorithm can make accurate predictions in the Multistep Risk Example. Estimation of ht and Ht were done with 1000 iterations. The reward and risk learning rates α and β were both set to 0.01. The parameter γ was set to 1, so there was no discounting.

Predicting Risk in a Multiple Stimulus-Reward Environment

16

Figure 4: One-step-ahead and total reward risk as estimated by TD learning.

Results of the simulations are reported in Figure 4. Results are close to the theoretical values displayed in Figure 3. Thus, it can be concluded that TD learning successfully predicts one-step-ahead as well as total risk. Notice that before t = 3, the one-step-ahead risk prediction are not exactly equal to zero. This is due to sampling variability induced by the non-zero and constant reward learning rate α .

Temporal Discounting In the simulations described in the previous section, there was no discounting, i.e., γ = 1. It is interesting to study the impact of discounting (γ < 1) on total reward risk. The formula tying total reward risk with one-step-ahead risks (Equation 7) provides insights. Imagine a situation where there is a single stimulus at t = 1 and reward is realized at t = T . In that case, there is one-step-ahead prediction risk only at t = T − 1. As a result, according to Equation 7, total reward risk will grow over time till t = T − 1 and then drop to zero. The fact that total reward risk increases in this case may provide an explanation for a recently described phenomenon in firing of midbrain dopaminergic neurons in monkeys [1]. The experimental setup

Predicting Risk in a Multiple Stimulus-Reward Environment

17

was as described above: an initial stimulus predicted a stochastic reward at some future time. Sustained firing of some dopaminergic neurons during the anticipation interval appeared to be correlated with total reward risk (which changed with the displayed stimulus). Curiously, the firing was not constant over the interval between display of the stimulus and realization of reward. The firing could best be described as “ramping” up (see Fig. 2.4B, Chapter 2). The specifics of the experiment were as follows. Trials started with the presentation of one of five different stimuli. Each stimulus was associated with a particular reward probability: 0, .25, .50, .75, or 1. Stimuli where presented during 2s. Rewards were delivered when the stimuli disappeared (i.e., no trace conditioning). Inter-stimulus interval varied randomly. The (two) monkeys completed each 18750 trials on average. After extensive training, the authors showed how activity of a sub-population of dopaminergic neurons increased progressively as the delivery of an uncertain reward approached, a phenomenon they referred to as “ramping effect.” The ramping effect was maximum when the probability was .50, that is when the reward variance was maximum. We can easily illustrate ramping with a simulation. The setup is similar to the classical condition task where ramping was first observed [1]. At the beginning of a trial, one of 5 different stimuli can occur. Each stimulus is associated with a different reward probability. The reward is either 0 or 1 and is delivered 10 time steps after the stimuli (T = 10). Risk and reward learning rates are set to 0.01. The number of iterations is set to 18750, the number of trials processed by each monkey in the original study [1]. We set the discounting factor equal to .90, a number usually proposed in the literature. The simulation generates estimates of one-step-ahead risks at t = 9 (one step before reward is delivered, and hence, uncertainty is realized) equal to 0, 0.17, 0.25, 0.19 and 0, for probabilities equal to 0, .25, .50, .75 and 1, respectively. These estimates are close to the true one-step-ahead risks, which are 0, 0.19, 0.25, 0.19 and 0, respectively. The one-step-ahead prediction risks converge to zero for t < 9, as expected. As concerns the total reward risks for t = 1, ..., 9, without discounting they would equal the one-stepahead prediction risk at t = 9. Consequently, the total reward risk is constant between t = 1 and t = 9. With discounting, however, the total reward risk increases with time. This is illustrated in Figure 5, which is based on the values of one-step-head prediction risks we recorded by the end of our simulations. The curve in the figure reproduces the ramping effect observed in actual firing of some dopaminergic neurons. If our interpretation of the ramping phenomenon is right, one can also conclude that dopaminergic neurons encode total risk, not one-step-ahead risk. Indeed, ramping is a property of total risk (see Figure 5).

Discussion Our mathematical analysis of reward prediction risk was prompted by recent discoveries that specific cortical regions correlated with errors in the estimation of one-step-ahead prediction risks [24, 26]. To

Predicting Risk in a Multiple Stimulus-Reward Environment

18

0.3 p=0 0.25

p = .25

Total reward risk prediction (H)

p = .50 0.2

p = .75 p=1

0.15

0.1

0.05

0

−0.05

1

2

3

4

5

6

7

8

9

10

t

Figure 5: Total reward risk for different reward probabilities as estimated by TD learning. Stimuli are presented at t = 1 and reward are delivered at t = 10.

find a teaching signal for the prediction of risk one step in the future seems paradoxical because one would expect the brain to be involved in learning total reward risk, in analogy with encoding of errors in expected value of total reward. The paradox can be resolved when one realizes that one-step-ahead prediction risks can be viewed as the building blocks of total reward risk – the link is provided through the formula in Equation 7 – so that learning of one-step-ahead prediction risks automatically leads to learning of total reward risk. As a by-product of our analysis, we showed that temporal discounting offers a plausible explanation of risk-related “ramping” in the firing of dopaminergic neurons in the monkey midbrain [1]. The ramping effect is consistent with the increase of mathematical total risk over the course of the trial. This suggests that the dopaminergic neurons may be encoding total reward risk, in contrast with activation in anterior insula, which correlates with the error in estimating one-step-ahead risks [26]. However, it is not known how the evidence of “ramping” of firing of dopaminergic neurons applies to the human brain. While fMRI analysis of striatal dopamine projection areas of the human brain has uncovered sustained activation correlating with risk [4, 10], the time resolution of fMRI does not allow us to discern whether this activation increases progressively over time, as is the case with the firing of dopaminergic neurons. It should be added that an alternative explanation has been advanced for the ramping phenomenon, based on experimental design, asymmetric firing of dopamine neurons, and backpropagation of learning signals in the TD learning algorithm [27]. The explanation has been challenged,

Predicting Risk in a Multiple Stimulus-Reward Environment

19

however, for various reasons [28]. In addition to the human striatum, risk and uncertainty in general activate a number of other brain regions, such as anterior cingulate cortex [29–31], and orbitofrontal cortex [8]. The sustained nature of these activations suggests that they reflect total reward risk. It has yet to be determined whether activations concerning total reward risk are consistent with one-step-ahead prediction risk. That is, do the different types of activation satisfy the restriction implicit in Equation 7? Further research is needed.

REFERENCES

20

References [1] Fiorillo C, Tobler P, Schultz W (2003) Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons. Science 299: 1898–1902. [2] Cox JC, Harrison GW (2008) Risk Aversion in Experiments, volume 12. Bingley, UK: Emerald, research in experimental economics edition. [3] McCoy A, Platt M (2005) Risk-sensitive neurons in macaque posterior cingulate cortex. Nature Neuroscience 8: 1220–1227. [4] Dreher J, Kohn P, Berman K (2006) Neural Coding of Distinct Statistical Properties of Reward Information in Humans. Cerebral Cortex 16: 561–573. [5] Huettel S, Song A, McCarthy G (2005) Decisions under Uncertainty: Probabilistic Context Influences Activation of Prefrontal and Parietal Cortices. Journal of Neuroscience 25: 3304–3311. [6] Kuhnen C, Knutson B (2005) The Neural Basis of Financial Risk Taking. Neuron 47: 763–770. [7] Rolls E, McCabe C, Redoute J (2008) Expected Value, Reward Outcome, and Temporal Difference Error Representations in a Probabilistic Decision Task. Cerebral Cortex 18: 652–663. [8] Tobler P, O’Doherty J, Dolan R, Schultz W (2007) Reward Value Coding Distinct From Risk Attitude-Related Uncertainty Coding in Human Reward Systems. Journal of Neurophysiology 97: 1621–1632. [9] Paulus M, Rogalsky C, Simmons A, Feinstein J, Stein M (2003) Increased activation in the right insula during risk-taking decision making is related to harm avoidance and neuroticism. Neuroimage 19: 1439–1448. [10] Preuschoff K, Bossaerts P, Quartz S (2006) Neural Differentiation of Expected Reward and Risk in Human Subcortical Structures. Neuron 51: 381–390. [11] Knutson B, Bossaerts P (2007) Neural Antecedents of Financial Decisions. Journal of Neuroscience 27: 8174–8177. [12] Huettel S, Stowe C, Gordon E, Warner B, Platt M (2006) Neural Signatures of Economic Preferences for Risk and Ambiguity. Neuron 49: 765–775. [13] Hsu M, Bhatt M, Adolphs R, Tranel D, Camerer C (2005) Neural Systems Responding to Degrees of Uncertainty in Human Decision-Making. Science 310: 1680–1683. [14] Rode C, Cosmides L, Hell W, Tooby J (1999) When and why do people avoid unknown probabilities in decisions under uncertainty? Testing some predictions from optimal foraging theory. Cognition 72: 269–304.

REFERENCES

21

[15] Sutton R, Barto A (1998) Reinforcement Learning: An Introduction. Cambridge, USA: MIT Press. [16] Schultz W, Dayan P, Montague P (1997) A Neural Substrate of Prediction and Reward. Science 275: 1593–1599. [17] Howard R, Matheson J (1972) Risk-Sensitive Markov Decision Processes. Management Science 18: 356–369. [18] Mihatsch O, Neuneier R (2002) Risk-Sensitive Reinforcement Learning. Machine Learning 49: 267–290. [19] Li J, Laiwan C (2006). Reward adjusted reinforcement learning for risk-averse asset allocation. Proceedings: International Joint Conference on Neural Networks. [20] Preuschoff K, Bossaerts P (2007) Adding prediction risk to the theory of reward learning. Annals of the New York Academy of Sciences 135–146: 1104. [21] Von Neumann J, Morgenstern O (1947) Theory of games and economic behavior. Princeton University Press. [22] Markovitz H (1952) Portfolio Selection. Journal of Finance 7: 77–91. [23] d’Acremont M, Bossaerts P (2008) Neurobiological Studies of Risk Assessment: A Comparison Of Expected Utility and Mean-Variance Approaches. Cogn Affect Behav Neurosci 8: 363-374. [24] d’Acremont M, Lu ZL, Li X, Van der Linden M, Bechara A (2008) Neural correlates of risk prediction error during reinforcement learning in humans. Manuscript under revision . [25] Bechara A, Damasio A, Damasio H, Anderson S (1994) Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50: 7–15. [26] Preuschoff K, Bossaerts P (2008) Human Insula Activation Reflects Risk Prediction Errors As Well As Risk. The Journal of Neuroscience 28: 2745–2752. [27] Niv Y, Duff M, Dayan P (2005) Dopamine, uncertainty and TD learning. Behavioral and Brain Functions 1: 1–9. [28] Tobler P, Fiorillo C, Schultz W (2005) Adaptive Coding of Reward Value by Dopamine Neurons. Science 307: 1642–1645. [29] Brown J, Braver T (2005) Learned Predictions of Error Likelihood in the Anterior Cingulate Cortex. Science 307: 1118–1121. [30] Brown J, Braver T (2007) Risk prediction and aversion by anterior cingulate cortex. Cognitive, Affective, & Behavioral Neuroscience 7: 266–277.

REFERENCES

22

[31] Behrens T, Woolrich M, Walton M, Rushworth M (2007) Learning the value of information in an uncertain world. Nature Neuroscience 10: 1214–1221.

Predicting Future High-Cost Patients: A Real-World Risk Modeling ...