Adaptive decision models for Virtual Rehabilitation ...

Viewer
Transcript

Adaptive decision models for Virtual Rehabilitation environments ´ Shender M. Avila-Sansores [email protected] Felipe Orihuela-Espina [email protected] Luis Enrique Sucar [email protected] National Institute for Astrophysics, Optics and Electronics (INAOE), Luis Enrique Erro # 1, Santa Mar´ıa Tonatzintla, San Andr´es Cholula, 72840 Puebla, Mexico ´ Paloma Alvarez C´ ardenas [email protected] Universidad Veracruzana, Calle Museo 133, Unidad Magisterial, Jalapa, 91010 Veracruz, Mexico

Abstract Virtual therapies for motor rehabilitation employ virtual environments tailored to the patient specific necessities to deliver the therapeutic exercises. These environments have to adapt to the changing needs of the therapy as it befalls. A dynamic adaptation policy is proposed that conciliates immediate therapy requirements imposed by patient progress with clinical decisions about therapy planning. The model uses a Markov decision process to produce an initial policy based on prior experience to give an initial edge to an reinforcement learning algorithm that yields a dynamic policy optimal at any time during the therapy. The learning algorithm converges for different user behaviours and further, it does so within a time frame suitable for a real therapy. Adaptation facilitates deployment of the virtual therapy system to the patient’s home, saving costs to both the patient and the health system, and alleviates the need for continuous expert support.

1. Introduction Motor impairment subsequent to the stroke, leave survivors functionally disabled and dependent on third people. Rehabilitation therapies aim at alleviating motor sequels and returning the patient some of its former independence. Limited availability of clinical Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s).

experts and fragile financial support, often accompanied with patient’s low motivation is a therapy-killer combination. Virtual rehabilitation (VR) is among the available range of therapies for the physically impaired. While sharing with other therapies the intensive repetitive task-oriented training, it diverges in the way exercises are administered. In VR, the exercises and tasks occur naturally as the patient interacts with a virtual environment, often in the form of serious games, conceivably optimized to fit his specific needs. A number of platforms providing support for VR are available (Krebs et al., 1998; Burgar et al., 2000; Burdea et al., 2000; Reinkensmeyer et al., 2002; Holden & Dyar, 2002; Ellsworth & Winters, 2002; Loureiro et al., 2003; Sanchez et al., 2006; Adamovich et al., 2009a). A number of benefits are often claimed regarding VR (Sucar et al., 2012), but for this work we will highlight adaptability. Adaptation is critical for relocating therapy sessions from rehabilitation centers to patient’s home, as well as to decrease the need for on-site professional assistance. Adaptability in VR means that the virtual environment intelligently senses the patient observable progress, and perhaps even non observable cognitive state, and adjust different aspects of the environment to ensure therapy is delivered under optimal conditions for the patient, whilst still adhering closely to what an expert physician would have plan for the patient. Adaptation by human intervention requires either an experienced physician to tune hardware or software parameters of the virtual rehabilitation platform. Despite the obvious advantages of human intervention, its practicality is hindered by limited human and financial resources. Efficient adaptation can be automated by equipping the VR therapy with intelligent decision making models.

Adaptive Virtual Rehabilitation Environments

Developing automatic intelligent adaptation is complex. On top of scientific knowledge limitations about the neurorehabilitation process (Krakauer et al., 2012), three major elements should be considered; the patient who has an evolving physical and cognitive state, the clinician who has a long term therapeutic plan but also has to take immediate decisions as the therapy develops, and the virtual environment itself which realises the therapy which has technical and technological limitations imposing boundaries to what can be achieved. Regardless of how these elements are sensed and interacted with by the platform, it is possible to recognise two levels of adaptation. Within-task adaptation refers to the adjustment of the level of difficulty to maintain task challenge, whilst intertask adaptation refers to therapy scheduling and is concerned with making the most out of the therapy time. Both can be modeled with decision algorithms and complemented with knowledge-transfer approaches (Avil´es et al., 2011; Kan et al., 2008; Loureiro et al., 2003; Adamovich et al., 2009b).

motor dexterity recovery equivalent to occupational therapy, but providing an edge on motivation. GT already has a previous adaptation module with a static policy resolved from a partially observable Markov decision process (POMDP) (Avil´es et al., 2011), which we hope to supersede with the dynamic model we are proposing here.

This work presents a new model of intelligent decision taking for adapting within-task challenge in virtual rehabilitation therapies. In addition to allowing adaptation of the task challenge, a critical innovation over other existing automated solutions is that the decision policy is not static, i.e. fixed over a priori knowledge. Instead, the decision policy is also dynamic and concomitantly adapts to therapy changing needs, maximizing compliance with patient present requirements and clinician’s planning. The model itself involves a double reward shaping learning strategy capable of accommodating patient and clinician feedback. This is initialized with an initial policy obtained by solving a Markov decision process (MDP) thus ensuring early and efficient convergence to an up-to-date policy. ´ This work extends our preliminary results in (AvilaS´ ansores et al., 2012) with a more extensive synthetic simulation batch, and an initial laboratory controlled feasibility study. The results demonstrates that the system converges to a solution, and moreover that it does so within a time frame appropriate for a real therapy.

Figure 1. A user interacting with Gesture Therapy. The characteristic hand gripper facilitates tracking of arm movements as well as incorporates a sensor for measuring gripping force.

2. Materials and methods 2.1. Gesture Therapy The adaptation algorithm presented is independent of the VR platform but for testing purposes we have incorporated it to the Gesture Therapy (GT) platform (Sucar et al., 2010; 2012). GT, illustrated in Figure 1, relies heavily in artificial intelligence to deliver the VR therapy in the most appropriate conditions. GT offers

2.2. Solution overview Adaptive therapy is achieved using a decision-taking policy which outputs a command for adjusting one or more parameters when certain given conditions are met. This policy can be as naive as a few conditional statements, or as sophisticated as a POMDP (Kan et al., 2008). However, automatic adaptation policies currently applied in virtual rehabilitation, regardless of how they were generated, are static in nature. That is, the decision policy regulating the adaptation is designed a priori based on available knowledge before therapy onset, but once deployed, it remains fix throughout the therapy. This is unappealing and the initial model could be inappropiade given the changing scenario of a therapy. An adaptation decision policy which is itself adaptive to the therapy needs would seem a rather more attractive solution. Consequently, the solution that we proposed here is composed of two stages. First, an initial policy is developed upon a priori existing knowledge using a Markov decision process (MDP). Thus far, this would have been a common solution affording a more or less successful static adaptation policy. Secondly, our model, now takes this seed policy and imposes a learning algorithm to it. A reinforcement learning algorithm receives feedback from patient and clinician in

Adaptive Virtual Rehabilitation Environments

the form of rewards in response to its suggested adaptation commands, allowing the policy to evolve in parallel to the virtual task adaptation. After just a few rounds of feedback, the new adaptive policy closes the gap between the initial conditions and the ongoing conditions of the therapy. This policy retraining can be carried out as many times as necessary to keep an upto-date adaptation policy. Formalization of these concepts follows.

per. But as the reader will realise, switching from this to other strategies is straight forward.

2.3. Markov decision processes (MDP)

Reinforcement learning (Kaelbling et al., 1996) roots its inspiration in human learning; a child will learn as s/he is rewarded whether positively or negatively for her/his actions. The computational analogy follows; a process or agent evolves to maximize expected rewards as it receives feedback for its outputs (Sutton & Barto, 1998). Realization of reinforcement learning can be achieved by different existing algorithms. Here we further develop from two of these. Q-Learning belongs to a family of algorithms known as off-policy because the policies are learn greedily independently of the actions that the agent performs, thus the optimal policy is learnt even when a non-optimal policy is being followed. The value of the quality function Q(st , at ) is updated considering the action that maximizes the expected utility maxat+1 (Q(st+1 , at+1 )). Despite its advantages for processes involving long learning, offpolicy learning algorithms often lead to instability problems. The second one, Sarsa, is an on-policy algorithm where the optimal policy is learnt only by using the systematic departures from the true optimal, i.e. the quality function Q(st , at ) is updated simply considering the action performed Q(st+1 , at+1 ). Often, the initial policy quality table QSxA is initialized to 0, and the value of Q is updated as the learning progresses.

An MDP is a decision model with foundations on Markovian theory (Poupart, 2011). The policy generated by solving an MDP will choose at any given time the best action for the system current state considering the knowledge base available at training time. Formally, an MDP is a tuple < S, A, T, R, h, γ > where S is a set of states describing the system conditions, A is a set of actions or possible decisions, T is a function describing transition among states subsequent to actions execution, R is a reward function permitting maximization of expected utility, h is the horizon at which the utility has to be maximized, and γ is a discount factor governing the value assigned to future rewards over short-term rewards. In our case, the system set of states S is given by a discretized bivariate performance space relying on the subject speed and control. Speed is measured from task onset until task completion and expressed relatively against empirically found values normal for a healthy subject, with three possible intervals low, medium and high. Control is calculated as the deviation in pixels from a straight path in the screen going from the user avatar location at task onset to the location of the task target. Control is also expressed relatively against empirically found values normal for a healthy subject, with three possible intervals poor, fair and good. A MDP is used because the performance metrics monitored in this case are observable, but it is easy to extend the present solution to a POMDP for a more educated model. The actions considered are either to increase, maintain or decrease the challenge level of the task. The practical realization of this in GT enforces lower and upper boundary limits to the challenge levels. The transition and reward functions implemented in this case have been designed to favour a match between speed and control encouraging a balanced progression ´ in both performance metrics (Avila-S´ ansores, 2013). The implications that will have for the rehabilitation, the many other transition and reward functions that can be designed lays beyond the intention of this pa-

Finally, the MDP is solved with a value iteration algorithm (Poupart, 2011) to yield an initial policy mapping from states and stages or observable histories, to actions π : S × T → A. This policy is the result of optimizing the expected rewards accumulated in time. 2.4. Reinforcement learning by reward shaping

In both cases a parameter α ∈ [0, 1] regulates the prevalence of new decisions over previous ones. Setting α = 0 prevents the algorithm from learning yielding in consequence a fixed policy, whereas setting α = 1 put emphasis on future rewards. A second parameter γ controls the relevance given to these future rewards. A γ = 0 considers only the most immediate future reward, whereas as γ approximates 1, progressively give more weight to long-term rewards. The original versions of the algorithms only accept a reward from a single source which is insufficient for our purposes. In addition, the complexity of the domain at hand, i.e. virtual rehabilitation, anticipates that the learning required to thoroughly explore the the policy space before a confident action can be issued will be prohibitively large in a real therapy. To address these two issues, we capitalize on reward shaping.

Adaptive Virtual Rehabilitation Environments

Reward shaping (Ng et al., 1999) can accelerate learning by providing localized advice. In other words, learning is sheered to the most relevant areas of the domain. In order to incorporate reward shaping ideas into the Q-Learning and Sarsa’s algorithms, these have to be modified to accept an additional local reward f which further complements the regular reward r received by the original algorithms. To distinguish these reward shaping versions from the original algorithms we refer to them as Q+ and S+ respectively. Q+ is summarised in Algorithm 1 and S+ is derived analogously from Sarsa, however not shown due to space limitations. The reader can find a detailed descrip´ tion of both modified algorithms in (Avila-S´ ansores, 2013). The reward shaping alteration is critical to (i) be able to feed on more than one reward source, and (ii) achieve learning dynamic policy adjustment within timings realistic in a real therapy timeframe. Algorithm 1 Q+ Input: < S, A, R > Ouput: Table of Q values Initialize Q(st , at ) for all feedback episode do Initialize st repeat Choose action at for st using policy quality value Q (e.g., -greedy) Execute at Obtain rt+1 and observe st+1 Q(st , at ) ← Q(s, a) + α ∗ [rt+1 + ft+1 + γmaxat+1 Q(st+1 , at+1 ) − Q(st , at )] st ← st+1 until st is terminal end for 2.5. Integration of the dynamic adaptation model to Gesture Therapy The system state as afore described is determined by the patient observable performance in terms of exhibited control and speed. The user proceeds with the task, and upon accomplishing it the system evaluates the exhibited performance. Often, physicians allow the patient to continue with a certain task for 3 to 5 minutes before switching to avoid boredom. Thus, every 3 minutes -for practical purposes, but can be set to any other given timing- the system check its state and based on its current policy, a decision action is issued, before a new task starts. The reward function assesses the goodness of the action in terms of the current state and yields a rt+1 reward. This reward is therefore consequence of the patient interaction with the system, yet it is hidden from the patient, who re-

mains unaware of the adaptation algorithm running in the background. In addition, before the new task starts, a screen informs the clinician of the patient’s speed and control, and the suggested action about to be committed. The clinician is asked to agree or disagree with the action recommendation, and his/her statement is mapped to a reward ft+1 . This reward feeds from the clinician experience and his/her in-situ assessment of the patient status. Again, the possible reward functions that can be designed for the ft+1 rewards is beyond the scope of this paper. The action output by the policy on a first place will still be committed, but the reward associated to the clinician’s statement will still shape future decisions. This process is repeated for several iterations until the decision policy is adjusted to the therapy current status. Afterwards, it can be switched off, and the newly adjusted policy will continue to rule the challenge adaptation decisions until a new adjustment is required. The requirement for the presence of the clinician is therefore reduced only to the policy re-training times, which we will show can be as low as 2-3 therapeutic sessions. The 3 minutes based feedback loop help us to reexpress the time needed for convergence from feedback iterations to real time. Assuming a regular rehabilitation therapy of around 20 sessions of 45 minutes each, a feedback every 3 minutes will collect 15 doublerewards in a session, and about 300 during the whole therapy shall it be allow to be continuously learning. But particularly interesting to us is an scenario where the policy can be adjusted slightly every now and then, and hold in between while the therapy is taken to the patient’s home without exhaustive therapist supervision. Following talks with clinicians, readjustment time shouldn’t last beyond 2-3 therapeutic sessions. At this feedback pace, we should aim to have our retrained policy within 30 to 45 feedback iterations. 2.6. State simulator for synthetically exploring the system response Exhaustive evaluation of the dynamic adaptation model proposed would demand a very large cohort size. To reduce the burden of such experimental set up, as well as to afford exploration of boundary conditions, a simulation architecture has been prepared. Its core component is a state generator which samples the performance space following an assumed patient behaviour, and converts the sampled performance vector into a state. In the description of the experiments, we give further details of the different patient behaviours explored in this work. The states generated are passed to the dynamic adaptation system which unaware of how this state has come to be, proceeds

Adaptive Virtual Rehabilitation Environments

normally with its policy re-training and challenge fitting. The simulator abstracts the dynamic adaptation system from data collection, evaluation and retrieval. Figure 2 schematically depicts the differences between a synthetic simulation and a real simulation. The critical difference arise from the many sources of noise affecting the state evaluation in the system. All this noise can confuse the dynamic adaptation system which won’t have reliable access to the real system state. This might slow down the training period in the real setup, yet as it will be shown, the convergence of the system is independent of the chaos affecting the state deliberation.

at the two reward levels, with some subjects playing the role of patients interacting with the system and two experts assessing the policy decisions in situ. This experiment examines the learning performance under more realistic conditions affected by noise in data acquisition.Moreover, it allows us to elaborate on the congruency between the decisions taken by the adaptation model and the corresponding dis/agreement statement made by the expert. This is critical if the system is to be taken home with limited clinical supervision for certain therapy periods. For this experiment, healthy subjects were preferred over real patients since interaction with the virtual platform and underlying decision taking mechanisms are independent of the user’s level of impairment, yet healthy users facilitate reproducible and controlled conditions. The tests were carried out using GT as our virtual rehabilitation platform. 3.1. Experiment 1: Model convergence 3.1.1. Experimental set-up

Figure 2. Synthetic evaluation versus real experimentation. A. The simulation architecture state generator creates new states. B. The full real chain; the human interacts with the GT system by holding the gripper which is tracked by a vision system (Sucar et al., 2012). The manoeuvres traces (time and screen path) are collected and evaluated in terms of speed and control to yield a new system state. In this chain, errors accumulate in each stage to yield a contaminated state. In both cases, the states generated are passed to the dynamic adaptation system which proceeds normally with policy re-training and challenge fitting.

3. Experiments and Results Two experiments have been carried out to test different aspects of the adaptation model. The first experiment uses the simulation architecture described in Sect. 2.6. This experiment, built over synthetic data, aims at establishing the convergence of the model empirically from a number of scenarios. In the process, it questions (i) possible differences between the underlying reward shaping algorithms, i.e. Q+ and S+, (ii) the effect of the learning factor parameter α, (iii) how does different user behaviours affect convergence, and (iv) whether the MDP seeding can provide an initial edge to the system. Preliminary results of this exper´ iment have been presented in (Avila-S´ ansores et al., 2012), but they are expanded here with a more extensive simulation batch as well as more varied questions to the system. The second experiment is a small feasibility study in laboratory controlled conditions. This second experiment involved human interventions

Using the synthetic state simulator, 54 scenarios (3 values of α times 2 reward shaping algorithms × 3 different user behaviours times 3 different policy initializations) were simulated 200 times each and allowed to evolve for 1500 feedback episodes. Parameterization of the scenarios are summarised in Table 1. For each run, the accumulated reward was retrieved and averaged across the runs. Three user behavioural patterns were explored: determinist representing an individual that systematically acts in a certain way, regardless of the therapist advice, conservative representing an individual who most times (80%) follows therapist advice, and stochastic reprsenting a chaotic individual may or may not follow therapist advice. α 0.2 0.5 0.9

Learning Q+ S+

User Conservative Stochastic Deterministic

Policy Initialization Random Zero MDP-seeded

Table 1. Variation in the parameters for experimentation. For each scenario a particular combination of one value per column was chosen.

Control poor fair high

low decrease decrease decrease

Speed medium decrease maintain increase

high decrease increase maintain

Table 2. The synthetic expert predefined decisions.

As previously described, rt+1 rewards are generated in the background to favour balanced progress in speed and control. The ft+1 rewards, in a real setup,

Adaptive Virtual Rehabilitation Environments

are made from the on-line feedback of the therapist. Therefore for synthetic simulations a strategy has been designed to simulate these rewards. In particular, Table 2 shows the actions recommended by a ”synthetic” expert based upon the patient’s performance. Statistical analysis was carried out in SPSS. In all cases, before attempting non-parametric testing depart from normality was first assessed visually using boxplots and further confirmed by the KolmogorovSmirnov test.

tained with each initialization was calculated, and a Kruskall-Wallis demonstrated that the results following MDP initialization are significantly better than with the other two initilizations (p < 0.05). Moreover, from Figure 3 it may be appreciated the asymptote for the MDP initialization is achieved earlier than with the other initializations and that its cumulative rewards are already around the asymptote by the 45th feedback episode, that is within the 2-3 real therapeutic sessions except for the stochastic patient.

3.2. Effect of the reward shaping algorithm

3.6. Experiment 2: Feasibility

The first 45, 100 y 300 iterations were isolated from all scenarios and simulations. Then the question about whether the use different algorithms using the same reward shaping makes a significant difference over the final accumulated reward was asked. The Wilcoxon test at 95% confidence rejected the alternative hypothesis (p=0.086) to express that the use of one or the other algorithm won’t affect the accumulated rewards.

3.6.1. Experimental set-up

3.3. Effect of parameter α A Kruskall-Wallis model was set to determine whether different values of alpha will yield significantly different accumulated rewards. The model responded affirmatively (p < .05) and subsequent pairwise comparisons with Dunn’s test correction and calculating the slopes suggested that α = 0.5 favouring faster learning would be preferable. 3.4. Effect of user behaviour Figure 3 illustrates the traces of the accumulated rewards for the explored user behaviours; deterministic, conservative and stochastic for the Q+ algorithm. Despite the latest ripples, possibly caused by the instability of the off-policy algorithms, in all cases the system reaches an asymptote. The output is unbounded but oscillates around a fix value. However, these oscillations occur late in the simulation timeline, well beyond our desired limit of 45 feedback episodes. We quantified these oscillations using entropy, and found that the stochastic user behaviourcan further enlarge them ´ (p < 0.05) (Avila-S´ ansores, 2013). 3.5. Effect of policy initialization One of the premises of the proposed model is that a smart initialization, achieved with an MDP, can give the policy an edge over naive initializations, zero or random. Already, Figure 3 hints that is the case obtaining in all cases higher accumulated rewards. The value of the quality function of the final policy ob-

The experiment was carried out in the National Institute for Astrophysics, Optics and Electronics (INAOE). Four healthy subjects were recruited from the student population (age mean 28; range 23-30). All the participants had previous experience in using computers, but only one had experience with Gesture Therapy. An introductory explanation about the experiment mechanics and aim was given, but no training or familiarization was allowed. One physician and a senior researcher played the roles of the therapists to provide the dis/agreement statements from which the f rewards were derived. The experiment was carried out in a single concurrent session, where all participants were in the lab at the same time. Each expert was in charge of monitoring two of the participants. Each participant played 25 blocks of 2 environments (tasks) (50 feedback episodes > 3 therapeutic sessions considering 15 episodes per session) from a set of 5 environments available. The different environments encourage different exercises and rehabilitatory movements. The first sequence of 5 tasks (2.5 blocks) was fixed ensuring that at each task was carried out at least once. Afterwards, the expert freely recommended the next task to the user as would happen in a actual therapy. Feedback pace was accelerated from 3 minutes to just 1 minute, thus each block lasted for 2 minutes. At each feedback episode, the dynamic adaptation module issued an action recommendation, and collected appropriate double rewards. After each block, all data was saved to a log file. A 15 minutes break was allowed after block 13, during which some stretching exercises were carried out. In all, the experiment lasted 4h approximately. Model was initialized with MDP resolved policy and the Q+ was used as the reward shaping dynamic learning algorithm. Further model parameterization was as follows: α = 0.5, = 0.2 and γ = 0.95. Some data loss occurred due to mistakes while backing up information in between blocks.

Adaptive Virtual Rehabilitation Environments

Figure 3. Accumulated rewards for the different user behaviours grand-averaged across runs and α values for the Q+. Statistically analogous traces can be appreciated with S+ (not shown). Left) Deterministic; Middle) Conservative; Right) Stochastic. The abscissa axis represent simulation time in terms of feedback episodes or iterations and it is shown in logarithmic scale for better appreciation of the initial stage. A vertical arrow indicates the 45 interactions limit equating to 3 therapeutic sessions. The traces for each initialization are shown separately. Oscillations about a fix asymptotic value can be appreciated in each case. In particular, we have highlighted the case with the MDP initialization.

3.6.2. Congruence of the model with the experts If the virtual therapy is to be deployed to the patient’s home without continuous expert supervision, it is critical that the adaptation model makes the system to behave intelligently to replicate what the expert would recommend at any given time. Thus, a high level of congruence between the model recommended actions and the expert agreement with this actions has to be sought. Table 3.6.2 summarises the congruence between the model and the expert of agreement statements expressed as a percentage over total decisions. High levels of congruence were achieved for 3 of the subjects. For subject 1 congruence was lower. As we show next this was consequence of some initial disagreements at the beginning of the session, which strongly affected subsequent decisions. Subject Congruence

1 56%

2 92%

3 96%

4 100%

Table 3. Expert agreement with the adaptation model suggested actions. The percentage is expressed over the total feedback episodes that occurred during the experiment.

3.6.3. Temporal evolution of the congruence To get further insight about how the congruence with the therapist is achieved, the temporal evolution of the congruence was analysed. Summation of the most recent dis/agreement statements was computed from a 5 episodes wide sliding window. Thus, a value 0 represents total disagreement between the model recommended actions and the expert feedback, whereas a

value 5 represent full agreement of the expert with the actions recommended by the system. Figure 4 shows the progress of the congruence over the experiment. The positive slopes of all of the regressions suggest that the agreement between the model decisions and the expert increases over time. It is clear from Figure 4 the bad decisions suggested by the model at the beginning for subject 1 whilst not enough to prevent successful learning cost the system dearly, clearly hindering the progress. This highlight an important effect, the early decisions have an important weight over the learning curve, which further stresses the importance of a good initialization.

4. Discussion The evolution of the learning exhibited in the two experiments present expectable differences. The synthetic experiment is far more controlled in its input, whereas the feasibility study receives a noisy input which can confuse the system. A number of factors are candidates for explaining these differences. First, variability in the human responses, both from the user and the expert should be expected to be higher than that of the state generator. Second, the expert has actual environmental information not available to the system, e.g. the body language of the patient. Finally, synthetic simulations are not affected by noise in the traces of the speed and control performance variables and their inherent instability (see Figure 2). A deeper ´ analysis of this can be found in (Avila-S´ ansores, 2013). This work suffers from a few limitations. Although we have already expressed our reasons for not using

Adaptive Virtual Rehabilitation Environments

5. Conclusions

Figure 4. Temporal evolution of the congruence between the expert and the recommended actions suggested by the dynamic adaptation model (black solid line), over the experiment. The overall learning progress can be characterized by the slope of the linear regression (red solid line). Positive slopes indicate a growing congruence between model and expert as times progress. Negative slopes (none occurred) indicate a departure in congruence.

patients at this stage, it is unclear whether convergence explored in experiment 1 can be extrapolated to patients of brain infarct since these will exhibit behaviours than can be expected to largely depart from the three users behaviours explored here. Notwithstanding, the experiment 2 hints that our model is robust and can learn despite challenging conditions. Besides the obvious α, and γ, the model includes many other parameters; the discretization of the performance space as well as the two hidden tables encoding the mapping from performance to r rewards and expert dis/agreement to f rewards (full parameteri´ zation can be found in (Avila-S´ ansores, 2013)). The large search space given by the many parameters involved in the model allow only for limited exploration. Even although they are intuitive in nature, fine tunning is beyond the scope of this work. Finally, this work has not compared the proposed approach with other existing solutions, nor it has implement the solution in a range of virtual rehabilitation platforms. While we appreciate the value in doing so, we think that is unnecessary for the proof of concept aimed at this paper; an adaptation system that not only permits adjustment of the challenge level, but that also permits update of the decision policy to evolve in concordance with therapy changing needs.

A new dynamic model for decision-taking in virtual rehabilitation has been presented. The model capitalizes on a double reward system to afford double adaptation; that of the within-task adaptation of the challenge level and the concurrent adaptation of the decision-taking policy. We have demonstrated that it is possible to achieve this double adaptation and moreover, that this can be achieve within a realistic timeframe. The implication is clear; the virtual therapy can be take home by the patient and only requires expert supervision during the sporadic updates. As far as we are aware this is the only work that can achieve this double adaptation, and specifically we are the first to afford a dynamic decision policy. We have also demonstrated the importance of a good initialization. Not only MDP initialization achieves better results than zero or random initialization, but also, good decisions at the beginning of the learning process can boost the learning drastically reducing the time needed for train and retrain of the decision policy. Future work involves incorporating cognitive and emotional aspects to the system state, as well as further enlarging the performance space with other new metrics of performance. The critical instability observed for long runs should be better investigated, and corrected. Experimental piloting with patients should occur in the near future.

References Adamovich, Sergey V., Fluet, Gerard G., Mathai, Abraham, Qiu, Qinyin, Lewis, Jeffrey, and Merians, Alma S. Design of a complex virtual reality simulation to trin finger motion for persons with hemiparesis: a proof of concept study. Journal of NeuroEngineering and Rehabilitation, 6:28 (10 pp.), 2009a. doi: 10.1186/1743-0003-6-28. Adamovich, Sergey V., Fluet, Gerard G., Tunik, Eugene, and Merians, Alma S. Sensorimotor training in virtual reality: A review. NeuroRehabilitation, 25 (1):29, 2009b. ´ Avila-S´ ansores, Shender, Orihuela-Espina, Felipe, and Sucar, Luis Enrique. Patient tailored virtual rehabilitation. In International Conference on NeuroRehabilitation (ICNR’2012),, pp. 879–883, Toledo, Spain, 14-17 NOV 2012. ´ Avila-S´ ansores, Shender Mar´ıa. Adaptaci´ on en l´ınea de una pol´ıtica de decisi´ on utilizando aprendizaje por refuerzo y su aplicaci´ on en rehabilitaci´ on virtual. PhD thesis, Dept. Computational Sciences, National Institute for Astrophysics, Optics and Electronics (INAOE), 2013.

Adaptive Virtual Rehabilitation Environments

Avil´es, H´ector, Luis, Roger, Oropeza, Juan, OrihuelaEspina, Felipe, Leder, Ronald, Hern´ andez-Franco, Jorge, and Sucar, Enrique. Gesture therapy 2.0: Adapting the rehabilitation therapy to the patient progress. In Hommerson, Arjen and Lucas, Peter (eds.), Workshop on Probabilistic Problem Solving in Biomedicine in 13th Conference on Artificial Intelligence in Medicine (AIME’11), pp. 3–14, Bled, Slovenia, JUL 2011. Burdea, Grigore, Popescu, Viorel, Hentz, Vincent, and Colbert, Kerri. Virtual reality-based orthopedic telerehabilitation. IEEE Transactions on Rehabilitation Engineering, 8(3):430–432, SEP 2000. Burgar, Charles G., Lum, Peter S., Shor, Peggy C., and Loos, Machiel van der. Development of robots for rehabilitation therapy: the Palo Alto VA/Stanford experience. Journal of Rehabilitation Research and Development, 37(6):663–673, NOV/DEC 2000. Ellsworth, Christopher and Winters, Jack. An innovative system to enhance upper-extremity stroke rehabililtation. In Clark, John W., McIntire, Larry V., Ktonas, Periklis Y., Mikos, Anthony G., and Ghorbel, Fathi H. (eds.), 2nd Joint Engineering in Medicine and Biology Society and Biomedical Engineering Society (EMBS/BMES) Conference, volume 3, pp. 2367 – 2368, Houston, Texas, USA, 23-26 OCT 2002. IEEE. Holden, Maureen K. and Dyar, Thomas. Virtual environment training: a new tool for neurorehabilitation. Neurobiology report, 26(2):62–71, 2002. Kaelbling, Leslie Pack, Littman, Michael L., and Moore, Andrew W. Reinforcement learning: A survey. Journal Artificial Intelligence Research, 4:237– 285, 1996. Kan, Patricia, Hoey, Jesse, and Mihailidis, Alex. Automated upper extremity rehabilitation for stroke patients using a partially observable markov decision process. In Association for Advancement of Artificial Intelligence (AAAI) 2008 Fall Symposium on AI in Eldercare, 2008. Krakauer, John W., Carmichael, S. Thomas, Corbett, Dale, and Wittenberg, George F. Getting neurorehabilitation right: what can be learned from animal models? Neurorehabilitation and Neural Repair, Epub ahed of print:9 pp., 2012. Krebs, Hermano Igo, Hogan, Neville, Aisen, Mindy L., and Volpe, Bruce T. Robot-aided neurorehabilitation. IEEE Transactions on Rehabilitation Engineering, 6(1):75–87, MAR 1998.

Loureiro, Rui, Amirabdollahian, Farshid, Topping, Michael, Driessen, Bart, and Harwin, William. Upper limb robot mediated stroke therapy - GENTLE/s approach. Autonomous Robots, 15:35–51, 2003. Ng, Andrew Y., Harada, Daishi, and Russell, Stuart J. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pp. 278–287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2. Poupart, Pascal. Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions, chapter An Introduction to Fully and Partially Observable Markov Decision Processes (Ch.3), pp. 33–62. IGI Global, 2011. Reinkensmeyer, David J., Pang, Clifton T., Nessler, Jeff A., and Painter, Christopher C. Web-based telerehabilitation for the upper extremity after stroke. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 10(2):102–108, JUN 2002. Sanchez, Robert J., Liu, Jiayin, Rao, Sandhya, Shah, Punit, Smith, Robert, Rahman, Tariq, Cramer, Steven C., Bobrow, James E., and Reinkensmeyer, David J. Automating arm movement training following severe stroke: functional exercises with quantitative feedback in gravity-reduced environment. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(3):378–389, SEP 2006. Sucar, L. Enrique, Luis, Roger, Leder, Ron, Hern´andez, Jorge, and S´anchez, Israel. Gesture therapy: A vision-based system for upper extremity stroke rehabilitation. In 32nd Annual International Conference of the IEEE EMBS, pp. 3690 – 3693, Buenos Aires, Argentina, 31st AUG -4th SEP 2010. IEEE. Sucar, Luis Enrique, Orihuela-Espina, Felipe, Vel´azquez, Roger Luis, Reinkensmeyer, David J., Leder, Ronald, and Hern´andez-Franco, Jorge. Gesture therapy: An upper limb virtual reality-based motor rehabilitation platform. Submitted to IEEE Transactions on Neural Systems and Rehabilitation Engineering, pp. 10pp., 2012. Sutton, Richard S. and Barto, Andrew G. Reinforcement learning: an introduction. MIT Press, Cambridge, MA, 1st edition edition, 1998.

VIRTUAL REHABILITATION WITH VIDEO GAMES : A NEW ...

Virtual reality rehabilitation system for neuropathic pain ...

Adaptive virtual channel partitioning for network-on-chip ... - CompArch

Adaptive virtual channel partitioning for network-on ... - GT comparch

A Decision-Theoretic Approach for Adaptive User ...

Adaptive models for large herbivore movements in ... - Springer Link

Cognitive Virtual-Reality Based Stroke Rehabilitation

eBook The Decision Book: 50 Models for Strategic ...