Learning Referring Expression Generation Policies for Spoken Dialogue Systems using Reinforcement Learning Second year report Srinivasan Janarthanam Supervisor: Dr. Oliver Lemon

www.classic-project.org

Introduction • Dialogue systems adapting to unknown users based on their domain expertise. • Choose appropriate referring expressions. – Jargon or descriptive expressions – Proper names or descriptive common names

• REG policy – Which RE to choose in a given state? • Learning REG policies that adapt dynamically. • Use Reinforcement Learning for NLG. (Lemon 2008) 03/05/2009

Srini Janarthanam - Second Year Review

2

Why adaptive policies? • Humans do it. Helps in grounding. – Audience design. (Issacs & Clark, 1987) • Improves usability. (Molich & Neilsen, 1990). • Analyse your audience – Technical writing.

03/05/2009

Srini Janarthanam - Second Year Review

3

Dialogue system Dialogue policy ∏: Ss -> As

User Dialogue Act

Dialogue Manager

System Dialogue Act

NLG module

System Utterance

Dialogue State Ss

03/05/2009

Srini Janarthanam - Second Year Review

4

Adaptive Dialogue System Dialogue policy ∏: Ss -> As

User Dialogue Act

Dialogue Manager

NLG policy ∏: UMs,u -> RECs

System Dialogue Act

NLG module

System Utterance

Dialogue State Ss User Model UMs,u 03/05/2009

Srini Janarthanam - Second Year Review

5

NLG module: decision problem Retrieve the utterance template

Choose REs based on policy

Replace RE handlers with REs

E.g. “Do you see a $broadband_filter$ connected to the $modem$?” User = Novice, $broadband_filter$ - small white box $modem$ - big black box with flashing lights “Do you see a small white box connected to the big black box with flashing lights?”

Can we learn an optimal adaptive NLG/REG policy using Reinforcement Learning? 03/05/2009

Srini Janarthanam - Second Year Review

6

Dialogue task • To troubleshoot an Internet connection at the user’s house. Problem reporting Diagnosis Repair instructions Verify & Close

03/05/2009

Srini Janarthanam - Second Year Review

7

NLG policy Learning Reinforcement Learning

(Janarthanam & Lemon 2009a)

(Sutton & Barto 1998)

Reward

Dialogue system

User Simulation As, RECs

Au

Hand-coded Dialogue script

Observe/ Manipulate

Update state

Dialogue state 03/05/2009

Simulated User Environment

Srini Janarthanam - Second Year Review

8

Dialogue System state • User model is a part of the dialogue state • Records the user’s domain knowledge during the conversation.

• The system decides what REs to use based on its dynamic user model.

03/05/2009

Srini Janarthanam Janarthanam & Lemon - Second - ENLG Year Review 2009

9

Dialogue System Action set

03/05/2009

Srini Janarthanam - Second Year Review

10

User simulation • • • •

Different from previous user simulation models. Sensitive to referring expressions. Simulates different domain knowledge profiles. Takes as input • System dialogue act • System’s choice of referring expressions.

• Outputs • User dialogue act • User environment act. 03/05/2009

Srini Janarthanam - Second Year Review

11

User action selection – PoC model As, RECs

User knows RECs ?

No

Au = Request clarification

Yes

User knows Location of domain objects?

No

Au = Request location

Yes

User knows how to manipulate them?

No

Au = Request procedure

Yes

Observe/Manipulate them. Done

Au = Provide Info/ Acknowledge 03/05/2009

Srini Janarthanam - Second Year Review

12

PoC - Training the NLG module • 50000 cycles (1500 dialogues) using SARSA RL algorithm. • Shorter dialogues get more reward.

• Learned policies (RL1 & RL2) adapt very well to given population • Produce tailored, short dialogues for their respective user groups (oracle performance is 13 moves). 03/05/2009

Srini Janarthanam - Second Year Review

13

PoC – Testing in simulation • Do the learned policies (RL1 & RL2) perform well with other user groups as well? • Tested using a different user simulation simulating more groups. • Learned policies were compared to baseline policies. • 250 dialogues per policy were produced.

03/05/2009

Srini Janarthanam - Second Year Review

14

Baseline policies (hand-coded) • • • •

Random – choose REs randomly. Descriptive only – Use only descriptive expressions. Jargon only – Use only technical terms. Adaptive 1 – Start with descriptive, change to technical terms if user requests verification. • Adaptive 2 - Start with technical terms, change to descriptive if user requests clarification. • Adaptive 3 – Switch between technical and descriptive expressions based on previous user requests. 03/05/2009

Srini Janarthanam - Second Year Review

15

Evaluation

• RL (1 & 2) are significantly better than other baseline policies. • RL2 is significantly better than RL1. • Learned policies adapted well to unseen profiles (because of Linear Function Approximation). 03/05/2009

Srini Janarthanam - Second Year Review

16

Why Learned Policies are better? • Did not use ambiguous expressions like “black box”. • Used descriptive terms only for complete novices and only jargon for experts. • Appropriately chose between descriptive and jargon terms for intermediate users. • For example, • If users knew “modem”, the system uses “dsl light” else it uses “second light”. • System uses “Network Connections” only when user knows “Modem” and “Network Icon”

03/05/2009

Srini Janarthanam - Second Year Review

17

Data? - Wizard of Oz!

03/05/2009

(Janarthanam & Lemon 2009b)

Srini Janarthanam - Second Year Review

18

Wizard interpretation tool

03/05/2009

Srini Janarthanam - Second Year Review

19

Data collection • • • • •

Fill-in background information Take pre-test (recognition of domain objects) Do the dialogue task Take post-test Review system performance (questionnaire)

03/05/2009

Srini Janarthanam - Second Year Review

20

Corpus • 17 participants • • • • • • •

Logs of interaction Participants’ background Pre-test recognition scores Post-test recognition scores Final environment state Participants’ feedback (Likert scale) Audio of the conversations

03/05/2009

Srini Janarthanam - Second Year Review

21

User simulation models Advanced n-gram simulation (Georgila et al. 2005) P(Au,t | As,t, RECs,t, H, DKu) P(EAu,t | As,t, RECs,t, H, DKu) Au,t - User’s Dialogue action EAu,t - User’s Environment action As,t - System’s Dialogue action RECs,t – System’s choice of Referring Expressions H – History of Clarification Requests DKu- User’s Domain Knowledge

- Models real users very closely - Breaks down in contexts not seen in the corpus (data sparsity) 03/05/2009

Srini Janarthanam - Second Year Review

22

User simulation models Two-tier model Tier 1: P(CRu,t | As,t, REs,t, HRE, DKu,RE) Tier 2: P(Au,t | As,t, CRu,t) P(EAu,t | As,t, CRu,t) - Trained on dialogue corpora - RE recognition and Environment interaction are divided in to two steps instead of one - Works well in unseen contexts 03/05/2009

Srini Janarthanam - Second Year Review

23

User simulation models Bigram model – trained on corpora P(Au,t | As,t) P(EAu,t | As,t) Trigram model – trained on corpora P(Au,t | As,t, As,t-1) P(EAu,t | As,t, As,t-1) Equal Probabilty model - Same as Bigram, but assigns equal probability to all possible responses. 03/05/2009

Srini Janarthanam - Second Year Review

24

Evaluation • Which one is close to the ideal simulation? • Dialogue similarity measure (Cuayahuitl et al 2005, Cuayahuitl 2009) based on Kullback-Leibler divergence.

P, Q – probability distributions N – Total number of contexts M – Number of responses per context 03/05/2009

Srini Janarthanam - Second Year Review

25

Evaluation • All models were compared to the ideal simulation in observed contexts (N = 175). • All models were smoothed using a modified version of Witten-Bell discounting. Model

Au,t

EAu,t

Two-tier

0.078

0.018

Bigram

0.150

0.139

Trigram

0.145

0.158

Equal Prob.

0.445

0.047

Two-tier model simulates real user data more faithfully. 03/05/2009

Srini Janarthanam - Second Year Review

26

Milestones • Learn REG policies with hand-coded simulation(Janarthanam & Lemon 09a) – DONE • Build WoZ setup(Janarthanam & Lemon 09b) – DONE • Build dialogue corpora from human users – 50% DONE – Need more for reward modelling

• Build user simulation from data - DONE

03/05/2009

Srini Janarthanam - Second Year Review

27

Schedule for the third year Month/Year

Task

July 2009

More data collection

Aug 2009

Learning REG policies using Simulation models, Evaluating learned policies with Simulated users

July / Aug 2009

Release shared task – alternative models to build adaptive REG systems – For comparison with our RL framework

Sep 2009

DDD

Oct – Nov 2009

Evaluation with real users

Dec - Feb 2009

Final writing up

03/05/2009

Srini Janarthanam - Second Year Review

28

Thesis plan Chapter

Title

1

Introduction ☺

2

Review of related work ☺

3

RL framework to learn adaptive NLG policies ☺

4

Corpus collection ☺

5

Building user simulation model from data ☺

6

Training/Testing the NLG module using user simulation model

7

Testing with real users

8

Conclusion and Future work

Appendix

Sample dialogues References

03/05/2009

Srini Janarthanam - Second Year Review

29

Relevant publications •

SEMDIAL 08 –



Srinivasan Janarthanam and Oliver Lemon. 2008. User simulation for knowledge-alignment and online adaptation in Troubleshooting Dialogue Systems. In proc SEMDIAL 2008 (LONDIAL), London. (CHAPTER 3)

ENLG 09 –



Srinivasan Janarthanam and Oliver Lemon. 2009a. Learning Lexical Alignment Policies for Generating Referring Expressions for Spoken Dialogue Systems. In proc. ENLG 2009 (Athens). (CHAPTER 3) Srinivasan Janarthanam and Oliver Lemon. 2009b. A Wizard-of-Oz Environment to study Referring Expression Generation in a Situated Spoken Dialogue Task. In proc. ENLG 2009 (Athens). (CHAPTER 4)

Forthcoming • Book chapter – “State-of-the-art in NLG” (to be edited by E. Krahmer and M. Theune)

03/05/2009

Srini Janarthanam - Second Year Review

30

☺ Thanks ☺

www.classic-project.org

03/05/2009

Srini Janarthanam - Second Year Review

31

References Srinivasan Janarthanam and Oliver Lemon. 2009a. Learning Lexical Alignment Policies for Generating Referring Expressions for Spoken Dialogue Systems. In proc. ENLG 2009 (Athens). Srinivasan Janarthanam and Oliver Lemon. 2009b. A Wizard-of-Oz Environment to study Referring Expression Generation in a Situated Spoken Dialogue Task. In proc. ENLG 2009 (Athens). Srinivasan Janarthanam and Oliver Lemon. 2008. User simulation for knowledge-alignment and online adaptation in Troubleshooting Dialogue Systems. In proc SEMDIAL 2008 (LONDIAL), London. Oliver Lemon. 2008. Adaptive Natural Language Generation in Dialogue using Reinforcement Learning. In Proc. SEMdial’08. R. Sutton and A. Barto. 1998. Reinforcement Learning. MIT Press. R. Molich and J. Nielsen. 1990. Improving a Human-Computer Dialogue. Communications of the ACM, 33-3:338–348. E. A. Issacs and H. H. Clark. 1987. References in conversations between experts and novices. Journal of Experimental Psychology: General, 116:26–37.

03/05/2009

Srini Janarthanam - Second Year Review

32

References K. Georgila, J. Henderson, and O. Lemon. 2005. Learning User Simulations for Information State Update Dialogue Systems. In Proceedings of Eurospeech/Interspeech. H. Cuayahuitl, S. Renals, O. Lemon, and H. Shimodaira. 2005. Human-Computer Dialogue Simulation Using Hidden Markov Models. In Proc. of ASRU 2005. H. Cuayahuitl. 2009. Hierarchical Reinforcement Learning for Spoken Dialogue Systems. Ph.D. thesis, University of Edinburgh, UK.

03/05/2009

Srini Janarthanam - Second Year Review

33

Extra slides

03/05/2009

Srini Janarthanam - Second Year Review

34

Witten-Bell discounting

N – Total number of events V – Total number of distinct events (types) T – Number of observed event types C(e) – Frequency of event e E.g. Provide_info (3, 0.75), other (1, 0.25), request_clarification(0, 0) Smoothed: Provide_info (0.5), other (0.167), request_clarification(0.33) 03/05/2009

Srini Janarthanam - Second Year Review

35

Modified Witten-Bell discounting Divide the extracted mass amongst all the event types (V) instead of just the unobserved events (V-T)

Smoothed with modified Witten-Bell discounting: Provide_info (0.44), Other (0.28), Request_clarification(0.11)

03/05/2009

Srini Janarthanam - Second Year Review

36

Learning Referring Expression Generation Policies for ...

Mar 5, 2009 - To troubleshoot an Internet connection at the user's house. ... PoC – Testing in simulation ... Take pre-test (recognition of domain objects).

675KB Sizes 0 Downloads 155 Views

Recommend Documents

Learning Adaptive Referring Expression Generation ...
rial. Jargon expressions are technical terms like. 'broadband filter', 'ethernet cable', etc. Descrip- tive expressions contain attributes like size, shape and color.

Learning Modular Neural Network Policies for Multi ...
transfer learning appealing, but the policies learned by these algorithms lack ..... details and videos can be found at https://sites. .... An illustration of the tasks.

Learning Cost-Aware, Loss-Aware Approximate Inference Policies for ...
thermore that distribution will be used for exact inference and decoding (i.e., the system ... one should train to minimize risk for the actual inference and decoding ...

Learning Lexical Alignment Policies for Generating ...
eration choices online to the users' domain knowledge as it encounters them. We ap- proach this problem using policy learning in Markov Decision Processes ...

Referring Site -
Oct 2, 2011 - Oct 3. Oct 13. Oct 23. Nov 2. Nov 12. Nov 22. Dec 2. Dec 12. Dec 22. Jan 1. Jan 11. Jan 21. Visits. This referring site sent 2,794 visits via 1 ...

Towards a New Generation of International Investment Policies - Unctad
Apr 12, 2013 - Through its eleven core principles, its guidelines on national policy making and its ... Data from UNCTAD's annual survey of the largest 100 TNCs suggest that the foreign .... Collective efforts at the multilateral level can help ...

Small-sample Reinforcement Learning - Improving Policies Using ...
Small-sample Reinforcement Learning - Improving Policies Using Synthetic Data - preprint.pdf. Small-sample Reinforcement Learning - Improving Policies ...

Generation 2.0 and e-Learning: A Framework for ...
There are three generations in the development of AT. ... an organising framework this paper now conducts a ... ways in which web 2.0 technologies can be used.

A Data-driven method for Adaptive Referring ...
present a data-driven Reinforcement Learning framework to automatically learn adaptive ... cision problems (utility based decision making). In this paper, we extend ... domain objects based on the user's expertise and business re- quirements.

A Data-driven method for Adaptive Referring ... - PRE-CogSci 2009
ing better reward functions and learning/training parameters. The user simulation was calibrated to produce two kinds of users - novices and experts. Expert users knew all the tech- nical terms, whereas the novices knew only a few terms like. 'livebo

Ensemble machine learning on gene expression data ...
tissues. The data are usually organised in a matrix of n rows ... gene expression data on cancer classification problems. This ... preconditions and duplication.

learning distributed power allocation policies in mimo ...
nt . Note that the Kronecker propa- gation model ( where the channel matrices are of the form. Hk = R. 1/2 k. ˜ΘkT. 1/2 k. ) is a special case of the UIU model. The.

Policies for Context-Driven Transactional Web Services
specifications exist (e.g., Web Services Transaction (WS-Transaction)1, Web ... 1 dev2dev.bea.com/pub/a/2004/01/ws-transaction.html. ... of the traffic network.

fiscal policies for sustainable transportation
e hicle Pollu ta nt. E m is s io n s. (s h o rt to n s. ) VMT. CO (10x7). NOx (10x6) ...... sector of these fiscal tools are vehicle sales taxes and vehicle VATs. The best ...

Liu_Yuan_GC12_QoS-Aware Policies for OFDM Bidirectional ...
Sign in. Page. 1. /. 6. Loading… ... of one-way relaying is decreasing with signal-to-noise ratio. (SNR). ... to the peak power constraint PR, which can be expressed as .... with Decode-and-Forward Relaying.pdf. Open. Extract. Open with. Sign In.

POLICIES FOR HIGHWAY FINANCING: GASOLINE ...
consumption data from 1980 to 2005 were used to estimate the fleet mix, ...... Efficiency and Renewable Energy website (USDoE, 2008) lists the vehicle sales.

Principles For Practical Drug Policies
Page 2 of 2. Principles for Practical Drug Policies_Proceedings o ... t Drug Abuse, Oslo, Norway_Mark Kleiman_Jan 1998.pdf. Principles for Practical Drug ...

POLICIES FOR HIGHWAY FINANCING: GASOLINE ...
... a Percent Value based on Gasoline Price Starting 2009, 2005-2025 . ..... emissions, increasing CAFE standards would not pass a benefit-cost test. In another ...

Liu_Yuan_TWC13_QoS-Aware Transmission Policies for OFDM ...
Liu_Yuan_TWC13_QoS-Aware Transmission Policies for OFDM Bidirectional Decode-and-Forward Relaying.pdf. Liu_Yuan_TWC13_QoS-Aware Transmission ...

Liu_Yuan_GC12_QoS-Aware Policies for OFDM Bidirectional ...
the weighted sum rates of the two users with quality-of-service. (QoS) guarantees. ... DF relaying with hybrid transmission modes, the importance. of one-way relaying ..... OFDM Bidirect ... Transmission with Decode-and-Forward Relaying.pdf.