Virtual Characters

Viewer
Transcript

Dealing with Out of Domain Questions in Virtual Characters Ronakkumar Patel and Anton Leuski and David Traum Institute for Creative Technologies University of Southern California Marina Del Rey, CA, 90292, USA

[email protected], [email protected] and [email protected] Abstract We consider the problem of designing virtual characters that support speech-based interaction in a limited domain. Previously we showed that text classification can be both an effective and robust tool for selecting the appropriate in-domain responses. In this paper, we consider the problem of dealing with outof-domain user questions. We introduce taxonomy of out-of-domain responses. We consider three classification architectures for selecting the most appropriate out-of-domain response. We evaluate these architectures and show that they significantly improve the quality of the response selection making the user's interaction with the virtual character more natural and engaging. Keywords: Classifier Architecture, Virtual Human, Design, Experimentation, Human Factor, HCI

1. Introduction In the recent Hollywood movie “I, Robot” set in 2035 the main character played by Will Smith is running an investigation into the death of an old friend. The detective finds a small device that projects a holographic image of the deceased. The device delivers a recorded message and responds to questions by playing back prerecorded answers. The goal of the device is to help the detective in his investigation by providing useful information. When the detective asks a question irrelevant to the task, the character responds with the sentence: “I am sorry. My responses are limited. You must ask the right question”, indicating that detective has ventured out of the domain. Our aim is to develop virtual characters with similar capabilities, which can deal with off-topic questions gracefully. We are targeting applications like training, education, and entertainment for this kind of virtual character. For use in education, such a character should be able to deliver a message to the student on a specific

topic. It also should be able to support a basic spoken dialog on the subject of the message, e.g., answer questions about the message topic and give additional explanations. In this paper we describe one of these characters called “Sgt. Blackwell” that serves as interface for an exhibition kiosk. It supports natural language dialog allowing the user to explore the system design and get an overview of the exhibition. We describe the system setup, and the different architectures used to analyze and select the appropriate character response. The three architectures we designed are as follow: The first has one classifier that can deliver an on-topic or an off-topic answer. The second one has two classifiers one that can deal with on-topic questions, second classifier that can deal with off-topic questions. The last architecture has three classifiers. We compare these architectures with our baseline system that has a single classifier with one class for off-topic answers. In the next section we outline the SGT Blackwell system setup. In Section 3 we discuss the different classifier architectures, focusing on how they are relevant to each other. In section 4 we present the results of off-line experiments showing that architecture with 3 classifiers displays significant improvement in answer quality. Finally, in section 5 we summarize our results and outline some directions for future work.

2. SGT Blackwell A user talks to SGT Blackwell using a head mounted microphone. The speech is converted into text using automatic speech recognition (ASR) system. The classifier analyzes the text output from ASR and selects the appropriate answer. The classifier is based on statistical language modeling techniques used in crosslingual information retrieval. It represents a text string with a language model -- a probability distribution over

the words in the string. The classifier views both questions and answers as samples from two different "languages" the language of questions and the language of answers. Given an input question from the user, the classifier calculates the language model of the most likely answer for the question, -- it uses the training data as a dictionary to "translate" the question into an answer, -- then it compares that model to the language models of individual answers, and selects the best matching response. Leuski and his colleagues [2] showed that this technique outperforms traditional text classification approaches such as support-vector machines for tasks that have a large number of response classes.

2.1 Data Set We have a collection of questions for SGT Blackwell each linked to a set of appropriate answers. The initial one or two questions were given to us by our script writer; we further expanded our questions set by paraphrasing the initial questions and also through collecting questions from users by simulating the final system in a Wizard of Oz study. Once the raw data was collected we used human coders to link the questions to appropriate answers using a 1-6 scale [3] (see Appendix B). There are 1572 questions in the collection linked to 82 answers.

3. Classifier Architectures Figure 2 shows the design of the original system, which serves as the baseline for our study. It has one classifier, which can deliver on-topic or offtopic answers based on the input question. This design assumes that all off-topic answers are equally appropriate to any off-topic question, so we put them into one class. Thus we have 60 response classes for the baseline, 59 for on-topic and 1 for off-topic answers. On-topic Answers

Question

Figure 1: A photograph of the SGT Blackwell system setup. SGT Blackwell’s responses include 82 spoken lines ranging from one word to a couple paragraphs. There are four kinds of responses it can deliver: content, prompts, off-topic, and comments [1]. In this study we only focus on the content and off-topic responses. The 59 content answers cover the identity of the character, its origin, its purpose, its animation technology, its design and some miscellaneous topics, such as “what time is it?” and where can I get my coffee?” When SGT Blackwell detects a question that cannot be answered with one of the content (on-topic) answers, then it picks a random answer from the pool of 13 off-topic answers indicating that the user has stepped out of domain.

Classifier

Off-topic Answers

Figure 2: Baseline’s classifier architecture The problem with this approach is that even among off-topic answers some answers can be more appropriate than others. Let us clarify this with an example.

Question What color are my eyes? How is the weather?

Answer given by Baseline I am not liberty to discuss that. I can tell you but I have to kill you.

Improvement You might want to put that one to a real human. I would like to know that to.

Table 1 Making improvement for off-topic question In column 2 of Table 1 SGT Blackwell answers the off-topic question randomly. It had better answers in its pool, but due to arbitrariness it could not respond with those and we end up with less efficient answer. We asked our human coders to give appropriateness score using 1-6 scale [3] for the question-answer pairs in Table 1. Answers given by baseline classifier scored 4 while answers in Improvement column scored 6, which show that there is room for improvement in baseline. 3.1 Disjoint classes for off-topic answers In order to achieve better conversation with a virtual character, one should design it in such a way that classifier can categorize the off-topic questions and deliver the answer based on that category. So for this study we came up with 8 disjoint classes, instead of only one for the off-topic answers. Each off-topic answer can be assigned to only one of the 8 classes. Thus each class has its own specific set of answers. See appendix A for further reference. 8 disjoint classes for off-topic answers are as follow 1. Don't understand - Questions those does not make any sense 2. Unknown - Understand, but don't know. 3. Out of Domain - Never heard before, not in knowledge 4. Restriction - Known, but cannot tell 5. Pass - Go ask someone else 6. Leave to human - Human perspective is required 7. Negative 8. Positive The two classes of questions that are most important to distinguish are Unknown and Out of Domain. All the questions that have a specific answer and also the questions based on the things spoken by the character in its previous answers, for which he does not

have any explanation, falls in Unknown class e.g. “what’s ASSALT stands for?” As one of the on-topic answer contains the word ASSALT, a human subject might like to know about it, after hearing that answer. Questions those are not in domain and open ended falls in Out of Domain category e.g. “what do you do in your spare time?” For the training set we used three coders to do the linking between questions and answers. At the end our collection was comprised of 1000 on-topic questions and 300 off-topic questions. Both set were disjoint, while the answer set had 59 on-topic answers and 8 offtopic classes. There are 67 response classes. Using this training set we created two small classifiers, one for ontopic questions and other for off-topic questions. An On-topic classifier is one where the input is an on-topic question and the output is an on-topic answer. The training set for this classifier is comprised of 1000 ontopic questions linked to 59 on-topic answers. An Offtopic classifier is one where the input is an off-topic questions and the output is an off-topic answer. Training set for this classifier is a mapping of 300 offtopic questions to 8 off-topic classes. These two small classifiers lead us to define three different architectures, which we will see one by one. On-topic Answers

Question

Classifier

1

Off-topic Class 1

2

Off-topic Class 2

N

*

1

Off-topic Class 8

Figure 3: First architecture with one classifier

2 * N

In a real situation the system is likely to encounter both on-topic and off-topic questions. Our first architecture has one classifier with the capability of both on-topic and off-topic classifiers (see figure 3). It has 1300 questions and 67 answers training sets. Here we mapped the 1000 on-topic questions to 59 on-topic

answers and 300 off-topic questions to 8 off-topic classes. This mapping we used to train the classifier. In the second architecture we use two classifiers. The first is an on-topic classifier with one more answer as “off-topic” in its answer set (see figure 4). It contains 1300 questions and 60 answers mapping, where we mapped 1000 on-topic questions to 59 on-topic answers and all off-topic questions to one single answer “offtopic”. The second is the off-topic classifier. A question is given as an input to the first classifier if the answer is “off-topic” then that question is sent through the second classifier, which produces the class for the off-topic question. The final answer is selected randomly from that class.

Ontopic

Answers

1 Question

Classifier

2 Class 1 Offtopic

*

Class 2

N 1 2

Class 8

Ontopic A

* N 1

Question

Classifier

2 Class 1 Offtopic

Class 2

* N 1

Class 8

2

Figure 5: Third architecture with three classifiers Our test set had 150 questions, none of which was included in training set. Out of 150 questions 100 were on-topic and 50 were off-topic. This ratio of 1/3 off-topic questions for the test set, was derived from our previous data [2]. We parsed this test set through all three architectures and the baseline. At the end we got 150 question and answer pairs for all architectures.

*

Figure 4: Second architecture with two classifiers

Baseline

Architecture

Architecture

Architecture

With One

With two

With Three

Classifier

Classifier

Classifier

3.92

3.89

4.16

4.63

4.58

4.44

4.65

5.14

2.59

2.77

3.17

3.62

0.05

0.03

0.05

0.03

0.07

0.09

0.07

0.13

N

The third architecture is an extension to the second one. Here we use three classifiers (see figure 5). The first one is a binary classifier to identify whether this question is on-topic or off-topic. We linked all the 1300 questions to one of two answers “on-topic” or “off-topic”. Other two classifiers are on-topic and offtopic. We pass the question through first classifier. If the question is linked with answer “on-topic” we parse it through on-topic classifier, and if the answer is “offtopic” then it will go through off-topic classifier.

4. Results The purpose of this study is to tackle these questions. 1) Is it really helpful to divide off-topic answers into different disjoint classes? 2) Which architecture is better among the given three?

Avg. Score Avg. Score (On-topic) Ave. Score (Off-topic) Error in On-topic Questions Error in Off-topic Questions

Table 2: Average appropriateness score and Errors In order to measure the performance of the system, next we used three human raters to judge the appropriateness of all four Q-A sets. Using a scale of 1-

6 each rater judged the appropriateness of SGT Blackwell’s answers to the questions in the test set. We evaluated the agreement between raters by computing Cronbach’s alpha score, which measures consistency in the data. The alpha score is 0.885 for baseline, 0.849 for first, 0.781 for second and 0.835 for third architecture respectively, which indicate high consistency among the raters. The average appropriateness score for all four architectures is displayed in Table 2. We can see in row 1 that the score increases by 18% in the architecture with 3 classifiers compared to the baseline. We can see the gradual increase in the average appropriateness score for off-topic questions, thus we can say that dividing the off-topic into disjoint classes is helpful. The differences in the scores are statistically significant according to pairwise t-test with the cutoff set to 5% except for the difference between the baseline and the first architecture with one classifier. The last two rows of Table 2 shows the wrong classification of test set questions. Row 4 displays wrong classification by producing an off-topic answer to an on-topic questions and row 5 displays wrong classification by producing an on-topic answer to an off-topic question. The error in row 5 for architecture with three classifiers is more than that with one classifier, but still it has a higher average score for off-topic questions. This is due to the binary classifier which makes 13% error in identifying offtopic question.

5. Conclusion and Future work In this paper we first presented 8 disjoint classes for off-topic answers, which helped us to deal with off-topic questions more effectively. Second we showed that architecture with three classifiers is more efficient than others in terms of appropriateness score, user engagement and experience. Still there is room for more improvements; first we need to collect more data to train the classifier. Second we plan to improve the binary classifier in the architecture with three classifiers. Finally, we can use methods like support vector machines, which are better suited for binary classifier.

Reference 1. Anton Leuski, Jarrell Pair, David Traum, Peter J. McNerney, Panayiotis Georgious and Ronakkumar Patel 2006. How to talk to a Hologram (IUI ’06) 2. Anton Leuski, Ronakkumar Patel, David Traum and Brandon Kennedy, 2006. Building Effective Question Answering Characters, (SIGDIAL ’06) 3. Sudeep Gandhe, Andrew S. Gordon, and David Traum. 2006. Improving question-answering with linking dialogues. In Proceedings of the 11th international conference on Intelligent user interfaces (IUI’06), pages 369–371, New York, NY, USA. ACM Press.

A Off-topic classes and there pool of answers Classes Don’t Understand Leave to Human Negative Out of Domain Pass Positive Restriction

Unknown

Response Sorry, I can't hear you I can't understand you. Stop mumbling. Just kidding. I didn't get that. You might want to put that one to a real human. No Negative, sir I don't have that information. Sorry. That's outside my AO. You'll have to talk to the PAO on that one. Yes Roger That's classified. I am not authorized to comment on that No comment I'm not at liberty to discuss. I can tell you.....but I would have to kill you (smirks) I would like to know that too

B Appropriateness grading Gandhe and his colleagues (Gandhe et al., 2006) suggested the following grading scheme that we used in our user evaluation Grade 1 2

3

4

5 6

Description Response is not related in any way the question Response contains some discussion of people or objects mentioned in the question, but does not really address the question itself. Response partially addresses the question, but little or no coherence between the question and response. Response does mostly address the question, but with major problems in the coherence between question and response; seems like the response is really addressing a different question than the one asked. Response does address the question, but the transition is somewhat awkward. Response answers the question in a perfectly fluent manner.

Dealing with Out of Domain Questions in Virtual Characters ... that detective has ventured out of the domain. ... design and get an overview of the exhibition. We.

Download PDF

89KB Sizes 2 Downloads 356 Views

Report

Task-driven Posture Optimization for Virtual Characters

Online Inserting Virtual Characters into Dynamic Video ...

Characters

spokes-characters

pdf unicode characters

Virtual directory

Mobile Conversational Characters

Characters on Parade.pdf

Virtual directory

characters-b20160306.pdf

A Virtual Switch Architecture for Hosting Virtual ...

Four Characters Four Themes.pdf

Parallax: Virtual Disks for Virtual Machines

Virtual German Charter Network: A Virtual Research ... - GitHub

Virtual Reality and Migration to Virtual Space

Virtual Graceland

virtual iraq

Virtual Tutor