Extended Hidden Vector State Parser - Springer Link

Viewer
Transcript

Extended Hidden Vector State Parser ˇ 1 and Filip Jurˇc´ıcˇ ek2 Jan Svec 1

Center of Applied Cybernetics, Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Pilsen, 306 14, Czech Republic [email protected] 2 Cambridge University Engineering Department Cambridge CB21PZ, United Kingdom [email protected]

Abstract. The key component of a spoken dialogue system is a spoken understanding module. There are many approaches to the understanding module design and one of the most perspective is a statistical based semantic parsing. This paper presents a combination of a set of modifications of the hidden vector state (HVS) parser which is a very popular method for the statistical semantic parsing. This paper describes the combination of three modifications of the basic HVS parser and proves that these changes are almost independent. The proposed changes to the HVS parser form the extended hidden vector state parser (EHVS). The performance of the parser increases from 47.7% to 63.1% under the exact match between the reference and the hypothesis semantic trees evaluated using HumanHuman Train Timetable corpus. In spite of increased performance, the complexity of the EHVS parser increases only linearly. Therefore the EHVS parser preserves simplicity and robustness of the baseline HVS parser.

1 Introduction The goal of this paper is to briefly describe the set of modifications of the hidden vector state (HVS) parser and to show that these changes are almost independent. Every described modification used alone significantly improves the parsing performance. The idea is to incorporate these modifications into a single statistical model. We suppose that the combined model yields even better results. The HVS parser consists of two statistical models - the semantic and the lexical model (see bellow). In the following sections we describe three techniques to improve the performance of the parser by modifying these models. First, we use a data-driven initialization of the lexical model of the HVS parser based on the use of negative examples which are collected automatically from the semantic corpus. Second, we deal with the inability of the HVS parser to process left-branching language structures. The baseline HVS parser uses a implicit pushing of concepts during a state transitions and this limits the class of generated semantic trees to be rightbranching only. To overcome this constraint we introduce an explicit push operation into the semantic model and we extend the class of parseable trees to the left-branching trees, the right-branching trees and their combinations. V. Matouˇsek and P. Mautner (Eds.): TSD 2009, LNAI 5729, pp. 403–410, 2009. c Springer-Verlag Berlin Heidelberg 2009

404

ˇ J. Svec and F. Jurˇc´ıcˇ ek

Finally, we extend the lexical model to process a sequence of feature vectors instead of a sequence of words only.

2 Hidden Vector State Parser The HVS parser is a statistical parser, which implements the search process over the sequence of vector states S = c1 , c2 , . . . , cT that maximizes the aposterior probability P (S|W ) for the given word sequence W = w1 , w2 , . . . , wT . The search can be described as S ∗ = argmax P (S|W ) = argmax P (W |S)P (S) (1) S

S

where P (S) is called the semantic model and P (W |S) is called the lexical model. The HVS parser is an approximation of a pushdown automaton. The vector state in the HVS parser represents a stack of a pushdown automaton. It keeps the semantic concepts assigned to several words during the parsing. The transitions between vector states are modeled by three stack operations: popping from zero to four concepts out of the stack, pushing a new concept onto the stack, and generating a word. The first two operations modelled by the semantic model which is given by: P (S) =

T

P (popt |ct−1 [1, . . . 4])P (ct [1]|ct [2, . . . 4])

(2)

t=1

where popt is the vector stack shift operation and takes values in the range 0, 1, . . . , 4. The variable ct represents the vector state consisting of four variables - the stored concepts, i.e. ct = [ct [1], ct [2], ct [3], ct [4]] (shortly ct [1, . . . 4]), where ct [1] is a preterminal concept dominating the word wt and ct [4] is a root concept. The lexical model performs the last operation - a generation of a word. The lexical model is given by: T P (W |S) = P (wt |ct [1, . . . 4]) (3) t=1

where P (wt |ct [1, . . . 4]) is the conditional probability of observing the word wt given the state ct [1, . . . 4]. For more details about the HVS parser see [1].

3 Negative Examples In this section we briefly describe the first used modification of the HVS parser. It is based on so called negative examples which are automatically collected from the semantically annotated corpus. The negative examples are then used to initialize the lexical model of an HVS parser. First, we define a positive example. Then we negate the meaning of the positive example and we get the definition of the negative example. Finally we describe the use of negative examples during the initialization of an HVS parser.

Extended Hidden Vector State Parser

405

In this paper the positive example is a pair (w, c) of a word and a semantic concept and it says: the word w can be observed with a vector state containing the concept c. For the utterance jede nˇejak´y spˇesˇn´y vlak do Prahy kolem cˇ vrt´e odpoledne (Lit.: does any express train go to Prague around four p.m.) with the semantic annotation D EPARTURE (T O (S TATION ), T IME ), one of many positive examples is the pair (Prahy, S TATION). Another positive example for the word Prahy (Lit.: Prague, capital city of Czech Republic) is the pair (Prahy, T IME). The negative example, similarly to the positive example, is a pair of a word w and a semantic concept c. However, the negative example says: the word w is not observed together with a vector state containing the concept c. In other words, the negative example is the pair of a word and a concept that do not appear together in any utterance in a training corpus. For example the utterance jede nˇejak´y spˇesˇn´y vlak do Prahy kolem cˇ vrt´e odpoledne (Lit.: does any express train go at four p.m.) with semantic annotation D EPARTURE (T IME ). We can see that the word jede (Lit.: does go) is not generated by the vector state containing the concept S TATION. Therefore, the pair (jede, STATION) is the negative example. We analyzed concepts defined in our semantic corpus (see Section 6), and we found four concepts suitable for the negative examples extraction: S TATION, T RAIN T YPE, A MOUNT, L ENGTH, and N UMBER. These concepts are selected because they are strong related to their word realization. In other words, the set of all possible words with meaning S TATION is finite and well-defined. Not all utterances are ideal for the extraction of negative examples. For instance, if we use the utterance dnes je pˇr´ıjemn´e poˇcas´ı v Praze (Lit.: weather is pleasant in Prague today.) with the semantic annotation OTHER I NFO for the negative examples extraction, we have to conclude that the word Praze is not generated by the vector state [S TATION, . . .] because the semantics does not contain the concept S TATION. However the word Praze (Lit.: in Prague) is related to the concept S TATION1 . To select the proper utterances, we use only the utterances containing the following top-level concepts: ACCEPT, A RRIVAL, D ELAY, D EPARTURE, D ISTANCE, D URA TION , P LATFORM, P RICE , and R EJECT because only these concepts can be parents of suitable leaf concepts. More details on extraction of the negative examples can be found in [2]. The negative examples give us much less information than the positive ones. We have to collect several negative examples to gain the information equal to one positive example. However, using such information brings significant performance improvement. To utilize negative examples, we modify the initialization phase of the lexical model. We still initialize the lexical model uniformly; however, at the same time, we penalize the probability of observing the word w given the vector stack ct according to the collected negative examples: if (w, c[1]) is a negative example, p(w, c[1, . . . 4]) = 1/ |V | otherwise 1

In this example, the concept S TATION does not have to appear in the semantics because the semantics OTHER I NFO is very general and it covers many meanings e.g. S TATION as well.

406

ˇ J. Svec and F. Jurˇc´ıcˇ ek

P (w|c[1, . . . 4]) =

p(w, c[1, . . . 4]) w∈V p(w, c[1, . . . 4])

∀w ∈ V

(4)

where is a reasonably small positive value and V is a word lexicon. We found that it is better to use some non-zero value for because the extraction of the negative examples is not errorless and the parser training algorithm (a kind of EM) can deal with such errors.

4 Left Branching In this section, we describe the semantic model modification which enables the HVS parser to generate not only the right-branching parse trees but also left-branching parse trees and their combinations. The resulting model is called the left-right-branching HVS (LRB-HVS) parser [3]. The left-branching parse trees are generated by pushing more than one concept onto the stack at the same time (the baseline HVS parser pushes only one concept at a time). We analyzed errors of the baseline HVS model and we did not find any error caused by the inability to push more than two concepts. Therefore, we limited the number of concepts inserted onto the stack at the same time to two, but in general it is straightforward to extend the number of pushed concepts to more than two. To control the number of concepts pushed onto the stack at the same time, we introduced a new hidden variable push into the HVS parser: P (S) =

T

P (popt |ct−1 [1, . . . 4])P (pusht |ct−1 [1, . . . 4])·

t=1

⎧ ⎪ if pusht = 0 ⎨1 if pusht = 1 P (ct [1]|ct [2, . . . 4]) ⎪ ⎩ P (ct [1]|ct [2, . . . 4])P (ct [2]|ct [3, 4]) if pusht = 2

(5)

In the case of inserting two concepts onto the stack (pusht = 2), we approximate the probability P (ct [1, 2]|ct [3, 4]) by P (ct [1]|ct [2, . . . 4])P (ct [2]|ct [3, 4]) in order to obtain more robust semantic model P (S). To illustrate the difference between right-branching and left-right-branching, we can use the utterance dneska veˇcer to jede v sˇestn´act tˇricet (Lit.: today, in the evening, it

Fig. 1. Incorrect (left) and correct (right) parse trees of left-branching language structure. Input: dneska veˇcer to jede v sˇestn´act tˇricet (Lit.: today, in the evening, it goes at four thirty p.m.).

Extended Hidden Vector State Parser

407

goes at four thirty p.m.). The incorrect parse tree (Figure 1 left) is represented by the semantic annotation T IME , D EPARTURE , T IME. Such parse tree would be an output of the HVS parser, which allows to generate right-branching parse trees only; the parser is not able to push more than one concept onto the stack. However, the correct parse tree (Figure 1 right) is represented by the semantic annotation D EPARTURE (T IME , . . . , T IME ). The correct parse tree would be an output of LRB-HVS parser because it is able to push two concepts D EPARTURE and T IME onto the stack at the same time so that the first word dneska (Lit.: today) can be labeled with the hidden vector state [T IME, D EPARTURE].

5 Input Feature Vector The input parameterization extends the HVS parser into a more general HVS parser with the input feature vector (HVS-IFV parser) [4]. This parser uses a sequence of feature vectors F = (f1 , . . . , fT ) instead of a sequence of words W . The feature vector is defined as ft = (ft [1], ft [2], . . . , ft [N ]). Every word wt has assigned a fixed set of N features. If we use the feature vector ft instead of the word wt in Eq. 3, the lexical model changes as follows: P (F |S) =

T

P (ft | ct ) =

t=1

T

P (ft [1], ft [2], . . . ft [N ] | ct )

(6)

t=1

To avoid the data sparsity problem, we used the assumption of conditional independence of features ft [i] and ft [j], i = j given the vector stack ct . This kind of assumption is also used for example in the naive Bayes classifier. The lexical model of the HVS-IFV parser is then given by: P (F |S) =

T N

P (ft [i] | ct )

(7)

t=1 i=1

Because the conditional independence assumption is hardly expected to be always true, we modified the search process defined in Eq. 1. Let’s assume that we have the sequence of the feature vectors F = (ft [1], ft [2])Tt=1 where ft [1] = ft [2] for every time step t. T Then the lexical model is given by P (F |S) = t=1 [P (ft [1]|ct )]2 . As we can see the probability P (F |S) is exponentially scaled with the factor 2 and it causes the imbalance between the lexical and the semantic model. Therefore, we use the scaling factor λ to compensate the error caused by the assumption of conditional independence: S ∗ = arg max P (F |S)P λ (S) S

(8)

Then the HVS-IFV parser is defined by equations 5, 7, and 8. The optimal value of λ was found by maximizing the concept accuracy measure defined in Section 6.1 on a development data. In our experiments the feature set consists of two linguistic features - a lemma and a morphological tag assigned to the original word.

408

ˇ J. Svec and F. Jurˇc´ıcˇ ek

Fig. 2. Graphical models of a transition between two consecutive vector states for the original HVS model (left) and the HVS-IFV model (right)

6 Experiments The semantic parsers described in this paper were trained and tested on the Czech human-human train timetable (HHTT) dialog corpus [5]. The HHTT corpus consists of 1,109 dialogs completely annotated with semantic annotations. Both operators and users have been annotated. It has 17,900 utterances in total. The vocabulary size is 2,872 words. There are 35 semantic concepts in the HHTT corpus. The dialogs were divided into training data (798 dialogs - 12972 segments, 72%), development data (88 dialogs 1,418 segments, 8%), and test data (223 dialogs - 3,510 segments, 20%). Each segment has assigned exactly one abstract semantic annotation. The training of the semantic and the lexical models of HVS parser is divided into three parts: (1) initialization, (2) estimation, (3) smoothing. All probabilities are initialized uniformly; however, the negative examples are used to alter the probabilities in the lexical model. To estimate the parameters of the models, it is necessary to use the expectation-maximization (EM) algorithm because the abstract semantic annotations do not provide full parse trees. We use a simple back-off model to smooth probabilities. To build the semantic parser, we use the Graphical modeling toolkit (GMTK, see Figure 2) [6]. We evaluate our experiments using two measures: semantic accuracy and concept accuracy. These measures compare the reference tree (human annotated) with the hypothesis tree (parser output). 6.1 Performance Measures When computing the semantic accuracy, the reference and the hypothesis annotations are considered equal only if they exactly match each other. The semantic accuracy of E · 100%, where N is the number of evaluated a model output is defined as SAcc = N semantics, E is the number of hypothesis semantics which exactly match the reference. The exact match is a very tough standard. It does not measure a fine differences between similar semantics. Therefore we introduced the concept accuracy. Similarity

Extended Hidden Vector State Parser

409

scores between the reference and the hypothesis semantics can be computed by a tree edit distance algorithm [7]. The tree edit distance algorithm uses a dynamic programing to find the minimum number of substitutions (S), deletions (D), and insertions (I) required to transform one semantic tree into another one. The operations act on nodes and modify the tree by changing parent/child relationships of given trees. The concept · 100%, where N is the accuracy of a model output is defined as CAcc = N −S−D−I N total number of concepts in the corresponding reference semantics. 6.2 Results Table 1 and Table 2 show the results achieved by different modifications of the baseline HVS parser on development and test data. To measure the statistical significance we use the paired t-test. The p-value < 0.01 of this test indicates significant difference. The baseline HVS parser corresponds to the implementation of He and Young. The “negative examples” method for the HVS parser initialization used the negative examples extracted for the concepts A MOUNT, L ENGTH, N UMBER, S TATION, and T RAIN T YPE. The LRB-HVS parser combines the extension which allows the generation of left-right-branching parse trees and the “negative examples” method. The HVS-IFV parser then adds the ability to parse the sequence of feature vectors instead of single sequence of words. According to [4] the feature vector consists of two features - the original word and its corresponding lemma. Finally the EHVS parser corresponds to the developed extended hidden vector state parser which yields the better results. Table 1. Performance of parsers evaluated on the development data

Parser type HVS (baseline) HVS with neg. examples LRB-HVS HVS-IFV EHVS

Development data SAcc CAcc p-value 50.7 64.3 52.8 67.0 < 0.01 60.1 70.6 < 0.01 58.2 73.1 < 0.01 65.4 75.7 < 0.01

Table 2. Performance of parsers and estimates of performance lower- and upper-bound evaluated on the test data Parser type HVS (baseline) HVS with neg. examples LRB-HVS HVS-IFV EHVS

SAcc 47.9 50.4 58.3 57.0 63.1

Test data CAcc p-value 63.2 64.9 < 0.01 69.3 < 0.01 69.4 < 0.01 73.8 < 0.01

410

ˇ J. Svec and F. Jurˇc´ıcˇ ek

7 Conclusions and Future Work In this paper we presented the combination of three well-known methods of improving the performance of the HVS parser. We used the data-driven initialization of the lexical model using the negative examples. We also modified the semantic model to support a wider class of semantic trees. The modified parser allows to generate left branching, right branching and left-right branching parse trees. The last modification extends the lexical model to be able to parse the sequence of feature vectors instead of the sequence of single words. We used a lemma and a morphological tags of the original words as features. All these modifications alone significantly improve the performance of the baseline HVS parser. The key result of this paper shows that these modifications (negative examples, left-right branching and the input feature vector) are almost independent. The performance gain of the resulting extended hidden vector state parser (EHVS) is composed of the performance gains of the partial modifications. All in all, we improved the performance of the parser from 47.9 % to 63.1 % in SAcc and from 63.2 % to 73.8 % in CAcc measured on the test data. The absolute improvement achieved by the suggested modifications of the original HVS parser was about 15 % in SAcc and more than 10 % in CAcc.

Acknowledgment This work was supported by the Ministry of Education of the Czech Republic under project No. 1M0567 (CAK).

References 1. He, Y., Young, S.: Semantic processing using the hidden vector state model. Computer Speech and Language 19(1), 85–106 (2005) 2. Jurˇc´ıcˇ ek, F.: Statistical approach to the semantic analysis of spoken dialogues. Ph.D. thesis, University of West Bohemia (2007) ˇ 3. Jurˇc´ıcˇ ek, F., Svec, J., M¨uller, L.: Extension of HVS Semantic Parser by Allowing Left-Right Branching. In: Proc. IEEE ICASSP (2008) ˇ 4. Svec, J., Jurˇc´ıcˇ ek, F., M¨uller, L.: Input Parameterization of the HVS Semantic Parser. In: Proceedings of TSD, Pilsen, The Czech Republic (2007) 5. Jurˇc´ıcˇ ek, F., Zahradil, J., Jel´ınek, L.: A Human-Human Train Timetable Dialogue Corpus. In: Proceedings of Interspeech, Lisboa, Portugal (2005) 6. Bilmes, J., Zweig, G.: The graphical models toolkit: An open source software system for speech and time-series processing. In: Proc. IEEE ICASSP (2002) 7. Klein, P.: Computing the edit-distance between unrooted ordered trees. In: Proceedings of the 6th Annual European Symposium, Venice, Italy. Springer, Berlin (1998)

Semi-supervised learning of the hidden vector state model for ...

Fast Support Vector Data Description Using K-Means ... - Springer Link

Semi-supervised learning of the hidden vector state model for ...

Semi-supervised Learning of the Hidden Vector State ...

Discriminative Training of the Hidden Vector State ... - IEEE Xplore

Interiors of Sets of Vector Fields with Shadowing ... - Springer Link

Using hidden Markov chains and empirical Bayes ... - Springer Link

Inferring Protocol State Machine from Real-World Trace - Springer Link

Planning for manual positioning: the end-state comfort ... - Springer Link

Tinospora crispa - Springer Link

Chloraea alpina - Springer Link

GOODMAN'S - Springer Link

Bubo bubo - Springer Link

Quantum Programming - Springer Link

BMC Bioinformatics - Springer Link

Candidate quality - Springer Link

Mathematical Biology - Springer Link

Artificial Emotions - Springer Link

Bayesian optimism - Springer Link

Contents - Springer Link

(Tursiops sp.)? - Springer Link

Fickle consent - Springer Link

Regular updating - Springer Link