Development of Spoken Language Model for Automatic ...

Viewer
Transcript

Development of Spoken Language Model for Automatic Railway Reservation Form Filling

Bibhu Prasad Mishra under the guidance of Dr. Samudravijaya K Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400005

1

Introduction

In this age where technology has become an important part in our life almost we can’t imagine a life without technology. It has permeated in almost every sphere of our life, be it anything. It has made the life of human beings much simpler, easier and faster. Earlier things which took longer time to be done are quite easily achieved within seconds. One of the most important achievement for mankind is the development of computers which are able to achieve millions of calculations in a second. Despite the prowess of computers they fail badly when it comes to pattern recognition and to thinking. They are nowhere near humans when it comes to detecting say another human by face or recognizing a person by his voice or his gait. Still in the recent years research has shown promise in making computers understand or see patterns as we humans do. One such system is the spoken language understanding systems. Spoken language understanding consists of two parts one is the speech model for converting speech audio signal into most probable sentences and the other part is to extract relevant information from the sentences with respect to the context using the language model. This has various applications in various situations like for say if you want to certain application to start in your computer and to make it perform a certain job. Imagine how easier it would be for a person to command the computer to do the same without having to touch the keyboard or the mouse. The computer would understand what the user wants to know and would his bidding without the user having to do anything and because it is automated the user need not worry about clicked or typing something wrong and causing some other application to start. This is one of the simplest uses for a spoken language recognition system. In this report a system is presented for automatic railway reservation form filling without the help of a person. The language used for the training and modelling is Hindi. The system interacts with the user and fills the form and issues the ticket. This report will specifically deal with how system extracts information from the dialogues it has with the user without getting into how it synthesizes speech or how it manages the dialogues. This report will describe the method the data has been collected and how the modelling has been done for the acoustic and language model [1 ].

2

Methods used for data collection

As the data collected for modelling of acoustic model and for language model is to be used for training a system which will be used for Railway Reservation Form filling where the user may not speak grammatically correct sentences and also people with various accents will come for booking tickets, so it is essential that the model be trained to handle such situation. Hence the way data is collected is very important to the building of the system. For our case data has been collected using the Wizard of Oz method [2 ]. Data obtained from a person speaking spontaneously has many differences from speech data collected from a person by requesting him to read a text or to himself a sentence relevant to the situation. When a person is asked to speak then he is less prone to speaking grammatically

Figure 1: GUI used for the Wizard of Oz method for data collection incorrect sentences and also the speech disfluencies are much less. Generally in such cases the noise is also less whereas in spontaneous speech it is a quite different case. In spontaneous speech the person may make grammatically incorrect sentences or say an incomplete sentences. Also the chances of speech disfluencies are much more and the background noise is also higher. So it is obvious that for our data to be used must be spontaneous speech. In case of Wizard of Oz method while collecting data the user is told that he is interacting with a sytem but in reality a human volunteer acts as a wizard and emulates the response of the system. The user is asked to interact with the system for asking about reservation information. The wizard sits infront a computer with a GUI to fill up the information as told by the user. Then after the information is entered if data is not complete then system itself generates a sentence asking the user to enter the remaining information and also tells what data has already been told by the user for verification. So the user speaks as he would interact with a system and hence the speech data obtained is spontaneous. Also in this method sometimes knowingly the error is feigned to make the user feel more like interacting with a system. The Wizard may for example wrong information information in the GUI so that the data obtained in such a case will help for training the system. Such a system is expected to perform better in real time environment when normal people use it. The GUI used for Wizard of Oz method is shown in Figure 1.

3

Spoken dialogue system

People communicate with others by mode of speech and through gestures. How good it would be for a person to interact the same way with a machine and be able to communicate his own intentions or giving some information. Spoken dialogue system are systems which would be

Figure 2: A basic block diagram for a Spoken Dialogue System able to interact with humans as normal people and thereby helping them in the process of getting their job done with any human intervention. It consists of a few important blocks. Firstly speech model is used for obtaining sentences from a audio input from a user and gives the sentence with the highest score. The speech model in itself consists of two blocks one is the acoustic model for detection of phonemes or the basic sound units and then the language model for generating the sentence with the highest score. The second block is the text processing block which tries to extract information from the sentence generated by the acoustic model and gives it to the dialog manager which in turn checks what information is further needed and generates sentences accordingly and this is fed to the fourth and the final block which will generate synthesized speech for the user [3 ], [4 ]. This is very helpful in the sense that it doesn’t need human intervention and also the job is done efficiently as a system doesn’t feel tired by continuosly doing the same job. In this report the acoustic model used is briefly described and how the language model has been derived is also explained. A block diagram of a spoken dialogue system is shown in Figure 2.

3.1

Acoustic Model

The acoustic model is the unit which takes the audio input and gives the most probable sequence of phonemes or the basic sound units in speech. HMM is the most prevelant technique for modelling such acoustic models. This is because in case of acoustic model we try to figure

Figure 3: Flow of the left-right or the Bakis model used for speech recognition purposes the sequence of phonemes which are equivalent to hidden states in HMM and the observation sequence is the audio input from the user. For acoustic modelling basically the two frequently used toolkits are HTK developed in Cambridge university and the SPHINX developed at CMU independently. Although different the basic underlying structure for extracting sequence of phonemes from audio input and the method used for training is essentially the same. Both train the system using HMM only. In the following paragraphs we will see the way how training is done and how HMM is used for this purpose. There are certain problems which are to be overcome before creating a model for detecting speech. Firstly one of the problems in mapping feature vectors with the underlying state sequence is not one to one. Various state sequences can give rise to the same speech sound. Secondly the units of sound are not discrete in the sense that speech is continuous. So we do not know the word boundaries when speech recognition is done [5 ]. Both the problems are simultaneously solved by using HMM. In HMM the hidden state sequence can be associated with the phonemes which are actually being spoken by a person and the observation vectors are the MFCC vectors obtained from the audio signal. The HMM used in this case is a left-right model or the bakis model [6 ], [7 ] and this works better than the standard ergodic model as with the passage of time the state index can be considered to increase. Hence it is preferred over the standard HMM model. The allowed state change is shown in Figure 3. The aij ’s represent the probability of transition from state i to state j and bj (ot ) represent the probability of the vector ot being observed when in state j. Now in case of speech we can have bj (ot ) as dicrete probability by applying VQ, i.e. by mapping each speech vector to a fixed vector which represents the center of a vector quantized region or it can be made continuos by assigning continuous probability and modelling it by using GMM.

It has been found that using continuos distribution for bj (ot ) is much better than making it discrete. It is obvious also because by discretizing there is loss of information and the modelling is not complete. The basic problems using HMM in speech which are to be solved for training and testing are as follows. Firstly given a observation sequence how does one detect the best possible state sequence and secondly how does one maximize the parameters of a HMM model so as to maximize P (O|λ)whereλ = (A, B, π) [6 ]. The first problem corresponds to the detecting of speech whereas the second problem corresponds to that of training a speech model.Therefore training a model essentially boils down to reestimating the parameters of each state output distribution as shown in equation (1) 1

bj (ot ) = q

1

(2π)n |Σ

j|

0

−1 (o −u ) t t

e− 2 (ot −ut ) Σ

(1)

Firstly a basic HMM is trained for each state. This is done by initializing the models with certain values as shown in equation (2) and (3). uˆj =

ΣTt=1 Lj (t)ot ΣTt=1 Lj (t)

T 0 ˆ j = Σt=1 Lj (t)(ot − ut )(ot − ut ) Σ ΣTt=1 Lj (t)

(2) (3)

Then Baum-Welch algorithm is used for training and for testing best viiterbi path is found to give the state sequence [6 ].In case of sphinx the way the training is briefly described below. First the audio input is converted to MFCC feature vectors. Then the dictionary is created from the file containing all the transcriptions and a list of phonemes is also created. Then a model is made for phonemes to generate question i.e. find the ways two phonemes can be grouped together. Then the context independent models are trained. Context independent models means that models are trained for phonemes without considering what phonemes precede it or follow it in the input. From this the context dependant models are trained if the amount of training data is sufficient. This is done for giving still higher accuracy. Still it may happen that the amount of training data is less than what might be actually necessary for training a context dependant model. Say that there are some fifty phonemes in a language then number of context dependant models that have to be trained is 503 = 125000 and for this the amount of data must be sufficient. If data is less such that there are only two vectors or three vectors to train a GMM then modelling would be poor and defeats the purpose of making context dependant model. Hence in this case models are tied together using the questions generated initially. Tying means combining two context dependant models to have one model. For example say in hindi say in devnag ‘.da’ and ‘da’ have almost similar pronounciation so if there are two context dependant models for a phoneme with the phonemes ‘.da’ and ‘da’ following it then these two models can be made into one model. Then the data for modelling this single model would be more and the training would obviously be better and hence giving higher recognition accuracy.

3.2

Language Model

This is the model which helps in getting the best possible sentence from the sequence of phonemes. There are two ways of representing the grammar for a specific task domain. One is the grammar model used in HTK and the other is the SPHINX finite state grammar. In HTK task grammar a simple model has been described [7 ]. Say you want a automatic system for phone dialling then typical inputs might be: Dial nine one five seven eight Phone Odell Call Joop Jansen This is defined in HTK task grammar as follows: $digit = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | OH | ZERO $name = [ JOOP ] JANSEN | [ JULIAN ] ODELL | [ DAVE ] OLLASON | [ PHIL ] WOODLAND | [ STEVE ] YOUNG ; ( SENT-START (DIAL <$digit> | (PHONE | CALL) $name) SENT-END ) Here the vertical lines lines denote alternatives, angular brackets denote that it may be repeated and square bracket denote that the item is optional. A simple representaion of this grammar has been shown in Figure 4.

In SPHINX Finite State Grammar the format is shown below [8 ]: FSG BEGIN [] NUM STATES <#numstates> START STATE FINAL STATE TRANSITION

[] TRANSITION

[] .... (any number of state transitions) FSG END

Trigram grammar can be obtained from this finite state grammar.

Figure 4: The structure of the simple grammar used for simple application for phone dialling

4

Network Grammar usage and its advantages

Generally for creation of a finite state grammar or HTK grammar to find the probabilities for each transition the database must be large. The application looked at here is to use it for automatic form filling for railway reservation and in such a case there must be a database with the possible names of alll trains and all possible monthes and all possible stations to be able to train the model properly. Instead what has been done in our case is that the train name, station names etc. which belong to one category has been replaced by the category name. Say for example “mumbaI eksapresa” as expressed in Devnag convention is a train name. It is replaced by the category name i.e. .Similarily five categories have been identified. They are ‘train’, ‘station’, ‘month’, ‘number’, ‘class’. The category ‘number’ represents either the date in a month or the class of reservation like ‘first class’ or ‘second class’. The category ‘class’ represents the words like ‘sleeper’, ‘ac’, ‘chair car’ etc.. Thus in any sentence or transcription the category names exist. So the grammar model can be trained efficiently for a less number of sentences. This replacing of the entries of a category by the category name itself creates the network grammar. The sentence obtained is called the network grammar. Transcription: ‘mujhe ya”savaMtapura jaMka”sana se ma/galura ko maI a.thArA ko jAnA hI’ Network Grammar equivalent: ‘mujhe se ko ko jAnA hI’ So we need not train the model for different train names or for various station names. Replacing by category automatically takes cares of that in network grammar. Also when the number of transcriptions (i.e. the sentences spoken by the user) are huge checking them by hand for

training would be a tough job. Also many sentences may become the same after replacing the entries name by their category names like for example say ‘mujhe allAhbAda jaMka”sana se hOv.RAh jaMka”sana jAnA hE’ and ‘mujhe ahamdAbAda jaMk”sana se kathago.dAma jAnA hE’ become the same after creating network grammar out of it i.e. ‘mujhe se jAnA hE’. Also we can see that these two sentences had similar structure. For proper training the sentences must be grammatically well balanced. When the number of sentences increase it becomes impossible for a user to check all the sentences and handpick sentences having different grammatical structure. Also it would be tiring for the person if he has to choose or find the category of words which are found at certain fixed places in a sentences. It would be much better if the system does some type of processing of the sentences and gives a probable group of words which are similar or may be placed in one category. This is similar to the words ’Phone’ and ’Dail’ in the HTK grammar as described above. Then it would be easier for the person performing the training to make some minor changes to the categories to train the model. The report has described a semi-automatic method for doing the above. Being semi-automatic it relieves the trainer of much of his work and is also data driven. Thye following section will explain how the network grammar has been generated and how it was used to generate the final Finite State Grammar output.

5

Generation of the network grammar and Finite State Grammar

This section explains in detail the flow of the implementation and how exactly the network grammar has been extracted and how it has been used to find similar words and strings and the finite state grammar. The complete implementation of the design has been depicted in Figure 5

First all the transcriptions are combined in one single file and a master transcription file is created which contains all the transcriptions. Then before generating the network grammar a files have to be created containing the list of trains, list of stations etc. Separate files are created for each category. Also a file is created containing the list of noises. Then first of all the file containg the list of noises is used to remove noises and to remove speech disfluencies from the master transcription file and give a file containing the clean transcriptions. Then clean transcriptions are used for creating network grammar. THe system searches for entries of any category in the clean transcription file and replaces them by the category names. Here one thing to be noted is that first the system searches for train names in clean transcriptions, then station names etc.. This is done because train names sometimes contains station names within itself. So if a station name is replaced by its category name then it will create a wrong network grammar. Say for example ‘Mumbai Rajdhani Express’, if station name is detected first then

Figure 5: The creation of finite state grammar beginning from extraction of network grammar is shown here

it becomes ‘ Rajdhani Express’ but it should have been changed to ‘’. So to prevent this the system checks for entries from different categories in a specific order. Then after extracting yhe network grammar the system asks the user whether he wants the transition probability in fsg output to be uniform or to be data driven or to be derived from the list of unique network grammar. According to the user choice it creates another list of network grammar which may contain only unique network grammar or may have multiple occurences of network grammar according to the users choice. Then it takes the list of network grammar created in the previous step to categorize network grammar into groups. In our case 440 transcriptions after categorization of network grammar gave only 52 catgories of network grammar. The categorization is done by looking at the sequence of the category names present in a network grammar. If two network grammar have same sequence of category names like ‘’ etc. then they are grouped together. After the network grammar have been categorized the system looks for words/strings ot be made into one category. The criteria for putting word/string into a single category is that they must be between two same category names. In the beginning of the sentence a category is placed named ‘’ and at the end a category named ‘’ is added for simplicity. So seeing the boundary category names two word/string are grouped together. After categorizing words/strings 32 such categories were formed. Also using the network grammar each category of network grammar was combined into one sentence. For example ‘’ ‘’ mujhe ‘’ ‘’ se jAnA hE are combined to form (| | mujhe ) (| ) (| | se jAnA hE ) Thus the network grammar and the word are categorized. Then finite state grammar file is created by taking into consideration the user choice while calculating the transition probabilities. In fsg each word/string in each is considered as a state and each entry in category list ‘’ is also taken as a state. Also null states are added to start and beginning of each category of word/string or to list of ‘’ etc. for simplicity. After this step the number of states obtained were 650. Thus the final finite state grammar was obtained from the master transcription file.

6

Conclusion

In this report the present day existing methods were described for extraction of phone sequence from an audio input and also an new method for extracting fsg grammar from the list of transcriptions was shown for a specific task like railway reservation. The same can be extended to any other application by following a similar method.

7

Acknowledgements

I would like to thank my guide Dr. Samudravijaya K for providing me the opportunity to do this work. I will always remain indebted to him for his valuable guidance and support. I would also like to thank Vijay, Lokesh and Akshat for their valuable suggestions they gave during discussion with them.

8

References and Links ´ Shaughnessy, “Interacting with Computers by Voice: Automatic Speech Recog1. Douglas O nition and Synthesis”, Invited Paper 2. Samudravijaya K, “Modelling Natural Language for Automatic Speech Recognition”. 3. Fodor P., Huerta J.M., “Planning and Logic Programming For Dialog Management”, Spoken Language technology workshop,2006,IEEE,Dec 2006, Page(s):214-217 4. Renato de mori, et. al,“Spoken Language Understanding”, IEEE Signal processing Magazine, May 2008. 5. Samudravijaya K, “Speech and Speaker Recognition: A Tutorial” 6. Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” 7. Steve Young, et. al, “The HTK book(for HTK version 3.2.1)” 8. http://cmusphinx.sourceforge.net

Language Model Verbalization for Automatic ... - Research at Google