International Journal of Computer Science Research and Application 2013, Vol. 03, Issue. 02, pp. 30-38 ISSN 2012-9564 (Print) ISSN 2012-9572 (Online) © Author Names. Authors retain all rights. IJCSRA has been granted the right to publish and share, Creative Commons3.0


Isolated Tamil Word Speech Recognition System Using HTK A. Akila1, E. Chandra2 Research Scholar, D.J Academy for Managerial Excellence, Coimbatore, INDIA, [email protected] 2 Director, Department of Computer Science, Dr.SNS Rajalakshmi College of Arts & Science, Coimbatore-32, INDIA. [email protected] Author Correspondence: [email protected] 1

Abstract Speech is one of the powerful tools for communication. The desire of researchers was that the machine should understand the speech of the human beings for the machine to function or to give text output of the speech. In this paper, an overview of Tamil based Automatic Speech Recognition Systems is presented. The implementation of Isolated Tamil word speech recognition system is done using Hidden Markov Model Tool Kit (HTK) and the performance of the system is manipulated in terms of various measures like Word Error Rate, Accuracy Percentage and Correctness Percentage.

Keywords: Accuracy Percentage, Correctness Percentage, Hidden Markov Model (HMM), HMM Tool Kit (HTK), Word Error Rate (WER)

1. Introduction Speech is one of the oldest and most natural means of information exchange among human beings. Communication between human is done by speech production and perception. For many years, researchers have tried to develop machines that understand and produce speech as humans do naturally. The speech recognition problem may be interpreted as “speech to text conversion problem”. Automatic Speech Recognition (ASR) has been an active research topic for more than four decades. Speech recognition systems for regional languages spoken in different countries with rural background and low literacy rates appear to be still evolving. Tamil Language is a Dravidian language spoken by Tamil people of South India, Sri Lanka and Singapore (Govt. of Tamilnadu, 2013). It has official status in Tamilnadu and Pondicherry. Tamil is also a national language of Sri Lanka and an official language in Singapore. It is also chiefly spoken in the states of Kerala, Karnataka, Andhra Pradesh and Andaman and Nicobar Islands as one of the secondary languages (Govt. of Tamilnadu, 2013). It is one of the 22 scheduled languages of India and was declared a classical language by the government of India in 2004. Tamil is one of the longest surviving classical languages in the world. Tamil literature has existed for over 2000 years.


International Journal of Computer Science Research and Application, 3(2):30-38

2. System Presentation 2.1 Hidden Markov Model Toolkit (HTK) The HTK is a toolkit for building Hidden Markov models. HTK is mainly used for speech recognition research and it has been used for numerous other applications like research into speech synthesis, character recognition and DNA sequencing. HTK consists of a set of library modules and tools available in C source form (Steve Young et al, 2006). The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. Figure 1 represents the architecture of HTK. The HTK consist of many tools and each has its own function. Using these tools Complexomplex HMM system can be built easily. The Toolkit supports HMMs using single mixture Gaussian distribution and multiple mixture Gaussian distributions. There are two major processing stages, the training phase and recognition phase involved in HTK. In the training phase, the some tools of HTK are used to estimate the parameter of the set of HMM’s using training utterances and their associated transcriptions. In the recognition phase, unknown utterances are transcribed using HTK recognition tools.

Figure 1: System Architecture of HTK


International Journal of Computer Science Research and Application, 3(2):30-38

2.1 Hidden Markov Model (HMM) HMM is a Statistical Model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters (Chandra and Akila, 2012). A Markov Model is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event (Rabiner and Jung, 1993). This characteristic is defined as the Markov property. A HMM is a collection of states that fulfills the Markov property, with an output distribution for each state defined in terms of a mixture of Gaussian densities. HMM has its application in many areas like Signature verification, Speech and Speaker Recognition, Bio Informatics and Bar code reading (Anusuya and Katti, 2011).

3. Speech Recognition for Tamil Languages 3.1 Tamil Alphabet Tamil language consists of 12 vowels and 18 consonants. The vowels and consonants combine to form 216 compound characters. The compound characters are formed by placing vowels in side or both sides of the consonant. There is one special character ayutha ezhuthu ( •) used in classical Tamil. So, totally there are 247 letters in standard Tamil alphabet.

3.2Research in Tamil ASR Speech Recognition for Tamil language is still at preliminary level compared to the vast development of computing with English language (Thagarajan et al, 2008). There are many commercially available speech applications like Speech Mate, Naunce Dragon Naturally Speaking 2011.34, Dictation 2005 and DSpeech for Tamil language. There are many researches carried out for speech recognition in Tamil language. Researches in Speech recognition systems (SRS) for Tamil language have been done by many researchers. The SRS is implemented using algorithms like Hidden Markov Model (HMM), Dynamic Time Warping (DTW), and Artificial Neural Networks (ANN). DTW based Speech Recognition System has a good accuracy for isolated Tamil Digits (Dharun and Karnan, 2012) (Karpagavalli, 2012). DTW works well for small number of templates and it is speaker dependent (Anusuya and Katti, 2011). In (Murugan et. al, 2012) SRS medium size vocabulary was built. The system was trained with 21 speakers and the recognition accuracy was 84%. The Syllable based Continuous speech recognizer done by Lakshmi.A and Hema A Murthy (Lakshmi and Murthy, 2006) uses group delay based two level segmentation to extract the syllable segment from the speech data. In Syllable based Continuous Speech Recognition for Tamil (Thangarajan and Natarajan, 2008), R.Thangarajan and A.M.Natarajan have given an algorithm which leverages linguistic rules of Tamil to identify prosodic syllables in a Tamil word. The syllable in Tamil is built with combinations of short vowel (Kuril), long vowel (Nedil) and Consonant (Mei eluthu). The algorithm segregates the given utterance into syllables using the syllabic rules of Tamil Language which consists of eight patterns. The algorithm showed a result of 80% accuracy. Saraswathi Selvarajan & T.V.Geetha (Selvarajan and Geetha, 2007) designed a morpheme based language model for Tamil speech Recognition system to reduce the size of vocabulary by decomposing the words that are formed from a base word into stem and ending and the subword morpheme is stored for training the language model. A small vocabulary word based model and a medium vocabulary triphone based continuous speech recognizer for Tamil language was build by R.Thangarajan and his team. They have built a context independent acoustic model for 371 unique words and triphone based context dependent acoustic model for 1700 unique word (Thagarajan et al, 2008). A trigram language model and a pronunciation dictionary with 44 phones were also built to improve the performance of the recognizer. In a work (Saraswathi and Geetha, 2007) done by S.Saraswathi and T.V.Geetha, the performance of Tamil phoneme recognition was improved using the language models at the recognition phase. The speech signal was segmented into phonemes and the error in detection of segmentation points were corrected using a language model.


International Journal of Computer Science Research and Application, 3(2):30-38

4. Isolated Tamil Word Speech Recognition using HTK – A proposed model 4.1 Speech corpus Table 1 Sample Grains considered for the recognition GRAIN NAME IN TAMIL





PIGEON PEA SUGAR CANE considered for training are 10 grain names as shown in Table 1 uttered by 2 female speakers in a normal room with minimal external noise. Each grain name is uttered 4 times by each speaker. The data are recorded using Audacity with project rate of 48000 Hz and with a Microphone of single input channel. The end points of each data signal were found using Voice Activity Detection (VAD) algorithm developed by Qiang He (Qiang He, 2001).

4.2 Processing All Speech Recognition Engines are made up of the following components. 1. Language Model or Grammar- Language Models contain a very large list of words and their probability of occurrence is in a given training sequence. Grammar is a much smaller file containing a set of predefined combinations of words. Each word in the language model or grammar has an associated list of words or its subword units. 2. Acoustic Model - The acoustic model contains a statistical representation of the distinct words or the sounds that make up each word in the language model or grammar. 3. Decoder - Software program that takes the words or subword units spoken by the user and searches the acoustic model for the equivalent word or its unit. The following steps are carried out to build simple isolated word recognition with whole word models using HTK. 1. Constructing the grammar 2. Constructing a dictionary for the models 3. Creating transcription files for training data. 4. Extracting the features 5. Training the Acoustic Model. 6. Evaluating the recognizer result from the test data. 7. Reporting recognition result.

4.2.1 Grammar Construction A recognition Grammar essentially defines constraints on what the Speech Recognition Engine (SRE) can expect as input. It is a list of words that the SRE listens for. The lists of words that we use in the Grammar are limited to the words that we have trained in our acoustic model.


International Journal of Computer Science Research and Application, 3(2):30-38

Figure 2: Network of the complete Grammar The grammar consists of a set of variable definitions followed by a regular expression describing the words to recognize. The grammar is provided for user convenience. The Speech Recognizer actually requires word network to be defined using low level notation called Standard Lattice Format (SLF) (Young et al, 2006) in which each word instance and each word to word transition is listed explicitly specified. Figure 2 specifies the network showing the complete grammar of the implemented isolated Tamil word ASR.

4.2.2 Dictionary Construction The way in which each word in the training data is expanded is determined from a dictionary. The dictionary consists of the list of words in the training set and the corresponding symbol to output when that word is recognized. The first step in building a dictionary is creating a sorted list of required words. For small ASR systems like the one implemented in this paper, it is easy to create a list by hand. For a larger system, the word list can be prepared from the training text corpus.

4.2.3 Creating Transcription files for training data For training, the recognizer needs to know which files correspond to which word of the training data set. This process of labeling the training data is called Transcription. The transcriptions are provided in the form of a Master Label File (MLF) (Young et al, 2006). The Transcription file contains the labeling of all the training set. First the word level transcription is done. When subword units are used, the next level of transcription should be done for that subword unit. Silence model has to be included in the first and last of every training data in the transcription file. This insertion of silence is done easier in HTK.

4.2.4 Extracting the features In this step, the process of converting linear amplitude signals into spectral like representation is done (Chandra and Akila, 2012). It reduces the data rate of the raw audio input, thereby decreasing the computational load of the forthcoming phases. One type of signal processing that is commonly used in speech recognition systems is Mel Frequency cepstral coefficients (MFCC). Initially the audio signal was converted into vector and then filtered to make the signal sharper. The signal is then segmented into frames of 10ms with optional overlap of 1/3 to 1/2 of the frame size (Furui, 2001). Each frame was multiplied with a hamming window in order to keep the continuity of the first and last points in the


International Journal of Computer Science Research and Application, 3(2):30-38

frame. Magnitude frequency response of each frame is obtained by performing Fast Fourier Transform. To smooth the magnitude frequency spectrum, Mel scale filter band is used. As a result the speech recognition system will behave more or less the same when input utterances are with different tones/pitch. The 39 Mel Frequency Ceptral Coefficients (13 MFCC + 13 Delta + 13 Acceleration) is obtained after the Discrete Cosine Transform.

4.2.5 Training the Acoustic Model As specified earlier in this paper, the acoustic model contains a statistical representation of the distinct words or the sounds that make up each word in the language model or grammar. For each word in the wordlist created in section 4.2.2, an acoustic model was created with 5 states and 39 feature vectors. Initially the zero mean and unit variance was considered with a single Gaussian mixture. The estimation of the parameters of the HMM model was done for 12 to 15 iterations with single, two and four Gaussian mixtures.

4.2.6 Evaluating the recognizer from the test data The Transcription file was created for the test data in the same way as training data described in section 4.2.3. Then the 39 MFCC coefficients were extracted like the same as extracted for training data. The test data’s MFCC coefficients and the final iteration HMM model of the words trained were given as parameters to the Viterbi Decoder that performs recognition. The test data to be recognized was taken and the corresponding word model in the acoustic model was searched. If a match is found the word is recognized.

4.2.7 Reporting Recognition Result The number of test data correctly recognized by the decoder is evaluated using the performance measures like Word Recognition rate, correctness rate etc.

5. Experiment Results 5.1 Performance Measure The performance of an ASR is computed using the measures like Word Error Rate, Word Recognition rate or Accuracy and Correctness rate. The Correctness percentage (Young et al,2006) of the test data is calculated using Eq. 1 where N is the total number of test utterance, H is the number of word hits(utterance recognized), I, S and D are the number of Insertion errors, Substitution Error and Deletion Error.

5.1.1 Insertion Error (I) The insertion error (I) occurs when a data is inserted in the transcription (Morris A.C et. al, 2004) and it is not in the reference (training) set. An example for insertion error is shown in Figure 3.

Figure 3: Example for Insertion Error


International Journal of Computer Science Research and Application, 3(2):30-38

5.1.2 Substitution Error (S) The substitution error (S) occurs when a transcription data is substituted (Morris A.C et. al, 2004) by another data as shown in the Figure 4 example.

Figure 4: Example for Substitution Error

5.1.3 Deletion Error (D) The deletion error (D) occurs when a data is not in transcription but present in the reference set (Morris et. al, 2004). The deletion error is shown with an example in Figure 5.

Figure 5: Example for Deletion Error The correctness percentage is calculated only with the number of Deletion Error and Substitution Error. (1) The accuracy percentage (Steve Young et al, 2006) is calculated using Eq. 2 where all the three errors are used for manipulation. (2) The word error rate is calculated using Eq. 3. W



International Journal of Computer Science Research and Application, 3(2):30-38

Figure 6: Graphical Representation of the Performance of ASR for the given test data The Isolated Tamil word speech recognition done using HTK for the sample data of Grains had a correctness percentage of 90% and Accuracy of 90% and Word Error Rate of 10%. The word nel (Paddy) is substituted by ulunthu (Black Gram). Figure 6 shows the graphical representation of the performance of the ASR for the given test data.

6. Conclusion An Isolated Tamil word Recognizer was built using the HTK Tool. The recognizer was trained with the grain names in Tamil spoken by 2 female speakers. The result obtained from the recognizer was having good accuracy rate for isolated Tamil words as shown in Figure 6. The training set consist about 80 data and a test set of 10 data. The accuracy and correctness percentage was high with this small vocabulary. When the vocabulary size grows in terms of thousands, a word based model will not be suitable since it needs large storage space to store the acoustic models of all training data words and the complexity of the system will increase. When any word other than the trained data is spoken, the isolated word ASR may not recognize the word correctly. To avoid these drawbacks, a sub word model can be developed. Phonemes, Triphones, Senone, Morpheme and syllable are some of the sub word units which are suitable for medium and large vocabulary ASR. The future work will be to construct a sub word based model to improve the accuracy and to reduce complexity of the vocabulary building.

References M.A.Anusuya, S.K.Katti, 2011, Classification Techniques used in Speech Recognition Applications: A Review, Int. J. Comp. Tech. Appl., Vol 2, 4, pp 910-954. Dr.E.Chandra and A.Akila, 2012, An Overview of Speech Recognition and Speech Synthesis Algorithms, Int.J.Computer Technology & Applications, Vol 3, 4, pp 1426-1430. V.S.Dharun, M.Karnan, 2012, Voice and Speech Recognition for Tamil Words and Numerals, IJMER, Vol. 2, 5, pp 3406-3414. Durai Murugan .L et. al, 2012, Automated Transcription System for Tamil Language, IJCE, Vol. 1, 3, pp 7-14. Steve Young et al, 2006, The HTK Book. Version 3.4 Govt. of Tamilnadu 2013 [] S. Karpagavalli et. al , 2012 , Isolated Tamil Digits Speech Recognition using Vector Quantization, IJERT, Vol.1, 4, pp 1-9. Lakshmi.A and Hema A Murthy, 2006, A Syllable Based Continuous Speech Recognition for Tamil, INTERSPEECH -ICSLP, Pennsylvania, pp 1878-1881. Morris, A.C et. al, 2004, From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition, Proc. ICSLP


International Journal of Computer Science Research and Application, 3(2):30-38

Qiang He, 2001, Voice Activity Detection, code/math/ detail81458_en.html L. Rabiner and B.H.Jung, 1993, “Fundamental of Speech recognition”, Pearson Education Sadaoki Furui, 2001, Digital speech Processing, Synthesis and Recognition, Second Edition- Marcedl Ekkerin, Inc, pp.213-237 Saraswathi Selvarajan and T.V.Geetha, 2007, Morpheme Based Language Model for Tamil Speech Recognition System, The International Arab Journal of Information Technology, Vol 4.No.3, pp 214219. S.Saraswathi and T.V.Geetha, 2007, Improvement in Performance of Tamil Phoneme Recognition using Variable Length and Hybrid Language Models, IEEE-ICSCN 2007, pp 11-15. R.Thangarajan and A.M.Natarajan, 2008, Syllable Based Continuous Speech Recognition for Tamil, South Asian Language Review, Vol XVIII NO.1. R.Thagarajan et. al, 2008, Word and Triphone Based Approaches in Continuous speech Recognition for Tamil Language, WSEAS TRANSACTIONS on SIGNAL PROCESSING, Vol 4, 3, pp 76-85.

Author Biography Mrs.AAkila completed her B.Sc in Bharathiar University, Coimbatore. She received her MCA degree from Madurai Kamaraj University. She pursued her M.Phil in the area of Neural Networks from Bharathiar University. She is a research Scholar of D.J Academy for Managerial Excellence, Coimbatore. She has presented papers in International Journals and attended many seminars and workshops conducted by various educational Institutions. Her research interest lies in the area of Speech Recognition Systems, Data Structures and Neural Networks. Dr.E.Chandra received her B.Sc from Bharathiar University, Coimbatore in 1992 and received M.Sc from Avinashilingam University, Coimbatore in 1994. She obtained her M.Phil in the area of Neural Networks from Bharathiar University in 1999. She obtained her PhD degree in the area of speech recognition from Alagappa University, Karaikudi in 2007. At present she is working as a Director at Department of Computer Science in SNS Rajalakshmi College of Arts and Science, Coimbatore. She has published more than 20 research papers in National, International Journals and Conferences. She has guided for more than 30 M.Phil research scholars. At present, she is guiding 8 PhD research scholars. Her research interest lies in the area of Data Mining, Artificial Intelligence, Neural Networks, Speech Recognition systems and Fuzzy Logic. She is an active member of CSI, currently management committee member of CSI, Life member of Society of Statistics and Computer Applications.

Copyright for articles published in this journal is retained by the authors, with first publication rights granted to the journal. By the appearance in this open access journal, articles are free to use with the required attribution. Users must contact the corresponding authors for any potential use of the article or its content that affects the authors’ copyright.

Isolated Tamil Word Speech Recognition System Using ...

Speech is one of the powerful tools for communication. The desire of researchers was that the machine should understand the speech of the human beings for the machine to function or to give text output of the speech. In this paper, an overview of Tamil based Automatic Speech Recognition Systems is presented.

550KB Sizes 3 Downloads 133 Views

Recommend Documents

Speech Recognition Using FPGA Technology
Figure 1: Two-line I2C bus protocol for the Wolfson ... Speech recognition is becoming increasingly popular and can be found in luxury cars, mobile phones,.

IC_55.Dysarthric Speech Recognition Using Kullback-Leibler ...
IC_55.Dysarthric Speech Recognition Using Kullback-Leibler Divergence-based Hidden Markov Model.pdf. IC_55.Dysarthric Speech Recognition Using Kullback-Leibler Divergence-based Hidden Markov Model.pdf. Open. Extract. Open with. Sign In. Main menu.

An Optical Character Recognition System for Tamil ...
in HTML document format. 1. ... analysis [2], image gradient analysis [3], ... Figure 1. Steps involved in complete OCR for Tamil documents. 2. PREPROCESSING.

An Optical Character Recognition System for Tamil ...
For finding the text part we use the Radial Basis Function neural network (RBFNN) [16]. The network is trained to distinguish between text and non-text (non-.

Recent Improvements to IBM's Speech Recognition System for ...
system for automatic transcription of broadcast news. The .... vocabulary gave little improvements, but made new types .... asymmetries of the peaks of the pdf's.

A Distributed Speech Recognition System in Multi-user Environments
services. In other words, ASR on mobile units makes it possible to input various kinds of data - from phone numbers and names for storage to orders for business.

A Robust High Accuracy Speech Recognition System ...
speech via a new multi-channel CDCN technique, reducing computation via silence ... phone of left context and one phone of right context only. ... mean is initialized with a value estimated off line on a representative collection of training data.

A Distributed Speech Recognition System in Multi-user ... - USC/Sail
A typical distributed speech recognition (DSR) system is a configuration ... be reduced. In this context, there have been a number of ... block diagram in Fig. 1.

improving speech emotion recognition using adaptive ...
human computer interaction [1, 2]. ... ing evidence suggesting that the human brain contains facial ex- pression .... The outline of the proposed method is as follows. 1. Generate ..... Computational Intelligence: A Dynamic Systems Perspective,.

Speech emotion recognition using hidden Markov models
tion of pauses of speech signal. .... of specialists, the best accuracy achieved in recog- ... tures that exploit these characteristics will be good ... phone. A set of sample sentences translated into the. English language is presented in Table 2.

Developmental Word Recognition
... but rather one of restructuring (improvement with increased auto- .... Greece and had used English in school for at least 10 .... Procedure. Pretest and Training.