Phonotactic Language Recognition Based on Time-Gap-Weighted Lattice Kernels Wei-Wei Liu1,2 , Wei-Qiang Zhang1 , Jia Liu1 Tsinghua National Laboratory for Information Science and Technology1 Department of Electronic Engineering, Tsinghua University, Beijing 100084, China1 General Communication Station, General Logistics Department, China2 [email protected]

Abstract Phonotactic method for spoken language recognition (SLR) deals with permissible phone patterns and their frequencies of occurrence in a specific language. Phone recognizers followed by vector space models (PR-VSM) system is a state-of-the-art phonotactic language identification system, in which any utterance can be mapped into a supervector filled with likelihood scores of the n-gram tokens (bag-of-n-gram). However, the bag-of-n-gram language model is not good at capture the long-context co-occurrence relations due to the restriction match of the n-gram phonemes and vulnerable to the insert and delete errors induced by the frontend phone recognizer. We propose a novel approach to fill the gaps based on the use of time-gap-weighted lattice kernel (TGWLK) in this paper. The kernel is an inner product in the feature space generated by all contiguous and uncontiguous subsequences in variety length in the lattice, which are weighted by an exponentially decaying factor produced by their time gap length. The results of experiments on the NIST 2009 LRE corpus demonstrate that the proposed TGWLK shows a reduction in equal error rate (EER) than baseline system. Index Terms: language identification, time-gap-weighted lattice kernel (TGWLK), bag-of-n-gram, vector space model (VSM)

1. Introduction Nowadays, spoken language recognition (SLR) has become an increasingly crucial technique for many applications, such as language translation systems and search engines [1]. Without loss of generality, we can consider language recognition as a classification problem [2]. Given a set of training data and associated labels, the first step is to learn characteristics of languages from the training data, and then classify a speech utterance to the most probable language based on the language model. The problem is how to define the characteristic of the languages. In phonotactic language recognition, a classical solution is to obtain features of each language extracted from training utterances such as bag-of-n-grams. Term frequency log-likelihood ratio (TFLLR) kernel method [3] provide an efficient way of comparing the similarity of two utterances via the use of the inner product of two feature vectors of bag-of-n-grams. One competitive advantage of n-gram language models is the language-independent property. However, language recognition based on bag-of-n-grams is strict to contiguous n-gram matching, so it can not tolerate to insert, delete and substitute decoding errors. So some methods is needed to count the n-grams dynamically and efficiently.

As early as in 1997, Navratil et al. have proposed the skip-gram [4] method in language identification to solve the problem. In this method a pair of phones with one phone skipped is modeled. Then a dynamic gap n-gram matching approach for high order n-grams has been proposed [5] to generalize the skip-gram matching method. After that, a gap-weighted subsequence kernel (GWSK) [6] is introduced into language recognition. The basic idea of a GWSK is similar to that of skip-gram; however, the GWSK is theoretically better formulated. The GWSK is to counting the presence of a subsequence with a penalty related to the number of gaps interspersed within it. In this way, it has the merits of not only being robust to deletion and insertion errors but also being capable of revealing the long-context co-occurrence. In this paper, we concentrate on investigate the application of the dynamic matching approach using time-gap-weighted lattice kernel. The kernel are based on features corresponding to occurrences of certain kind of time contiguous or non-contiguous, bounded-length-subsequences that are penalized by the time length of gaps interspersed within the subsequences in the lattices. The basic idea of a TGWLK is similar to that of GWSK except be penalized by the time length of gaps, so TGWLK is also capable of revealing the long-context co-occurrence and robust to deletion and insertion errors. Experimental results show that TGWLK outperforms PR-VSM baseline language recognition system and achieve comparable performance with GWSK system. And TGWLK and GWSK system are always mutually complementary. The rest of the paper is organized as follows. In Section 2 we review phone recognition followed by support vector machine (PR-SVM) baseline language recognition system and TFLLR kernel. Time-gap-weighted lattice kernel and its implementation are introduced in Section 3. We also discuss variants and implementation of the algorithm. Experimental setup is described in section 4. The results and discussions that TGWLK experimentally compared against the PR-VSM approach are in Section 5 followed by conclusions in Section 6.

2. PR-VSM baseline system In this work, phone recognition followed by support vector machine (PR-SVM) [7] language recognition system is employed as baseline system. The phonotactic language recognition system maps the input data x to a high dimensional feature supervector as: Φ : x → ϕ(x).

(1)

Then the supervector ϕ(x) is sent to the SVM classifier, and a decision is made based on the output belief score of the classifier [8]. In this paper ϕ(x) = [p(d1 |ℓx ), p(d2 |ℓx ), ..., p(dF |ℓx )],

(2)

here di = si ...si+N−1 is the N -gram phoneme string, F = f N (f is the number of the phonemes of the frontend phone recognizer). ℓx denotes the lattice generated from data x by the frontend phone recognizer. p(di |ℓx ) is the observed probability of the N -gram di in the lattice. In PR-SVM language recognition system an SVM is employed as the classifier, the output score is computed as following: X αl KTFLLR (ϕ(x), ϕ(xl )) + d, (3) f (ϕ(x)) = l

here ϕ(xl ) are support vectors. KTFLLR is a TFLLR kernel computed as [3]: F X

p(dq |ℓxj ) p(dq |ℓxi ) p ∗p , p(dq |ℓall ) p(dq |ℓall ) q=1 (4) the p(di |ℓall ) is the probability of di across all lattices. The training stage is carried out with a one-versus-rest strategy. KTFLLR (ϕ(xi ), ϕ(xj )) =

φ(u|ℓxj ) =

F X

X

= sr ...sr+n′ −1 , u =

λp(ti )|N p(dq |ℓxj ),

(6)

q=1 i:u=dq [i]

where λ is the gap penalty parameter, p(ti )|N is the time gap weight parameter, and λp(ti )|N is the time-gap-weighted penalty. 

 gt (i)|N Ut (|u|)/N   P (tim +1 − tim ) (tiN +1 − ti1 ) − im ∈u   P = ,   (tim +1 − tim )/N   im ∈u

p(ti )|N =

(7)

Ut (|u|)/N is the average during time for each phoneme in subsequence u, and gt (i)|N is the gap time in string dq . We define the time gap weight parameter using round up function to emphases the time gap penalty. Then the output score of the SVM classifier is computed as: X αl KTWGLK (ϕ(x), ϕ(xl )) + d, (8) f (ϕ(x)) = l

where ϕ(xl ) are support vectors.

3. Time-gap-weighted lattice kernel (TGWLK)

3.3. Implementation of TGWLK

3.1. Subsequence and time-gap Let Σ be a set of finite phonemes and Σn be the set of all phoneme strings with length n. For a string d, we denote its length by |d|. u denotes a subsequence of d, if there exist indices i = (i1 , . . . , i|u| ), with 1 ≤ i1 < · · · < i|u| ≤ |d|, such that uj = dij , for j = 1, . . . , |u|, or u = d[i] for short. The context length (or span) of the subsequence u in d is l(i) = i|u| − i1 + 1 and the number of gaps is g(i) = l(i) − |u| [9]. For example, phoneme string d = ph9 ph1 ph7 ph13 ph8 ph6 ph3 , ph denotes a phoneme, and phi denotes the i-th phoneme in the list of the phoneme inventory of the phone recognizer. Then |d| = 7. if u =ph9 ph7 ph6 , then |u| = 3, l(i) = 6, g(i) = 6 − 3 = 3. Here we definite a new gap based on time. Let ti1 , ti2 , ..., tin denote the start time of the i1 , i2 , ..., in phoneme in the string d. Then lt (i) = ti|u|+1 − ti1 , Ut (|u|) = P in ∈u (tin +1 − tin ), gt (i) = lt (i) − Ut |u|. For the previous example, lt (i) = ti7 −ti1 , Ut (|u|) = (ti2 −ti1 )+(ti4 −ti3 )+ (ti7 − ti6 ), gt (i)||u| = lt (i) − Ut (|u|) = ti6 − ti4 + ti3 − ti2 . In another word, gt (i)||u| means the sum of the time gaps of subsequence u in the string d. 3.2. TGWLK

The architecture of TGWLK language system is shown in Fig. 1. In this paper, we decode the input utterances using the method similar to sausage to generate phone lattices, as Fig 2. First of all, the Viterbi method is used to obtain the best phone-level segmentation for each utterance. Then the posterior probabilities of these segments to all phones in the phoneme set are calculated. At last the top-N best phones for each segment are retained. We only need to calculate the posterior probabilities, so the additional time spent on generating phone lattices is less than 10 percent of that used for 1-best phone sequence. As a result, the sausage method is much faster than the traditional forward-backward algorithm of generating phone lattices [10]. In sausages, the expected counts of N -grams can be regarded as the sum of acoustic weighted N -gram counts. It has been proven that a proper normalization method should be done on the super-vectors to achieve good language recognition performance [3]. ph22 0.31

ph31 0.20

ph3 0.48

ph16 0.44

ph8 0.32

ph1 0.25

ph44 0.20

...

ph9 0.33 ph38 0.47

Figure 2: An illustration of lattice similar to sausage.

With the definition of the time gap, time-gap-weighted lattice kernel between the utterances xj and xk can be defined as: KTGWLKN (ϕ(xj ), ϕ(xk )) =

where dq = sq ...sq+n−1 , dr su ...su+N−1 (N ≤ n, N ≤ n′ ),

F F X X X

X

q=1 r=1 u∈ΣN (i1 ,i2 ):u=dq [i1 ]=dr [i2 ]

φ(u|ℓxj ) φ(u|ℓxk ) p ∗p , and φ(u|ℓall ) φ(u|ℓall )

(5)

As illustrated in [10], the lattices consist of a series of slots, each slot contains one or several alternative edge, and each edge is labeled with a token and corresponding posterior probability. Suppose there are I slots in the lattice L, and Ji edges in the i-th slot. Let L[i, j] denote the token in the j-th edge of the i-th slot, and p[i, j] denote the corresponding observed posterior probability. The pseudocode is given as listed in Algorithm 1. There are deeper loops for obtaining n-gram tokens u and the expected counts p(dq |ℓxj ) in the n-fold loops.

Speech Waveform

Pre-processing and Feature Extraction

SVM Classifier (Time-gap-weighted Lattice Kernel)

Phone recognizers

Score calibration and fusion

Lattice

Figure 1: Architecture of TGWLK language system.

Algorithm 1 Implementation of lattice-based GWSK feature mapping. 1: φ(u|ℓx ) = 0; 2: for i1 = 1, . . . , min(I − N + 1, n − N + 1) do 3: for i2 = i1 + 1, . . . , min(I − N + 2, n − N + 2) do 4: ... 5: for in = iN−1 + 1, . . . , min(I, n) do 6: for j1 = 1, . . . , Ji1 do 7: for j2 = 1, . . . , Ji2 do 8: ... 9: for jn = 1, . . . , JiN do 10: u = L[i1 , j1 ]L[i2 , j2 ] · · · L[iN , jN ]; Q 11: p(dq |ℓx ) = N k=1 p[ik , jk ] 12: φ(u|ℓx ) = φ(u|ℓx ) + P p(dq |ℓx ) ∗ λp(ti )|N ; i:u=dq [i]

13: end for 14: ... 15: end for 16: end for 17: end for 18: ... 19: end for 20: end for

When n = I, Algorithm 1 is untruncated TWGLK algorithm; and When n ≤ I, Algorithm 1 is truncated TWGLK algorithm.

4. Experimental setup 4.1. Baseline language recognition system In this paper a parallel PR-SVM language recognition system is used as baseline system. The first step is to tokenize speech by the means of running phone-recognizers to provide the posterior probabilities of the phone occurrences. Hungarian (HU), Czech (CZ), Russian (RU) Temporal Patterns Neural Network (TRAPs/NN) phone-recognizers that developed by the Brno University of Technology (BUT) [11] are employed in this paper. Then, we use a popular classifier LIBLINEAR [12] to classify. Finally, we use LDA-MMI algorithm [13] for score calibration and fusion.

by NIST for the 2003, 2005 and 2007 LRE and VOA are used for develop database. 4.3. Evaluation measures The performance of language recognition systems is reported in terms of Equal Error Rate (EER) and average cost performance Cavg which is defined by NIST LRE 2009 [14] in this paper.

5. Experimental results and discussion 5.1. Performance of HU front-end TGWLK system We investigate the performance of TGWLK in this subsection. Besides the decaying factor λ, we also vary the truncated context length n. The results are listed in Table 1 and N = 3. Note that n = ∞ corresponds to the untruncated TGWLK. Table 1 shows that for each fixed n, the trend with respect to λ is similar to the untruncated TGWLK. The EERs/Cavgs first decrease and then increase with the increasing of decaying factor λ, and the minimum occurs at λ = 0.2, which has been predicted in [6]. On the other hand, for each fixed λ, the truncated TGWLK changes to standard TGWLK with increasing n. The EERs/Cavgs are monotonously increasing for large λ; as for proper λ, the EERs/Cavgs are first decreasing and then increasing and the minimum occurs at n = 9, as we expected. We can also observe that n has little effect on the performance, based on this result, we can select small values of n to deduce the computation cost while retaining similar performance. In the subsection, we have seen that TGWLK outperformed PR-VSM method and achieve comparable performance with GWSK. And the fusion of TGWLK and GWSK system gives better results, which means the two system can be mutually complementary. 5.2. Parallel Phone Recognizer Experiments

4.2. Test, training and developing dataset

The previous subsections has shown that TGWLK outperformed PR-VSM method. In this section, we will further validate TGWLK using HU, RU, and CZ phone recognizers as frontends. We only focus on the most challenging case, i.e., PR-VSM versus TGWLK with N = 3. As for TGWLK, we set the parameters λ = 0.2 and n = 9 and use the truncated version.

The results are reported for the test trials of the National Institute of Standards and Technology Language Recognition Evaluation (NIST-LRE) 2009. The test data consists of 41793 test segments of 23 languages for 30-s, 10-s, and 3-s nominal duration test. 30000 conversations are selected from the Call-Home, Call-Friend, OGI, OHSU and VOA Corpus used for training. 22701 conversations selected from the database provided

The results of parallel frontends system are obtained using LDA + MMI score fusion backend [13]. Fig. 3 shows the detection error trade-off (DET) curves and the EERs and Cavgs are listed in Table 2. Table 2 shows the consistent performance improvement due to changing from PR-VSM to TGWLK for both single and parallel frontends. For the HU+RU+CZ parallel frontends and 30-s test, the Cavg decreases from 1.54%, 4.11% and 15.80% to 1.49%, 3.98%, and 15.71%.

Table 1: Performance of TWGLK (N = 3), HU frontend, LRE09, 30-s test (EER and Cavg in %). λ 0.1 0.2 0.3 0.4 0.5

TGWLK

PR-VSM GWSK

EER

Cavg

EER

Cavg

EER

Cavg

EER

Cavg

EER

Cavg

n=5

2.39

2.29

2.30

2.23

2.30

2.25

2.32

2.25

2.43

2.31

n=7 n=9

2.38 2.37

2.29 2.29

2.29 2.28

2.22 2.22

2.30 2.35

2.24 2.25

2.38 2.46

2.29 2.36

2.47 2.69

2.39 2.60

n = 11

2.37

2.29

2.29

2.24

2.35

2.25

2.51

2.40

2.71

2.63

n = 13

2.38

2.29

2.29

2.24

2.37

2.28

2.59

2.52

2.86

2.84

n=∞

2.43

2.37

2.39

2.28

2.44

2.39

2.72

2.66

3.21

3.14

n=3 n=7

2.44 2.29

2.37 2.24

2.44 2.28

2.37 2.23

2.44 2.36

2.37 2.37

2.44 2.43

2.37 2.34

2.44 2.58

2.37 2.40

2.29

2.23

2.26

2.17

2.27

2.14

2.28

2.16

2.42

2.22

TGWLK(n = 9)+GWSK(n = 7)

Table 2: Performance of baseline and TGWLK systems (n = 9), NIST LRE 2009, (EER/Cavg in %).

Baseline

TGWLK

30s

10s

3s

HU RU

2.44/2.37 2.21/2.00

7.38/7.24 6.23/6.07

23.00/22.61 20.53/20.38

CZ

3.33/3.30

10.03/10.07

25.20/25.14

fusion

1.54/1.61

4.11/4.00

15.99/15.76

HU

2.28/2.22

7.08/7.16

22.08/22.12

RU CZ

2.15/1.92 3.08/3.07

5.84/5.70 9.86/9.82

19.17/19.23 25.05/25.06

fusion

1.49/1.47

3.98/4.03

15.71/15.57

10 5 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1 2 5 10 20 False Alarm probability (%)

40

Figure 3: DET curves of PR-VSM and TGWLK, LRE09, HU+RU+CZ frontend.

5.3. Computational Complexity Table 3 shows the real time (RT) factors of each part of the language recognition system. For the training stage, super vector product is the dominant computational part, so compared to PR-VSM, the computational cost increases about 1.5 times for the untruncated TGWLK, and only 8.1% for the truncated TGWLK (n = 9, N = 3). For the test stage, decoding and super vector generation are the dominant computational parts and the computational cost increases about 50% for the untruncated TGWLK, with almost no increase for the truncated TGWLK (n = 9).

Table 3: Comparison of real time factor for PR-VSM and TGWLK, HU frontend, LRE09, 30-s test, N = 3. CPU: Xeon [email protected], RAM: 8GB, single thread. SV gen.: super vector generation, SV prod.: super vector product method n decoding SV gen. SV prod. PR-VSM

-

0.11

1.1 × 10−4

3.7 × 10−6

TGWLK

9

0.11

3.7 × 10−4

4.3 × 10−6

−2

9.1 × 10−6



20 Miss probability (%)

System

TGWLK PR−VSM

40

6.7 × 10

6. Conclusions In this paper, an approach of time-gap-weighted lattice kernel (TGWLK) has been presented for language recognition. TGWLK is based on time-gap-weighted matching, which is not so vulnerable to deletion and insertion errors of the frontend phone recognizer than traditional contiguous n-gram matching. The experimental results evaluated on NIST 2009 LRE task show that the relative improvements of the proposed TGWLK are 3.24%, 3.16% and 1.75% for 30s, 10s and 3s over traditional bag-of-n-gram approach respectively.

7. Acknowledgements This project is supported by National Natural Science Foundation of China (No. 61005019, No.61273268 and No. 61370034).

8. References [1] H. Li, B. Ma, and K.-A. Lee, “Spoken language recognition: from fundamentals to practice,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1136–1159, 2013. [2] M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1, pp. 33–44, 1996. [3] W. M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones, and T. R. Leek, “Phonetic speaker recognition with support vector machines,” Advances in neural information processing systems, vol. 16, 2003. [4] J. Navr´atil and W. Zuhlke, “Double bigram-decoding in phonotactic language identification,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 2. IEEE, 1997, pp. 1115–1118. [5] W. Liu, W.-Q. Zhang, and J. Liu, “A dynamic gap dimension reduction approach for high order n-gram phonotactic language recognition,” in Audio, Language and Image Processing (ICALIP), 2012 International Conference on. IEEE, 2012, pp. 971–975. [6] W.-Q. Zhang, W.-W. Liu, Z.-Y. Li, Y.-Z. Shi, and J. Liu, “Spoken language recognition based on gap-weighted subsequence kernels,” Speech Communication, vol. 60, pp. 1–12, 2014. [7] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo, “Support vector machines for speaker and language recognition,” Computer Speech and Language, vol. 20, no. 2-3, pp. 210–229, Jan 2006. [8] J. L. Gauvain, A. Messaoudi, and H. Schwenk, “Language recognition using phone lattices,” in Proc. ICSLP, Jeju Island, Oct 2004, pp. 1283–1286. [9] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text classification using string kernels,” The Journal of Machine Learning Research, vol. 2, pp. 419–444, 2002. [10] J.-L. Gauvain, A. Messaoudi, and H. Schwenk, “Language recognition using phone latices.” in INTERSPEECH, 2004. [11] P. Schwarz, “Phoneme recognition based on long temporal context,” 2009. [Online]. Available: http://speech.fit.vutbr.cz/software/phoneme-recognizer-basedlong-temporal-context [12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008. [13] W.-Q. Zhang, T. Hou, and J. Liu, “Discriminative score fusion for language identification,” Chinese Journal of Electronics, vol. 19, pp. 124–128, Jan 2010. [14] “The 2009 NIST language evaluation plan,” Apr 2009. [Online]. http://www.itl.nist.gov/iad/mig/tests/lang/2009/

recognition Available:

Phonotactic Language Recognition Based on Time ...

languages from the training data, and then classify a speech utterance to the most probable language based ... only being robust to deletion and insertion errors but also being capable of revealing the long-context .... Institute of Standards and Technology Language Recognition. Evaluation (NIST-LRE) 2009. The test data ...

146KB Sizes 0 Downloads 218 Views

Recommend Documents

Phonotactic Language Recognition Based on DNN ...
In this work phone recognition followed by support vector machine (PR-SVM) [12] ..... and language recognition,” Computer Speech and Language, vol. 20, no.

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute of ... over all competing classes, and have been demonstrated to be effective in isolated word ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute ... NIST (National Institute of Standards and Technology) has ..... the best procedure to follow.

Language Recognition Based on Acoustic Diversified ...
mation Science and Technology, Department of Electronic Engi- neering ... or lattices are homogeneous since the same training data and phone set are used.

Spoken Language Recognition Based on Gap ...
Feb 14, 2014 - gual speech recognition, speech translation, information security .... weighted by such as term frequency-inverse document frequency (TF-IDF),.

Face Recognition Based on SVM ace Recognition ...
features are given to the SVM classifier for training and testing purpose. ... recognition has emerged as an active research area in computer vision with .... they map pattern vectors to a high-dimensional feature space where a 'best' separating.

On calibration of language recognition scores
a very direct relationship between error-rates and information. .... of the original hypotheses, then we call it a binary classi- ...... AIP Conference Pro- ceedings ...

Face Recognition Based on Local Uncorrelated and ...
1State Key Laboratory for Software Engineering, Wuhan University, Wuhan, ... of Automation, Nanjing University of Posts and Telecommunications, 210046, ...

Authorization of Face Recognition Technique Based On Eigen ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 2, ..... computationally expensive but require a high degree of correlation between the ...

Robust Speech Recognition Based on Binaural ... - Research at Google
degrees to one side and 2 m away from the microphones. This whole setup is 1.1 ... technology and automatic speech recognition,” in International. Congress on ...

Three dimensional face recognition based on geodesic ...
dimensional face recognition systems, on the other hand, have been reported to be less .... to methods based on PCA applied to the 3D point clouds, fitting implicit ... surfaces, and PCA applied to range images.21 Its performance was equivalent to an

Face Recognition Based on Nonlinear DCT ...
Dec 12, 2009 - the kernel-based nonlinear discriminant analysis technique has now been widely ... alized discriminant analysis (GDA) method for nonlinear.

Approaches to Speech Recognition based on Speaker ...
best speech recognition submissions in its Jan- ... ity such as telephone type and background noise. ... of a single vector to represent each phone in context,.

3D Object Recognition Based on Low Frequency ... - CiteSeerX
in visual learning. ..... based in the polar form of the Box-Muller transformation [1]. .... [1] Box, G.E.P., Muller, M.E.: A note on the generation of random normal ...

Affect recognition based on physiological changes ...
physiological signals acquired during presentation of music video clips, which is ... However, the peripheral physiological signals can be considered as a good alternative ..... Thus, the artifact-free EEG ( ˜Y) signal can be obtained by subtracting

Text-dependent speaker-recognition systems based on ...
tems based on the one-pass dynamic programming (DP) algo- rithm. .... Rsil. R 52. R 54. R 53. R 51. Rsil. Rsil. RsilR 11. R 14. R 13. R 12. Forced Alignment ..... help increase the robustness of the system to arbitrary input noise conditions and ...

Highly Noise Robust Text-Dependent Speaker Recognition Based on ...
conditions and non-stationary color noise conditions (factory, chop- per and babble noises), which are also the typical conditions where conventional spectral subtraction techniques perform poorly. Index Terms: Robust speaker recognition, hypothesize

Robust Audio-Visual Speech Recognition Based on Late Integration
Jul 9, 2008 - gram of the Korea Science and Engineering Foundation and Brain Korea 21 ... The associate ... The authors are with the School of Electrical Engineering and Computer ...... the B.S. degree in electronics engineering (with the.

3D Object Recognition Based on Low Frequency ... - CiteSeerX
points. At last, the DAM is fed with this information for training and recognition. To ... then W is auto-associative, otherwise it is hetero-associative. A distorted ...

Face Recognition Using Composite Features Based on ...
Digital Object Identifier 10.1109/ACCESS.2017.DOI. Face Recognition Using Composite. Features Based on Discriminant. Analysis. SANG-IL CHOI1 ... ing expressions, and an uncontrolled environment involving camera pose or varying illumination, the recog