Spoken Language Recognition Based on Gap ...

Viewer
Transcript

Spoken Language Recognition Based on Gap-Weighted Subsequence Kernels Wei-Qiang Zhang∗, Wei-Wei Liu, Zhi-Yi Li, Yong-Zhe Shi, Jia Liu Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Abstract Phone recognizers followed by vector space models (PR-VSM) is a state-ofthe-art phonotactic method for spoken language recognition. This method resorts to a bag-of-n-grams, with each dimension of the super vector based on the counts of n-gram tokens. The n-gram cannot capture the long-context co-occurrence relations due to the restriction of gram order. Moreover, it is vulnerable to the errors induced by the frontend phone recognizer. In this paper, we introduce a gap-weighted subsequence kernel (GWSK) method to overcome the drawbacks of n-gram. GWSK counts the co-occurrence of the tokens in a non-contiguous way and thus is not only error-tolerant but also capable of revealing the long-context relations. Beyond this, we further propose a truncated GWSK with constraints on context length in order to remove the interference from remote tokens and lower the computational cost, and extend the idea to lattices to take the advantage of multiple hypotheses from the phone recognizer. In addition, we investigate the optimal parameter setting and computational complexity of the proposed methods. Experiments on NIST 2009 LRE evaluation corpus with several conﬁgurations show that the proposed GWSK is consistently more eﬀective than the PR-VSM approach. Keywords: spoken language recognition, gap-weighted subsequence kernel (GWSK), n-gram, phone recognizer (PR), vector space model (VSM)

∗

Corresponding author. Tel.: +86-10-62781847 Email address: [email protected] (Wei-Qiang Zhang) URL: http://sites.google.com/site/weiqzhang/ (Wei-Qiang Zhang)

Preprint submitted to Speech Communication

February 14, 2014

1. Introduction Spoken language recognition (SLR, usually shortened to language recognition) is a developing branch of speech signal processing. The goal is to recognize the language of a spoken utterance, with applications in multilingual speech recognition, speech translation, information security and forensics (Muthusamy et al. (1994); Zissman and Berkling (2001)). Language recognition can be classiﬁed into two broad categories: acoustic model methods and phonotactic methods. Acoustic model methods directly model the acoustic spectral (or cepstral) feature vectors, and are also referred to as spectrum methods. The classic acoustic model methods include Gaussian mixture models (GMM) (Torres-Carrasquillo (2002)), support vector machines (SVM) (Zhang et al. (2006)), SVM with GMM super vector (GSV) (Torres-Carrasquillo et al. (2008)) and most recently the i-vector method (Dehak et al. (2011)). Phonotactic methods ﬁrst decode the utterance into a token string or lattice, and then model the token string or lattice using n-gram lexicon model (Hazen and Zue (1993); Zissman and Singer (1994)), binary-decision tree (BT) (Navratil (2001)) or vector space model (VSM) (Li et al. (2007); Campbell et al. (2007)). These methods utilize the internal results of phone recognizers (or tokenizers), so are also referred to as token methods. Of phonotactic methods, the most classic approach may be the phone recognizer followed by language models (PRLM) (Zissman and Singer (1994)), which uses a phone recognizer as the frontend to obtain the token string and employs n-gram language model as the backend to model the co-occurrence of the tokens. There are also several signiﬁcant improvements on PRLM focusing on diﬀerent aspects of the algorithm. The ﬁrst aspect is the architecture of the frontend. The single phone recognizer is enhanced to parallel phone recognizers (PPR) (Zissman and Singer (1994)) or universal phone recognizer (UPR) (Li et al. (2007)), which makes the frontend capable of covering more phones or acoustic units. The second aspect involves the representation of phone recognizer results. The one-best token string is extended to multi-candidate lattice (Gauvain et al. (2004); Campbell et al. (2006b)), which leads to more accurate estimation of n-gram frequencies. The third aspect is related to the phone recognizer itself. The hidden Markov model (HMM) based phone recognizer is replaced with neural networks (NN) based decoder (Matejka et al. (2005)), which makes use of long temporal context information and gives robust token results. The fourth aspect concerns lan2

guage modeling. The n-gram models are changed to binary-decision trees (Navratil (2001)) which take advantage of binary-decision tree structures, or vector space models (Li et al. (2007)) which make use of the powerful SVM classiﬁer. All these methods (except BT), however, explicitly or implicitly model the co-occurrence of the tokens as contiguous n-grams. This suﬀers from two main problems. One is order restriction. The model size is exponentially related to the model order n, causing severe data sparsity for large n. For this reason, it is not easy to capture long-context relations between tokens. The other problem is error sensitivity 1 . It is known that the phoneme error rate of widely used phone recognizer for language recognition is about 4060% (Matejka et al. (2005)), so utterances with the same content may be decoded as diﬀerent token strings. For example, if an utterance is “cat” and its decoding result is “cant”, the trigram (n = 3) probabilities are totally diﬀerent. Changing the decoding result from string to lattice can make up for this shortcoming in some extent in the frontend, but if there are still some errors, maybe we can do something in the backend. So our motivation is two-fold: modeling the long-context dependence beyond short n-gram and providing an error-tolerant method through rough matching of token strings. In the text processing and bioinformation ﬁeld, string kernels have been successfully used for text classiﬁcation (Lodhi et al. (2002)), text language identiﬁcation (Kruengkrai et al. (2005)), and DNA sequence analysis (Kim et al. (2010)). The string kernel has many variants (Shawe-Taylor and Cristianini (2004)) depending on how the subsequences are deﬁned, e.g. contiguous versus non-contiguous, mismatches penalized versus non-penalized. One example is the gap-weighted subsequence kernel (GWSK). The GWSK counts the presence of a subsequence with a penalty related to the number of gaps interspersed within it. In this way, it has the merits of not only being capable of revealing the long-context co-occurrence but also being robust to deletion and insertion errors. In fact, in text classiﬁcation, there is no decoding error for the text string, so the GWSK has no signiﬁcant advantage over the traditional n-gram model (Lodhi et al. (2002)). For spoken language recognition, however, the token string is generated by a phone recognizer, which causes some errors, so using GWSK will have beneﬁt. It is worth mentioning that as early as in 1997, Navratil et al. have proposed 1

The errors mentioned in this paper are random errors instead of systematic errors.

3

the use of skip-gram in language identiﬁcation (Navratil and Zuhlke (1997)). This method models a pair of phones with one phone skipped. The underlying idea of a skip-gram is similar to that of GWSK; however, the GWSK is theoretically better formulated. The skip-gram and GWSK both try to capture the long-context co-occurrence with lower order n-gram. Besides that, GWSK also has the error-tolerant ability. In this paper, we will fully investigate the application of GWSK to language recognition. The rest of the paper is organized as follows. Section 2 summarize the relevant existing n-gram based approaches and Section 3 introduces the GWSK. In Section 4, we develop GWSK for language recognition, including truncated version and lattice-based version, the detailed implementation method and some theoretical analysis on the optimal parameter and computational complexity. Section 5 demonstrates the eﬀectiveness of the proposed methods through detailed experiments. Finally, conclusions are given in Section 6. 2. Review of N-Gram Modeling 2.1. N-Gram Model An n-gram is a contiguous substring of n tokens from a given token sequence (string) 2 . The n-gram model assumes that the current token xi depends only on its last n − 1 tokens xi−(n−1) , · · · , xi−1 and models this dependance by conditional probability: P (xi |xi−(n−1) , · · · , xi−1 ).

(1)

This Markov assumption simpliﬁes the learning of language model. For coping with the sparseness problem, back oﬀ strategies (Zissman and Singer (1994)) are usually applied. 2.2. Vector Space Model (VSM) A vector space model or term vector model is an algebraic model for representing text documents as vectors of identiﬁers. The vector space model for spoken language recognition was proposed by Li (Li et al. (2007)) and 2

According to (Lerma (2008)), sequence and string are both ordered lists of elements, but string is finite and sequence is usually infinite. We do not strictly distinguish them in this paper.

4

Campbell (Campbell et al. (2006a, 2007)) independently 3 . Inspired by the idea of bag-of-words, the co-occurrence of tokens is modeled as a bag-ofn-grams instead of conditional probability. In this way, a token string or lattice is converted into a super vector with its dimensions representing the joint probability of the unique n-gram tokens. In addition, the super vector is weighted by such as term frequency-inverse document frequency (TF-IDF), latent semantic indexing (LSI) (Li et al. (2007)) or term frequency log likelihood ratio (TFLLR) (Campbell et al. (2004)) to improve the discriminative capability. This approach has some improvements, such as the methods proposed in (Penagarikano et al. (2011); Ma et al. (2007); Tong et al. (2009)). 3. Gap-Weighted Subsequence Kernel 3.1. Kernel method The kernel method can be viewed as a generalization of the well-known SVM. It maps the data into an inner product space (Hofmann et al. (2008)) and allows us to construct algorithms in that space. In this way, it is easier to ﬁnd the relations between data. Definition 1 (Kernel function). For the data x1 and x2 , the kernel function satisfying the Mercer theorem (Shawe-Taylor and Cristianini (2004)) is defined as K(x1 , x2 ) = ⟨ϕ(x1 ), ϕ(x2 )⟩, (2) where ϕ is a feature mapping, which maps the data into the feature space. Given ϕ, we could compute K(x1 , x2 ) by ﬁnding ϕ(x1 ) and ϕ(x2 ) in feature space and obtaining their inner product easily. But often the problem is that ϕ(x) itself may be very expensive to calculate because it is an extremely high or even inﬁnite dimensional vector. With the kernel function, we can operate in the feature space even without explicitly computing ϕ(x), which often makes the algorithm computationally eﬃcient. 3

In (Li et al. (2007)), it is called as PR-VSM, while in (Campbell et al. (2007)), it is named as PR-SVM.

5

3.2. GWSK Definition 2 (Subsequence (Lodhi et al. (2002))). Let Σ be a finite alphabet and Σn be the set of all strings with length n. For a string x, we denote its length by |x|. We say that u is a subsequence of x, if there exist indices i = (i1 , . . . , i|u| ), with 1 ≤ i1 < · · · < i|u| ≤ |x|, such that uj = xij , for j = 1, . . . , |u|, or u = x[i] for short. The context length (or span) of the subsequence u in x is l(i) = i|u| − i1 + 1 and the number of gaps is g(i) = l(i) − |u|. From the deﬁnition, we can see that subsequences can be non-contiguous or contiguous. It is an extension of a substring, which is just a contiguous subsequence. Definition 3 (GWSK (Lodhi et al. (2002))). The GWSK of two strings x1 and x2 gives a sum over all common subsequences weighted according to their frequency of occurrence and gaps: ∑ Kn (x1 , x2 ) = ϕu (x1 )ϕu (x2 ), (3) u∈Σn

where the dimension corresponding to u of the feature mapping ϕ(x) is defined as 4 ∑ λg(i) , (4) ϕu (x) = i:u=x[i]

where λ ∈ [0, 1] is the decaying factor. Note that for λ = 0, we deﬁne 00 = 1, thus { 1, if g(i) = 0, g(i) λ = 0, otherwise.

(5)

This means that ϕu (x) becomes the counts of the n-gram contiguous substring u, so the GWSK can be viewed as a natural generalization of the traditional n-gram. 4

In (Shawe-Taylor and Cristianini (2004)), the context length l(i) is used as the exponential weight. We use this gap number weighting variant because we wish only to penalize the subsequence with gaps and this definition has better consistency with the traditional n-gram.

6

Example 1. For the strings “cat” and “cant”, if we consider n = 3, i.e., u ∈ Σ3 , each dimension of the feature mapping of GWSK will be: u c-a-n c-a-t c-n-t a-n-t

ϕ(cat) 0 1 0 0

ϕ(cant) 1 λ λ 1

where the dimension corresponding to “c-a-t” of the feature mapping is obtained as illustrated in Figure 1. In the strings “cat” and “cant”, the subsequence “c-a-t” occurs both once. In “cat”, the count is 1 because there is no gap, while in “cant”, the count is λ because there is one gap between “a” and “t”. 1 c

1 a

λ

1 t

c

a

n

t

Figure 1: The subsequence “c-a-t” in the strings “cat” and “cant”.

According to the feature mappings, the kernel between “cat” and “cant” will be: K3 (“cat”, “cant”) = λ. We can see that as measured by GWSK, “cat” and “cant” have similarity λ. In contrast, the kernel will be zero if measured by a traditional trigram. This is the most important advantage of GWSK over n-grams. In addition, if λ ̸= 0, GWSK has the ability to capture the long-context relations without increasing the order of n-gram, which is another advantage of GWSK. Example 2. Revisit Example 1 using the idea of skip-gram (Navratil and Zuhlke (1997)). The skip-gram was proposed in the framework of traditional n-gram model; however, its idea can be easily extended to the framework of VSM. If we consider the first-order skip-gram (with one phone skipped), each dimension of the feature mapping will be: u c-t c-n a-t

ϕ(cat) 1 0 0 7

ϕ(cant) 0 1 1

1 c

a

t

c

1

1

a

n

t

Figure 2: The skip-gram “c-t”, “c-n” and “a-t” in the strings “cat” and “cant”.

where the dimension of the feature mapping is obtained as illustrated in Figure 2. We can find that the kernel between “cat” and “cant” will be zero if measured by the skip-gram. This shows that the skip-gram is vulnerable to deletion and insertion errors, which is similar to n-gram. 4. GWSK for Language Recognition 4.1. Truncated GWSK Although GWSK can depict the co-occurrence relations even up to inﬁnite length, this may greatly increase the computational burden. Furthermore, for spoken language recognition, if the token string exceeds some context length, the relation between remote tokens becomes unimportant, even potentially adding noise. So we can consider GWSK within some context length, which suggests the truncated GWSK. Definition 4 (Truncated GWSK). The definition of the truncated GWSK is similar to standard GWSK, except that it is a sum over the common subsequences with context length no more than T . Thus the dimension corresponding to u of the feature mapping is defined as ∑ λg(i) . (6) ϕu (x) = i:u=x[i] and l(i)≤T

Note that if we set T = ∞, the truncated GWSK will become the standard untruncated GWSK. Example 3. For the string “language”, we only consider the dimension corresponding to “l-a-g” of feature mapping ϕ. As illustrated in Figure 3, the subsequence “l-a-g” occurs three times. If we consider T = 3, none of them satisfies context length ≤ T , so ϕl-a-g = 0. If we consider T = 5, there is only one occurrence with context length ≤ T (the dashed line with one gap), so ϕl-a-g = λ. If we consider T = ∞, all the occurrences should be counted (one dashed line with one gap and two solid lines with four gaps), so ϕl-a-g = λ + 2λ4 . 8

l

λ4

λ

1 a

n

g

u

a

λ4

g

e

1

Figure 3: The subsequence “l-a-g” in the string “language”.

4.2. TFLLR-Like Normalization and Weighting The GWSK is usually normalized to remove the eﬀect of string length. For example in (Lodhi et al. (2002)), the feature mapping is normalized as ˆ ϕ(x) =

ϕ(x) . ||ϕ(x)||2

(7)

This can be easily converted into the operations between kernel functions. In language recognition, we usually use another scheme, namely term frequency log likelihood ratio (TFLLR) (Campbell et al. (2007)). Initially, the feature mapping is normalized as ˆ ϕ(x) =

ϕ(x) . ||ϕ(x)||1

(8)

This corresponds to converting the term count into term frequency. After that, the feature mapping is weighted by ˜ ˆ ϕ(x) = ψ ◦ ϕ(x),

(9)

where ◦ is element-wise multiplication and ψ is a weighting vector whose dimension corresponding to u is { √ ∑ ∑ || ∀x ϕ(x)||1 ∑ , if ∀x ϕu (x) > 0, ϕ (x) ∀x u ψu = (10) 0, otherwise. This corresponds to log likelihood ratio weighting (Campbell et al. (2007)). In fact, ψ can be approximated as { √ 1 , if Ex {ϕˆu (x)} > 0, ˆu (x)} E { ϕ x ψu = (11) 0, otherwise, where Ex {·} denotes the expectation operator over x. 9

4.3. Lattice-Based GWSK As mentioned previously, a token lattice provides multiple internal result candidates and thus can improve performance. In fact, lattices try to deal with the decoding uncertainty in term of hypotheses, whereas GWSK tries to cope with this in a temporal direction. So if we combine lattice and GWSK together, it should result in further gains. Definition 5 (Lattice-Based GWSK). The GWSK of two lattices L1 and L2 is defined as ∑ Kn (L1 , L2 ) = ϕu (L1 )ϕu (L2 ), (12) u∈Σn

where the dimension corresponding to u of the feature mapping ϕ(L) is defined as ∑ ∑ p(x|L)λg(i) , (13) ϕu (L) = x∈L i:u=x[i]

where x is a possible path of L and p(x|L) is its probability. Similarly, it is easy to deﬁne a truncated lattice-based GWSK following Deﬁnition 4. 4.4. Implementation 4.4.1. GWSK According to the deﬁnition, GWSK can be rewritten as ∑ ∑ λg(i1 )+g(i2 ) . Kn (x1 , x2 ) =

(14)

u∈Σn (i1 ,i2 ):u=x1 [i1 ]=x2 [i2 ]

This expression enables us to compute the kernel function without explicitly obtaining the feature mapping. There are many methods for eﬃcient implementation of GWSK, such as dynamic programming, recursive computation, and suﬃx tree (Lodhi et al. (2002); Rousu and Shawe-Taylor (2005); Yin et al. (2008)). If we normalized the feature mapping as in (7), the operations could be converted into kernel function level and thus we could make use of the mature implementation methods just mentioned (Lodhi et al. (2002); Rousu and Shawe-Taylor (2005); Yin et al. (2008)). In our application, however, the TFLLR-like normalization and weighting is not easy to convert into a kernel. Fortunately, unlike the text processing, the length of token strings 10

for a typical 30-sec utterance is no more than several hundreds, which makes straight forward computation possible. The implementation is based on nfold loop whose computational complexity is O(|x|n ). The pseudocode is listed in Algorithm 1. Algorithm 1 Implementation of string-based GWSK feature mapping. 1: ϕ = 0; 2: for i1 = 1, . . . , |x| − n + 1 do 3: for i2 = i1 + 1, . . . , |x| − n + 2 do 4: ... 5: for in = in−1 + 1, . . . , |x| do 6: i = (i1 , . . . , in ); 7: u = x[i]; 8: g = in − i1 + 1 − n; 9: ϕu = ϕu + λg ; 10: end for 11: ... 12: end for 13: end for

4.4.2. Truncated GWSK For the truncated GWSK, we can determine whether l(i) = ip −i1 +1 ≤ T in the most inner loop, and only accumulate the statistics satisfying this condition. In order to speed up this process, the constraint conditions can be moved as far out in the loops as possible. Based on this idea, we can give the algorithm for truncated GWSK as listed in Algorithm 2, with computational complexity of O(|x|T n−1 ). (Note that usually T << |x|.) 4.4.3. Confusion Network Approach for Lattice-Based GWSK A confusion network (CN), also referred to as a sausage, is a compact representation of lattice or n-best list. As illustrated in Figure 4, the confusion network consists of a series of slots, and each slot contains one or more alternative edges. Each edge is labeled with a token and a posterior probability. We can convert the lattice to a confusion network by using a tool such as the lattice-tool of SRILM toolkit (Stolcke (2002)), making it more easy to implement lattice-based GWSK. Suppose there are I slots in the confusion 11

Algorithm 2 Implementation of truncated string-based GWSK feature mapping. 1: ϕ = 0; 2: for i1 = 1, . . . , |x| − n + 1 do 3: for i2 = i1 + 1, . . . , min(|x| − n + 2, i1 + T − n + 1) do 4: ... 5: for in = in−1 + 1, . . . , min(|x|, i1 + T − 1) do 6: i = (i1 , . . . , in ); 7: u = x[i]; 8: g = in − i1 + 1 − n; 9: ϕu = ϕu + λg ; 10: end for 11: ... 12: end for 13: end for slot 1

slot 2

a: 0.1

a: 0.1

b: 0.5

d: 0.7

c: 0.4

e: 0.2

slot I

...

a: 0.8 c: 0.2

Figure 4: An illustration of confusion network.

network C, and Ji edges in the i-th slot. Let C[i, j] denote the token in the i-th slot and the j-th edge, and p[i, j] denote the corresponding posterior probability. We can generalize Algorithm 1 to give the implementation, with pseudocode as listed in Algorithm 3. The only diﬀerence is that in the n-fold loops, there are deeper loops for obtaining n-gram tokens u and the expected counts c. For the truncated lattice-based GWSK, we can also easily obtain the corresponding implementation based on confusion network. 4.4.4. Lattice-Tool Approach for Lattice-Based GWSK For the truncated GWSK of lattice, if the truncated context length T is not large, we can convert the lattice to T -gram counts by using the lattice-tool (Stolcke (2002)). Through this tool, we can compute posterior expected n-gram (n up to T ) counts in the lattices and output them to a count ﬁle. After that, the feature mapping can be obtained easily, with 12

Algorithm 3 Implementation of lattice-based GWSK feature mapping. 1: ϕ = 0; 2: for i1 = 1, . . . , I − n + 1 do 3: for i2 = i1 + 1, . . . , I − n + 2 do 4: ... 5: for in = in−1 + 1, . . . , I do 6: for j1 = 1, . . . , Ji1 do for j2 = 1, . . . , Ji2 do 7: ... 8: for jn = 1, . . . , Jin do 9: 10: u =∏ C[i1 , j1 ]C[i2 , j2 ] · · · C[in , jn ]; 11: c = nk=1 p[ik , jk ] 12: g = in − i1 + 1 − n; 13: ϕ u = ϕ u + c ∗ λg ; 14: end for 15: ... 16: end for 17: end for 18: end for 19: ... 20: end for 21: end for

13

pseudocode as listed in Algorithm 4. Algorithm 4 Implementation of truncated lattice-based GWSK feature mapping. 1: ϕ = 0; 2: for each line with tokens s and expected counts c in the count ﬁle do 3: if |s| ≥ n then 4: i1 = 1; in = |s|; for i2 = 2, . . . , |s| − n + 2 do 5: 6: ... 7: for in−1 = in−2 + 1, . . . , |s| − 1 do 8: i = (i1 , . . . , in ); 9: u = s[i]; 10: g = |s| − n; 11: ϕ u = ϕ u + c ∗ λg ; 12: end for 13: ... 14: end for 15: end if 16: end for

4.4.5. SVM Classifier The support vector machine (SVM) proposed by Vapnik (Vapnik (1995)) is a popular supervised classiﬁer. Its basic idea is to ﬁnd the optimal linear separate plane for distinguishing between the positive and negative labeled samples. Let D = {(x1 , y1 ), (x2 , y2 ), · · · , (xN , yN )} be the training set and yi = ±1 be the target value. Through maximizing the margin between the positive and negative classes with∑ proper regulations, we can obtain the parameters {αi } and b, which satisfy N i=1 αi yi = 0 and αi > 0. If there is a test data x, the SVM will predict its target value by f (x) =

N ∑

αi yi K(x, xi ) + b.

(15)

i=1

Since we have obtained the feature mapping ϕ, the following task is straight forward. We can use any SVM tool with linear kernel to get the kernel functions and train the separate plane. In this paper we use liblinear 14

(Fan et al. (2008)) which is suitable for solving large-scale regularized linear problem with fast speed. 4.5. Optimal Decaying Factor The performance of GWSK is greatly dependent on the value of the decaying factor λ. We will analyze its optimal value in this subsection. Of course, the relation between λ and the recognition rate is too complex to be derived precisely; however, we can simply assume the best performance is achieved when the counting loss due to decoding error and counting gain because of GWSK is balanced. The basic idea is borrowed from dynamic equilibrium, which seeks a balance between the loss and gain: Loss Gain.

(16)

In our case, suppose for a subsequence u in a string x, let Ni (x, u) denote the number of occurrence of u with i gaps. For example, if u = “a-b”, N0 (x, u) is the number of occurrence of “a-b”, N1 (x, u) is the number of occurrences of “a-?-b” (where question mark “?” denotes a single token), N2 (x, u) is the number of occurrences of “a-?-?-b”, and so on. e N2

N1

N0

... 2

λ

λ

Figure 5: The balance between the loss and the gain of the number of occurrence of subsequence.

As illustrated in Figure 5, if we enumerate the occurrence of u, the loss due to decoding error will be N∑ 0 (x, u)e, where e is the error rate of u, while i the gain due to GWSK will be ∞ i=1 Ni (x, u)λ . Assume the loss and gain is balanced, i.e.: { (∞ )} { } ∑ ∑ ∑ Ni (x, u)λi N0 (x, u)e . (17) Ex = Ex u

u

i=1

Interchanging the order of expectation and summation, and deﬁne Ex { Ni , we get ∞ ∑ Ni λi = N0 e. i=1

15

∑ u

Ni (x, u)} =

(18)

The solution to (18) can be found in the Appendix. For the lattice-based GWSK, because each possible path is in fact a string, the analysis is similar and the conclusion is identical. Note that although we use the same notation e in our derivations, in fact this value may be diﬀerent for diﬀerent circumstances. For strings, the error rate for n-gram is approximately e ≈ 1 − (1 − e1 )n , where e1 is the phoneme error rate. For lattices, this value will become smaller because the phone recognizer provides more candidates. Through numerical methods, we can solve for the optimal decaying factor. The results are shown in Table 1. As the values are very close, we intentionally list more signiﬁcant digits. In our evaluation, we set e1 = 0.4196 which is the phoneme error rate of Hungarian phone recognizer (Matejka et al. (2005)), and Ex {|x|} = 330.83 which is the average number of phones of 30-s segments calculated on 2009 NIST Language Recognition Evaluation (LRE09) corpus (NIST (2009)). From the results we can conclude that the optimal λ is about 0.4 for n = 2, and 0.25 for n = 3. Table 1: Theoretical optimal decaying factor λ for string-based GWSK.

n-gram

T =5

T =7

T =9

T =∞

2

0.3999

0.3999

0.3999

0.3999

3

0.2834

0.2583

0.2565

0.2564

It should be noted that we only provide an approximate analysis. There are still many other factors we have not considered. For example, the phone error rate refers only to utterances of the recognizer language and it may be diﬀerent for diﬀerent language. Thus, the selection of the value of lambda remains open. 4.6. Computational Complexity The whole procedure of GWSK consists of three parts, i.e., decoding, feature mapping (or super vector) generation and SVM training/test. We will analysis the computational complexity of each part and compare them with the counterparts of PR-VSM in this subsection. The decoding parts of PR-VSM and GWSK are identical and the super vector generation part of GWSK has been analyzed in Section 4.4. For SVM 16

training/test, the basic operation is the inner product of the super vectors, whose computational cost greatly depends on the number of nonzero elements (NNZ) of super vectors (i.e., the NNZ dimensions of the feature mapping). So we mainly focus on this aspect. We know that there are (|x| − n + 1) substrings with length n in the string x. So for PR-VSM super vector, the NNZ is at ∏ most (|x| − n + 1); n−1 while for GWSK super vector, the NNZ is at most min{ i=0 (|x| − i), |Σ|n }, n where |Σ| is the nominal dimensions of the super vector. We can calculate the ratio between NNZs to indicate the increasing of the computational cost and this value is approximately min{|x|n−1 , |Σ|n /|x|}. ∏n−1 For the truncated GWSK, the maximum NNZ is min{|x| i=1 (T −i), |Σ|n }, so the ratio is approximately min{T n−1 , |Σ|n /|x|}. For the lattice-based version, suppose there are J edges for each slot (see Figure 4), the maximum NNZ will be multiplied by J n and their ratio will keep unchanged. Note that the above ratios are calculated between maximum NNZs, so they are not strict upper bounds. However, since the numerator and denominator items are both obtained in the limit case, the ratio can be thought as a coarse upper bound. In order to give concrete examples, we compute some real values on the NIST LRE09 corpus using the Hungarian phone recognizer. The size of phoneme set is |Σ| = 58 and the order of gram is n = 3. Our results are obtained by averaging on all the nominal 30-s segments. First we calculate string-based, lattice-based and our derived (coarse bound) ratios of NNZ, and plot them in Figure 6. For the untruncated case, the coarse bound can reach 590, however, the string-based one is 30 and the lattice-based one is only 2.5. This shows that the increasing of the computational cost of GWSK is not too much, especially for the lattice-based ones. Next, we count the real time (RT) factors of each part and list the results in Table 2. For the training stage, super vector product is the dominant part, so compared to lattice-based PR-VSM, the computational cost increases about 1.5 times for the untruncated GWSK, and only 8.1% for the truncated GWSK (T = 7). For the test stage, decoding and super vector generation are the dominant parts and the computational cost increases about 50% for the untruncated GWSK, with almost no increase for the truncated GWSK (T = 7).

17

3

10

coarse bound string lattice

Ratio of NNZ

2

10

1

10

0

10

5

10

15

20

25

untruncated

Truncated context length T

Figure 6: Ratio between the numbers of nonzero elements of GWSK and PR-VSM super vectors (n = 3), HU frontend, LRE09, 30-s test.

5. Experiments 5.1. Experimental Setup We perform our experiments on the NIST LRE09 corpus (NIST (2009)) under closed-set test condition. In this evaluation, there are 23 target languages, including Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, and Vietnamese. There are a total of 41793 test segments, of which 14166, 13847, 13780 segments are for 30-s, 10-s, and 3-s nominal duration test, respectively. This evaluation involves two channel conditions: conversational telephone speech (CTS) and Voice of America (VOA) radio broadcasts. For CTS, our training data come from CallFriend, CallHome, OGI, OHSU and LRE07 Train corpus. For VOA, our training data come from VOA3 training data provided by NIST. The development data come from LRE03, LRE05, LRE07 and VOA data with NIST annotations and some VOA3 data that are not used for training. For the long speech ﬁles, we segment them as short segments with about 30-s speech. There are a total of about 186000 segments for training. In the initial experiments, we use the Hungarian phone recognizer (HU 18

Table 2: Comparison of real time factor for PR-VSM and GWSK, HU frontend, LRE09, 30-s test. CPU: Xeon [email protected], RAM: 8GB, single thread. SV gen.: super vector generation, SV prod.: super vector product

method

str/lat

T

decoding

SV gen.

SV prod.

PR-VSM

string

-

0.10

4.0 × 10−5

7.2 × 10−8

(n = 3)

lattice

-

0.11

1.1 × 10−4

3.7 × 10−6

string

7

0.10

8.3 × 10−5

7.2 × 10−7

4.2 × 10−3

2.1 × 10−6

3.1 × 10−4

4.0 × 10−6

5.4 × 10−2

9.1 × 10−6

∞

GWSK (n = 3)

lattice

7

0.11

∞

frontend) and test 30-s segments for parameter tuning. Finally, we use parallel Hungarian, Russian and Czech phone recognizers (HU+RU+CZ frontend) (Matejka et al. (2005)) and test all duration segments for further validation. In SVM modeling, we use the one-versus-the-rest method. After that, we use linear discriminant analysis (LDA) and GMM trained with maximum mutual information (MMI) criterion (Zhang et al. (2010)) as the backend to further process the scores. In the evaluation, the detection task is done for each language and the closed-set pooled equal error rate (EER) and minimum average cost (Cavg) (NIST (2009)) are used as performance measures. 5.2. Baseline First we evaluate the traditional phonotactic methods, including PRLM, skip-gram, PR-BT and PR-VSM, as our baseline. For detailed comparison, we list the results of n = 2 and n = 3 in Table 3. From the results, we can see that lattice-based methods are signiﬁcantly better than string-based counterparts, and the trigrams (n = 3) are consistently better than the bigrams (n = 2). These results are consistent with other literatures. Among all the methods, PR-VSM achieves best performance, so we only compare GWSK with PR-VSM in the following experiments.

19

Table 3: Performance of string- and lattice-based n-gram, HU frontend, LRE09, 30-s test.

method

str/lat

n-gram

EER (%)

Cavg (%)

PRLM

string

2

5.91

5.87

3

4.79

4.68

2

5.24

5.03

3

3.66

3.54

lattice bigram

string

2

5.47

5.27

+ skip-gram

lattice

2

5.29

5.09

PR-BT

string

-

5.02

4.83

lattice

-

2.53

2.39

string

2

5.26

5.13

3

4.83

4.71

2

2.81

2.66

3

2.10

1.95

PR-VSM (or PR-SVM)

lattice

5.3. String-Based GWSK 5.3.1. GWSK In order to compare GWSK with string-based PR-VSM, we test GWSK for n = 2 and n = 3. The decaying factor is varied from 0 to 0.9 with step size 0.1, and the results are plotted in Figure 7 and Figure 8. Note that λ = 0 corresponds to the traditional PR-VSM. We can see that for n = 2 and n = 3, the GWSK achieves best performance when λ = 0.4 and λ = 0.3 respectively, and the results are better than the PR-VSM counterparts. In Section 4.5, we have derived the optimal λ ≈ 0.4 for n = 2, and λ ≈ 0.3 for n = 3. Our experimental result matches well with the theoretical analysis. In addition, from the trend of the bars in Figure 7 and Figure 8, we can observe that for large λ (such as λ > 0.7), GWSK performs even worse than PR-VSM. We know that GWSK is based on rough matching, which brings ambiguity in the meantime of tolerating error. If we penalize too little (i.e., large λ) for the gaps, the ambiguity will become the major factor and the worsening of the results is understandable. 20

7.5 EER (%) min Cavg (%)

7 6.5 6 5.5 5 4.5 4

0

0.2

0.4

0.6

Decaying factor λ

0.8

1

Figure 7: Performance of string-based GWSK (n = 2), HU frontend, LRE09, 30-s test.

5.3.2. Truncated GWSK We investigate the performance of truncated GWSK in this subsection. Besides the decaying factor λ, we also vary the truncated context length T . The results are listed in Table 4 and Table 5 for n = 2 and n = 3, respectively. Note than T = ∞ corresponds to the standard (untruncated) GWSK. From the results we can see that for each ﬁxed T , the trend with respect to λ is similar to the standard GWSK. The EERs/Cavgs ﬁrst decrease and then increase with the increasing of λ, and the minimum occurs at λ = 0.4. On the other hand, for each ﬁxed λ, the truncated GWSK changes to standard GWSK with increasing T . For large λ, the EERs/Cavgs are monotonously increasing; for proper λ, as we expected, the EERs/Cavgs are ﬁrst decreasing and then increasing, and the minimum occurs at T = 7. We can also observe that T has little eﬀect on the performance, which has been predicted in Section 4.5. Based on this result, we can select small values of T to lower the computational burden while retaining similar performance. 5.4. Lattice-Based GWSK After the string-based GWSK, we evaluate the performance of the latticebased GWSK. From the previous analysis and results we know T has little inﬂuence on the performance, so we set T = 7 directly. We still vary λ. The results are shown in Figure 9 and Figure 10.

21

8.5 EER (%) min Cavg (%)

8 7.5 7 6.5 6 5.5 5 4.5 4 3.5

0

0.2

0.4

0.6

Decaying factor λ

0.8

1

Figure 8: Performance of string-based GWSK (n = 3), HU frontend, LRE09, 30-s test.

Similarly to the string-based GWSK, there also exists an optimal value for λ. This time, however, the optimal value becomes relatively less than that of the string-based GWSK: it changes from 0.4 to 0.3 for n = 2, and from 0.3 to 0.2 for n = 3, respectively. This is because the average decoding error rate of lattice is lower than that of string, so the optimal λ will become smaller according to the analysis in Section 4.5. In addition, as discussed previously, lattices directly compensate for the uncertainty in hypothesis direction. At the same time, it also compensates for the uncertainty in temporal direction implicitly, which will dilute the eﬀect of GWSK. So we need penalize more to obtain an optimal tradeoﬀ in the lattice-based GWSK. 5.5. Parallel Phone Recognizer Experiments In the previous subsections, we have seen that GWSK outperformed PRVSM method. In this section, we will further validate GWSK using HU, RU, and CZ frontends. We only focus on the most challenging case, i.e., latticebased PR-VSM versus GWSK with n = 3. In GWSK, we use the truncated version and set the parameters λ = 0.2 and T = 7. The results of parallel frontends are obtained through LDA + MMI score fusion backend (Zhang et al. (2010)). The detection error trade-oﬀ (DET) curves are showed in Figure 11, and the EERs and Cavgs are listed in Table 6. From the results we can see the consistent performance improvement due to 22

Table 4: Performance of truncated string-based GWSK (n = 2), HU frontend, LRE09, 30-s test (EER and Cavg in %).

λ

T =5

T =7

T =9

T =∞

EER

Cavg

EER

Cavg

EER

Cavg

EER

Cavg

0.1

4.67

4.64

4.67

4.65

4.67

4.64

4.67

4.64

0.2

4.53

4.51

4.53

4.50

4.53

4.49

4.53

4.50

0.3

4.51

4.44

4.50

4.45

4.47

4.45

4.47

4.45

0.4

4.38

4.33

4.37

4.33

4.37

4.33

4.38

4.33

0.5

4.58

4.51

4.55

4.48

4.53

4.44

4.51

4.44

0.6

4.71

4.61

4.78

4.65

4.78

4.68

4.81

4.70

0.7

5.09

4.93

5.07

4.99

5.14

5.07

5.10

5.01

0.8

5.53

5.31

5.48

5.40

5.66

5.59

5.74

5.68

0.9

6.03

5.74

6.11

6.00

6.54

6.27

7.04

6.84

changing from PR-VSM to GWSK for both single and parallel frontends. For the HU+RU+CZ frontends and 30-s test, the Cavg decreases from 1.32% to 1.20%. 6. Conclusion In this paper, we have proposed a gap-weighted subsequence kernel (GWSK) method for spoken language recognition. GWSK is based on gapweighted matching, rather than contiguous, and is not so vulnerable to deletion and insertion errors of the frontend phone recognizer. In addition, the standard GWSK is extended to a truncated version which constrains the context length to eliminate the interference from remote tokens and reduce the computational burden, and extended to lattice-based version which counts the presence of a subsequence in multi-candidate paths to take the advantage of multi-candidate hypothesis of the phone recognizer. Furthermore, we derive the optimal decaying factor and analyze the computational complexity of the proposed methods. Experiments on NIST 2009 LRE corpus show that GWSK is more eﬀective than the state-of-the-art PR-VSM approach.

23

Table 5: Performance of truncated string-based GWSK (n = 3), HU frontend, LRE09, 30-s test (EER and Cavg in %).

λ

T =5

T =7

T =9

T =∞

EER

Cavg

EER

Cavg

EER

Cavg

EER

Cavg

0.1

4.57

4.41

4.58

4.42

4.66

4.66

4.67

4.46

0.2

4.33

4.21

4.28

4.20

4.32

4.21

4.32

4.22

0.3

4.27

4.13

4.24

4.17

4.25

4.17

4.25

4.16

0.4

4.38

4.16

4.38

4.25

4.38

4.22

4.32

4.21

0.5

4.43

4.29

4.43

4.32

4.45

4.31

4.43

4.32

0.6

4.56

4.42

4.77

4.57

4.65

4.54

4.72

4.57

0.7

4.71

4.58

4.93

4.80

5.03

4.91

5.33

5.19

0.8

4.86

4.70

5.21

5.06

5.45

5.36

5.96

5.88

0.9

5.20

4.91

5.72

5.57

6.26

6.16

8.15

7.99

Acknowledgement This work was supported by the National Natural Science Foundation of China under Grant No. 61370034, No. 61273268 and No. 61005019. Appendix A. Solution to (18) ∑ Supposing the length of the string x is |x| and |u| = n, and deﬁne u {Ni (x, u)} = Ni (x), we obtain  if i = 0,  (|x| − n) + 1, i+n−2 Ni (x) = (A.1) (N0 (x) − i), if 1 ≤ i < M (x), i  0, if i ≥ M (x), ( ) where i+n−2 is the number of combinations of “(i+n−2) choose i”, M (x) = i N0 (x) for the untruncated GWSK and M (x) = min{T − n, N0 (x)} for the truncated GWSK with context length T . Taking the expectation with respect to x and denote Ex {M (x)} = M , we get   E x {|x|} (i+n−2 ) − n + 1, if i = 0, Ni = (A.2) (N0 − i), if 1 ≤ i < M , i  0, if i ≥ M . 24

4.4

EER (%) min Cavg (%)

4.2 4 3.8 3.6 3.4 3.2 3 2.8 2.6 2.4 0

0.2

0.4

0.6

Decaying factor λ

0.8

1

Figure 9: Performance of truncated lattice-based GWSK (n = 2, T = 7), HU frontend, LRE09, 30-s test.

Appendix A.1. Case n = 2 ( ) If n = 2, i+n−2 = 1. Substituting (A.2) into (18), we obtain i M ∑

(N0 − i)λi = N0 e.

(A.3)

i=1

Through some additional mathematical operations, we get λM +1 − N0 (1 + e)λ2 + (N0 − 1 + 2N0 e)λ − N0 e = 0.

(A.4)

Appendix A.2. Case n = 3 ( ) If n = 3, i+n−2 = i + 1, thus giving i M ∑

(i + 1)(N0 − i)λi = N0 e.

(A.5)

i=1

This leads to − C1 λM +3 + (2C1 + C2 )λM +2 − (C1 + C2 − 2)λM +1 + N0 (1 + e)λ3 − 3N0 (1 + e)λ2 + (2N0 − 2 + 3N0 e)λ − N0 e = 0, (A.6) where C1 = (M + 1)(N0 − M ), C2 = N0 − 2M . 25

2.7 EER (%) min Cavg (%)

2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 1.7

0

0.2

0.4

0.6

Decaying factor λ

0.8

1

Figure 10: Performance of truncated lattice-based GWSK (n = 3, T = 7), HU frontend, LRE09, 30-s test.

References Campbell, W., Gleason, T., J., N., Reynolds, D., Shen, W., Singer, E., Torres-Carrasquillo, P., June 2006a. Advanced language recognition using cepstra and phonotactics: MITLL system performance on the NIST 2005 language recognition evaluation. In: Proc. Odyssey. San Juan. Campbell, W., Gleason, T., Reynolds, D., Singer, E., June 2006b. Experiments with lattice-based PPRLM language identiﬁcation. In: Proc. Odyssey. San Juan. Campbell, W. M., Campbell, J. P., Reynolds, D. A., Jones, D. A., Leek, T. R., May 2004. High-level speaker veriﬁcation with support vector machines. In: Proc. ICASSP. Montreal, pp. I–73–I–76. Campbell, W. M., Richardson, F., Reynolds, D. A., Apr. 2007. Language recognition with word lattices and support vector machines. In: Proc. ICASSP. Vol. 4. Honolulu, pp. 989–992. Dehak, N., Kenny, P., Dehak, R., May 2011. Front-end factor analysis for speaker veriﬁcation. IEEE Trans. Audio, Speech, Lang. Process. 19 (4), 788–798.

26

PR−VSM GWSK

Miss probability (in %)

40

20

10

3−s

5 2 1

10−s

0.5 0.2

30−s

0.1 0.1 0.2 0.5 1

2

5

10

20

40

False Alarm probability (in %)

Figure 11: DET curves of lattice-based PR-VSM and truncated GWSK, LRE09, HU+RU+CZ frontend, full training.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J., Sept. 2008. LIBLINEAR: A library for large linear classiﬁcation. J. Mach. Learn. Res. 9 (9), 1871–1874. URL http://www.csie.ntu.edu.tw/~cjlin/liblinear Gauvain, J.-L., Messaoudi, A., Schwenk, H., Oct. 2004. Language recognition using phone lattices. In: Proc. Interspeech. Jeju Island, pp. 25–28. Hazen, T. J., Zue, V. W., Sept. 1993. Automatic language identiﬁcation using a segment-based approach. In: Proc. Eurospeech. Vol. 2. Berlin, pp. 1303–1306. Hofmann, T., Scholkopf, B., Smola, A. J., Mar. 2008. Kernel methods in machine learning. The Annals of Statistics 36 (3), 1171–1220. Kim, S., Yoon, J., Yang, J., , Park, S., Feb. 2010. Walk-weighted subsequence kernels for protein-protein interaction extraction. BMC Bioinformatics 11 (2). Kruengkrai, C., Srichaivattana, P., Sornlertlamvanich, V., Isahara, H., Oct. 2005. Language identiﬁcation based on string kernels. In: Proc. ISCIT. Beijing, pp. 896–899. 27

Table 6: Comparison of lattice-based PR-VSM and truncated GWSK (λ = 0.2, T = 7), LRE09.

method

frontend

30-s

10-s

3-s

EER

Cavg

EER

Cavg

EER

Cavg

HU

2.10

1.95

7.35

7.29

22.52

22.42

PR-VSM

RU

1.88

1.70

5.53

5.41

19.38

19.57

(n = 3)

CZ

2.99

2.85

9.52

9.51

26.77

26.81

HU+RU+CZ

1.32

1.26

3.59

3.44

15.98

15.51

HU

1.92

1.77

6.60

6.53

21.73

21.74

GWSK

RU

1.68

1.52

5.20

5.11

19.43

18.91

(n = 3)

CZ

2.57

2.51

9.21

9.16

25.56

25.49

HU+RU+CZ

1.20

1.14

3.40

3.35

14.35

14.20

Lerma, M. A., Jun. 2008. Sequences and strings. dm-sequences.pdf. URL http://www.math.northwestern.edu/~mlerma/courses/ cs310-04w/notes/ Li, H., Ma, B., Lee, C.-H., Jan. 2007. A vector space modeling approach to spoken language identiﬁcation. IEEE Trans. Audio, Speech, Lang. Process. 15 (1), 271–284. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C., Feb. 2002. Text classiﬁcation using string kernels. J. Mach. Learn. Res. 2 (2), 419–444. Ma, B., Tong, R., Li, H., Sept. 2007. Spoken language recognition using ensemble classiﬁers. IEEE Trans. Audio, Speech, Lang. Process. 15 (7), 2053–2062. Matejka, P., Schwarz, P., Cernocky, J., Chytil, P., Sept. 2005. Phonotactic language identiﬁcation using high quality phoneme recognition. In: Proc. Eurospeech. Lisbon, pp. 2237–2240. URL http://speech.fit.vutbr.cz/software/ phoneme-recognizer-based-long-temporal-context

28

Muthusamy, Y. K., Barnard, E., Cole, R. A., Oct. 1994. Reviewing automatic language identiﬁcation. IEEE Signal Process. Mag. 11 (4), 33–41. Navratil, J., Sept. 2001. Spoken language recognition - A step toward multilinguality in speech processing. IEEE Trans. Speech Audio Process. 9 (6), 678–685. Navratil, J., Zuhlke, W., Apr. 1997. Double bigram-decoding in phonotactic language identiﬁcation. In: Proc. ICASSP. Vol. 2. Honolulu, pp. 1115– 1118. NIST, Apr. 2009. The 2009 NIST language recognition evaluation plan. LRE09 EvalPlan v6.pdf. URL http://www.itl.nist.gov/iad/mig/tests/lang/2009/ Penagarikano, M., Varona, A., Rodriguez-Fuentes, L. J., Bordel, G., Nov. 2011. Improved modeling of cross-decoder phone co-occurrences in SVMbased phonotactic language recognition. IEEE Trans. Audio, Speech, Lang. Process. 19 (8), 2348–2363. Rousu, J., Shawe-Taylor, J., Sept. 2005. Eﬃcient computation of gapped substring kernels on large alphabets. J. Mach. Learn. Res. 6 (9), 1323– 1344. Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. Cambridge University Press. Stolcke, A., Sept. 2002. SRILM - An extensible language modeling toolkit. In: Proc. ICSLP. Vol. 2. Denver, pp. 901–904. URL http://www.speech.sri.com/projects/srilm/ Tong, R., Ma, B., Li, H., Chng, E. S., Sept. 2009. A target-oriented phonotactic front-end for spoken language recognition. IEEE Trans. Audio, Speech, Lang. Process. 17 (7), 1335–1347. Torres-Carrasquillo, P., Singer, E., Campbell, W. M., et al., Sept. 2008. The MITLL NIST LRE 2007 language recognition system. In: Proc. Interspeech. Brisbane, pp. 719–722. Torres-Carrasquillo, P. A., 2002. Language identiﬁcation using Gaussian mixture models. Ph.D. thesis, Michigan State University. 29

Vapnik, V. N., 1995. The Nature of Statistical Learning Theory. SpringerVerlag, New York. Yin, C., Tian, S., Mu, S., Shao, C., Jan. 2008. Eﬃcient computations of gapped string kernels based on suﬃx kernel. Neurocomputing 71 (4-6), 944–962. Zhang, W., Li, B., Qu, D., Wang, B., Nov. 2006. Automatic language identiﬁcation using support vector machines. In: Proc. ICSP. Vol. 1. Guilin. Zhang, W.-Q., Hou, T., Liu, J., Jan. 2010. Discriminative score fusion for language identiﬁcation. Chin. J. Electron. 19 (1), 124–128. Zissman, M. A., Berkling, K. M., Aug. 2001. Automatic language identiﬁcation. Speech Commun. 35 (1-2), 115–124. Zissman, M. A., Singer, E., Apr. 1994. Automatic language identiﬁcation of telephone speech messages using phoneme recognition and N-gram modeling. In: Proc. ICASSP. Vol. 1. Adelaide, pp. 305–308.

30

Language Recognition Based on Score ... - Semantic Scholar