1

Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR Tara N. Sainath, Member, IEEE, Bhuvana Ramabhadran, Senior Member, IEEE, Michael Picheny, Fellow, IEEE, David Nahamoo, Fellow, IEEE, and Dimitri Kanevsky, Senior Member, IEEE

Abstract—The use of exemplar-based methods, such as support vector machines (SVMs), k-nearest neighbors (kNN) and sparse representations (SRs), in speech recognition has thus far been limited. Exemplar-based techniques utilize information about individual training examples and are computationally expensive, making it particularly difficult to investigate these methods on large vocabulary continuous speech recognition (LVCSR) tasks. While research in LVCSR provides a good testbed to tackle real-world speech recognition problems, research in this area suffers from two main drawbacks. First, the overall complexity of an LVCSR system makes error analysis quite difficult. Second, exploring new research ideas on LVCSR tasks involves training and testing state-of-the-art LVCSR systems, which can render a large turnaround time. This makes a small vocabulary task such as TIMIT more appealing. TIMIT provides a phonetically rich and hand-labeled corpus that allows easy insight into new algorithms. However, research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we combine the advantages of using both small and large vocabulary tasks by taking well-established techniques used in LVCSR systems and applying them on TIMIT to establish a new baseline. We then utilize these existing LVCSR techniques in creating a novel set of exemplar-based sparse representation (SR) features. Using these existing LVCSR techniques, we achieve a phonetic error rate (PER) of 19.4% on the TIMIT task. The additional use of SR features reduce the PER to 18.6%. We then explore applying the SR features to a large vocabulary Broadcast News task, where we achieve a 0.3% absolute reduction in word error rate (WER). Index Terms—Speech Recognition, Sparse Representations, Exemplar-based techniques

I. I NTRODUCTION Gaussian Mixture Models (GMMs) continue to be extremely popular for recognition-type problems in speech. While GMMs allow for fast model training and scoring, training samples are pooled together for parameter estimation, resulting in a loss of the information that exists within the individual training samples. Alternatively, exemplar based techniques, including k-nearest neighbors (kNNs), support vector machines (SVMs) and sparse representation (SR) methods (i.e., [29], [45]), utilize information about actual training examples. Since the number of training examples in speech tasks can be very large, exemplar-based techniques typically utilize a smaller number of these training examples to characterize a test vector. In our SR formulation, we are given a test vector y and a set of exemplars hi from a training set, which we put into a Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

dictionary H = [h1 ; h2 . . . ; hn ]. Test vector y is represented as a linear combination of training examples by solving y = Hβ subject to a spareness constraint on β. The feature Hβ represents a reconstruction of the input y closer toward training features H. Exemplar-based methods have been shown to offer improvements in accuracy over GMMs for a variety of classification tasks. For example, on a phone classification task, it was found that SVMs and SRs provided a 1% absolute improvement over a GMM [29]. In addition, in [33], SVMs were shown to offer over a 2% absolute improvement in accuracy over a GMM for an audio classification task. However, applying exemplar-based techniques for recognition applications has achieved limited success (i.e., [4], [7], [44]). One reason for this limited success is that characterizing a test sample by searching over a large amount of training data is more computationally expensive relative to evaluating a set of Gaussian mixtures. Second, recognition type problems require estimating class probabilities that can be compared across frames, which GMMs can easily do. However, many exemplarbased methods utilize the decision scores from the exemplarbased classifiers themselves to generate probabilities, and thus the probabilities are generated in an ad-hoc manner and cannot easily be compared across frames [4].1 Investigating exemplar-based techniques for LVCSR systems poses an even larger challenge. While research in LVCSR has sparked the development of many state of the art research ideas, research in this area suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, model training typically requires many hours compared with small vocabulary tasks, providing challenges for testing new ideas. In this paper, we introduce a methodology to allow us to explore exemplar-based techniques for LVCSR systems, while addressing the problem of training and error analysis in LVCSR systems. Specifically, we explore applying exemplarbased SR features on top of a state of the art LVCSR system on a small vocabulary corpus, as a precursor to applying this technique to large corpora. Typically a corpora is categorized as “small vocabulary” if it has a small dictionary size. This is 1 It is important to note that both Neural Nets (NNs) and Conditional Random Fields (CRFs) are also model-based approaches which often perform better than GMMs/HMMs for classification and recognition tasks. However, these approaches address other problems with GMMs rather than the non exemplar-based issue. For example, CRFs lend itself to better optimization techniques and can allow for multiple feature streams [49]. Secondly, NNs address the frame independence assumption of HMMs [2].

2

often correlated to the corpora also having a small database size, as the vocabulary and number of training exemplars are limited. In this work, we will refer to a “small vocabulary” corpus as one that has a small database size, taking note that there is still a strong correlation to small dictionary size. Our small vocabulary experiments are conducted on TIMIT [20], a continuous speech recognition corpus recorded and transcribed by Texas Instruments (TI) and the Massachusetts Institute of Technology (MIT), respectively. Our motivation for conducting experiments on TIMIT is threefold. First, it provides a fair benchmark for comparing the performance of our LVCSR recipe to other state of the art results on this phonetic recognition task. Second, having time-aligned phonetic transcriptions allows for a detailed error analysis when exploring new research ideas such as exemplar-based methods. This is an added benefit of TIMIT over other smallvocabulary tasks such as Resource Management [27] and Switchboard [14], which do not have hand-aligned transcriptions available for the full corpora. Third, because TIMIT is a small-vocabulary task, it allows new research ideas to be quickly tested. One of the drawbacks of using a small-vocabulary task is that many ideas which have shown good gains on small vocabulary tasks, for example in duration modeling, do not necessarily translate to improvements in LVCSR tasks [25]. Often, this is the result of comparisons with “easy to beat” baselines. To address this, we investigate using SR features on top of an algorithmically complete, state of the art LVCSR system. Most high performing LVCSR systems (i.e., [9], [22], [23], [42]), including our IBM Recognizer [38], utilize a complex “recipe” during acoustic model training. First a set of speaker independent (SI) models are built. Next, a set of speaker adapted (SA) models are built for each speaker or speaker cluster. Multiple discriminative training steps are employed to produce a set of discriminative features and models for further error rate reduction. Lastly, additional stages of speaker adaptation may be added. This recipe has shown considerable gains on a variety of large vocabulary tasks such as conversational speech [37] and broadcast news [39]. We hypothesize that if we use this typical LVCSR recipe and are able to achieve improvements on a small vocabulary task with SR features, then similar improvements are likely to be achieved for large vocabulary tasks. To test this hypothesis, we investigate creating a testbed by first applying the LVCSR recipe to TIMIT. We then utilize these existing LVCSR techniques to create a set of SR features. On the TIMIT corpus, we show that applying the SR features on top of our best speaker-adapted, discriminatively trained system (i.e., PER of 19.4%) allows for a 0.8% absolute reduction in PER. Furthermore, on a large vocabulary 50 hour Broadcast News task [18], we achieve a reduction in word error rate (WER) of 0.3% absolute using the SR technique, demonstrating the benefit of using TIMIT to quickly test new research ideas. The rest of this paper is organized as follows. Section II provides a discussion of related work in exemplar-based methods and sparse representation techniques in speech recognition. Section III discusses the creation of SR features. In Section IV,

an overview of the IBM LVCSR System used for experiments in this paper is provided. Section V outlines the experiments performed for small and large vocabulary. Sections VI analyzes the results using the LVCSR recipe on TIMIT and Broadcast News, while the results using SRs on both corpora are presented in Section VII respectively. Finally, Section VIII summarizes the main contributions of the paper and discusses future work. II. R ELATED W ORK A. Exemplar-Based Techniques While many exemplar-based techniques (i.e., SRs, SVMs, kNNs) have shown improvements in accuracy over GMMs for phoneme classification tasks [29], typically the gains in classification observed using these techniques are not as prevalent in speech recognition tasks ([4], [7], [44]). In phoneme classification, the segments associated with each class are known ahead of time, and thus decision scores can directly be calculated for each segment using exemplar-based techniques. In speech recognition, class boundaries are not known beforehand, and thus must be determined via a dynamic programming approach (e.g., HMMs). This requires estimating class probabilities that can be compared across frames. Most exemplar-based methods try to utilize the decision scores from the exemplarbased classifiers themselves to generate probabilities. Thus the probabilities are generated in an ad-hoc methodology and cannot easily be compared across frames. For example, [4] explores the use of exemplar-based nearest neighbors for speech recognition. Specifically, given an input feature y and a specific HMM state s, the k closest exemplars from training which belong to state s are found. The Euclidean distance from y to these neighbors is used as the new output probability for this HMM state, rather than typically defining the output probability from a GMM. On both small and large vocabulary tasks, the authors show that the nearest neighbor method is only able to offer improvement over the GMMbased approach when less than 3 hours of training data is used. Another exemplar-based technique involving speech templates was investigated in [44]. In this paper, a set of templates, representing acoustic knowledge of the training set, is created. Then given a test signal, a dynamic time warping (DTW) method is employed to match the test signal to the best set of templates from the training set. While the DTW system is not able to outperform a typical HMM, the DTW system provides complementary errors to the HMM system. Thus, improvements on the Resource Management Task are achieved when combining the DTW and HMM systems. Finally, SVM methods have been used to estimate class probabilities by mapping SVM decision scores into probabilities through a sigmoid function [7]. This approach suffers from over-fitting the probabilities to the sigmoid function, again making it difficult to compare probabilities across frames. Thus, to date SVMs have not successfully been used to create new HMM state output probabilities. Instead, SVMs have shown benefits when used in tandem with an HMM system, on both small and large vocabulary tasks.

3

B. Sparse Representation Methods In addition to the use of sparse representations (SRs) as an exemplar-based methodology, as presented in this work, there has been variety of other applications of SRs for speech recognition. [36] looks at representing a spectro-temporal pattern of speech as a linear combination of an overcomplete set of bases such that the weights of the linear combination β are sparse. Instead of seeding the dictionary (i.e. basis) H using exemplars, one fixed dictionary is learned similar to traditional work in compressive sensing. The authors pass the set of weights β as a new set of features into a multi-layer perceptron (MLP), which is used to estimate posterior probabilities. Phoneme recognition experiments on TIMIT show that the proposed features outperform conventional Perceptual Linear Prediction (PLP) features in both clean and noisy conditions. In addition, [11], uses SRs in modeling noisy speech as a linear combination of speech and noise exemplars, again such that the weights β in the linear combination are sparse. The weights on the clean speech features are used to derive a set of state-based posterior features, which are used as HMM output probabilities. The authors show benefits of this approach on the noisy Aurora-2 digits task. Finally, [10] explores SRs as a missing data technique (MDT). MDTs are used to estimate clean speech features from noisy environments by finding reliable information in the noisy speech signal. Decoding is then performed using both reliable and unreliable information, where unreliable parts of the signal are reconstructed via SRs. Using SRs as a MDT, the authors show improvements in WER on a noisy LVCSR task. III. S PARSE R EPRESENTATION F EATURES One benefit of our SR method over other exemplar-based techniques discussed in Section II-A is that it allows new features to be created which take advantage of the classification accuracy of exemplar-based methods, but still utilizes the benefits of HMMs to efficiently compare scores across frames. In this section, we review the use of exemplar-based SRs for classification and then discuss the creation of exemplar-based SR features for recognition. A. Classification Using Sparse Representations 1) Classification Overview: The goal of classification is to use training data from k different classes to determine the best class to assign to a test vector y. To formulate the use of SRs for classification, first consider taking all training examples ni from class i and concatenating them into a matrix Hi as columns, in other words Hi = [xi,1 , xi,2 , . . . , xi,ni ] ∈ ℜm×ni , where x ∈ ℜm represents a feature vector from the training set of class i with dimension m. Given sufficient training examples from class i such that H is an overcomplete dictionary where m < ni , [45] shows that a test sample y ∈ ℜm from the same class can be represented as a linear combination of the entries in Hi weighted by β, as given by Equation 1. y = βi,1 xi,1 + βi,2 xi,2 + . . . + βi,ni xi,ni

(1)

However, since the class membership of y is unknown, we define a matrix H to include training examples from k different classes in the training set. In other words the columns of H are defined as H = [H1 , H2 , . . . , Hk ] = [x1,1 , x1,2 , . . . , xk,nk ] ∈ ℜm×N . Here m is the dimension of each feature vector x and N is the total number of all training examples from all classes. H can be thought of as an overcomplete dictionary where the dimension of feature vectors in H (i.e. m) is much less than the number of training examples N . We can then write test vector y as a linear combination of all training examples, in other words y = Hβ. Since N > m, and assuming H is full-rank, y can always be represented as a linear combination of elements in H. However, if sparsity is enforced on β such that only a few elements in H are chosen, then ideally the optimal β should be non-zero for the elements in H which belong to the same class as y [45]. This motivates us to solve for β using a SR technique (i.e. [1], [41], [47]), which solves for y = Hβ subject to a sparsity constraint on β. Test vector y will then be represented as a linear combination of the training examples in H, with training examples which belong to the same class as y having non-zero β coefficients. 2) Classification Rule: After solving y = Hβ, we must assign y to a specific class. There are a variety of different classification rules that could be used [30], which we highlight in more detail in Appendix IX-A. For example, we can assign y to the class which has the largest support in β, as given by Equation 3 in Appendix IX-A. Alternatively, we could compute the l2 norm for all β entries within a specific class, and choose the class with the largest l2 norm support, as shown in Equation 4 in Appendix IX-A. In this paper, since we are using SRs to create a new “spectral feature”, we define the classification decision rule in the “spectral” domain. More specifically, we explore how well y assigns itself to different classes in H by looking at the residual error between y and the Hβ entries corresponding to a specific class [45]. Ideally, all nonzero entries of β should correspond to the entries in H with the same class as y and the residual error will be smallest within this class. More specifically, let us define a selector δi (β) ∈ ℜN as a vector whose entries are zero except for entries in β corresponding to class i. We then compute the residual error for class i as ∥ y − Hδi (β) ∥2 . The best class for y will be the class with the smallest residual error. Mathematically, the best class i∗ is defined as i∗ = min ∥ y − Hδi (β) ∥2 i

(2)

3) The Importance of Sparsity: The classification problem y = Hβ can also be solved enforcing no sparsity on β, a technique which is known as Ridge Regression (RR) [43]. Both the RR and SR methods use multiple training samples in each class to linearly represent the test sample. However, SR uses a small number of training samples in H compared to RR to avoid over-fitting. Thus, as [45] shows, the SR classifier can better adapt to the actual distributions of the training samples of each class (i.e., if these distributions are non-linear or multimodal) and is also more robust to outliers. We will motivate the difference between the RR and SR methods further with the following example. Let us consider a

4

2×7 matrix H = [h1 , h2 , h3 , h4 , h5 , h6 , h7 ], as shown in Table I. In this table, the first three columns h1 , h2 , h3 are “training” frames that belong to a class C1 and last four columns are “training” frames that belong C2 . Also, assume that vector y = [0.29; 0.29] is “test” data belonging to class C1 .

Ridge Regression

Class 2 0.2 0.1 0

TABLE I E NTRIES OF H

−0.1

MATRIX AND CORRESPONDING CLASS LABELS

Class 1

0.3

Ci

1

2

3

4

5

6

7

5

6

7

Sparse Representation

H= class =

h1 0.2 0.2 C1

h2 0.1 0.3 C1

h3 0.4 0.35 C1

h4 0.3 0.3 C2

h5 -0.6 0.1 C2

h6 0.6 0.3 C2

1

h7 -0.6 0.4 C2

0.5

0

Figure 1 shows a plot of the entries in H from Table I belonging to the two classes, and test vector y. 0.5

1

2

3

4

Fig. 2. Plot of β coefficients for Ridge Regression and Sparse Regression Methods

Class 1 Class 2 y

0.45

−0.5

0.4

Figure 3 shows the accuracy for the RR and SR techniques as the number of non-zero β coefficients (i.e. sparsity level) used to solve y = Hβ is varied. In this experiment, since the dimension of feature vector y was 40, the maximum number of sparse coefficients to obtain a unique solution to y = Hβ was 40. Note again that at the optimum sparsity level2 , the SR method offers improvements in accuracy over the RR method. This practical example strengthens our motivation of enforcing sparsity for classification.

Dimension 1

0.35 0.3

0.25

0.2 0.15

0.1 −0.8

−0.6

−0.4

−0.2 0 Dimension 0

0.2

0.4

0.6 0.8

Fig. 1.

Sparse Representation Ridge Regression

Plot of entries in H and y 0.795

0.79 Accuracy

Figure 2 shows the β coefficients for the RR and SR techniques. The RR method uses all training examples to characterize H, and will include the outlier points of C2 . Thus, using the classification decision rule given in Equation 2, the RR method chooses class C2 as the best class. However using a SR method produces a β vector with the support located at the third entry in H. In this case, C1 is identified as the correct class. It is important to point out that when using the β vector to make a classification decision rule (outlined in Appendix IX-A), either by using the maximum support rule given by Equation 3 or the maximum l2 support given by Equation 4, the SR method still identifies C1 and the RR method chooses C2 . Thus, by using a subset of examples in H, the classification decision for SR and RR can be vastly different, particularly in the case of outliers. Furthermore, we have observed the benefit of SRs vs. RRs in a practical application, namely the TIMIT phone classification task. As described in [29], in this classification experiment, given a test phone y, and a set of training exemplars in H, the SR and RR methods are used to solve y = Hβ. A classification decision rule is then made using the l2 norm of β methodology described in Equation 4. This experiment was conducted for roughly 7, 000 test phones in the TIMIT testcore set.

0.785

0.78

0.775

0.77 15

20

25 30 Number of Sparse Coefficients

35

40

Fig. 3. Phonetic Accuracy on TIMIT for Ridge Regression and Sparse Regression Methods with Varied Number of Sparse Coefficients

4) Sparse Representation Methods: Various SR methods can be used to solve for β. In this work, we assume that the dictionary H is seeded with training examples. Since the frames in H may be correlated, this suggests that methods which use a combination of an l1 and l2 constraint, such as Elastic Net (EN) [47] and Approximate Bayesian Compressive 2 Note that the optimum sparsity level will differ for various problems and needs to be tuned accordingly.

5

Sensing (ABCS) [29], are preferred over methods which just enforce an l1 constraint such as Lasso [41]. As [47] discusses, if there is high correlation among the elements in H, l1 techniques typically select only one variable from the group and do not care which one is selected. This is in contrast to EN and ABCS, which encourages multiple examples from the same group to be chosen. Furthermore, a phone classification task on TIMIT in [16], where H was seeded using training examples nearest to y, also illustrated that EN and ABCS offered promising results compared to Lasso. In this paper, we solve for β using the ABCS method, which we first explored in [29] for phonetic classification.

2) Creation of Features: The creation of Hβ features for recognition is done as follows. First, a speech signal is defined by a series of feature vectors through time, Y = {y 1 , y 2 . . . y P }, for example Mel-Scale Frequency Cepstral Coefficients (MFCCs). For every test sample y n ∈ Y , we solve y n = H n β n to compute a β n .3 Then given this β n , a corresponding H n β n vector is formed. Thus a series of Hβ vectors is created at each frame as {H 1 β 1 , H 2 β 2 . . . H P β P }. The sparse representation features are created for both training and test. An HMM is then trained given this new set of features and recognition is performed in this new feature space. C. Selection of H

B. Sparse Representation Features For Recognition 1) Motivation: As discussed in Section III-A, SRs are used to solve y = Hβ, where Hβ represents a reconstruction of the input y. In this section, we motivate the use of Hβ as a feature for recognition. When finding a solution to y = Hβ, since H is an overcomplete dictionary, there are many possible solutions for β. However, as explained in Section III-A, a sparsity constraint is enforced on β such that only a few examples in H are used to explain y, and the classification accuracy is maximized, but not necessarily when the reconstruction error between y and Hβ is minimized. For example, Figure 4 shows the residual mean-squared error (MSE) between y and Hβ as a function of sparsity. Naturally, as the sparsity increases, the MSE between y and Hβ decreases, as shown in the figure. However, looking at Figure 3, the highest classification accuracy occurs at a sparsity of 30 when the in-class residual error given by Equation 2 is minimized, not when the total residual error is minimized. The Hβ features explored in this work are created to maximize per-frame classification accuracy. Thus, the Hβ features represent a reconstruction of the input y closer toward training features, and also towards the correct class.

9 Sparse Representation 8

Residual Error

7

6

5

4

3

2 15

Fig. 4.

20

25 30 Number of Sparse Coefficients

Sparsity vs. Residual Error between y and Hβ

35

40

Success of the sparse representation features depends heavily on a good choice of H. Pooling together all training data from all classes into H will make the number of columns in H large (roughly 2 millions frames in TIMIT, 18 million frames on a 50-hour Broadcast News task [18]), and will make solving for β intractable. Therefore, in this section we discuss various methodologies to select H from a large sample set. 1) Seeding H from Nearest Neighbors: For each y, we find a neighborhood of closest points to y from all examples in the training set using a kd-tree, similar to the method used in [29]. These k neighbors become the entries of H. k is chosen to maximize classification accuracy on a heldout development set. In [29], it was found that having an overcomplete dictionary H with 200 nearest neighbors offered optimal classification performance. A set of Hβ features is created for both training and test, but H is always seeded using data from the training set. To avoid overtraining of Hβ features on the training set, we require that only when creating Hβ features on training, samples be selected from training that are of a different speaker than the speaker corresponding to frame y. While this approach works well on small-vocabulary tasks, using a kd-tree based kNN approach with large amounts of training data can be computationally expensive. To address this, we discuss other choices for seeding H below, tailored to large vocabulary applications. 2) Using a Trigram Language Model: In speech recognition, when an utterance is scored against a set of HMM states (which have an output distributions given by GMMs), typically evaluating only a small subset of these states at a given frame allows for a large improvement in speed without a reduction in accuracy [34]. Using this fact, we explore using training data belonging to a small subset of HMM states to seed H. To determine these states at each frame, we decode the data using a trigram language model (LM), and find the best aligned state at each frame. For each state and corresponding GMM, we compute the 4 other closest GMMs to this GMM. Here closeness is defined by finding GMM pairs which have the smallest Euclidean distance between their centroids4 . After we find the top 5 GMMs (and corresponding states) at a specific 3 Note that for every test frame n, the dictionary H n is chosen specific to that frame. This will be discussed in more detail in Section III-C. 4 We have also explored other methods to define closeness between GMMs which incorporate both their mean and variance [31], but have observed no difference in performance between this and using just the centroid of the means.

6

frame, we seed H with the training data aligning to these top 5 states. Since this still typically amounts to thousands of training samples in H, we must sample this further. Our method for sampling is discussed in Section III-D. We also explored seeding H with the top 10 states rather than top 5, which we will discuss further in the Section VII-D1. 3) Using a Unigram Language Model: One problem with using a trigram LM is that this decode is essentially the baseline system we are trying to improve upon. Therefore, seeding H with frames related to the top aligned HMM state is essentially projecting y back down to the same state which initially identified it. Unigram LMs have often been used in speech recognition to increase variability of recognition hypotheses (and thus aligned Gaussians) relative to a trigram LM, most notably in lattice generation for discriminative training [26]. Taking a similar approach, to increase variability between the states used to seed H and the best aligned state from the trigram LM decode, we explore using a unigram LM to find the best aligned state at each frame. Again, given the best aligned state and corresponding GMM, the 4 closest GMMs to the best GMM are found and sampled data from these 5 states is used to seed H, as will be discussed in Section III-D. 4) Using no Language Model Information: To further weaken the effect of the LM, we explore seeding H using only acoustic information. Namely, at each frame we find the top 5 scoring HMM states. H is seeded with training data aligning to these states. 5) Enforcing Unique Phonemes: Another problem with seeding H by finding the 5 closest states relative to the best aligned state is that all of these states could come from the same phoneme (i.e. phoneme “AA”). Therefore, we explore finding the 5 closest states relative to the best aligned such that the phoneme identities of these states are unique (i.e. “AA”, “AE”, “AW”, etc.). H is then seeded by from frames aligning to these 5 states. 6) Using Gaussian Means: The above approaches of seeding H use actual examples from the training set, which is computationally expensive. To address this, we investigate seeding H from Gaussian means. Namely, at each frame we use a trigram LM to find the best aligned state. Then we find the 24 closest states to this top state, and use the means of these individual GMMs corresponding to these states to seed H. D. Choice of Sampling If we seed H using all training data belonging to specific HMM state, this amounts to thousands of training examples in H. In this section, we explore two different approaches to sampling a subset of this data to seed H. Both sampling methodologies can be applied to any of the H selection techniques described in Section III-C. 1) Random Sampling: For each HMM state we want to select training data from, we explore randomly sampling N training examples from the total set of training frames that aligned to this state. This process is repeated for each of the closest 5 states. We reduce the size of N as the “closeness”

decreases. For example, for the closest 5 states, the number of data points N chosen from each of the 5 states is 200, 100, 100, 50 and 50 data points respectively. It should be noted that in [31], different methods of seeding H from random sampling were explored, with all methods yielding similar accuracy. We use this fact to conclude that different trials of randomly sampling H will not change recognition performance. 2) Sampling Based on Cosine Similarity: While random sampling offers a relatively quick approach to select a subset of training examples, it does not guarantee that we select “good examples” from this state which actually are close to frame y. Alternatively, we explore splitting training points aligning to the HMM state as being as being 1σ, 2σ, etc. away from the mean of the individual Gaussians which comprise the GMM of the state. Assume we want to sample N points from a specific state. Then we take wi ×N points from each Gaussian i of the GMM, where wi is the weight of the Gaussian. Here wi × N is chosen to be the total number of training points aligned to this Gaussian, divided by number of samples N we want to sample from this Gaussian. Then within each σ set, we find the training point which has the closest cosine similarity to the test point y. This is repeated for all 1σ, 2σ, etc. values. Again the number of samples taken from each state reduces as “closeness” decreases. E. Computational Complexity In this section, we analyze in more detail the computational complexity of the SR method, both mathematically and empirically. 1) Mathematical Complexity: The creation of Hβ features first involves finding H for every frame, and then solving y = Hβ. Below, we break down the computational complexity of both steps. We assume that H ∈ ℜm×n where m is the dimension of the feature vector and n is the number of training exemplars in H. As an example, we provide the computational complexity when H is seeded using knowledge of the top aligned Gaussian at each frame. The steps in the selection of H and the corresponding computational complexity can be summarized as follows: 1) Given the top HMM state at a specific frame, finding the top N closest states to this. This can be computed off-line, and is therefore just a table lookup of O(1). 2) Finding n random samples from training to seed H. This involves selecting n samples without replacement from a set of P total examples. This is roughly O(P logP ) . To get a sense for the size of P assume that a 50-hour Broadcast News task has 3,000 states and 18 million training frames which are equally distributed between the 3,000 states. Assume that if we select the top N states, this is roughly P = 6, 000N samples to select from. Assuming that we select data from N = 5 states, this is roughly O(P logP ) ≈ O(309, 269). 3) Seeding H with these actual samples. Since H is m×n, this is computation O(mn). Typically in LVCSR we chosen m = 40 and n = 70, which is roughly O(28, 000).

7

Therefore the total computational complexity of seeding H is dominated by finding random samples from training to seed H, which is O(P logP ). Second, after H is seeded, we must solve y = Hβ, for which we use the ABCS algorithm. As detailed in Appendix IX-B, ABCS has computational complexity O(dn3 ), where d is the number of iterations of the algorithm. Typically ABCS is run between 5-10 iterations. If we choose n = 750, the total computational complexity of our SR method is dominated by ABCS, which is O(dn3 ) per frame. 2) Empirical Complexity: To get a sense for the empirical complexity of the SR method, we compute the average time per frame to seed H and compute β via ABCS on the BN task [18]. The overall time is also provided. Notice that the overall time is dominated by solving for β via ABCS. We are currently exploring methods to speed up the creation of SR features [31], to make them tractable to create for large vocabulary tasks involving thousands of hours of data. TABLE II B REAKDOWN OF C OMPUTATION T IMES SR Step Seeding H Solving for β, ABCS Overall Per Frame

Time (seconds) 0.23 0.63 0.86

IV. BASELINE R ECOGNIZER A RCHITECTURE Now that we have described our SR features, in this section we outline the LVCSR recipe used to create a state of the art baseline system, on top of which our SR are created. A typical LVCSR system (i.e., [9], [22], [23], [42]), including our recognizer at IBM [38], operates in a series of steps, as indicated in Figure 5. First, feature vectors are extracted from the speech signal. Next, a set of speaker independent (SI) subword unit models are trained. Then, using the set of SI models, a set of speaker adapted (SA) features and models are learned. Finally, feature and model space discriminative training is applied on top of the SA system. Below each component of the process is described in more detail. In certain components, a distinction is made depending on whether the processing is done on TIMIT or LVCSR. A. Front-end Processing 1) Standard LVCSR Processing: A speech utterance is first chunked into 20ms frames, with a frame-shift of 10 ms. Each frame is represented by MFCCs (typically between 12-14 dimensional) or 19 dimensional PLP features. Features are then mean and variance normalized on a per utterance basis. Then, at each frame, a context of 4 frames to the left and right of the current frame are joined together and a Linear Discriminative Analysis (LDA) transform is applied to project the feature vector down to 40 dimensions. Note that in the IBM system, because a context of frames is used, delta and delta-delta features are not used. It was observed in [8] that there was no difference in performance when using delta-delta vs. LDA features in the IBM recognizer.

Fig. 5.

Block Diagram of Various Stages in the IBM LVCSR System

2) TIMIT Processing: In a phonetic recognition task, finer window granularity is needed when obtaining MFCC/PLP coefficients so as to better distinguish between phonemes. Therefore, a 5-ms frame shift is typically used. With this smaller frame shift, in order to capture the same contextual information as an LVCSR system, we found benefits by using an LDA context of 8 frames rather than 4. B. Speaker Independent Acoustic Modeling 1) Standard LVCSR Processing: In SI modeling, each subword unit is a phoneme, represented by a 3 state left-to-right HMM with no skip states. The output distribution of each state is modeled by a GMM. First a series of a set of contextindependent (CI) models are trained using information from the transcription. Maximum likelihood (ML) estimation is used to train parameters of the HMM. The training of CI models produces a set of state-level alignments of the speech against their corresponding transcripts. The CI models are then used for bootstrapping the training of a set of more complex triphone context-dependent (CD) models, which can capture more acoustic variability. These CD models are also modeled by a 3-state left-to-right HMM with no skip states. As with most speech recognizers, due to data availability issues in modeling all possible CD triphones, a clustering procedure is employed to share data across various CD models. First, a set of united CD HMMs are estimated for each possible triphone combination. Next a top-down decision tree is grown for each phone, and states belonging to the same phone are tied together. After the clustering learns a set of CD states, a set of GMMs is trained for each state. 2) TIMIT Processing: In TIMIT, the questions used to generate a top-down decision tree are the same set as those used in [46] for TIMIT. In addition, due to data limitations in a small vocabulary task, we explore using a global decision tree to tie phones together. Specifically, a data-driven set of 13 broad phonetic classes (BPCs) is specified, and phones in the same BPC are grouped together. On TIMIT, we found that tying the middle phones within each BPC was more effective than tying the middle phones of all states together.

8

C. Speaker Adapted Acoustic Modeling After a set of SI models are designed, they are used to bootstrap the training of a set of SA models. In SA modeling, first vocal tract length normalization (VTLN) is applied, followed by a feature/model space adaptation. Both steps are discussed below in more detail. 1) Vocal Tract Length Normalization: The length of a speaker’s vocal tract is often a large factor in speaker variability. VTLN is a popular technique used to reduce this variability. In this procedure, a scaling factor is learned for each speaker that warps the speech from this speaker into a canonical speaker with an average vocal tract length. The warp is applied to the given set of acoustic features before they are LDA transformed. After the warp, features are again spliced together at each frame and an LDA transform is applied to produce a set of 40 dimensional “warped” features. 2) Feature/Model Space Adaptation: After VTLN, the “warped” features are adapted for each speaker using feature space Maximum Likelihood Linear Regression (fMLLR) [6]. Next, using the adapted fMLLR features, the set of CD models are adapted to each speaker using Maximum Likelihood Linear Regression (MLLR) [6]. For MLLR, an eight-level binary regression tree is used, which is built by successively splitting the nodes in a top-down manner using a soft K-means algorithm. D. Discriminative Training For Generative Models Discriminatively trained generative acoustic models have been shown to significantly improve error rates compared to ML trained models, as these discriminative models have more power to better differentiate between confusable sounds, such as “ma” and “na”. In this work, we use a large margin discriminative training approach using the Boosted Maximum Mutual Information (BMMI) criterion [26]. We also explored the MPE criterion, though the objective function appeared sensitive to the phone accuracy counts for phonetic recognition and therefore little gain was found. Similar results were also observed in [5]. We apply discriminative training first to the feature space and then the model space, as past research has indicated that this approach allows for significant improvements in word error rate [26]. First, a set of ML-trained models are used to used to bootstrap the training of a set of discriminatively trained features. Specifically, these models are used to compute a set of lattices for each utterance, representing possible hypotheses of the utterance. Discriminative training is used to estimate a transform on the feature space which separates the correct hypotheses from competing hypotheses. Then using these new feature space Boosted Maximum Mutual Information (fBMMI) features, a second discriminative step using the BMMI criterion is applied to produce a set of discriminatively trained generative acoustic models. Finally, MLLR transforms are applied to the discriminatively trained models. Note that the discriminative training step can occur after a set of CD ML models are trained (i.e., Section IV-B). However, discriminative training is usually done after featurespace speaker adaptation (SA) (i.e., Section IV-C) for two

reasons. First, the goal of feature-space SA is to map features from different speakers into a canonical speaker space, and if it is performed after discriminative training, then intuitively some discrimination is lost when the SA is performed. Second, experimentally we have observed that doing SA and then discriminative training provides larger gains, rather then doing discriminative training and then feature-space SA [39]. V. E XPERIMENTAL S ETUP A. TIMIT The small vocabulary experiments in this paper are conducted on TIMIT [20]. It contains over 6,300 phonetically rich hand-labeled utterances read by 630 speakers. The sentences from the corpus are divided into three sets. The standard NIST training set consists of 3,696 sentences, used to train various models used by the recognizer. The development set is composed of 400 utterances and is used to train various tuning parameters in the LVCSR system. Finally, all phonetic error rates (PERs) are reported on 192 utterances belonging to the core test set. In accordance with standard experimentation on TIMIT [21], the 61 phonetic labels are collapsed into a set of 48 for acoustic model training. Similar to [32], a set of CI HMMs are trained using information from the phonetic transcription. The output distribution of each CI state is a 32-component diagonalized-covariance GMM. The CI models are then used for bootstrapping the training of a set of triphone CD HMMs. Due to the small vocabulary nature of the task, a globaltree clustering algorithm described in Section IV-B is used, which allows for both states and phones to be tied together. Totally the CD system has 2,400 states and 15,000 Gaussian components, which was chosen to optimize performance on the development set. The number of training iterations for each stage of the LVCSR process was also chosen to optimize performance on the development set to avoid overtraining. A trigram language model is used for all experiments. For testing purposes, the standard practice is to collapse the 48 trained labels into a smaller set of 39 labels, which was found in [21] to improve recognition performance. To ignore the glottal stop [q], an approach similar to [2] and [15] is followed. In this approach, models for [q] are built during training. During test, the [q] phone is ignored by mapping the [q] phone to the silence phone in both the reference and hypothesized transcripts. B. Broadcast News The large vocabulary experiments are conducted on an English broadcast news transcription task [18]. The acoustic model is trained on 50 hours of data from the 1996 and 1997 English Broadcast News Speech Corpora. Results are reported on 3 hours of the EARS Dev-04f [18] set for each stage of the LVCSR process. The initial acoustic features are 19-dimensional PLP features. The standard LVCSR recipe outlined in Figure 5 is used to create a final set of fBMMI features and models. Phones are modeled as three-state, left-to-right HMMs with no skip states. States are quinphone context-dependent, except for

9

silence states, which are context-independent. The final system has roughly 2,200 states and 50,000 Gaussian components. The language model used for decoding is a 54M n-gram, interpolated backoff model trained on a collection of 335M words. [18] describes the sources used for language model training in more detail. VI. BASELINE R ESULTS U SING LVCSR R ECIPE In this section, we present our baseline results on TIMIT and Broadcast News for various stages in the LVCSR framework. For TIMIT, we compare results at each stage to other reported results in the literature to further analyze how well-established techniques used in LVCSR systems impact a task such as TIMIT. A. TIMIT: Speaker Independent System 1) Context Independent System: Since many CD systems are designed from bootstrapping CI models, we first explore the behavior of our CI models. Table III compares the results of various CI systems reported in the literature. The IBM system has the lowest PER of all systems. We believe one major explanation for the improved performance over other techniques is the use of robust features which are mean and variance normalized and then LDA transformed. Establishing a good baseline at the CI level, allows for better baselines at subsequent LVCSR stages as well. TABLE III C OMPARISON OF CI S YSTEMS ON TIMIT C ORE T EST S ET System 3-state CI HMM [46] CI Segment-Based System[13] 7-state CI HMM [21] IBM 3-state CI HMM (this paper)

PER (%) 38.3 35.9 35.9 25.2

2) Context Dependent System - Maximum Likelihood Trained: Next, Table IV compares the results of various CD systems reported in the literature to our IBM CD system. For fair comparison, note that none of these systems are discriminatively trained. The IBM CD system offers a PER of 24.5%, which performs better than the HMM systems listed in [19], [24] and [46]. In addition, the IBM CD system also offers improvements over the Hidden Trajectory Model (HTM) [3], which computes the total acoustic model score by combining scores from separate HMM and HTM systems. Further improvements in PER were achieved in [15], though a combination of feature sets was used, rather than one feature as is done in the HMM systems. 3) Context Dependent System - Discriminatively Trained: Table V compares the results of variously discriminatively trained systems on the TIMIT Core test set. The feature and model space discriminative training are indicated as fBMMI and BMMI respectively. On TIMIT, we found that fBMMI captured most of the discrimination and model space discriminative training (i.e. BMMI) offered no added benefits. Since it is somewhat difficult to compare error rates for different discriminative training methods, as the baseline ML error rates are different, we have also provided the relative improvement

C OMPARISON

OF

TABLE IV CD ML T RAINED S YSTEMS ON TIMIT C ORE T EST S ET

System Triphone Discrete HMMs [21] CD Segment-Based Model [13] Triphone Continuous HMMs [19] Generalized Triphone HMMs [46] Recurrent Neural Nets [28] Bayesian Triphone [24] Monophone HTMs [3] IBM CD HMMs (this paper) CD Segment-Based Model, Heterogeneous Measurements [15]

PER (%) 33.9 30.5 26.6 26.3 26.1 25.6 24.8 24.5 24.4

provided by discriminative training over ML for each method. Note that our discriminative training methods provide a large relative improvement in PER ML trained models. This is the best result of all discriminatively trained HMM systems from an absolute perspective, bringing us closer to the performance of the Restricted Boltzmann Machine (RBM). One hypothesis for the superior performance of the RBM is its ability to address two representational efficiency issues of the IBM HMM system, namely diagonal covariance modeling and the frame independence assumption [2]. C OMPARISON

OF

TABLE V CD D ISCRIMINATIVELY T RAINED S YSTEMS ON TIMIT C ORE T EST S ET System

PER (%)

MMI Training, HMM[17] Large-Margin Training, HMM [35] P-MCE, HMM [5] IBM fBMMI, HMM (this paper) IBM fBMMI+BMMI, HMM (this paper) Deep Belief Networks, RBM [2]

28.2 28.2 27.0 21.7 21.7 20.5

Relative PER Red. from ML 4.2 13.8 6.5 16.6 16.6 not available

B. TIMIT: Full LVCSR System As discussed in section IV, the full LVCSR recipe involves building a set of SI CD models, and then performing feature and model space speaker adaptation and discriminative training. Table VI shows the error rates for each stage of this process.5 To analyze the statistical significance of the results at each stage of the LVCSR process, a Matched Pairs Sentence Segmentation Word Error (MPSSWE) [12] significance test is performed comparing the PER at a specific stage to the PER at the previous stage. ̸= indicates that the two results are statistically significant at a 95% confidence level, while = indicates that the behavior of the two systems is similar. The VTLN stage provides a 1.2% absolute decrease in PER which is statistically significant from the SI system, while fMLLR allows for another 1.6% reduction in PER, also statistically significant. Discriminative training allows for a further 2.3% reduction in PER, again statistically significant from the previous SI+VTL+fMLLR system. Finally, applying MLLR on top of this provides a PER of 19.4%, which is not statistically significant. 5 Results for each stage of the LVCSR recipe using a bigram LM are provided in Appendix IX-C.

10

PER

FOR VARIOUS

TABLE VI LVCSR STAGES ON TIMIT C ORE T EST S ET

System SI System +VTL +fMLLR +fBMMI +BMMI +MLLR

PER (%) 24.5 23.3 21.7 19.5 19.5 19.4

MPSSWE ̸= ̸ = ̸ = = =

C. Broadcast News: Full LVCSR System To analyze how each stage of the LVCSR recipe behaves for a large vocabulary task, Table VII shows the error rates for each step. Similar to TIMIT, each stage in the LVCSR recipe improves the WER from the previous stage, and a MPSSWE [12] test indicates that the improvement in WER at each stage is statistically significant with 95% confidence. Notice for LVCSR, the absolute reduction in WER from one stage to the next is greater than in TIMIT. WER

FOR VARIOUS

System SI System +VTL +fMLLR +fBMMI +BMMI +MLLR

TABLE VII LVCSR STAGES ON D EV-04F TASK PER (%) 33.4 28.0 25.8 21.1 20.2 19.4

MPSSWE ̸= ̸ = ̸ = ̸ = ̸ =

VII. S PARSE R EPRESENTATION R ESULTS Now that we have described our baseline start of the art LVCSR system, in this section we present results on top of this baseline system using sparse representation features. Results are discussed for both small and large vocabulary tasks.

of discrimination by comparing the frame accuracy from the fBMMI and Hβ systems. This frame accuracy is calculated by performing a forced path phonetic alignment using the output decode of the fBMMI or Hβ systems. On the Broadcast news task, we find that even though both fBMMI and Hβ features offer the same WER, the frame accuracy of the fMMI system is roughly 76.7% while the Hβ system is 75.2%, showing the loss of discrimination. Therefore, we apply another fBMMI transformation to the Hβ features before applying model space discriminative training and MLLR. To fairly compare to the performance of the Hβ features to the baseline fBMMI features, we also apply another fBMMI transformation to these original features. Notice that when the second fBMMI transformation is applied, the BMMI objective function is computed using lattices which are created using the first-step fBMMI features, not using lattices created using the SA features as is done when estimating the first fBMMI features. This allows for the possibility to capture extra discrimination that the original fBMMI features may not have. However, for every fBMMI transformation that is applied, the gains are typically smaller than at the previous discrimination stage since less discriminatory power is able to be captured on an already discriminatory feature. B. Sparsity Analysis As discussed in Section III, the Hβ features are first created by solving y = Hβ subject to a sparsity constraint on β. We first analyze the behavior of β coefficients on both TIMIT and Broadcast News. For two randomly selected frames y, Figure 6 shows the sorted β coefficients corresponding to 200 entries in H for TIMIT and 500 entries for Broadcast News. Notice that for both datasets, the β entries are quite sparse (i.e., most of the entries are close to zero), illustrating that only a few samples in H are used to characterize y.

A. Creation of Hβ Features Beta Values, TIMIT 0.4 0.2 0 −0.2 −0.4

0

20

40

60

80

100

120

140

160

180

200

400

450

500

Beta Values, Broadcast News 0.4 0.2 β Values

We create a set of Hβ features from a set of fBMMI features. We choose this level as these features offer the highest frame accuracy relative to LDA, VTLN, or fMLLR features, allowing us to further improve on the accuracy of the fBMMI system. A set of Hβ features are created at each frame from the fBMMI features for both training and test sets. A new ML HMM is trained up from these new features and used for both training and test. Parameters of the SR algorithm, including size of H and sparsity level, were optimized on the development sets of both TIMIT and BN to maximize frame classification accuracy. For TIMIT, the optimal size of the dictionary was approximately 200 training examples, while for BN 700 examples were sufficient. Similar parameters for ABCS [1] were used for both tasks, including the number of iterations to estimate β and the sparsity parameter associated with the semi-Gaussian constraint. Since Hβ features create a linear combination of the discriminatively trained fBMMI features, we argue that some discrimination can be lost. Since the objective of the BMMI criterion is to reduce frame error rate, we can quantify this loss

0 −0.2 −0.4

Fig. 6.

0

50

100

150

200

250 300 Entries in H

350

β Coefficients on TIMIT and Broadcast News

Sparsity has two advantages. First, as [45] discusses, sparsity can be thought of as a form of discrimination, as certain examples are selected as “good” in H while jointly assigning

11

zero weights “bad” examples in H. We have seen advantages of the sparse representation approach for classification, even on top of discriminatively trained y features, compared to a GMM [29]. We will also re-confirm this behavior in Section VII-C. Secondly, [16] illustrates how the sparsity constraint on β acts as a regularization term to prevent over-fitting and reduce sensitivity to outliers, which often allows for better classification performance than without sparsity.6 The benefits of SRs discussed above, coupled with an exemplar-based nature of SRs, motivates us to further explore its behavior for recognition tasks. C. TIMIT Results 1) Frame Accuracy: The success of Hβ first relies on the fact that the β vectors give large support to correct classes and small support to incorrect classes (as demonstrated by Figure 6) when computing y = Hβ at each frame. Thus, the classification accuracy per frame, computed using Equation 2 should ideally be high. Table VIII shows the frame accuracy for the GMM and SR methods. TABLE VIII F RAME ACCURACY ON TIMIT T ESTCORE S ET Frame Accuracy 70.4 71.7

Baseline System fBMMI +BMMI+MLLR

PER 19.5 19.4

Hβ System Hβ +BMMI+MLLR

PER 18.6 18.6

measured at approximately 15.0%. We believe that our error rate of 18.6% illustrates the benefits of using the best available baseline system on TIMIT. 3) Error Analysis: In this section, we explore the benefit of using TIMIT for a detailed error analysis, something that is difficult to do for large vocabulary tasks due to the large number of parameters and poorly labeled transcriptions. First, we analyze the substitution errors, which constitutes the majority of the error rate, for the fBMMI+BMMI+MLLR system, which offered a PER of 19.4%. Figure 7 shows a confusion matrix of substitution errors for each phoneme, with phonemes within the same manner class also indicated. We find that approximately 80% of confusions occur within the same manner class, as was similarly observed in [15]. A high number of confusions exists because linguistic knowledge when recognizing a sequence of phonemes as belonging to a word was not used in our system, but was available in the experiment in [48]. Hypothesis

6 For example, in [30], it was observed on a text classification experiment that enforcing sparsity offered improvements in accuracy over no sparsity. 7 Notice from Table IX that the gains after applying BMMI+MLLR are small. One intuition is that on a small vocabulary clean speech task such as TIMIT, there is little capability to model data variability once speaker adaptation, feature-space discriminative training and SRs are applied. Thus, model-space discriminative training and speaker adaptation do not allow for additional modeling capabilities, thus resulting in little improvement in performance.

iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w y m n ng dx jh ch z s sh hh v f dh th b p d t g k cl

vowels/semi-vowels

nasals/flaps

strong fricatives

weak fricatives

stops

iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w y m n ng dx jh ch z s sh hh v f dh th b p d t g k cl

iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w y m n ng dx jh ch z s sh hh v f dh th b p d t g k cl

Notice that the SR technique has a higher frame accuracy compared to the GMM method, again confirming the benefit of exemplar-based classifiers. 2) Error Rate for Hβ Features: Table IX shows the recognition performance of Hβ features on TIMIT. Note that because of the small vocabulary nature of the TIMIT task, we are able to explore using nearest neighbors to seed H. Notice that creating a set of Hβ features in the fBMMI space offers a 0.9% absolute improvement in PER. Given the small vocabulary nature and accurate phonetic labeling of TIMIT, no gain was found applying another fBMMI transform to the baseline or Hβ features. After applying BMMI and MLLR to both feature sets7 , the Hβ features offer a 0.8% absolute improvement in PER over the baseline system, and a MPSSWE [12] significance test indicates that this result is statistically significant. This shows that using sparse representations to produce a new Hβ features not only moves test features closer to training, but also moves the feature vectors closer to the correct class, resulting in a decrease in PER. We would particularly like to comment on the error rate of 18.6%, which to our knowledge is the lowest reported error rate on the TIMIT task to date. In [48], human level error rate of reading phonemes from speech spectrograms was

Reference

iy ih eh ae ah uw uh aa ey ay oy aw ow er l r w y m n ng dx jh ch z s sh hh v f dh th b p d t g k cl

Classifier Gaussian Mixture Model (GMM) Sparse Representations

TABLE IX WER ON TIMIT

Fig. 7. Confusion Matrix of substitution errors with radii linearly proportional to the error. This matrix was derived from the SA+Discrimininative Training System which offered a PER of 19.4%. The largest bubble represents 5.4% of the total error.

Given the high confusability among phonemes within different broad phonetic classes (BPCs), we hypothesize that using information about actual training examples can help to reduce confusions among the same (as well as different) classes. To quantify the benefits of the SR technique on top our fBMMI+BMMI system, Figure 8 shows the breakdown of error rates for each stage of LVCSR process and SR stage within 6 BPCs, namely vowels/semivowels, nasals, strong fricatives, weak fricatives, stops and closures/silence. Here the

12

error rate was calculated by counting the number of insertions, deletions and substitutions that occur for all phonemes within a particular BPC. Notice that by utilizing information about actual training examples, the PER within each BPC except vowels drops after the SR stage is incorporated. The trends in Figure 8 illustrate the benefit of using TIMIT for detailed error analysis and provide a clear understanding that one advantage of SRs is the reduction of errors within BPCs. The biggest decrease in PER can be seen in the nasal, stop and closure classes, where the PER drops approximately 0.2% absolute from the BMMI to SR stage. The PER for the strong fricatives and weak fricatives decreases more modestly from the BMMI to SR stage. One hypothesis we have for this behavior is that vowel phonemes occur more frequently than other phonemes in the TIMIT corpus (over 40% occurrence). With a large amount of training data available for vowels, model-based methods such as GMMs are able to capture a lot of variability within the data, and thus perform as well as exemplar-based methods. However, with less data availability in the other classes, model-based methods are not able to capture as much variability in the data and exemplar-based techniques are preferred, a fact that was also confirmed in [4]. vow/sv





• •

This lower frame accuracy translates into a higher WER when H is seeded with nearest neighbors. Seeding H from unique phonemes provides too much variability of phoneme classes into the Hβ feature, also leading to a higher WER. Using a unigram LM to reduce the link between the states used to seed H and the best aligned state from the trigram LM decode offers a slight improvement in WER over the trigram LM. Utilizing no LM information results in a relatively high WER. Using Gaussian means to seed H reduces the computation to create Hβ without a large increase in WER. TABLE X WER

OF



FEATURES FOR DIFFERENT

H

H Selection Method Trigram LM, Random Sampling, Top 5 HMM states Trigram LM, Cosine Similarity Sampling, Top 5 HMM states Trigram LM, Top 10 HMM states Nearest Neighbor, 500 Trigram LM, 5 Unique Phonemes Unigram LM, Top 5 HMM states No LM Information, Top 5 HMM states Gaussian Means, Top 25 HMM states

WER 21.2 21.3 21.3 21.4 21.6 21.1 22.7 21.4

nas 0.03 PER PER

i

sr

m

llr m

bm

l

ln

0.02

i

sr sr

m i

llr m

bm

vt ln

l

cd

clt

0.04 PER

st

−m

ci

sr

i m

m llr

bm

vt ln

l −m

cd

ci

0.01

0.025

m

llr m

bm

ln vt

l −m

cd

sr

i m

llr m

bm

vt ln

cd

−m

l

0.02

ci

0.02

0.03

ci

PER

0.01

PER

wf

0.03

0.015

Fig. 8.

vt

cd

sf

−m

ci

i

sr

m

0.01

0.02

0.03

0.02

bm

−m

cd

ci

0.1

m llr

0.12

l vt ln

PER

0.14

Error Rates within 6 BPCs for various LVCSR Stages

For further recognition experiments on Broadcast news, we explored using Hβ features created from a unigram LM, as this produced the best results. 2) WER for Hβ Features: Table XI shows the performance of Hβ features on the Broadcast News task. Creating a set of Hβ features at the fBMMI space offers a WER of 21.1% which is comparable to the baseline system. However, after applying an fBMMI transform to the Hβ features we achieve a WER of 20.2%, a 0.2% absolute improvement when another fBMMI transform is applied to the original fBMMI features. Finally, after applying BMMI and MLLR to both the baseline and Hβ system, the Hβ features offer a WER of 18.7%, a 0.3% absolute improvement in WER over the baseline system, though not statistically significant. Nevertheless, this still demonstrates again that using information about actual training examples to produce a set of features which are mapped closer to training and have a higher frame accuracy than GMMs improves accuracy for large vocabulary as well.

D. Broadcast News Results 1) Selection of H : Table X shows the WER for the Hβ features for different H choices discussed in Section III-C. Note that the baseline fMMI system has a WER of 21.1%. The following can be observed: • There is little difference in WER when sampling is done randomly or using cosine similarity. For speed efficiencies, we use random sampling for H selection methods. • There is little difference between using 5 or 10 HMM states. • Seeding H using nearest neighbors is worse than using the trigram LM. On Broadcast News, we find that a kNN has lower frame-accuracy than a GMM, a result similarly observed in the literature for large vocabulary corpora [4].

Baseline System fBMMI +fBMMI +BMMI+MLLR WER

WER 21.1 20.4 19.0 ON

Hβ System Hβ +fBMMI +BMMI+MLLR

WER 21.1 20.2 18.7

TABLE XI B ROADCAST N EWS

VIII. C ONCLUSIONS AND F UTURE W ORK In this paper, we presented a methodology for quickly testing new research ideas for LVCSR systems on a small scale task, and observed how gains with this new idea carry over to an LVCSR task. Utilizing our LVCSR “recipe” on the TIMIT corpus, we showed that at the speaker-independent level, our

13

results of 24.5% offered improvements over the best previously published SI HMM results. Incorporating discriminative training allowed for a PER of 21.7%, the best results reported for discriminatively trained HMM systems to date. Finally, utilizing the full LVCSR recipe with speaker adaptation and discriminating training provided an error rate of 19.4%. On the TIMIT corpus, applying a novel set of SR features on top of our best feature-space discriminatively trained system allows for a 0.8% absolute reduction in PER. The PER of 18.6% provided by the SR method are the best result on the TIMIT task to date, and moving us closer to human phonetic recognition performance. Applying these SR features to an LVCSR task allows for a reduction in word error rate (WER) of 0.3% absolute, demonstrating the benefit of using TIMIT as a methodology to explore ideas for LVCSR systems. In the future, we would like to expand this work in a number of areas. First, many improvements can be made to the the current exemplar-based SR method presented in this paper. First, our processing of Hβ features estimates a β for each frame without incorporating word level information from the entire utterance, also known as sequence information. The neural net approach presented in [18] is one approach which learns features at the frame level while incorporating this sequencing information, and we would like to incorporate a similar technique when estimating β. Second, one of the problems with SR features discussed in this work, is they are computationally very expensive, as outlined in Section III-E. Specifically, it requires storing all training samples at run-time and randomly sampling from these samples at each frame to seed H. Furthermore, a SR computation to solve for β is also required for each frame. We have started initial steps into exploring faster methods to create SR features to make them more usable for large vocabulary tasks [31], and would like to expand on this further in the future. Second, the SR work presented in this paper can be used as another type of feature enhancement technique. Similar to [11], we can model H as a combination of clean speech and noise exemplars, i.e. H = [S; N ], where S and N represent the clean speech and noise examples respectively. Given a noisy test feature y, we can use SRs to solve y = Hβ to find the clean speech and noise examples in H which best explain y. Then, the noise exemplars in H can be discarded and a new “clean” speech feature can be constructed via Sβs , where βs ∈ β are the β values which correspond to speech exemplars. Recognition can then be performed with this new “clean” feature. Finally, our analysis of within class confusions hints that vowel performance in LVCSR can be improved by merging together highly confusable vowels into one class (i.e. [ih] and [ah]), which can aid in pronunciation generation by reducing the number of pronunciations. In addition, due to the high error rates within short and long vowels, voicing and duration modeling in LVCSR might also improve error rates. ACKNOWLEDGEMENTS The authors would like to thank Hagen Soltau, George Saon, Brian Kingsbury and Stanley Chen for their contributions towards the IBM toolkit and recognizer utilized in this

paper. Also, thanks to Abhinav Sethy and Parikshit Shah for many useful discussions and guidance related to the sparse representation features described in this paper. IX. A PPENDIX A. Sparse Representation Classification Decision Rules In this section, we discuss how to assign y as belonging to a specific class. We explore various classification rules, which we highlight below. 1) Maximum Support: Ideally, all nonzero entries of β should correspond to the entries in H with the same class as y. In this ideal case, y will assign itself to one training example from H, and we can assign y to the class which has the largest support in β. i∗ = arg max(β) i

(3)

2) Maximum l2 Support: However, due to noise and modeling error, β belonging to other classes could potentially be non-zero. Therefore, we compute the l2 norm for all β entries within a specific class, and choose the class with the largest l2 norm support. More specifically, let us define a selector δi (β) ∈ ℜN as a vector whose entries are zero except for entries in β corresponding to class i. We then compute the l2 norm for β for class i as ∥ δi (β) ∥2 . The best class for y will be the class in β with the largest l2 norm. Mathematically, the best class i∗ is defined as i∗ = arg max ∥ δi (β) ∥2 i

(4)

3) Residual Error: As [45] discusses, a classification decision can also be formulated by measuring how well y assigns itself to different classes in H. This can be thought of as looking at the residual error between y and the Hβ entries corresponding to a specific class [45]. Let us define a selector δi (β) ∈ ℜN as a vector whose entries are zero except for entries in β corresponding to class i. We then compute the residual error for class i as ∥ y − Hδi (β) ∥2 . The best class for y will be the class with the smallest residual error. Mathematically, the best class i∗ is defined as i∗ = arg min ∥ y − Hδi (β) ∥2 i

(5)

B. ABCS Computational Complexity In this section, we provide the computational complexity of the ABCS algorithm [1]. We would like to use ABCS to solve the following problem: y = Hβ s.t. ∥ β ∥21 < ϵ for β

(6)

Here ∥ β ∥21 < ϵ denotes a sparsity-promoting semi-Gaussian constraint, which we will describe in more detail below. In addition, y is a frame of data from the test set such that y ∈ ℜm where m is the dimension of the feature vector y. H is a matrix of training examples and H ∈ ℜm×n where m << n. We assume that y satisfies a linear model as: y = Hβ + ζ where ζ ∼ N (0, R). This allows us to represent p(y|β) as a Gaussian distribution as: p(y|β) ∝ exp(−1/2(y − Hβ)T R−1 (y − Hβ))

(7)

14

Assuming β is a random variable with prior p(β), we can obtain the maximum a posteriori (MAP) estimate for β as follows: β ∗ = arg maxβ p(β|y) = maxβ p(y|β)p(β). In the ABCS formulation, we assume that p(β) is actual the product of two prior constraints, namely a Gaussian constraint pG (β) and a semi-Gaussian constrain pSG (β) which enforces sparsity. Below, we present a two-step solution to solve the following problem in the ABCS framework. β ∗ = arg max p(y|β)pG (β)pSG (β) β

(8)

1) Step 1: In step 1, we solve for the β which maximizes the following expression. Equation 9 is equivalent to solving the equation y = Hβ without enforcing a sparsity constraint on β [1]. β ∗ = arg max p(y|β)pG (β) (9) β

We assume that pG (β) is a Gaussian, i.e., pG (β) = N (β|β0 , P0 ). Here β0 and P0 are initialized statistical moments utilized in the algorithm. In [1], we show that the solution to Equation 9 has a closed form solution given by Equations 10a and 10b. ( ) β ∗ = β1 = I − P0 H T (HP0 H T + R)−1 H β0 + P0 H T (HP0 H T + R)−1 y

(10a)

the variance of β1 as P1 = [Similarly, we can express ] E (β − β1 )(β − β1 )T , given more explicitly by Equation 10b. P1 = (I − P0 H (HP0 H + R) T

T

−1

H)P0

(10b)

2) Step 2: Step 1 essentially solved for the pseudo-inverse of y = Hβ, of which there are many solutions. In this section, we impose an additional constraint that β will have a sparsity-promoting semi-Gaussian prior, as given by Equation 11. Here σ 2 is a constant parameter which controls the degree of sparsity of β, and c is a constant added to ensure that the integral of Equation 11 sums to one. ( ) ||β||21 pSG (β) = c × exp − (11) 2σ 2 Given the solutions to Step 1 in Equations 10a and 10b, we can simply rewrite Equation 9 as another Gaussian as p′ (β|y) = p(y|β)pG (β) = N (β|β1 , P1 ). Therefore, let us assume now that we would like to solve for the MAP estimate of β given the constraint that it is semi-Gaussian, in other words: β ∗ = arg max p′ (β|y)pSG (β) β

ˆ i (β i ) = 0 for β i = 0. This vector H ˆ is for β i < 0, and H motivated from the fact that ∑ ∑ ˆ i (β i )β i )2 = (Hβ) ˆ 2 (13) ∥ β ∥21 = ( (|β i |))2 = ( H i

i

Substituting the expression for ∥ β ∥21 given in Equation 13 and assuming a that y = 0, we can rewrite Equation 11 as Equation 14. Notice that Equation 14 has the same form ˆ and σ as Equation 7 with H and R now replaced by H respectively. ( pSG (β) = p(y = 0|β) = c × exp

ˆ 2 −(0 − Hβ) 2σ 2

) (14)

The only problem with using Equation 12 to solve for β ˆ on β in Equation 10a. Therefore, we is the dependency of H ˆ based on the sign of the make an assumption, by calculating H ˆ i (β i ) ≈ H ˆ i (β i ). previously estimated β. In other words H k−1 With this approximation we can use Equations 10a and 10b to solve Equation 14. However, because of this semi-Gaussian approximation, we must estimate β and P iteratively. As [1] shows, this iteration also requires that we set σ 2 as σ 2 × d, where d is the total number of iterations of Step 2. Equation 15 gives the recursive formula which solves Equation 12 at iteration k for k > 1 to d. Note that p′ (β|y) = N (β|βk−1 , Pk−1 ). ˆT Pk−1 H ˆ k−1 Hβ ˆ k−1 H ˆ T + d × σ2 HP [ ] ˆTH ˆ Pk−1 H Pk = I − Pk−1 ˆ k−1 H ˆ T + d × σ2 HP

βk = βk−1 −

(15a)

(15b)

3) Computational Complexity: Step 2 requires the most computation, so we will focus on this when calculating complexity. Using the fact that H ∈ ℜm×n , P ∈ ℜn×n and β ∈ ℜm×1 . The computation of β, as given by Equation 15a, involves multiplying an m × n matrix with a matrix of size n × n. This has a computational complexity of roughly O(mn2 ) at a single iteration [40]. In addition, the computation of P , as given by Equation 15b, involves inverting an n × n matrix and is roughly O(n3 ) at a single iteration via GaussJordon elimination [40]. Therefore, for d iterations, the total computational complexity of the original ABCS method is O(dmn2 + dn3 ) ≈ O(dn3 ) when m < n. C. TIMIT Results Using Bigram and Trigram Language Model In this section, we report results on TIMIT for each stage of the LVCSR process using a bigram and trigram LM.

(12)

In order to represent pSG (β) as a Gaussian the same way that p(y|β) in Equation 7 was represented, let us define β i to be the ith entry of the vector β. We introduce a vector ˆ of which the entries are set as H ˆ i (β i ) = sign(β i ), for H i i ˆ (β ) = +1 for β i > 0, H ˆ i (β i ) = −1 i = 1, . . . , n. Here H

R EFERENCES [1] A. Carmi, P. Gurfil, D. Kanevsky, and B. Ramabhadran, “ABCS: Approximate Bayesian Compressive Sensing,” Human Language Technologies, IBM, Tech. Rep., 2009. [2] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine,” in Proc. NIPS, 2010.

15

TABLE XII PER

FOR VARIOUS LVCSR STAGES ON TIMIT C ORE T EST B IGRAM AND T RIGRAM L ANGUAGE M ODELS

System SI CI System SI CD System +VTL +fMLLR +fBMMI +BMMI +Hβ +MLLR

PER (%) (bigram) 26.3 25.4 24.1 22.2 20.0 20.0 19.8 19.8

S ET, U SING

PER (%) (trigram) 25.2 24.5 23.3 21.7 19.5 19.5 18.6 18.6

[3] L. Deng and D. Yu, “Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition,” in Proc. ICASSP, 2007. [4] T. Deselaers, G. Heigold, and H. Ney, “Speech Recognition With Statebased Nearest Neighbour Classifiers,” in Proc. Interspeech, 2007. [5] Q. Fu, X. He, and L. Deng, “Phone-Discriminating Minimum Classification Error (P-MCE) Training for Phonetic Recognition,” in Proc. Interspeech, 2007. [6] M. J. F. Gales, “Maximum Likelihood Linear Transformations for HMM-based Speech Recognition,” Computer Speech & Language, vol. 12, pp. 75–98, 1998. [7] A. Ganapathiraju, J. Hamaker, and J. Picone, “Applications of Support Vector Machines to Speech Recognition,” IEEE Transactions in Signal Processing, vol. 52, no. 8, p. 23482355, 2004. [8] Y. Gao, Y. Li, V. Goel, and M. Picheny, “Recent Advances in Speech Recognition System for IBM DARPA Communicator ,” in Proc. Eurospeech, 2001. [9] J. Gauvain, L. Lamel, and G. Adda, “The LIMSI Broadcast News Transcription System,” Speech Communication, 2002. [10] J. Gemmeke, U. Remes, and K. J. Palomki, “Observation uncertainty measures for sparse imputation,” in Proc. Interspeech, 2010. [11] J. F. Gemmeke and T. Virtanen, “Noise Robust Exemplar-Based Connected Digit Recognition,” in Proc. ICASSP, 2010. [12] L. Gillick and S. Cox, “Some Statistical Issues in the Comparison of Speech Recognition Algorithms,” in Proc. ICASSP, 1989. [13] J. Glass, J. Chang, and M. McCandless, “A Probabilistic Framework for Feature-Based Speech Recognition,” in Proc. ICSLP, 1996. [14] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone Rpeech Corpus for Research and Development,” in Proc. ICASSP, 1992. [15] A. Halberstat and J. Glass, “Heterogeneous Measurements and Multiple Classifiers for Speech Recognition,” in Proc. ICSLP, 1998. [16] D. Kanevsky, T. N. Sainath, B. Ramabhadran, and D. Nahamoo, “An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification,” in Proc. Interspeech, 2010. [17] S. Kapadia, V. Valtchev, and S. J. Young, “MMI Training for Continuous Phoneme Recognition on the TIMIT Database,” in Proc.ICASSP, 1993. [18] B. Kingsbury, “Lattice-Based Optimization of Sequence Classification Criteria for Neural-Network Acoustic Modeling,” in Proc. ICASSP, 2009. [19] L. Lamel and J. Gauvain, “High Performance Speaker-Independent Phone Recognition using CDHMM,” in Proc. Eurospeech, 1993. [20] L. Lamel, R. Kassel, and S. Seneff, “Speech Database Development: Design and Analysis of the Acoustic-Phonetic Corpus,” in Proc. of the DARPA Speech Recognition Workshop, 1986. [21] K. F. Lee and H. W. Hon, “Speaker-independent Phone Recognition Using Hidden Markov Models,” IEEE Transacations on Acoustics, Speech and Signal Processing, vol. 37, pp. 1641–1648, 1989. [22] X. Lei, W. Wu, W. Wang, A. Mandal, and A. Stolcke, “Development of the 2008 sri mandarin speech-to-text system for broadcast news and conversations,” in Proc. Interspeech, 2009. [23] S. Matsoukas, T. Colthurst, O. Kimball, A. Solomonoff, F. Richardson, C. Quillen, H. Gish, and P. Dognin, “The 2001 byblos english large vocabulary conversational speech recognition system,” in Proc. ICASSP, 2002. [24] J. Ming and F. J. Smith, “Improved Phone Recognition using Bayesian Triphone Models,” in Proc. ICASSP, 1998. [25] D. Povey, “Phone Duration Modeling for LVCSR,” in Proc. ICASSP, 2004.

[26] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted MMI for Model and Feature Space Discriminative Training,” in Proc. ICASSP, 2008. [27] P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett, “The DARPA 1000-word Resource Management Database for Continuous Speech Recognition,” in Proc. ICASSP, 1988. [28] A. Robinson, “An Application of Reccurent Nets to Phone Probability Estimation,” IEEE Transactions on Neural Networks, vol. 5, pp. 298– 305, 1994. [29] T. N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian Compressive Sensing for Phonetic Classification,” in Proc. ICASSP, 2010. [30] T. N. Sainath, S. Maskey, D. Kanevsky, B. Ramabhadran, D. Nahamoo, and J. Hirschberg, “Sparse Representations for Text Categorization,” in Proc. Interspeech, 2010. [31] T. N. Sainath, B. Ramabhadran, D. Nahamoo, and D. Kanevsky, “Reducing Computational Complexities of Exemplar-Based Sparse Representations With Applications to Large Vocabulary Speech Recognition,” in submitted to Proc. Interspeech, 2011. [32] T. N. Sainath, B. Ramabhadran, and M. Picheny, “An Exploration of Large Vocabulary Tools for Small Vocabulary Phonetic Recognition,” in Proc. ICASSP, 2009. [33] T. N. Sainath, V. Zue, and D. Kanevsky, “Audio Classification using Extended Baum-Welch Transformations,” in Interspeech, 2007. [34] G. Saon, G. Zweig, B. Kingsbury, L. Mangu, and U. Chaudhari, “An architecture for rapid decoding of large vocabulary conversational speech,” in Proc. Eurospeech, 2003. [35] F. Sha, “Comparison of Large Margin Training to Other Discriminative Training Methods for Phonetic Recognition by Hidden Markov Models,” in Proc. ICASSP, 2007. [36] G. Sivaram, S. Ganapathy, and H. Hermansky, “Sparse Auto-associative Neural Networks: Theory and Application to Speech Recognition,” in Proc. Interspeech, 2009. [37] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig, “The IBM 2004 Conversational Telephony System for Rich Transcription,” in Proc. ICASSP, 2005. [38] H. Soltau, G. Saon, and B. Kingsbury, “The IBM Attila Speech Recognition Toolkit,” in IEEE Workshop on Spoken Language Technology, 2010. [39] H. Soltau, G. Saon, B. Kingsbury, J. Kuo, L. Mangu, D. Povey, and G. Zweig, “The IBM 2006 Gale Arabic ASR System,” in Proc. ICASSP, 2007. [40] G. Strang, Introduction to Linear Algebra. Wellesley-Cambridge Press, 2003. [41] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996. [42] M. Tomalin, F. Diehl, M. Gales, J. Park, and P. C. Woodland, “Recent improvements to the cambridge arabic speech-to-text systems,” in Proc. ICASSP, 2010. [43] A. Tychonoff and V. Arseny, “Solution of Ill-Posed Problems,” Washington: Winston and Sons, 1977. [44] M. D. Wachter, M. Matton, K. Demuynck, P. Wambacq, R. Cools, and D. V. Compernolle, “Template Based Continuous Speech Recognition,” IEEE Transcations on Audio, Speech and Language Processing, 2007. [45] J. Wright, A. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust Face Recognition via Sparse Representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, 2009. [46] S. J. Young, “The General Use of Tying in Phoneme-Based HMM Speech Recognizers,” in Proc. ICASSP, 1992. [47] H. Zou and T. Hastie, “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal Statistical Society, 2005. [48] V. Zue and R. Cole, “Experiments on Spectrogam Reading,” in Proc. ICASSP, 1979. [49] G. Zweig and P. Nguyen, “A segmental crf approach to large vocabulary continuous speech recognition,” in Proc. ASRU, 2007.

16

Tara Sainath received her PhD in Electrical Engineering and Computer Science from MIT in 2009. The main focus of her PhD work was in acoustic modeling for noise robust speech recognition. She joined the Speech and Language Algorithms group at IBM T.J. Watson Research Center upon completion of her PhD. She organized a Special Session on Sparse Representations at Interspeech 2010 in Japan. In addition, she has served as a staff reporter for the IEEE Speech and Language Processing Technical Committee (SLTC) Newsletter. She currently holds 8 US patents. Her research interests include acoustic modeling, sparse representations, adaptation methods and noise robust speech recognition.

Bhuvana Ramabhadran is the Manager of the Speech Transcription and Synthesis Research Group at the IBM T. J. Watson Center, Yorktown Heights, NY. Upon joining IBM in 1995, she has made significant contributions to the ViaVoice line of products focusing on acoustic modeling including acoustics-based baseform determination, factor analysis applied to covariance modeling, and regression models for Gaussian likelihood computation. She has served as the Principal Investigator of two major international projects: the NSF-sponsored MALACH Project, developing algorithms for transcription of elderly, accented speech from Holocaust survivors, and the EU-sponsored TC-STAR Project, developing algorithms for recognition of EU parliamentary speeches. She was the publications chair of the 2000 ICME Conference, organized the HLTNAACL 2004 Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, a 2007 Special Session on Speech Transcription and Machine Translation at the 2007 ICASSP in Honolulu, HI, and a 2010 Special Session on Sparse Representations at Interspeech 2010. She is currently a member of the Speech Technical Committee of the IEEE Signal Processing Society, and serves as its industry liaison. She served as an Adjunct Professor in the Electrical Engineering Department of Columbia University in the fall of 2009 and co-taught a course in speech recognition. Her research interests include speech recognition algorithms, statistical signal processing, pattern recognition, and biomedical engineering.

Michael Picheny is the Senior Manager of the Speech and Language Algorithms Group at the IBM T.J. Watson Research Center. Michael has worked in the Speech Recognition area since 1981, joining IBM after finishing his doctorate at MIT. He has been heavily involved in the development of almost all of IBM’s recognition systems, ranging from the world’s first real-time large vocabulary discrete system through IBM’s current product lines for telephony and embedded systems. He has published numerous papers in both journals and conferences on almost all aspects of speech recognition. He has received several awards from IBM for his work, including three outstanding Technical Achievement Awards and two Research Division Awards. He is the co-holder of over 20 patents and was named a Master Inventor by IBM in 1995 and again in 2000. Michael served as an Associate Editor of the IEEE Transactions on Acoustics, Speech, and Signal Processing from 1986-1989, was the chairman of the Speech Technical Committee of the IEEE Signal Processing Society from 2002-2004, and is a Fellow of the IEEE. He served as an Adjunct Professor in the Electrical Engineering Department of Columbia University in the fall of 2009 and co-taught a course in speech recognition. He is currently a member of the board of ISCA (International Speech Communication Association) and is on the editorial board of the IEEE Signal Processing Magazine.

David Nahamoo is the Speech CTO and the Speech Business Strategist for IBM Research. He is responsible for IBM Research technical and business directions in the conversational and multimodal technologies. He joined IBM Research in 1982 as a Research Staff Member. Since then he has held a number of positions in the organization including Manager, Speech Recognition Modeling and Interim General Manager, Speech Business Unit. In 2008, he was granted the title of IBM Fellow, which is IBM’s most prestigious technical honor. He holds 25 patents and has published more than 55 technical papers in scientific journals. Dr. Nahamoo is a Member of IBM Academy of Technology and a Fellow of the IEEE. He has also been a Member of the Spoken Language Coordinating Committee, DARPA, and of the Speech Technical Committee of the ASSP Society as well as the Associate Editor of Transactions on Acoustics, Speech and Signal Processing. In 2001, he received the IEEE Signal Processing Best Paper Award. His current research interests include conversational and multimodal technologies and tools, speech solutions and services, and speech user interfaces. David Nahamoo received a B.S. degree from Tehran University, Iran, an M.S. from Imperial College of London, England, and his Ph.D. in 1982 from Purdue University in Indiana; all of his studies were in electrical engineering.

Dimitri Kanevsky is a research staff member in the Speech and Language algorithms department at IBM T.J.Watson Research Center. Prior to joining IBM, he worked at a number of prestigious centers for higher mathematics, including Max Planck Institute in Germany and the Institute for Advanced Studies in Princeton. At IBM he has been responsible for developing the first Russian automatic speech recognition system, as well as key projects for embedding speech recognition in automobiles and broadcast transcription systems. He currently holds 132 US patents and was granted the title of Master Inventor IBM in 2002 , 2005 and 2010. His conversational biometrics based security patent was recognized by MIT, Technology Review, as one of five most influential patents for 2003. His work on Extended Baum-Welch algorithm in speech work , another work for embedding speech recognition in automobiles and his work on conversational biometrics were recognized as science accomplishment in 2002 , 2004 and 2008 by the Director of Research at IBM . In 2005 Dimitri received Honorary degree (Doctor of Laws, honoris causa) from the University College of Cape Breton. He was elected a member of Word Technology Network in 2004 and was a Chairperson of IT Software Technology session at Word Technology Network Summit, 2005, San-Francisco.

Exemplar-Based Sparse Representation Features ...

in LVCSR systems and applying them on TIMIT to establish a new baseline. We then .... making it difficult to compare probabilities across frames. Thus, to date SVMs ...... His conversational biometrics based security patent was recognized by.

6MB Sizes 2 Downloads 275 Views

Recommend Documents

Sparse Representation Features for Speech Recognition
ing the SR features on top of our best discriminatively trained system allows for a 0.7% ... method for large vocabulary speech recognition. 1. ... of training data (typically greater than 50 hours for large vo- ... that best represent the test sampl

Exemplar-Based Sparse Representation Phone ...
1IBM T. J. Watson Research Center, Yorktown Heights, NY 10598. 2MIT Laboratory for ... These phones are the basic units of speech to be recognized. Moti- vated by this ..... to seed H on TIMIT, we will call the feature Sknn pif . We have also.

Incorporating Sparse Representation Phone ...
Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned ...

SPARSE REPRESENTATION OF MEDICAL IMAGES ...
coefficients, enables a potentially significant reduction in storage re- quirements. ... applications have increased the amount of image data at an explo- sive rate. ..... included in the SparseLab package that is available online at http://.

Self-Explanatory Sparse Representation for Image ...
previous alternative extensions of sparse representation for image classification and face .... linear combinations of only few active basis vectors that carry the majority of the energy of the data. ..... search Funds for the Central Universities (N

Temporal Representation in Spike Detection of Sparse ... - Springer Link
and stream mining within a framework (Section 2.1): representation of time with ... application data sets (Section 3.2), experimental design and evaluation ...

Sparse Representation based Anomaly Detection using ...
HOMVs in given training videos. Abnormality is ... Computer vision, at current stage, provides many elegant .... Global Dictionary Training and b) Modeling usual behavior ..... for sparse coding,” in Proceedings of the 26th Annual International.

Sparse Representation based Anomaly detection with ...
ABSTRACT. In this paper, we propose a novel approach for anomaly detection by modeling the usual behaviour with enhanced dictionary. The cor- responding sparse reconstruction error indicates the anomaly. We compute the dictionaries, for each local re

Sparse Representation based Anomaly Detection with ...
Sparse Representation based Anomaly. Detection with Enhanced Local Dictionaries. Sovan Biswas and R. Venkatesh Babu. Video Analytics Lab, Indian ...

Bayesian Pursuit Algorithm for Sparse Representation
the active atoms in the sparse representation of the signal. We show that using the .... in the MAP sanse, it is done with posterior maximization over all possible ...

Random Sparse Representation for Thermal to Visible ...
except the elements associated with the ith class, which are equal to elements of xi. ..... mal/visible face database.,” Oct. 2014, [Online]. Available: http://www.

SPARSE CODING OF AUDITORY FEATURES ... - Research at Google
ence may indeed be realizable, via the general idea of sparse features that are localized in a domain where signal compo- nents tend to be localized or stable.

Search features
Search Features: A collection of “shortcuts” that get you to the answer quickly. Page 2. Search Features. [ capital of Mongolia ]. [ weather Knoxville, TN ]. [ weather 90712 ]. [ time in Singapore ]. [ Hawaiian Airlines 24 ]. To get the master li

Program features - MCShield
Feb 26, 2012 - Hard disk drives – enables initial scan of all hard drives ..... C:\Documents and Settings\All Users\Application Data\MCShield (for Windows XP).

Representation: Revisited - GEOCITIES.ws
SMEC, Curtin University of Technology. The role of representation in ... Education suffered a decline in the last 20 or 30 years. (vonGlaserfled, 1995), which led ...

Representation: Revisited
in which social interchange has a major role in constructing and representing knowledge ... Explicitly speaking, the construction and representation of meaning.

Hardware and Representation - GitHub
E.g. CPU can access rows in one module, hard disk / another CPU access row in ... (b) Data Bus: bidirectional, sends a word from CPU to main memory or.

C++98 features? - GitHub
Software Architect at Intel's Open Source Technology. Center (OTC). • Maintainer of two modules in ... Apple Clang: 4.0. Official: 3.0. 12.0. 2008. C++11 support.

Sparse Sieve MLE
... Business School, NSW, Australia; email: [email protected] .... space of distribution functions on [0,1]2 and its order k controls the smoothness of Bk,Pc , with a smaller ks associated with a smoother function along dimension s.

Recursive Sparse, Spatiotemporal Coding - CiteSeerX
In leave-one-out experiments where .... the Lagrange dual using Newton's method. ... Figure 2. The center frames of the receptive fields of 256 out of 2048 basis.

TECHNOLOGIES OF REPRESENTATION
humanities are changing this, with imaging and visualizing technologies increasingly coming to the ... information visualization are all very different in their nature, and in the analytical and interpretive .... The best example of this is a thermom

REPRESENTATION OF GRAPHS USING INTUITIONISTIC ...
Nov 17, 2016 - gN ◦ fN : V1 → V3 such that (gN ◦ fN )(u) = ge(fe(u)) for all u ∈ V1. As fN : V1 → V2 is an isomorphism from G1 onto G2, such that fe(v) = v′.