Alternative Regularized Neural Network Architectures ...

Viewer
Transcript

Alternative Regularized Neural Network Architectures for Speech and Speaker Recognition

by

Sri Venkata Surya Garimella

A dissertation submitted to The Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland July, 2012

c Sri Venkata Surya Garimella 2012 ° All rights reserved

Abstract Artificial Neural Networks (ANNs) have been widely used in a variety of speech processing applications. They can be used either in a classification or regression mode. Proper regularization techniques are necessary when training these networks, especially in scenarios where the amount of training data is limited or the number of layers in a network is large. In this thesis, we explore alternative regularized feed-forward neural network architectures and propose learning algorithms for speech processing applications such as phoneme recognition and speaker verification. In a conventional hybrid phoneme recognition system, a multilayer perceptron (MLP) with a single hidden layer is trained on standard acoustic features to provide the estimates of posterior probabilities of phonemes. These estimates are used for decoding the underlying phoneme sequence. In this thesis, we introduce a sparse multilayer perceptron (SMLP) which jointly learns an internal sparse feature representation and nonlinear classifier boundaries to discriminate multiple phoneme classes. This is achieved by adding a sparse regularization term to the original crossentropy cost function. Instead of MLP, SMLP is used in a hybrid phoneme recognition ii

ABSTRACT

system. Experiments are conducted to test various feature representations, including the proposed data-driven discriminative spectro-temporal features. Significant improvements are obtained using these techniques. Another application where neural networks are used is in speaker verification. Auto-Associative Neural Network (AANN) is a fully connected feed-forward neural network, trained to reconstruct its input at its output through a hidden compression layer. AANNs are used to model speakers in speaker verification, where a speaker-specific AANN model is obtained by adapting (or retraining) the Universal Background Model (UBM) AANN, an AANN trained on multiple held out speakers, using corresponding speaker data. When the amount of speaker data is limited, this procedure may lead to overfitting as all the parameters of UBM-AANN are being adapted. To alleviate this problem, we regularize the parameters of AANN by developing subspace methods namely weighted least squares (WLS) and factor analysis (FA). Experimental results show the effectiveness of the subspace methods over directly adapting a UBM-AANN for speaker verification. Thesis Committee Hynek Hermansky, Sanjeev Khudanpur, Trac Tran, Daniel Povey, Nelson Morgan.

iii

Acknowledgments I owe my deepest gratitude to my supervisor Prof. Hynek Hermansky, whose insight, encouragement and guidance made this work possible. It is an honor for me to work with him. I am indebted to Andrea Ridolfi, Sanjeev Khudanpur, Trac Tran, Donniell Fishkind, Carey Priebe, Rene Vidal and Daniel Povey for offering graduate level courses that enabled me to learn and appreciate mathematical rigour. I would like to thank my internship hosts Pedro Moreno, Olivier Siohan at Google for providing me the opportunity to gain valuable research experience. I am grateful to Joel Pinto, Nima Mesgarani, Sridhar Krishna Nemala, Sriram Ganapathy, Balakrishnan Varadarajan and Samuel Thomas for the scientific collaboration, and also several colleagues and friends for their support during the PhD. Finally, I would like to thank my mother and sisters for their infinite love and support.

iv

Dedication This thesis is dedicated to my mother.

v

Contents

Abstract

ii

Acknowledgments

iv

List of Tables

x

List of Figures

xi

1 Introduction

1

1.1

Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Scope of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2 Improved Hybrid Phoneme Recognition System

7

2.1

Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3

Discriminative Spectro-Temporal Features . . . . . . . . . . . . . . . . . . .

9

vi

CONTENTS

2.4

2.3.1

Estimation of Spectro-Temporal Filters . . . . . . . . . . . . . . . .

10

2.3.2

Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Sparse Multilayer Perceptron (SMLP) . . . . . . . . . . . . . . . . . . . . .

12

2.4.1

Theory of SMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Error Back-Propagation Training . . . . . . . . . . . . . . . . . . . .

14

Gradient of L˜ w.r.t. yjl . . . . . . . . . . . . . . . . . . . . . . . . . .

15

l−1 Gradient of L˜ w.r.t. wij . . . . . . . . . . . . . . . . . . . . . . . .

16

Update Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

SMLP as a Posterior Probability Estimator . . . . . . . . . . . . . .

16

System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.5.1

Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.5.2

Feature Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

PLP Cepstral Coefficients . . . . . . . . . . . . . . . . . . . . . . . .

18

FDLP Temporal Features . . . . . . . . . . . . . . . . . . . . . . . .

19

MLDA Spectro-Temporal Features . . . . . . . . . . . . . . . . . . .

19

Hierarchical SMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Estimation of 3-state Phoneme Posteriors . . . . . . . . . . . . . . .

20

Hierarchical Estimation of Posterior Probabilities . . . . . . . . . . .

22

2.5.4

Dempster-Shafer Combination . . . . . . . . . . . . . . . . . . . . .

22

2.5.5

Hybrid HMM Decoding . . . . . . . . . . . . . . . . . . . . . . . . .

23

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4.2 2.5

2.5.3

2.6

vii

CONTENTS

2.7

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3 Weighted Least Squares based Auto-Associative Neural Networks for Speaker Verification

27

3.1

Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.2

Auto-Associative Neural Networks . . . . . . . . . . . . . . . . . . . . . . .

29

3.3

Speaker Verification using AANNs . . . . . . . . . . . . . . . . . . . . . . .

30

3.3.1

Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3.2

UBM-AANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3.3

Speaker-Specific AANN . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.3.4

Score Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.4

Weighted Least Squares of AANNs . . . . . . . . . . . . . . . . . . . . . . .

33

3.5

WLS based AANN Speaker Verification System . . . . . . . . . . . . . . . .

35

3.5.1

Closed-form Expression for Adapting UBM-AANN . . . . . . . . . .

36

3.5.2

T-matrix Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.5.3

i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.5.4

PLDA training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.5.5

Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.6

4 Factor Analysis of Auto-Associative Neural Networks for Speaker Verification

41

4.1

Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.2

Factor analysis of AANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

viii

CONTENTS

4.3

4.4

FA based AANN Speaker Verification System . . . . . . . . . . . . . . . . .

46

4.3.1

Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.3.2

T-matrix Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.3.3

i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.4.1

49

Comparison with GMM based i-vector/PLDA System . . . . . . . .

5 Conclusions

A

51

5.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

Standard Error Back-Propagation

53

Bibliography

55

ix

List of Tables 2.1

2.2 2.3

2.4 3.1 3.2 4.1 4.2

PER (in %) on TIMIT test set for various acoustic feature streams using hierarchy of multilayer perceptrons. Last column indicates the results of feature stream combination at the hierarchical posterior level using the DempsterShafer theory of evidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average measure of sparsity of the first hidden layer outputs of SMLP and four layer MLP for various feature streams. . . . . . . . . . . . . . . . . . . PER (in %) on TIMIT test set for various acoustic feature streams when the 3-state phoneme posteriors are obtained using a single layer perceptron trained on the first hidden layer outputs of SMLP or MLP classifier. . . . . PER (in %) on TIMIT test set using a single multilayer perceptron (without hierarchy). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Description of various telephone conditions of NIST-08. . . . . . . . . . . . EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8 of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8 of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison with the state-of-art GMM based i-vector/PLDA system. EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8 of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

24 25

26 26 39 39 48

49

List of Figures 1.1 1.2

Feed-forward multilayer perceptron. . . . . . . . . . . . . . . . . . . . . . . Block diagram depicting the thesis organization. . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4

Block diagram of a hybrid phoneme recognition system. . . . . . . . . Steps involved in extracting spectro-temporal representation. . . . . . Block diagram of the phoneme recognition system. . . . . . . . . . . . Block diagram of hierarchical SMLP. Though both the networks are connected, only a portion of the connections are shown for clarity. . .

2 5

. . . . . . . . . fully . . .

8 10 17

3.1 3.2 3.3

Block schematic of the AANN based speaker verification system. . . . . . . Auto-Associative Neural Network. . . . . . . . . . . . . . . . . . . . . . . . WLS based AANN speaker verification system. . . . . . . . . . . . . . . . .

28 29 36

4.1

Block schematic of the proposed FA based AANN speaker verification system. 46

xi

21

Chapter 1

Introduction Artificial Neural Networks (ANNs) are used in many speech processing applications such as speech activity detection (SAD) [1], keyword spotting (KS) [2, 3], automatic speech recognition (ASR) [4–8], speaker verification (SV) [9–11] and language identification (LID) [12] etc. Most of these applications use a feed-forward multilayer perceptron (MLP) which is described in section 1.1.

1.1

Multilayer Perceptron MLP is a fully connected feed-forward neural network with multiple hidden layers,

depicted in Fig. 1.1. Each node in any layer (except output) is connected to every other node in the subsequent layer using a set of weights. The output of any node is obtained by applying a specified transformation (linear or non-linear) on the sum of weighted combination of previous layer outputs and the node-specific bias value. Thus an MLP transforms inputs to outputs using a set of weights and biases (parameters).

1

CHAPTER 1. INTRODUCTION

Hidden layers Input layer

Output layer

Output

Input

Figure 1.1: Feed-forward multilayer perceptron.

Let {W1 , W2 , . . . , Wm−1 }, {b1 , b2 , . . . , bm−1 } be the set of weight matrices and bias vectors respectively representing parameters of an MLP, where Wi indicates the weights connecting ith layer and (i + 1)th layer, and bi indicates the bias vector of (i + 1)th layer. The (element-wise) non-linearity applied at the ith layer be Φi . Typically, a sigmoid nonlinearity is used at the hidden layers, and a linear or softmax is applied at the output layer. For an input vector f , the MLP output vector o (f ) can be expressed as, ¡ ¡ ¡ ¡ ¢ ¢ ¢ ¢ o (f ) = Φm Wm−1 . . . Φ3 W2 Φ2 W1 f + b1 + b2 . . . + bm−1 .

2

CHAPTER 1. INTRODUCTION

This output must be close to the desired vector d (f ) which depends on whether the task is classification or regression. In either cases, MLP is trained to minimize the discrepancy between its output vector o (f ) and the desired vector d (f ) over the training data. Due to the high non-linear dependency between input and output, MLP is trained using the stochastic gradient descent. The gradient can be computed efficiently using the error back-propagation algorithm. However, proper regularization on the parameters of an MLP is desirable when the amount of training data is limited or the number of layers in a network is large. Such an alternative regularized neural network architectures require learning algorithms different from existing methods. The focus of this thesis is to develop such algorithms and apply them to applications such as speech and speaker recognition.

1.2

Scope of the Work The learning algorithms for training regularized neural networks are generic, and

applicable to other applications as well. For instance, MLP can be trained to extract datadriven discriminative features. These features are used in ASR as described in TANDEM framework [5]. The subspace methods developed in this thesis may form the basis for adapting MLPs using limited amount of adaptation data. The underlying principles seem to be attractive for speaker-specific adaptation of neural network based acoustic model for ASR. Additionally, neural network speaker verification systems based on subspace methods yield performances comparable to the state-of-the-art systems. This may facilitate further research on the application of neural networks in speaker verification.

3

CHAPTER 1. INTRODUCTION

1.3

Contributions The contributions of this thesis are :

1. Proposed MLDA features for recognizing phonemes. 2. Introduced SMLP classifier, which encourages its hidden layer outputs to be sparse. Further, SMLP is used to build a phoneme recognition system. 3. Developed regularized WLS based AANN speaker verification system. 4. Developed FA for adapting the parameters of an AANN in a low-dimensional subspace, and applied it in speaker verification.

1.4

Organization of the Thesis

The block diagram depicting the organization of this thesis is shown in Fig. 1.2. The core of this thesis is to develop learning algorithms for training various regularized neural networks (MLPs). A brief description summarizing the focus of various chapters is provided below.

In chapter 2, an MLP with multiple hidden layers is used as a classifier to classify phonemes. Specifically, a sparse multilayer perceptron (SMLP) is developed which encourages a particular hidden layer outputs to be sparse. This is achieved by adding a sparse regularization term to the original cross-entropy cost function. This chapter also proposes to use multiple linear discriminant analysis (MLDA) based features. A phoneme recogni-

4

CHAPTER 1. INTRODUCTION

Speaker Verification

FA Adaptation

AANN Regression

MLP

Phoneme Recognition

SMLP Classifier

AANN Regression

Chapter2

Chapter4 WLS Adaptation

Chapter3

Speaker Verification

Figure 1.2: Block diagram depicting the thesis organization.

tion system built using the SMLP and various feature representations shown to perform significantly better than the MLP counterpart.

Chapters 3 and 4 use MLP (specifically, Auto-Associative Neural Network (AANN)) as a regression module. The problem of adapting AANNs using limited amount of data is addressed in these chapters. In chapter 3, a closed-form expression is derived for the adaptation parameters of an AANN with regularization. These parameters are further regularized by projecting onto a subspace in a weighted least squares (WLS) sense. These techniques have shown to improve the performance of a speaker verification system.

In chapter 4, all adaptation parameters of AANNs are regularized by restricting them to a common low-dimensional subspace using factor analysis (FA) technique. A

5

CHAPTER 1. INTRODUCTION

speaker verification system based on FA yields better results than the WLS based AANN speaker verification system developed in chapter 3. Our experiments also show that this technique yields an order of magnitude better performance than the existing neural network based speaker verification system.

Finally, the conclusions of this thesis along with the future work is provided in chapter 5.

6

Chapter 2

Improved Hybrid Phoneme Recognition System 2.1

Chapter Outline In a conventional hybrid phoneme recognition system, shown in Fig. 2.1, a multi-

layer perceptron (MLP) with a single hidden layer is trained on standard acoustic features to estimate the posterior probabilities of phonemes [4]. These estimates are used for decoding the underlying phoneme sequence. First, we propose to derive a spectro-temporal feature representation by applying multiple linear discriminant analysis (MLDA) technique. Second, we introduce a sparse multilayer perceptron (SMLP) which jointly learns an internal sparse feature representation and nonlinear classifier boundaries to estimate the phoneme posterior probabilities. Experimental results show that the proposed techniques improve the phoneme error rate (PER).

7

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Speech

Hybrid HMM

Feature extraction

MLP

Acoustic features

Phonemes

decoding

Phoneme posterior probabilities

Figure 2.1: Block diagram of a hybrid phoneme recognition system.

2.2

Background The task of phoneme recognition is to convert a speech waveform into a sequence

of underlying sound units known as phonemes. The block diagram of a hybrid phoneme recognition system consisting of feature extraction, MLP and hybrid hidden markov model (HMM) decoding is shown in Fig. 2.1.

The purpose of feature extraction is to discard information irrelevant for performing the task. Features are usually extracted from the two-dimensional representation of speech such as spectrogram. Depending on the manner in which features are derived, they can be broadly classified into spectral [13,14], temporal [15–17] or spectro-temporal [18–20] features.

These features are used as input to train an MLP. It has been shown [21] that MLP, trained to minimize the cross-entropy cost between its outputs and hard targets1 using sufficient amount of data, estimates the posterior probabilities of output classes conditioned on the input feature vector in a discriminative manner. This has led to an extensive use of 1 A hard target vector consist of all zeros except a one at the index corresponding to the phoneme to which current input feature vector belongs.

8

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

MLP in state-of-the-art automatic speech recognition systems [4–6, 22–26].

In order to decode the underlying phoneme sequence, each phoneme is modeled using a left-to-right HMM. The emission likelihood of each HMM state is computed from the MLP posteriors. Viterbi algorithm is applied to decode the phoneme sequence.

2.3

Discriminative Spectro-Temporal Features It is well known [27] that the information about speech sounds, such as phonemes,

is encoded in the spectro-temporal dynamics of speech.

Recently, there has been an

increased research effort in deriving the features that explicitly capture these dynamics [18–20, 28]. Such an approach is primarily motivated by the spectro-temporal receptive field (STRF) model for predicting the response of a cortical neuron to the input speech, where STRF describes the two-dimensional spectro-temporal pattern to which the neuron is most responsive to [29].

Most of the works so far have used the parametric two-dimensional Gabor filters for extracting features. The parameters of the Gabor functions are optimized to improve the recognition accuracy [18], [28] or grouped into low and high frequency modulations to form various streams of information [19]. Even though multiple spectro-temporal feature streams were formed and combined using MLPs in [19], it is difficult to interpret what each feature stream is trying to achieve. We propose to extract features using a set of twodimensional filters to discriminate each phoneme from the rest of the phonemes as described below [20, 30].

9

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Speech

STFT

.

2

Bark frequency warping

log

Spectro−temporal representation

Figure 2.2: Steps involved in extracting spectro-temporal representation.

2.3.1

Estimation of Spectro-Temporal Filters Speech is represented in the spectro-temporal domain (log critical band energies)

for both learning the two-dimensional filter shapes and extracting the features. Fig. 2.2 shows the steps involved in extracting such a representation. This representation is obtained by first performing a Short Time Fourier Transform (STFT) on the speech signal with an analysis window of length 25 ms and a frame shift of 10 ms. The magnitude square values of the STFT output in each window are then projected on to a set of overlapping positive weight vectors in such a way that their centers are equally spaced on the Bark frequency scale to obtain the spectral energies in various critical bands. Finally, the spectro-temporal representation is obtained by applying a logarithm on critical band energies.

We use TIMIT database [31] to obtain spectro-temporal patterns corresponding to each phoneme 2 . These patterns corresponding to a particular phoneme are derived from the spectro-temporal representation of the training utterances by taking a context of 2N +1 frames centered on every frame that is labeled as this phoneme. In our experiments, each spectro-temporal pattern has 19 critical bands (K) and 21 frames (2N + 1).

In order to derive 2-D filter shapes, we ask the question : what are the directions 2

The 61 hand-labeled phonemes are mapped to a standard set of 39 phonemes [32].

10

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

(patterns) along which the spectro-temporal patterns of a phoneme are well separated from that of the rest of the phonemes? Fisher Linear Discriminant Analysis (FLDA) gives only one optimal discriminating pattern for each phoneme (two-class problem) due to the rank limitation of its between-class scatter matrix. The resultant low-dimensional feature space (projections of a spectro-temporal pattern on these discriminating patterns) may hinder classification performance.

Modified Linear Discriminant Analysis (MLDA) is a generalization of FLDA that overcomes this limitation by modifying the between-class scatter matrix [33]. It defines the between-class scatter matrix as the weighted sum of average sample to sample scatter. This modification yields multiple solutions (or discriminating patterns) to the generalized eigen vector problem which arises in the conventional FLDA.

We apply MLDA to learn multiple discriminating patterns to discriminate spectrotemporal patterns of a phoneme from that of the rest of the phonemes. MLDA features are obtained by projecting a spectro-temporal patch onto these discriminating patterns as described in section below. For instance, 13 discriminating patterns per phoneme would yield a feature vector of length 13 × 39 = 507. Discriminating patterns can be interpreted as spectro-temporal filters by flipping them in time around the center.

2.3.2

Feature Extraction

If S(n, k) denotes the spectro-temporal representation of the speech, and h(n, k) characterizes a discriminating pattern, then the corresponding feature f (n) at a particular time

11

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

n is extracted using (2.1). f (n) =

N X K X

h(i, k) S(n + i, k)

(2.1)

i=−N k=1

where, i and k denote the discrete time index (due to 10ms shift) and the critical band index respectively. K represents the total number of critical bands, while the temporal extent (context) of the 2-D filter is given by 2N + 1.

2.4

Sparse Multilayer Perceptron (SMLP) Sparse representations are first observed in the visual area of the mammalian

cortex [34, 35], and recently, many pattern classification applications have made use of sparse signal representations [36–42]. Most of these methods treat sparse representation as features and train an additional classifier for making decisions. However, in only a few instances have sparse representations been optimized in conjunction with the classifier for discriminative classification. Some of the previous works have attempted to address this issue. For example, a two-class classification problem with a linear or bilinear classifier has been considered in [39]. In a different work, Fisher’s linear discrimination criterion with sparsity is used [36].

We propose to jointly learn both sparse features and nonlinear classifier boundaries that best discriminate multiple output classes. Specifically, we propose to learn sparse features at the output of a hidden layer of a MLP trained to discriminate multiple output classes. This is achieved by adding a sparse regularization term to the conventional crossentropy cost between the target values and their predicted values at the output layer. The 12

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

parameters of the MLP are learned to minimize the joint cost using the standard backpropagation algorithm which takes the additional sparse regularization term into consideration. The resultant model is referred to as the sparse multilayer perceptron (SMLP). Further, under certain conditions, described in section 2.4.2, the SMLP estimates the Bayesian a posteriori probabilities of the output classes conditioned on the sparse representation.

2.4.1

Theory of SMLP The notations used are as follows.

m - number of layers (including input and output layers) Nl - number of neurons (or nodes) in the lth layer φl - output nonlinearity at the lth layer xlj - input to the j th neuron in the lth layer ³ ´ φl xlj = yjl - output of the j th neuron in the lth layer l−1 wij - weight connecting the ith neuron in (l − 1)th layer and j th neuron in lth layer

dj - target of the j th neuron in the output layer . ej = dj − yjm - error of the j th neuron in the output layer

The goal of an SMLP classifier is to jointly learn sparse features at the output of its pth layer and estimate posterior probabilities of multiple classes at its output layer. In the case of MLP, estimates of the posterior probabilities are typically obtained by minimizing the cross-entropy cost between the output layer values (after the softmax) and the hard targets. We modify this cost function for SMLP as follows.

13

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Cost Function The two objectives of the SMLP are • minimize the cross-entropy cost between the output layer values and the hard targets, and • force the outputs of the pth layer to be sparse for a particular p ∈ {2, 3, ..., m − 1}. The instantaneous 3 cross-entropy cost is L=−

Nm X £ ¡ ¢¤ dj log yjm .

(2.2)

j=1

To obtain the SMLP instantaneous cost function we add an additional sparse regularization term to the cross-entropy cost (2.2), yielding Np

³ ´ λX log 1 + (yjp )2 . L˜ = L + 2

(2.3)

j=1

where λ is a positive scalar controlling the trade-off between the sparsity and the cross³ ´ Np P entropy cost. The function log 1 + (yjp )2 which is continuous and differentiable everyj=1

where, was successfully used in previous works to obtain a sparse representation [34, 37, 42, 42]. The weights of the SMLP are adjusted to minimize (2.3), and is discussed below.

Error Back-Propagation Training Stochastic gradient descent is applied for updating the SMLP weights. The conventional error back-propagation training algorithm is a result of applying the chain rule of calculus to compute the gradient of a cross-entropy cost (2.2) function with respect to the weights. 3

By instantaneous we mean corresponding to a single input pattern.

14

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

For training SMLP, the error back-propagation must be modified in order to accommodate the additional sparse regularization term.

We derive update equations for training the SMLP by minimizing the cost function (2.3) with respect to weights4 over the training data. Since the learning is based on stochastic gradient descent, the key is to determine the gradient of the cost function (2.3) with respect to the weights.

Gradient of L˜ w.r.t. yjl From (2.2) and (2.3), ∀l ∈ {p + 1, p + 2, ..., m}, ∀j ∈ {1, 2, ..., Nl }, ∂L ∂ L˜ = l. l ∂yj ∂yj

(2.4)

Using (2.3), for layer p, ∀j ∈ {1, 2, ..., Np }, ∂L ∂ L˜ +λ p = ∂yj ∂yjp

Ã

yjp

!

1 + (yjp )2

.

(2.5)

Using (2.3) and chain rule of calculus, ∀ (l − 1) ∈ {2, 3..., p − 1}, ∀i ∈ {1, 2..., Nl−1 }, ∂ L˜ ∂yil−1

=

Nl X j=1

=

Nl X j=1

Ã

Ã

∂ L˜ ∂yjl ∂ L˜ ∂yjl

!Ã

!

∂yjl ∂xlj

!Ã

∂xlj

!

∂yil−1

³ ´ 0 l−1 φl xlj wij .

(2.6)

The above equations (2.4),(2.5) and (2.6) indicate that the gradients of L˜ w.r.t. yjl can be computed from the gradients of L w.r.t. yjl . Specifically, we need the gradients of L w.r.t. yjl , ∀l ∈ {p, p + 1, ..., m}, ∀j ∈ {1, 2, ..., Nl } in order to compute gradients of L˜ w.r.t. 4

The bias values at any layer can be interpreted as weights connecting an imaginary node in the previous layer, with its output being unity, and all the nodes in the current layer.

15

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM yjl , ∀l ∈ {2, 3, ..., m}, ∀j ∈ {1, 2, ..., Nl }. The computation of these gradients is described in Appendix A.

l−1 Gradient of L˜ w.r.t. wij l−1 By definition, wij denotes the weight connecting the ith neuron in (l − 1)th layer and j th

neuron in lth layer. Thus by using the chain rule, ∂ L˜ l−1 wij

Ã = Ã =

∂ L˜ ∂yjl ∂ L˜ ∂yjl

!Ã !

!Ã

∂yjl ∂xlj

∂xlj

!

l−1 ∂wij

³ ´ 0 φl xlj yil−1 .

(2.7)

Update Equations SMLP weights are updated using stochastic gradient descent. The gradient of the cost function (2.7) with respect to a particular weight is accumulated for several input patterns known as bunch size and then the weight is updated using l−1 l−1 wij ← wij −η h

where η is a small positive learning rate, and h

2.4.2

∂ L˜ l−1 i wij

∂ L˜ i, l−1 wij

(2.8)

is the accumulated value of the gradient.

SMLP as a Posterior Probability Estimator The number of input and output nodes of the SMLP is set to be equal to the di-

mensionality of its input acoustic feature vector and the number of output phoneme classes, respectively. Softmax nonlinearity is used at its output layer, and weights are adjusted to minimize (2.3) when the hard targets are being used. Note from the equations (2.4), (2.5), (2.6) and (2.7) that the sparse regularization term affects the update of only those weights 16

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Hierarchical PLP

Speech

FDLP

SMLP

Dempster−Shafer

Hierarchical

Combination

SMLP

Hybrid HMM decoding

Phonemes

Hierarchical MLDA

SMLP

Figure 2.3: Block diagram of the phoneme recognition system. l , ∀l ∈ {1, 2, ..., p − 1}. This implies that the weights w l , ∀l ∈ {p, p + 1, ..., m − 1} can wij ij

be adjusted to minimize the cross-entropy term of (2.3) without affecting the sparse regularization term. If p < m − 1 and one of the hidden layers between pth and mth layers is sigmoidal (nonlinear) then the pth layer outputs can be nonlinearly transformed to the SMLP outputs. Therefore, in such a case, SMLP estimates the posterior probabilities of output classes conditioned on the pth layer outputs (sparse representation). This follows from the fact that an MLP with a single nonlinear hidden layer estimates the posterior probabilities of output classes conditioned on the input features [21], and SMLP outputs are completely determined by the outputs of pth layer (hidden).

2.5

System Description The block diagram of the phoneme recognition system used in our experiments is

shown in Fig. 2.3. The various components of the system are described below.

17

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.5.1

Database

Phoneme recognition experiments are conducted on the TIMIT database [31]. It consists of 630 speakers with 10 utterances per speaker sampled at 16 kHz. The two SA dialect sentences per speaker are excluded from the setup as they are identical across all the speakers. The original TIMIT train and test sets consist of 462 and 168 speakers respectively [31]. We further divide the original train set into training and validation sets having 425 and 37 speakers, and keep the original test set unchanged. Thus in all our experiments, the training, validation and test sets consist of 3400, 296 and 1344 utterances from 425, 37 and 168 speakers, respectively.

2.5.2

Feature Streams To test the proposed SMLP classifier, we developed system using three different

feature streams, namely PLP cepstral coefficients [13], FDLP temporal features [17] and MLDA spectro-temporal features [20]. These features are extracted for every 10 ms of speech, and they are normalized for speaker specific mean and variance. A detailed description of each feature stream is provided below.

PLP Cepstral Coefficients Short Time Fourier Transform (STFT) is applied on the speech signal with an analysis window of length 25 ms and a frame shift of 10 ms. The squared magnitude values of the STFT output are then projected on a set of frequency weights which are equally spaced on the Bark frequency scale to obtain the spectral energies in various critical bands. Transformations such as equal loudness and cubic root are applied for reducing the dynamic range. 18

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

The resultant spectral envelopes are smoothed by the twelfth order linear prediction analysis [13]. The top 13 cepstral coefficients are concatenated with the corresponding delta and delta-delta features to obtain 39 dimensional feature vector. A nine frame context of these vectors is used as the input PLP feature stream.

FDLP Temporal Features Speech is transformed to the frequency domain by applying discrete cosine transform (DCT) on the full utterance. The full band DCT signal is divided into multiple critical band DCT signals by multiplying with a set of Gaussian windows centered on critical bands. Linear prediction analysis is performed on each critical band DCT signal to obtain the smooth sub-band temporal envelopes. These temporal envelopes are passed through nonlinearities such as logarithmic and adaptive compression loops. The resultant compressed sub-band envelopes are divided into 200 ms segments with a shift of 10 ms. DCT is applied on each segment to derive the features. The first 14 DCT coefficients are concatenated from each sub-band to form the FDLP temporal feature stream [17].

MLDA Spectro-Temporal Features The description of MLDA features can be found in section 2.3. A set of spectro-temporal discriminative patterns are designed using the modified linear discriminant analysis (MLDA) to discriminate each phoneme from the rest of the phonemes. The number of discriminating patterns per phoneme is chosen to be 13 to maximize the phoneme recognition accuracy on validation data. Projections of a given spectro-temporal patch on these discriminative patterns are concatenated to form MLDA feature stream [20, 30]. 19

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.5.3

Hierarchical SMLP Hierarchies of multilayer perceptron (MLP) classifiers have been shown to be useful

for acoustic modeling in speech recognition [43–45], model adaptation [46] and language identification [47]. A hierarchical MLP consists of two MLPs in series which are sequentially trained. The first MLP uses standard acoustic feature vectors to estimate the posterior probabilities of various output classes such as phonemes. The second MLP is then trained on the same targets using long temporal spans of posterior probabilities estimated by the first MLP as inputs.

Block diagram of the hierarchical SMLP is shown in Fig. 2.4. Initially, a four layer SMLP is trained to estimate the 3-state phoneme posterior probabilities. Subsequently, another three layer MLP is trained on a long temporal span of these posteriors to estimate the single state phoneme posterior probabilities. Both these networks are initialized randomly using uniform noise and trained using back-propagation. We have modified the Quicknet package [48] (software for MLP training) to perform SMLP training.

Estimation of 3-state Phoneme Posteriors The 61 hand-labeled TIMIT phone symbols are mapped to 49 phoneme classes by treating each of the following set of phonemes as a single class: {/tcl/, /pcl/, /kcl/}, {/gcl/, /dcl/, /bcl/}, {/h#/, /pau/}, {/eng/, /ng/}, {/axr/, /er/}, {/axh/, /ah/}, {/ux/, /uw/}, {/nx/, /n/}, {/hv/, /hh/}, and {/em/, /m/}.

20

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

MLP

SMLP

input

hidden sparse layer−2 hidden outputs

hidden layer output

input temporal context of 230 ms

Acoustic features

output

Single state phoneme posterior probabilities

(23 frames)

3−state phoneme posterior probabilities

Figure 2.4: Block diagram of hierarchical SMLP. Though both the networks are fully connected, only a portion of the connections are shown for clarity. As shown in Fig. 2.4, the SMLP used for estimating the 3-state phone posterior probabilities consists of four layers (m = 4): an input layer to receive a given feature stream, two hidden layers with a sigmoid nonlinearity, and an output layer with a softmax nonlinearity. The number of nodes in the input and output layers is set to be equal to the dimensionality of the input feature vector and the number of phoneme states (i.e., 49 x 3 = 147) respectively. The outputs of the first hidden layer (p = 2) are forced to be sparse with number of nodes in it being same as that of the input layer. The number of nodes in the second hidden layers is chosen to be 1000. For each feature stream, the value of λ in the SMLP cost function (2.3) is chosen to minimize the phoneme error rate (PER) on the validation data.

In the first pass of SMLP classifier training, 3-state hard phoneme targets are obtained by segmenting each phoneme in the training data equally into three states i.e., start, middle and end. This classifier is retrained in a second pass using the hard targets corresponding to the best state alignment obtained by applying the Viterbi algorithm on 21

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

3-state posterior probability estimates of the first pass. Frame classification accuracy on the validation set is used to control the learning rate and to terminate training.

In order to gauge the effect of the sparse regularization term λ, an identically configured four layer MLP with λ = 0 is also trained to estimate 3-state phoneme posteriors. For an additional comparison, we also estimate the 3-state phoneme posteriors using a conventional three layer MLP which has a sigmoid nonlinearity at the hidden layer and a softmax nonlinearity at the output layer. The number of hidden layer nodes in this system are chosen such that the total number of parameters match approximately that of the SMLP.

Hierarchical Estimation of Posterior Probabilities As shown in the Fig. 2.4, 3-state phoneme posterior probability estimates are mapped to single state phoneme posterior probability estimates by training an MLP which operates on a context of 230 ms or 23 posterior probability vectors. Its hidden layer consists of 3500 nodes with a sigmoid nonlinearity, and output layer consists of 49 nodes with a softmax nonlinearity.

2.5.4

Dempster-Shafer Combination

Hierarchically estimated posterior probabilities corresponding to various feature streams are combined as described in [49] using the Dempster-Shafer (DS) theory of evidence [50]. First we combine posteriors of PLP and FDLP feature streams, and then the resultant posteriors are combined with posteriors of MLDA feature stream.

22

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.5.5

Hybrid HMM Decoding

The 49 phoneme classes are mapped to 39 phoneme classes for decoding5 [32]. The posterior probabilities of phoneme classes are converted to the scaled likelihoods by dividing them by the corresponding prior probabilities of phonemes obtained from the training data. A 3-state HMM (connected from left to right) with equal self-transition and state transition probabilities is used to model each phoneme. The emission likelihood of each state is set to be the scaled likelihood. A bigram phonotactic language model is used in all the experiments. Finally, the Viterbi algorithm is applied for decoding the phoneme sequence. The PER is obtained by comparing the decoded phoneme sequence against the reference sequence. While evaluating the performance on the test set, the language model scaling factor is chosen to minimize the PER of the validation data.

2.6

Experimental Results Table 2.1 shows the PER of the proposed SMLP based hierarchical hybrid system

and the baseline MLP based hierarchical hybrid systems for various feature streams on the TIMIT test set. As described earlier in section 2.5.3, proposed and baseline systems differ only in the way 3-state phoneme posteriors are estimated. Results indicate that four layer MLP based system performs better than conventional three layer MLP based system for each feature stream. Moreover, the SMLP based system outperforms the baseline four layer MLP based system for each feature stream. This improved performance can be attributed to the sparse regularization term. 5 The appropriate subsets of 49 phoneme posterior probability estimates are summed to get 39 phoneme probability estimates.

23

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM Table 2.1: PER (in %) on TIMIT test set for various acoustic feature streams using hierarchy of multilayer perceptrons. Last column indicates the results of feature stream combination at the hierarchical posterior level using the Dempster-Shafer theory of evidence. PLP

FDLP

MLDA

PLP+FDLP+MLDA

MLP (3 layers)

22.9

23.2

22.8

20.5

MLP (4 layers)

22.6

22.8

22.4

20.1

SMLP (4 layers)

21.9

22.1

21.9

19.6

To exploit the complementary nature of these acoustic feature streams, they are combined at the hierarchical posterior probability level as described in section 2.5.4. The system combination results are shown in the last column of Table 2.1. It can be observed that the combination of SMLP based systems yields a PER of 19.6%, a relative improvement of 2.5% over the combination of four layer MLP (i.e., λ = 0) based systems. These results are statistically significant with a p-value of less than 0.0004. Further, the information transferred6 for SMLP and MLP (4 layers) based systems is 3.46 and 3.42 bits respectively. On the TIMIT core test set consisting of 192 utterances (a subset of the test set provided by LDC [31]), we obtain a PER of 20.7% using the combination of SMLP based systems. This performance compares well with the existing state-of-the-art systems7 . 6

Computed for the confusion matrix of the test set. Note that some of the TIMIT phoneme recognition systems use part of the original test set as a validation set [40, 51, 52]. However, as mentioned earlier, we kept the the original test set unchanged as in [53]. 7

24

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.7

Analysis First, we quantify the sparsity of pth hidden layer outputs using the following

measure (κ) [54].



 N2 P

 p  N2 − si=1 κ= √ N2 − 1  N2  P 1

|yip |

i=1

It is to be noted that

yip

  .  p 2 (yi )

(2.9)

represents the output of a node i in layer p of a SMLP. Furthermore,

0 ≤ κ ≤ 1, and the value of κ is one for maximally sparse and close to zero for minimally sparse representations. Table 2.2 lists the average κ value of the first hidden layer outputs over the validation data for various phoneme recognition systems. As expected, SMLP based systems have significantly higher κ values than four layer MLP systems which indicates the effectiveness of the sparse regularization term. Table 2.2: Average measure of sparsity of the first hidden layer outputs of SMLP and four layer MLP for various feature streams. PLP

FDLP

MLDA

MLP (4 layers)

0.275

0.278

0.282

SMLP (4 layers)

0.496

0.552

0.540

Second, we experimentally verify whether sparse features tend to be more linearly separable than non-sparse counterparts. After training the hierarchical system for each feature stream, the first hidden layer outputs of SMLP or four layer MLP classifier are used as input features for training a single layer perceptron (linear classifier) to estimate the 3state phoneme posterior probabilities. The second MLP in the hierarchy remains unchanged. 25

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Table 2.3 shows the PER of the resulting hierarchical system for various acoustic feature streams. The linear classifier is able to model the SMLP features better than it models the MLP features. Table 2.3: PER (in %) on TIMIT test set for various acoustic feature streams when the 3-state phoneme posteriors are obtained using a single layer perceptron trained on the first hidden layer outputs of SMLP or MLP classifier. PLP

FDLP

MLDA

Hidden layer features from MLP (4 layers)

26.9

27.0

26.5

Hidden layer features from SMLP (4 layers)

25.0

25.5

24.6

Finally, results using only a single multilayer perceptron (without hierarchy) are analyzed. A single multilayer layer perceptron (SMLP or MLP) is trained directly to estimate the single state phoneme posterior probabilities which are decoded as described in section 2.5.5. The value of λ in SMLP cost function is optimized on the validation data. Table 2.4 summarizes the PER of various feature streams. It can be observed from this table that the SMLP system consistently outperforms the corresponding baseline MLP systems. Table 2.4: PER (in %) on TIMIT test set using a single multilayer perceptron (without hierarchy). PLP

FDLP

MLDA

MLP (3 layers)

27.2

26.3

27.1

MLP (4 layers)

27.3

27.5

27.0

SMLP (4 layers)

26.6

25.8

26.0

26

Chapter 3

Weighted Least Squares based Auto-Associative Neural Networks for Speaker Verification 3.1

Chapter Outline Auto-Associative Neural Network (AANN) [55] is a fully connected feed-forward

neural network trained to reconstruct the input at its output through a hidden bottleneck layer. Existing AANN based speaker verification systems [9,10] use the reconstruction error difference computed using the universal background model (UBM) AANN and the speakerspecific AANN models as a score for making decision, as shown in the Fig. 3.1. The UBM-AANN is obtained by training an AANN on multiple held out speakers. Where as the speaker-specific AANN is obtained by adapting (or retraining) the UBM-AANN using

27

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

Average reconstruction error UBM AANN

Test speech

Score

Feature extraction

Difference

Speaker specific AANN

Average reconstruction error

Figure 3.1: Block schematic of the AANN based speaker verification system. corresponding speaker data.

In this chapter, we propose to project the speaker-specific AANN parameters onto a low-dimensional subspace, and build a probabilistic linear discriminant analysis (PLDA) model on resultant vectors in the subspace to perform hypothesis testing. The low-dimensional subspace is learned using large amounts of development data to preserve most of the variability of speaker-specific AANN parameters in a weighted least squares (WLS) sense. Experimental results show that the proposed WLS based AANN speaker verification system outperforms the existing AANN speaker verification system on NIST-08 speaker recognition evaluation.

The remainder of this chapter is organized as follows. AANNs are introduced in the next section. Section 3.3 describes the previously proposed speaker verification system. The 28

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

Non−linear hidden layers Linear input layer

Linear output layer

Input

Output

Number 39 of nodes

20

6

39

39

Figure 3.2: Auto-Associative Neural Network.

proposed WLS formulation of AANNs is presented in section 3.4. The speaker verification system based on WLS of AANNs is described in section 3.5. Finally, experimental results are shown in section 3.6.

3.2

Auto-Associative Neural Networks

AANN is a fully connected feed-forward neural network with a hidden compression layer, shown in Fig. 3.2, trained for auto-encoding task [55]. Five layer AANNs are used in all our experiments. This architecture consists of three non-linear hidden layers between the linear input and output layers. The second hidden layer contains fewer nodes than the input layer, and is known as the compression layer. AANN is used as an alternative to GMM 29

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

for modeling the distribution of data [9], and some of its advantages are - it relaxes the assumption of feature vectors to be locally normal and can capture higher order moments. For an input vector f , the network produces an output ˆf (f , Θ) which depends both on the input f and the parameters Θ of the network (the set of weights and biases). For simplicity, we denote the network output as ˆf . While training the network, the parameters Θ of the network are adjusted to minimize typically the average squared error cost between the input f and the output ˆf over the training data as in (3.1). The network is trained using the stochastic gradient descent, where gradient is computed using the error back-propagation algorithm. h i min E kf − ˆf k2 . {Θ}

(3.1)

Once this network is well trained, the average reconstruction error of input vectors that are drawn from the distribution of the training data will be small compared to vectors drawn from a different distributions [9].

3.3

Speaker Verification using AANNs

The goal of the speaker verification is to verify whether a given utterance belongs to a claimed speaker or not based on a sample utterance from claimed speaker. In other words, the task is to verify whether a given two utterances of a speaker verification trial belong to the same speaker or not. The block diagram of a previously proposed AANN based speaker verification system [9, 10] is shown in the Fig. 3.1. The various components

30

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

of this system are described below.

3.3.1

Feature Extraction

The acoustic features used in our experiments are 39 dimensional frequency domain linear prediction (FDLP) features [56–59]. In this technique, sub-band temporal envelopes of speech are first estimated in narrow sub-bands (96 linear bands). These sub-band envelopes are then gain normalized to remove reverberation and channel artifacts. After normalization, the frequency axis is warped to 37 Mel bands in the frequency range of 125-3800 Hz to derive a gain normalized mel scale energy representation of speech. These mel band energies are converted to cepstral coefficients by applying a log and Discrete Cosine Transform (DCT). The top 13 cepstral coefficients along with derivative and acceleration components are used as features, yielding 39 dimensional feature vectors. Finally, a subset of these feature vectors corresponding to speech are selected based on the voice activity detection information provided by NIST.

3.3.2

UBM-AANN The concept of UBM is introduced in [60], where a GMM trained on data from

multiple speakers is used as a UBM. In our work, UBMs are obtained by training AANNs on development data consisting of multiple speakers (described below) [9]. Gender-specific AANN based UBMs are trained on a telephone development data set consisting of audio from the NIST 2004 speaker recognition database, the Switchboard-2 Phase III corpora and the NIST 2005 speaker recognition database. We use only 400 male and 400 female utterances each corresponding to about 17 hours of speech. 31

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

AANN based UBMs are trained using the FDLP features (see Section 3.3.1) to minimize the reconstruction error loss function described in section 3.2. Each UBM has a linear input and a linear output layers along with three nonlinear (tanh nonlinearity) hidden layers. Both input and output layers have 39 nodes corresponding to the dimensionality of the input FDLP features. First, second and third hidden layers have 20, 6 and 39 nodes respectively. Schematic of an AANN with this architecture is shown in the Fig. 3.2. We have modified the Quicknet package for training the AANNs [48].

3.3.3

Speaker-Specific AANN

A speaker-specific AANN model is obtained by retraining the entire UBM-AANN using the corresponding speaker data [9]. However, we have observed that the performance can be improved by retraining only the weight matrix connecting third hidden layer and output layer. This may be due to limited amount of speaker-specific data. Thus, an improved baseline is used in our experiments, where a speaker-specific AANN is obtained by adapting only UBM-AANN weights that impinge on the output layer.

3.3.4

Score Computation

During the test phase, the average reconstruction error of the test data is computed under both the UBM-AANN and the claimed speaker AANN models. The final score of a trial is computed as the difference between these average reconstruction errors. In the ideal case, if the claim is verified, the average reconstruction error of the test data is large under the UBM-AANN model than the claimed speaker AANN model, and vice versa.

32

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

3.4

Weighted Least Squares of AANNs Conventional speaker verification systems use likelihood ratio between GMM based

UBM and its maximum a posteriori (MAP) adapted speaker-specific model for making decision [60]. More recently proposed GMM factor analysis techniques use a low-dimensional subspace(s) in part of the GMM parameter space to model the speaker and channel variabilities [61–65], and extract coordinates, known as i-vector [63], corresponding to a given utterance of a speaker in the low-dimensional subspace. The i-vectors are treated as features while training the probabilistic linear discriminant analysis (PLDA) for hypothesis testing [66, 67].

In this section, we propose to learn a low-dimensional subspace which preserves most of the variability of the adapted weight matrices of speaker-specific AANNs in a weighted least squares (WLS) sense. The resultant low-dimensional representation of an utterance (also known as i-vector) is obtained by projecting the corresponding adapted weight matrix onto the subspace.

Subspace Modeling of AANNs The subspace modeling of adapted weight matrices is formulated as follows. The development data consists of m speakers with l(s) sessions for sth speaker. The weight matrix connecting third hidden layer and output layer of UBM-AANN is adapted for each session of a speaker to obtain speaker and session specific AANN model. We denote the adapted vectorized weights (a closed-form expression for adaptation is derived in section 3.5.1) corresponding to lth session and sth speaker as wl,s , and number of frames with nl,s . 33

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION Let us define mean w and covariance1 Σ of wl,s as  −1 l(s) l(s) m X m X X X   w = nl,s nl,s wl,s , s=1 l=1

s=1 l=1

s=1 l=1

s=1 l=1

 −1 l(s) l(s) m X m X X X   Σ = nl,s nl,s (wl,s − w) (wl,s − w)T . We model wl,s using a low dimensional affine subspace parameterized by a matrix T i.e., wl,s ≈ w + Tql,s , where ql,s represents the unknown i-vector associated with the lth session of sth speaker. To find T, the following weighted least squares cost function is minimized with respect to its arguments: ¡ ¢ L T, q1,1 , . . . , ql(m),m =

l(s) m X X

k [wl,s − (w + Tql,s )] k2Σ−1 nl,s

s=1 l=1 l(s) m X X ¢ ¡ ¡ ¢ λ tr TT Σ−1 nl,s T + qTl,s ql,s + s=1 l=1

|

{z

(3.2)

}

regularization term

where k.kA denotes a norm given by kxk2A = xT Ax, and λ is a small positive constant. Differentiating (3.2) with respect to ql,s and setting equal to zero yields, ∂L =0⇒ ∂ql,s −TT Σ−1 nl,s [wl,s − (w + Tql,s )] + ql,s = 0 ⇒ ¡ ¢−1 T −1 T Σ nl,s (wl,s − w) ql,s = I + TT Σ−1 nl,s T 1

Only diagonal part is used.

34

(3.3)

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

Differentiating (3.2) with respect to T and setting it equal to zero yields, ∂L =0⇒ ∂T l(s) m X X ¡ −1 ¢ Σ nl,s Tql,s qTl,s − Σ−1 nl,s [wl,s − w] qTl,s + λΣ−1 nl,s T = 0 ⇒ s=1 l=1 l(s) m X X

l(s)

−1

Σ

m X ¡ ¢ X T nl,s T λI + ql,s ql,s = Σ−1 nl,s [wl,s − w] qTl,s

s=1 l=1

(3.4)

s=1 l=1

To solution for T is obtained by applying the coordinate gradient descent i.e., (3.3) and (3.4) are iteratively solved for several times. In other words, for a given T, we first find the i-vectors {q1,1 , . . . , ql(m),m } using (3.3). In the next step, we solve for T in (3.4) using the i-vectors {q1,1 , . . . , ql(m),m } found in the previous step. This procedure is repeated for several times until convergence.

The above update equations can be compared with the total variability space training of GMMs [63]. Note that (3.3) and (3.4) resemble the maximum likelihood (ML) update equations in [63], except for the λI term in (3.4).

3.5

WLS based AANN Speaker Verification System The block diagram of a WLS based AANN speaker verification system is shown in

Fig. 3.3. This system uses the FDLP features described in section 3.3.1. The description of UBM-AANN can be found in section 3.3.2. The remaining components of the system are described below.

35

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

UBM−AANN training

PLDA training

T matrix training

FDLP features

i−vector extraction

Adapt UBM−AANN

Hypothesis testing

Figure 3.3: WLS based AANN speaker verification system.

3.5.1

Closed-form Expression for Adapting UBM-AANN The weight matrix connecting third hidden layer and output layer of UBM-AANN

is adapted for each utterance to obtain a speaker specific model. It is possible to derive the closed-form solution for updating the speaker-specific weight matrix Wl,s connecting third hidden and output layers. The output bias vector b of UBM-AANN is not adapted.

Let fi,l,s be the ith feature vector (frame) of an utterance corresponding to lth session of sth speaker, and n(l, s) be the number of such frames in that utterance. The third hidden layer output vector of UBM-AANN corresponding to this input is denoted with hi,l,s . The loss function (3.5) is minimized to obtain the speaker-specific weight matrix Wl,s corresponding to lth session of sth speaker. Where β is non-negative and controls the amount of regularization. n(l,s) h

L (Wl,s ) =

¡ ¢i T kfi,l,s − b − Wl,s hi,l,s k22 + β tr Wl,s Wl,s .

X

(3.5)

i=1

Differentiating the expression above with respect to Wl,s and setting it to zero yields, ∂L (Wl,s ) ∂Wl,s

n(l,s) h

= 0⇒

i X ¡ ¢ T hi,l,s hTi,l,s + βI Wl,s − hi,l,s (fi,l,s − b)T = 0. i=1



n(l,s)

⇒

Wl,s = 

X



−1 X¡ ¢ hi,l,s hTi,l,s + βI  . (fi,l,s − b) hTi,l,s   n(l,s) i=1

i=1

36

(3.6)

Scores

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

3.5.2

T-matrix Training

Gender dependent low-dimensional subspaces (T matrices) are trained to capture most of the variability of adapted weights in a WLS sense as described in section 3.4. The development data for training the subspaces consists of Switchboard-2, Phases II and III; Switchboard Cellular, Parts 1 and 2 and NIST 2004-2005 SRE [62]. The total number of male and female utterances is 12266 and 14936 respectively.

3.5.3

i-vectors

Each utterance is converted to an i-vector using (3.3) with an appropriate gender-specific T matrix. All i-vectors are normalized to have unit length to reduce the mismatch during training and testing [68].

3.5.4

PLDA training

PLDA is a generative model for observations [66, 67], in our case i-vectors. The i-vectors are assumed to be generated as ql,s = µ + Φβs + ²l,s ,

(3.7)

where µ is an offset; Φ is a matrix fewer columns than rows representing a low-dimensional subspace in i-vector space; βs is a latent identity variable having a normal distribution with mean zero and covariance matrix identity; and ²l,s is a residual noise term assumed to be Gaussian with mean zero and full covariance matrix Σ² . Additionally, these variables are assumed to be independent.

Gender-specific PLDA models are trained using the same development data that 37

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

is used for training T matrices (see Section 3.5.2). The maximum likelihood estimates of the model parameters {µ, Φ, Σ² } are obtained using an Expectation Maximization (EM) algorithm [66].

3.5.5

Hypothesis Testing

Given two i-vectors q1 , q2 of a speaker verification trial, we need to test whether they belong to the same speaker (Hs ) or different speakers (Hd ). For the Gaussian PLDA above, the log-likelihood ratio can be computed in a closed-form as score = log

p (q1 , q2 |Hs ) p (q1 |Hd ) p (q2 |Hd )     

(3.8) 

T ΦΦT  q1  µ ΦΦ + Σ²        N   ;   ,   q2 µ ΦΦT ΦΦT + Σ² = log       , T 0 q1  µ ΦΦ + Σ²         N   ;   ,   q2 µ 0 ΦΦT + Σ²

where N (.; η, Λ) is a multivariate Gaussian density with mean η and covariance Λ. The above score can be computed efficiently as described in [64, 68].

3.6

Experimental Results Speaker verification systems are tested on telephone conditions of NIST-2008

speaker recognition evaluation (SRE). Table 3.1 shows the description of various telephone conditions. Table 3.2 lists the EER and minimum detection cost function (minDCF) of NIST-2008 for baseline AANN (see section 3.3) and WLS based AANN (see section 3.5) speaker verification systems. These neural network based systems use the same UBM38

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

Table 3.1: Description of various telephone conditions of NIST-08. C6

Telephone speech in training and test

C7

English language telephone speech in training and test

C8

English language telephone speech spoken by a native U.S. English speaker in training and test

Table 3.2: EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8 of NIST-08. System

C6

C7

C8

Baseline AANN

19.5 (87.3)

15.8 (73.0)

15.2 (74.8)

PCA of AANNs, β = 0

13.4 (67.6)

8.7 (41.4)

7.9 (39.4)

12.2 (66.2)

7.2 (38.3)

6.4 (35.6)

10.7 (59.6)

5.5 (28.3)

4.4 (24.1)

240 dim. i-vector WLS of AANNs, β = 0 240 dim. i-vector WLS of AANNs, β = 0.005 240 dim. i-vector

AANN of size (39, 20, 6, 39, 39), where each number indicates the number of nodes in a corresponding layer.

The error rates of the baseline AANN system are indicated in a first row of Table 3.2. In this case, the difference between average reconstruction errors of a UBM and a given

39

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

speaker-specific model is used as a score for making decision [11]. The second row of the table lists the error rates when PCA is applied instead of WLS to reduce the dimensionality of the speaker-specific weights. The resultant PCA based i-vectors are modeled using PLDA. The next two rows of Table 3.2 shows the error rates of a WLS based AANN speaker verification system that uses gender-dependent 150 dimensional (number of columns of Φ) subspace PLDA models in 240 dimensional i-vector space. The speaker-specific weights Wl,s are derived using (3.6) for different values of β.

These results suggest that the proposed AANN based i-vector/PLDA framework outperforms the baseline AANN speaker verification system. Additionally, the proposed WLS formulation for obtaining i-vectors yields better results than the simple PCA. Moreover, regularized speaker-specific weights (β = 0.005) yield much better results. It can be observed that a relative improvement of 59.2% in EER and 52.3% in minDCF is obtained using the WLS based AANN system over the baseline AANN system.

40

Chapter 4

Factor Analysis of Auto-Associative Neural Networks for Speaker Verification 4.1

Chapter Outline The main disadvantage of WLS of AANNs is that it requires speaker-specific adap-

tation of the weight matrix connecting third hidden and output layers despite a much low-dimensional representation (i-vector) is being used (see section 3.4) for the subsequent modeling. In this chapter, we introduce and develop the factor analysis (FA) theory of AANNs to alleviate this problem. This is achieved by regularizing each speaker-specific weight matrix by restricting it to a common low-dimensional subspace during adaptation. The subspace is learned using large amounts of development data, and is held fixed during

41

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

adaptation. Thus, only the coordinates in a subspace, also known as i-vector, need to be estimated using the speaker-specific data. Unlike the WLS of AANNs approach, we adapt the weight matrix directly in a low-dimensional common subspace. The update equations are derived for learning both the common low-dimensional subspace and the i-vectors corresponding to speakers in the subspace. The resultant i-vector representation is used as features for the subsequent PLDA model. Proposed system shows promising results on the NIST-08 SRE, and yields 12% relative improvement in EER over the WLS based AANN speaker verification system described in section 3.5.

The remainder of the chapter is organized as follows. In section 4.2, the factor analysis (FA) of AANNs is developed. The FA based AANN speaker verification system is described in section 4.3. Experimental results are provided in section 4.4.

4.2

Factor analysis of AANNs The idea of FA is to constrain the weight matrix connecting the third hidden layer

and output layer of each speaker-specific AANN model to lie on a common low-dimensional subspace such that it minimizes the overall loss function over the entire development data. In this process, the rest of the parameters of each AANN are held fixed at values learned during speaker independent training such as UBM (see section 3.3.2). The proposed loss function is different from the one used in section 3.5. The notations used in this section are summarized below.

m - number of speakers 42

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION l(s)- number of sessions of the sth speaker n(l, s) - number of frames in lth session of sth speaker fi,l,s - ith acoustic feature vector of lth session, sth speaker d - dimensionality of fi,l,s hi,l,s - fourth layer output of UBM-AANN for input fi,l,s d0 - dimensionality of hi,l,s Wl,s - Weight matrix connecting third hidden layer and output layer of AANN specific to lth session of sth speaker b - output bias vector of UBM-AANN ei,l,s - error vector of UBM-AANN for input fi,l,s

In the AANNs loss function below, we first vectorize Wl,s so that a subspace structure can be imposed on it. The loss function with speaker and session specific weights is given by L =

l(s) n(l,s) m X X X

kfi,l,s − b − Wl,s hi,l,s k22

s=1 l=1 i=1

=

l(s) n(l,s) m X X X

kfi,l,s − b − Hi,l,s wl,s k22 ,

s=1 l=1 i=1

43

(4.1)

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

where, wl,s = Row ordered(Wl,s ),

Hi,l,s = Id ⊗ (hi,l,s )T  (hi,l,s )     =      0



T

0 . . (hi,l,s )T

          

.

d×dd0

The dimensionality of Wl,s is d × d0 and that of wl,s is dd0 × 1. The vector wl,s is obtained by arranging rows of Wl,s as columns one after the other. Matrix Hi,l,s is equal to the Kronecker product of Id (a d × d identity matrix) and (hi,l,s )T .

We can rewrite (4.1) as L(w1,1 , . . . , wl(m),m ) =

l(s) n(l,s) m X X X

[fi,l,s − b − Hi,l,s wl,s ]T [fi,l,s − b − Hi,l,s wl,s ] .

(4.2)

s=1 l=1 i=1

The factor analysis model (or subspace constraint) for the vectorized weights wl,s is wl,s ≡ wubm + T ql,s , where wubm represents the speaker independent (UBM) vector of weights connecting third hidden layer and output layer, T is a matrix having fewer columns than rows representing the common low-dimensional subspace, and ql,s is a vector of coordinates in the subspace or an i-vector associated with the lth session of sth speaker. By substituting this factor 44

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

analysis model in (4.2), (note that the loss function depends only on (T, {ql,s }) as we know wubm ) l(s) n(l,s) m X X ¡ ¢ X L T, q1,1 , . . . , ql(m),m = [fi,l,s − b − Hi,l,s (wubm + Tql,s )]T s=1 l=1 i=1

[fi,l,s − b − Hi,l,s (wubm + Tql,s )] .

(4.3)

The reconstruction error vector of UBM is defined as . ei,l,s = fi,l,s − b − Hi,l,s wubm .

(4.4)

By substituting the expression above in (4.3), l(s) n(l,s) m X X ¡ ¢ X L T, q1,1 , . . . , ql(m),m = [ei,l,s − Hi,l,s Tql,s ]T [ei,l,s − Hi,l,s Tql,s ] .

(4.5)

s=1 l=1 i=1

Let us define the statistics of lth session of sth speaker as n(l,s) . X T F1 (l, s) = Hi,l,s ei,l,s

(4.6)

i=1 n(l,s) . X T F2 (l, s) = Hi,l,s Hi,l,s

(4.7)

i=1

We can rewrite (4.5) using (4.6) and (4.7) as, ¡ ¢ L T, q1,1 , . . . , ql(m),m =   l(s) n(l,s) m X X X  eTi,l,s ei,l,s  − 2qTl,s TT F1 (l, s) + qTl,s TT F2 (l, s) Tql,s . s=1 l=1

(4.8)

i=1

The low-dimensional subspace T can be learned by minimizing the loss function in (4.8) using the coordinate descent. In the first step, (4.8) is minimized with respect to {ql,s } by keeping T fixed. In the second step, (4.8) is minimized with respect to T by keeping {ql,s } fixed at the values found in step one. Note that the loss function is convex in each 45

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

UBM−AANN training FDLP features

PLDA training

T matrix training i−vector extraction

Extract statistics

Hypothesis testing

Figure 4.1: Block schematic of the proposed FA based AANN speaker verification system. step, and therefore the optima is found by setting the gradient of the loss function with respect to the corresponding variable to zero. These steps are repeated until convergence.

Differentiating (4.8) with respect to ql,s and setting it to zero yields, ¤ £ ∂L = 0 ⇒ −2TT F1 (l, s) + 2TT F2 (l, s) Tql,s = 0 ∂ql,s £ ¤−1 T ql,s = TT F2 (l, s) T T F1 (l, s)

(4.9)

Differentiating (4.8) with respect to T and setting it equal to zero yields, m l(s)

XX ∂L −2F1 (l, s) qTl,s + 2F2 (l, s) Tql,s qTl,s = 0, =0⇒ ∂T

(4.10)

s=1 l=1

where we solve for T by solving a set of linear equations involving entries of the matrix T.

4.3

FA based AANN Speaker Verification System

The block diagram of the proposed FA based AANN speaker verification system is shown in the Fig. 4.1. This system resembles the WLS based AANN speaker verification system described in section 3.5. These systems differ in the way UBM-AANN is adapted. In the FA approach, the matrix connecting third hidden and output layers of a UBM-AANN 46

Scores

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

is adapted in a common low-dimensional subspace to obtain an i-vector. Where as in WLS based approach, the entire matrix is adapted first using (3.6) and then it is projected on to a low-dimensional subspace to obtain an i-vector. The remainder of this section describes the stages which are different from WLS based AANN speaker verification system shown in the Fig. 3.3.

4.3.1

Statistics

The statistics in (4.6) and (4.7) are precomputed for each utterance that corresponds to a particular speaker and a session. Appropriate gender-specific UBM is used for computing the statistics. Note that we need to compute only few entries of F2 (l, s) as its entries are redundant. These statistics are sufficient for training the T matrix and extracting the i-vectors.

4.3.2

T-matrix Training

Gender dependent low-dimensional subspaces (T matrices) are trained as described in Section 4.2. The development data for training the subspaces consists of Switchboard-2, Phases II and III; Switchboard Cellular, Parts 1 and 2 and NIST 2004-2005 SRE [62]. The total number of male and female utterances is 12266 and 14936 respectively. We initialize the T matrix with a Gaussian noise and learn the subspace as described in Section 4.2 using coordinate gradient descent.

47

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION Table 4.1: EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8 of NIST-08. System

C6

C7

C8

Baseline AANN

19.5 (87.3)

15.8 (73.0)

15.2 (74.8)

WLS of AANNs, 240 dim. i-vector

10.7 (59.6)

5.5 (28.3)

4.4 (24.1)

9.6 (55.1)

4.7 (25.3)

3.8 (20.1)

(closed-form β = 0.005, from Table 3.2) FA of AANNs 240 dim. i-vector

4.3.3

i-vectors

Each utterance is converted to an i-vector using (4.9), using an appropriate gender-specific T matrix. All i-vectors are normalized to have unit length to reduce the mismatch during training and testing [68].

4.4

Experimental Results Speaker verification systems are tested on telephone conditions of NIST-2008

speaker recognition evaluation (SRE). The details of baseline AANN and WLS based AANN speaker verification systems can be found in section 3.6. All neural network based systems use the same UBM-AANN of size (39, 20, 6, 39, 39), where each number indicates the number of nodes in a corresponding layer1 . Table 3.1 shows the description of various telephone 1

The number of nodes in each hidden layer is optimized one at a time on NIST-08 to obtain the best performance of the FA based AANN system. Where as, the number of nodes in input or output layers is fixed by the dimensionality of feature vectors.

48

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION Table 4.2: Comparison with the state-of-art GMM based i-vector/PLDA system. EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8 of NIST-08. System FA of AANNs

C6

C7

C8

9.6 (55.1)

4.7 (25.3)

3.8 (20.1)

7.0 (41.3)

2.8 (14.8)

2.1 (10.8)

240 dim. i-vector GMM 400 dim. i-vector

conditions. Table 4.1 lists the EER and minimum detection cost function (minDCF) of NIST-2008 for all the systems. The last row of the table shows the error rates of the FA based AANN system which uses gender-dependent 150 dimensional (number of columns of Φ) subspace PLDA models in 240 dimensional i-vector space. Results indicate that FA based AANN speaker verification system outperforms baseline AANN and WLS based AANN speaker verification systems in all conditions. It can also be observed that the proposed FA approach yields 12.1% relative improvement in EER and 10.2% relative improvement in minDCF over the best WLS based AANN system.

4.4.1

Comparison with GMM based i-vector/PLDA System The block diagram in Fig. 4.1 is also applicable to GMM based i-vector/PLDA

system, except the UBM is based on a GMM. Each GMM based UBM consists of 1024 mixture components with diagonal covariances. The male and female UBMs are trained using FDLP features extracted from 4324 and 5461 utterances of development data respectively. Gender-specific 400 dimensional total variability space (T matrix) is trained as described

49

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL NETWORKS FOR SPEAKER VERIFICATION

in [63]. The i-vectors of this space are length normalized and subsequently used for training a gender-dependent PLDA system with 250 dimensional subspace. Note that the development data used for training the T matrices and the PLDA models is same as that of the FA based AANN speaker verification system (see section 4.3).

A gender-specific GMM based i-vector/PLDA system is also trained for comparison. The results are shown in Table 4.2. Although significant improvements are achieved using the proposed FA of AANNs, GMM based i-vector/PLDA system continues to perform better. However, further work on neural network based systems might close the existing performance gap, and bring forward possible advantages of this alternative nonlinear neural network based modeling in speaker verification.

50

Chapter 5

Conclusions 5.1

Conclusions Discriminative MLDA features were proposed for phoneme recognition in section

2.3. The SMLP classifier was proposed in section 2.4.1. It has been shown that one of its hidden layer outputs can be forced to be sparse by adding a sparse regularization term to the cross-entropy cost function. Update equations were derived for training the SMLP. Finally, a multi-stream phoneme recognition system based on SMLP has been shown to outperform its MLP counterpart.

The closed-form expression for adapting the AANN weight matrix connecting the third and output layers with regularization was derived in chapter 3. It was further regularized by projecting it onto a low-dimensional subspace, which was learned to preserve most of the variability of the adapted weight matrices in a WLS sense. Each speaker was modeled using a projection in the subspace (an i-vector). The resultant speaker verification system

51

CHAPTER 5. CONCLUSIONS

based on i-vectors achieved an order of magnitude better performance than the existing AANN based speaker verification system.

The theory of FA of AANNs was introduced to directly adapt the AANN weight matrix connecting the third and output layers in a low-dimensional subspace, that captures variability of weight matrices (see section 4.2). This particular way of regularizing adaptation parameters of AANNs has been shown to yield better performance than the WLS approach described in section 3.4.

5.2

Future Work The future work includes applying SMLP to extract data-driven features for ASR

in TANDEM framework. Also, to replace MLP with SMLP in other pattern classification applications.

An interesting extension of FA of AANNs would be to add additional terms to the cost function (4.5) such that within class i-vectors become close to each other and between class i-vectors tend to be far from each other. Further, developing a FA formulation for the MLP based acoustic modeling to perform speaker adaptation in ASR.

52

Appendix A

Standard Error Back-Propagation Gradients of L w.r.t. yjm can be computed from the equation (2.2) as follows, ∀j ∈ {1, 2, ..., Nm }, dj ∂L = − m. m ∂yj yj

(A.1)

For softmax non-linearity, ∀i ∈ {1, 2, ..., Nm }, m

yjm

∂yjm ex j m m ⇒ m = yj (Iij − yi ) , N m ∂x P m i exk

=

(A.2)

k=1

where, Iij

=

    1

if i == j,

   0

otherwise.

Thus using (A.1) and (A.2), gradients of L w.r.t. xm i ∀i ∈ {1, 2, ..., Nm } are given by, N

m X ∂L = ∂xm i

j=1

Ã

∂L ∂yjm

!µ

∂yjm ∂xm i

¶

¡ ¢ = yjm − dj

(A.3)

Standard error back-propagation algorithm expresses gradients of L w.r.t. yjm−1

53

APPENDIX A.

STANDARD ERROR BACK-PROPAGATION

in terms of previously computed gradients in (A.3). ∂L ∂yjm−1

¶ Nm µ X ∂L = ∂xm i

Ã

i=1

=

Nm X

∂xm i ∂yjm−1

!

m−1 (yim − di ) wji .

(A.4)

i=1

In general ∀l ≤ m − 1 , given gradients of L w.r.t. yjl , ∀j ∈ {1, 2, ..., Nl }, gradient of L w.r.t. yil−1 for any i ∈ {1, 2, ..., Nl−1 } is given by (similar to (2.6)), ∂L ∂yil−1

=

Nl X j=1

=

Nl X j=1

=

Nl X j=1

Ã

Ã

Ã

∂L ∂yjl ∂L ∂yjl ∂L ∂yjl

!Ã

!

∂yjl ∂xlj

!Ã

∂xlj ∂yil−1

³ ´ 0 l−1 φl xlj wij

! l−1 yjl (1 − yjl ) wij

where, ³ ´ yjl = φl xlj =

1 . 1 + exp(−xlj )

54

!

Bibliography [1] J. Dines, J. Vepa, and T. Hain, “The segmentation of multi-channel meeting recordings for automatic speech recognition,” in INTERSPEECH, 2006. [2] M. Lehtonen, P. Fousek, and H. Hermansky, “Hierarchical approach for spotting keywords,” IDIAP Research Report, no. 05–41, 2005. [3] J. Pinto, I. Szoke, S. Prasanna, and H. Hermansky, “Fast approximate spoken term detection from sequence of phonemes,” in Proceedings of the ACM SIGIR Workshop on Searching Spontaneous Conversational Speech, 2008, pp. 08–45. [4] H. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach. Kluwer Academic Pub, 1994. [5] H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional hmm systems,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2000. [6] B. Chen, Q. Zhu, and N. Morgan, “Learning long-term temporal features in lvcsr using neural networks,” in INTERSPEECH, 2004.

55

BIBLIOGRAPHY

[7] F. Gr´ezl, M. Karafi´at, S. Kont´ ar, and J. Cernocky, “Probabilistic and bottle-neck features for lvcsr of meetings,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2007. [8] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, no. 99, 2011. [9] B. Yegnanarayana and S. Kishore, “Aann: an alternative to gmm for pattern recognition,” Neural Networks, vol. 15, no. 3, pp. 459–469, 2002. [10] K. Murty and B. Yegnanarayana, “Combining evidence from residual phase and mfcc features for speaker recognition,” IEEE Signal Processing Letters, vol. 13, no. 1, pp. 52–55, 2006. [11] G. Sivaram, S. Thomas, and H. Hermansky, “Mixture of auto-associative neural networks for speaker verification,” in INTERSPEECH, 2011. [12] P. Matejka, P. Schwarz, J. Cernock` y, and P. Chytil, “Phonotactic language identification using high quality phoneme recognition,” in Ninth European Conference on Speech Communication and Technology, 2005. [13] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” The Journal of the Acoustical Society of America, vol. 87, pp. 1738–1752, 1990. [14] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.

56

BIBLIOGRAPHY

[15] B. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the modulation spectrogram,” Speech Communication, vol. 25, no. 1, pp. 117–132, 1998. [16] H. Hermansky and P. Fousek, “Multi-resolution RASTA filtering for TANDEM-based ASR,” in INTERSPEECH, 2005. [17] S. Ganapathy, S. Thomas, and H. Hermansky, “Modulation frequency features for phoneme recognition in noisy speech,” The Journal of the Acoustical Society of America - Express Letters, vol. 125, no. 1, pp. 8–12, 2009. [18] M. Kleinschmidt and D. Gelbart, “Improving word accuracy with Gabor feature extraction,” in Proc. of ICSLP.

USA, 2002.

[19] S. Zhao and N. Morgan, “Multi-stream spectro-temporal features for robust speech recognition,” in INTERSPEECH. Brisbane, Australia, 2008. [20] N. Mesgarani, G. Sivaram, S. K. Nemala, M. Elhilali, and H. Hermansky, “Discriminant Spectrotemporal Features for Phoneme Recognition,” in INTERSPEECH. Brighton, 2009. [21] M. Richard and R. Lippmann, “Neural network classifiers estimate Bayesian a posteriori probabilities,” Neural computation, vol. 3, no. 4, pp. 461–483, 1991. [22] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf, P. Jain, H. Hermansky, D. Ellis et al., “Pushing the envelope-aside [speech recognition],” IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 81–88, 2005. [23] V. Balakrishnan, G. Sivaram, and S. Khudanpur, “Dirichlet mixture models of neural 57

BIBLIOGRAPHY

net posteriors for hmm-based speech recognition,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011. [24] F. Gr´ezl, M. Karafi´at, S. Kont´ ar, and J. Cernocky, “Probabilistic and bottle-neck features for lvcsr of meetings,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2007. [25] Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, “On using mlp features in lvcsr,” in INTERSPEECH, 2004. [26] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, “Training and adapting mlp features for arabic speech recognition,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009. [27] T. Elliott and F. Theunissen, “The modulation transfer function for speech intelligibility,” PLoS computational biology, vol. 5, no. 3, 2009. [28] B. Meyer and B. Kollmeier, “Optimization and evaluation of Gabor feature sets for ASR,” in INTERSPEECH.

Brisbane, Australia, 2008.

[29] D. Depireux, J. Simon, D. Klein, and S. Shamma, “Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex,” Journal of Neurophysiology, vol. 85, no. 3, pp. 1220–1234, 2001. [30] G. Sivaram, S. Nemala, N. Mesgarani, and H. Hermansky, “Data-driven and feedback based spectro-temporal features for speech recognition,” IEEE Signal Processing Letters, vol. 17, no. 11, pp. 957–960, 2010.

58

BIBLIOGRAPHY

[31] “TIMIT database,” Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=L [32] K. Lee and H. Hon, “Speaker-independent phone recognition using hidden Markov models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 11, pp. 1641–1648, 1989. [33] S. Chen and D. Li, “Modified linear discriminant analysis,” Pattern Recognition, vol. 38, no. 3, pp. 441–443, 2005. [34] B. Olshausen and D. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997. [35] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for visual area v2,” Advances in neural information processing systems, vol. 20, pp. 873–880, 2008. [36] K. Huang and S. Aviyente, “Sparse representation for signal classification,” Advances in neural information processing systems, vol. 19, pp. 609–616, 2007. [37] M. Ranzato, Y. Boureau, and Y. LeCun, “Sparse feature learning for deep belief networks,” Advances in neural information processing systems, vol. 20, pp. 1185–1192, 2007. [38] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009. [39] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary

59

BIBLIOGRAPHY

learning,” Advances in neural information processing systems, vol. 21, pp. 1033–1040, 2008. [40] T. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010. [41] J. Gemmeke and T. Virtanen, “Noise robust exemplar-based connected digit recognition,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010. [42] G. Sivaram, S. Nemala, M. Elhilali, T. Tran, and H. Hermansky, “Sparse coding for speech recognition,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010. [43] J. Pinto, B. Yegnanarayana, H. Hermansky, and M. Magimai.-Doss, “Exploiting contextual information for improved phoneme recognition,” Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2008. [44] J. Pinto, G. Sivaram, M. Magimai.-Doss, H. Hermansky, and H. Bourlard, “Analyzing MLP Based Hierarchical Phoneme Posterior Probability Estimator,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 225–241, 2011. [45] H. Ketabdar and H. Bourlard, “Enhanced phone posteriors for improving speech recognition systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1094–1106, 2010.

60

BIBLIOGRAPHY

[46] J. Pinto, M. Magimai-Doss, and H. Bourlard, “Mlp based hierarchical system for task adaptation in asr,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2009, pp. 365–370. [47] D. Imseng, M. Doss, and H. Bourlard, “Hierarchical multilayer perceptron based language identification,” in INTERSPEECH, 2010. [48] “The

ICSI

Quicknet

Software

Package,”

Available:

http://www.icsi.berkeley.edu/Speech/qn.html. [49] F. Valente, “Multi-stream speech recognition based on Dempster-Shafer combination rule,” Speech Communication, vol. 52, no. 3, pp. 213–222, 2010. [50] G. Shafer, A mathematical theory of evidence.

Princeton university press Princeton,

1976, vol. 1. [51] G. Dahl, M. Ranzato, A. Mohamed, and G. Hinton, “Phone recognition with the meancovariance restricted boltzmann machine,” Advances in Neural Information Processing Systems, vol. 23, pp. 469–477, 2010. [52] A. Mohamed and G. Hinton, “Phone recognition using restricted boltzmann machines,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010. [53] P. Schwarz, P. Matejka, and J. Cernocky, “Hierarchical structures of neural networks for phoneme recognition,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2006.

61

BIBLIOGRAPHY

[54] P. Hoyer, “Non-negative matrix factorization with sparseness constraints,” The Journal of Machine Learning Research, vol. 5, pp. 1457–1469, 2004. [55] M. Kramer, “Nonlinear principal component analysis using autoassociative neural networks,” AIChE journal, vol. 37, no. 2, pp. 233–243, 1991. [56] R. Kumerasan and A. Rao, “Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications,” The Journal of the Acoustical Society of America, vol. 105, pp. 1912–1924, 1999. [57] M. Athineos, H. Hermansky, and D. Ellis, “Plp2 autoregressive modeling of auditorylike 2-d spectrotemporal patterns,” in INTERSPEECH, 2004. [58] M. Athineos and D. Ellis, “Autoregressive modeling of temporal envelopes,” IEEE Transactions on Signal Processing, vol. 55, no. 11, pp. 5237–5245, 2007. [59] S. Ganapathy, J. Pelecanos, and M. Omar, “Feature normalization for speaker verification in room reverberation,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011. [60] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000. [61] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007. [62] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny, “Comparison of scoring 62

BIBLIOGRAPHY

methods used in speaker recognition with joint factor analysis,” in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009. [63] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011. [64] N. Br¨ ummer and E. de Villiers, “The speaker partitioning problem,” in Proceedings of the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010. [65] D. Garcia-Romero and C. Espy-Wilson, “Joint factor analysis for speaker recognition reinterpreted as signal coding using overcomplete dictionaries,” in Proc. Odyssey Speaker and Language Recognition Workshop, 2010. [66] S. Prince and J. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in IEEE 11th International Conference on Computer Vision, 2007., 2007, pp. 1–8. [67] P. Kenny, “Bayesian speaker verication with heavy-tailed priors,” in Proceedings of the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010. [68] D. Garcia-Romero and C. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in INTERSPEECH, 2011.

63

A Regularized Line Search Tunneling for Efficient Neural Network ...

Neural Network Toolbox

Neural Network Toolbox - Share ITS

Learning a L1-regularized Gaussian Bayesian network ...

LONG SHORT TERM MEMORY NEURAL NETWORK FOR ...

A programmable neural network hierarchical ...

A Regenerating Spiking Neural Network

Neural Network Toolbox User's Guide

TEMPORAL KERNEL NEURAL NETWORK ...

Mutation-Based Genetic Neural Network

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Convolutional Neural Network Committees For Handwritten Character ...

Credit Card Fraud Detection Using Neural Network