Total Variability Modelling for Face Verification

Viewer
Transcript

IET BIOMETRICS

1

Total Variability Modelling for Face Verification Roy Wallace, Mitchell McLaren

Abstract This paper presents the first detailed study of total variability modelling (TVM) for face verification. TVM was originally proposed for speaker verification where it has been accepted as state-of-the-art technology. Also referred to as front-end factor analysis, TVM uses a probabilistic model to represent a speech recording as a low-dimensional vector known as an i-vector. This representation has been successfully applied to a wide variety of speech-related pattern recognition applications, and remains a hot topic in biometrics. In this work, we extend the application of i-vectors beyond the domain of speech to a novel representation of facial images for the purpose of face verification. Extensive experimentation on several challenging and publicly-available face recognition databases demonstrates that TVM generalises well to this modality, providing between 17% and 39% relative reduction in verification error rate compared to a baseline Gaussian mixture model (GMM) system. We evaluate several i-vector session compensation and scoring techniques including source-normalised linear discriminant analysis (SN-LDA), probabilistic LDA (PLDA) and within-class covariance normalisation (WCCN). Finally, this paper provides a detailed comparison of the complexity of TVM, highlighting some important computational advantages with respect to related state-of-the-art techniques. Index Terms biometrics, face verification, Gaussian mixture modelling, total variability modelling, i-vectors Roy Wallace ([email protected]) is with Idiap Research Institute, PO Box 592, CH-1920 Martigny, Switzerland. Mitchell McLaren ([email protected]) is with the STAR Laboratory of SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025, USA. This work was carried out while Mitchell McLaren was at the Centre for Language and Speech Technology, Radboud University Nijmegen, PO Box 9102, 6500HC, The Netherlands. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement number 238803 (BBfor2). Credit is hereby given to the University of Zagreb, Faculty of Electrical Engineering and Computing for providing the SCface database of facial images. September 29, 2012

DRAFT

IET BIOMETRICS

2

I. I NTRODUCTION We propose a new approach to face verification using total variability modelling (TVM) to improve accuracy in uncontrolled conditions. In particular, we build upon the state-of-the-art parts-based framework for face verification (also known as face authentication) [1]–[4]. Within this framework, the distribution of local features extracted from images of a subject’s face is described by a Gaussian mixture model (GMM). The problem is that a subject’s GMM is difficult to estimate reliably when there is limited enrolment data and, furthermore, verification scores are unstable when there is a mismatch between the conditions of the enrolment and probe (test) images. Considerable focus has been given to finding solutions to these problems in the field of speaker verification, resulting in the development of advanced session compensation techniques. Session compensation refers to the process of modelling and removing the undesirable effects of withinclass (session) variation to thus provide more representative subject models and improved robustness to mismatched conditions. For face recognition, session variability refers to differences between images of the same subject that occur due to illumination, pose, expression, acquisition device and range, age, facial hair, etc. We recently began to address the issue of session variation in the context of face verification in [5] where GMM-based session compensation techniques were proposed to improve verification accuracy in challenging conditions. In particular, we applied factor analysis techniques originally developed in the speaker verification field to face verification and found that inter-session variability modelling (ISV) and joint factor analysis (JFA) provided relative error rate reductions of between 11% and 44% across several challenging databases. These factor analysis techniques use training data to learn the primary directions of within and between-class variability in a high-dimensional GMM mean supervector space1 , allowing information relating to a subject’s identity to be separated from the effects of the conditions in which the sample was taken. A related but different approach, total variability modelling (TVM), has recently superseded ISV and JFA as the state-of-the-art approach to speaker verification [6], [7]. Also referred to as i-vector modelling or front-end factor analysis, TVM is similar to ISV and JFA in its use of GMM modelling, however, session compensation in the new approach is carried out in a low-dimensional total variability factor space rather than in the high-dimensional GMM mean supervector space. In this context, an i-vector refers to a low-dimensional vector extracted using a generative probabilistic model to represent the observations of 1

A GMM mean supervector is a CD-dimensional vector formed by concatenating the D-dimensional means from each of

the C components of a GMM.

September 29, 2012

DRAFT

IET BIOMETRICS

3

a biometric sample. The key innovation of TVM is performing factor analysis and session compensation separately, rather than simultaneously as is done in ISV and JFA. In recent years, several enhancements have been proposed to further increase the accuracy of the state-of-the-art TVM framework for speaker verification [6]–[12]. However, to the best of the authors’ knowledge, there has been no study of whether this new modelling approach may also be applicable to the face verification task. This paper presents the first study of total variability modelling for face verification. We demonstrate the effectiveness of TVM through experimentation on several challenging and publicly-available face recognition databases. The new approach is compared to alternative session variability modelling approaches, in terms of verification accuracy and computational complexity, to demonstrate the benefits associated with TVM. The results and analysis presented in this work highlight the suitability as well as the shortcomings of TVM, to provide a foundation for future work on further tailoring the new approach to face recognition. In Section II we introduce the baseline parts-based framework for face verification, then in Section III we outline recent factor analysis approaches including TVM, with an overview of TVM session compensation and normalisation techniques presented in Sections IV and V. In Sections VI to VIII we present experimental results and analysis of system complexity, before concluding in Section IX. II. PARTS - BASED FACE VERIFICATION The parts-based framework for face verification is well-established and is state-of-the-art [1]–[5]. In this paper we focus our efforts on improving face verification accuracy within the scope of this framework. This section provides a brief overview of the baseline approach and readers are directed to recent publications such as [4] for further details. The approach uses local features and generative statistical modelling. First, facial images are converted to grayscale, cropped, registered, and preprocessed to reduce illumination variation. Next, 2D discrete cosine transform (2D DCT) coefficients are extracted from small mean and variance-normalised B × B blocks of pixels, which are exhaustively sampled from the image. From each vector of coefficients, the subset of D coefficients that correspond to the low frequency range are retained and are mean and variance-normalised in each dimension with respect to the other feature vectors of the image. Each image is thus represented by a set of K feature vectors, O = o1 , o2 , . . . , oK . The distribution of the resulting features for a subject is approximated using a Gaussian mixture model (GMM). This GMM is estimated during enrolment through mean-only maximum a posteriori (MAP) adaptation from a subject-independent prior GMM with diagonal covariance matrices, known September 29, 2012

DRAFT

IET BIOMETRICS

4

as a universal background model (UBM) [13]. The UBM is estimated beforehand to maximise the likelihood of observations from a large subject-independent training data set. To perform verification, the observations from a probe image O t = o1t , o2t , . . . , oK are compared to the model of the claimed t subject, si , and a log likelihood ratio (LLR) score is calculated with respect to the UBM, m: ! K Y p(okt | si ) h (O t , si ) = log p(okt | m) k=1

(1)

An efficient approximation to the LLR known as linear scoring is used to improve verification speed without loss of accuracy [4], [14]. The major strengths of this GMM-based local feature modelling approach include natural robustness to occlusion, local transformation and face mislocalisation, while offering the best trade-off in terms of complexity, robustness and discrimination [3]. However, it remains a difficult problem to learn the distribution of a subject’s observations in all possible conditions when only a limited amount of enrolment data is available per subject. On the other hand, large amounts of training data are typically available that contain data from a disjoint set of subjects. Factor analysis techniques aim to leverage the information from this training data, to improve enrolment by incorporating a model of subject and session variability, as described in the next section. III. FACTOR ANALYSIS FOR GMM S As outlined in the previous section, the baseline approach to GMM-based face (and speaker) verification is to model the distribution of a subject’s features using a GMM found through mean-only MAP adaptation of a UBM. A subject’s model can thus be represented by a mean offset from the UBM in a highdimensional GMM mean supervector space. In the context of limited enrolment data, this offset is difficult to estimate reliably due to its inherent sensitivity to the specific conditions in which the enrolment data was captured. Factor analysis was developed to accommodate this scenario by constraining mean offsets to lie within linear, low-dimensional subspaces of the supervector space, where the corresponding linear coefficients are referred to as factors. In this way, small amounts of enrolment data can be utilised more reliably by appropriately constraining the directions of adaptation within supervector space. Subspaces are used to constrain different sources of variability, resulting in a variety of factor analysis recipes. In this section we briefly outline state-of-the-art factor analysis approaches to speaker and face verification. The first two approaches are inter-session variability modelling (ISV) and joint factor analysis (JFA). The third and most recent approach is total variability modelling (TVM). ISV and JFA both perform session compensation through factor analysis in the high-dimensional GMM supervector space. September 29, 2012

DRAFT

IET BIOMETRICS

5

In contrast, TVM utilises factor analysis as a front-end processing step to extract low-dimensional factors referred to as i-vectors, and subsequently performs session compensation in i-vector space. In [5], we successfully applied ISV and JFA to face verification. To the best of our knowledge, this is the first work to apply TVM to face verification. A. Classical MAP adaptation Subject enrolment can be expressed in terms of GMM mean supervectors as si = m + d i ,

(2)

where m is the UBM mean supervector, di is the subject-dependent offset and si is the resulting model for subject i. In the baseline approach [13], the subject model is estimated using MAP adaptation given all of the subject’s enrolment data, where di = Dz i

(3)

z i ∼ N (0, I) r Σ D= . τ

(4) (5)

Here, D is a diagonal matrix, τ is the adaptation relevance factor [13] and Σ is the CD × CD diagonal matrix constructed from the C diagonal covariance matrices of the components of the UBM. Enrolment thus involves MAP estimation of the latent factors in z i . Note that (5) involves an element-wise square root, rather than a full matrix square root. Unfortunately, using the baseline approach, the model si will likely be suboptimal when enrolment data is limited, because it is sensitive to variability within the subject’s enrolment images due to e.g. illumination, expression or pose. B. Inter-session variability modelling (ISV) ISV explicitly models the variation between different images (sessions) of the same subject and compensates for this variation during both the enrolment and probe phases. Importantly, the conditions of each particular session are assumed to result in a different additive offset to the subject’s model within a low-dimensional subspace of GMM mean supervector space [15]. The ISV model is thus given by

September 29, 2012

µi,j = m + ui,j + di

(6)

ui,j = U xi,j

(7) DRAFT

IET BIOMETRICS

6

where ui,j is the session-dependent offset for the j ’th image of subject i, and µi,j is the resulting mean supervector of the GMM that best represents the image O i,j . The subject-dependent offset di is defined as before (3). The low-dimensional (nU ) session variability subspace U is learned from a large training set using an expectation-maximisation (EM) algorithm, where the session-dependent factors are assigned standard normal priors, xi,j ∼ N (0, I) [15]. To enrol subject i using J images, the subject model si = m + di is found using (6) through MAP estimation of the factors z i and xi,j ∀j = 1 . . . J . In

contrast to (1), the score for a probe image O t is then K Y p(okt | si + ut ) h (O t , si ) = log p(okt | m + ut ) k=1

! ,

(8)

where ut is the MAP estimate of the session-dependent offset for image O t with respect to the UBM. C. Joint factor analysis (JFA) In a similar manner to ISV, JFA aims to explicitly model the effects of within-class variation in a lowdimensional subspace U . However, JFA extends on ISV by also capturing the majority of between-class variation in a second, low-dimensional (nV ) subspace, V , with the motivation that this should result in more reliable subject models in the context of limited enrolment data [16]. JFA thus differs from ISV only in the definition of the subject offset di in (6), replacing (3) with ˆ i, di = V y i + Dz

(9)

ˆ is learned via where y i ∼ N (0, I) is a vector of subject-dependent factors and the diagonal matrix D

EM on the training data. The subject space V is learned prior to the session subspace U and finally ˆ is trained to capture any residual variation, following [17]. Enrolment involves MAP estimation of D

the factors z i and xi,j ∀j = 1 . . . J and, additionally, the subject-dependent factors in y i . As in ISV, the result of enrolment is the subject model si = m + di but now di is given by (9) instead of (3). Finally, scoring is performed in the same way as in ISV (8).

D. Total variability modelling (TVM) Recent work has shown that factor analysis can fail to separate between-class variation from within-class variation in the high-dimensional GMM mean supervector space. In particular, [6] showed that sessiondependent factors ui,j estimated using JFA, which are supposed to model only within-class variability, also contained discriminatory information about subjects. This leads to suboptimal performance because the proportion of useful between-class variation that lies within the session subspace U is discarded by the September 29, 2012

DRAFT

IET BIOMETRICS

7

model. To address this issue for speaker verification, total variability modelling (TVM) was proposed [6], [7]. TVM utilises factor analysis simply as a front-end processing step to extract a low-dimensional i-vector, wi,j . In contrast to (6), the supervector representation for TVM is µi,j = m + T wi,j

(10)

where the low-dimensional (nT ) total variability subspace T is learned via factor analysis to maximise the likelihood of a large training data set, in a similar manner to the subspaces of ISV and JFA, and wi,j ∼ N (0, I). A matrix ΣT is also estimated, which is a diagonal covariance matrix of dimension CD × CD that models the residual variability in the observed feature vectors that is not captured by the

total variability matrix T [7]. These parameters are estimated with the same procedure used to estimate V for JFA in [18], with the difference that each image in the training data set is treated as if it is of

a distinct subject. A subject is enrolled by finding the MAP estimate of a single i-vector wi using (10) given the subject’s enrolment data. Specifically, i-vector wi is extracted using the centralised zeroth-order and first-order Baum-Welch statistics of the enrolment data (N i and F i respectively) [18], using −1 0 −1 wi = (I + T 0 Σ−1 T N i T ) T ΣT F i .

(11)

Similarly, the first step of verification of a probe image O t is to extract the i-vector wt using the corresponding statistics of the probe image. The difference between the ISV, JFA and TVM paradigms lies in the kinds of variation captured by each subspace. The subspace U in ISV and JFA aims to capture directions of within-class variation, while for JFA V captures between-class variation. In contrast, the total variability space T simply aims to capture the principal directions of between-image variation. The TVM approach does not, therefore, incorporate session compensation in the factor analysis process. Rather, factor analysis is used simply to extract a low-dimensional representation of each image in the form of an i-vector. I-vectors in their raw form thus capture both the important subject-specific information needed for discrimination, as well as detrimental session variability. For this reason, session compensation and scoring are performed as separate processes after i-vector extraction, as described in the following section. IV. S ESSION COMPENSATION AND SCORING IN THE I - VECTOR SPACE As this is the first work to apply i-vectors to face verification, this section gives a brief overview of the session compensation and scoring techniques that are typically applied to i-vectors in state-of-theart speaker verification. The usage of these techniques has been summarised in Figure 1. As shown in September 29, 2012

DRAFT

IET BIOMETRICS

8

the figure, whitening and I-norm may be employed as i-vector preprocessing techniques. Then, scoring proceeds via one of two alternative processing chains, that is, either using probabilistic linear discriminant analysis (PLDA) scoring or using cosine scoring. Prior to cosine scoring, linear discriminant analysis (LDA) and/or within-class covariance normalisation (WCCN) may be employed to improve between-class separation of i-vectors. These processing steps are explained in more detail below.

A. Linear discriminant analysis (LDA) LDA aims to find a transformation that minimises within-class variation while simultaneously maximising between-class variation. From a training data set of images of numerous subjects with multiple images per subject, LDA operates on i-vectors wi,j extracted from those images, by first estimating within-class and between-class scatter matrices, SW =

XX (wi,j − wi )(wi,j − wi )0 i

SB =

X

(12)

j

Ni (wi − w)(wi − w)0 ,

(13)

i

where Ni is the number of i-vectors from subject i, wi is the mean of these i-vectors, and w is the mean of all i-vectors in the training data set. The LDA transform, A, is given by the nLDA eigenvectors with the greatest eigenvalues from the eigen decomposition of S B v = λS W v . An i-vector, w, is projected into the LDA space using w ← A0 w.

(14)

B. Source-normalised linear discriminant analysis (SN-LDA) An extension of LDA that has recently been proposed in the field of TVM for speaker verification is source-normalised LDA (SN-LDA) [8]. It was designed to improve scatter matrix estimation in the context of a training data set that provides an incomplete representation of within-class variation. A source is defined as any aspect of the training data set, typically labelled, that contributes to differences between images of the same subject. In [8] the chosen source labels were telephone speech and microphone speech. SN-LDA was shown to provide significant improvements over LDA when the amount of data from each of the sources was not balanced and when the subjects in the training data set did not provide samples from all sources. Specifically, SN-LDA proposes to replace the between-class scatter matrix of

September 29, 2012

DRAFT

IET BIOMETRICS

9

(13) with a source-normalised version, SB =

X

S src B

(15)

S src B =

X

src src src 0 Nisrc (wsrc i − w )(w i − w ) ,

(16)

i src i-vectors from subject i in source src, while w src is the average of where wsrc i is the average of the Ni

the i-vectors from source src across all subjects. The within-class variation is then estimated by subtracting the between-class scatter from the total variance, that is SW = ST − SB XX ST = (wi,j − w)(wi,j − w)0 . i

(17) (18)

j

Finally, the transform is estimated from S B and S W in the same way as LDA, then applied to i-vectors using (14). We propose to apply SN-LDA to face verification by defining the source as the database from which an image originated. Our hypothesis is that, for a database with a limited amount of well-matched training data, SN-LDA will allow for the exploitation of disparate data sets to better estimate the directions of within and between-class variation and thus improve verification accuracy. In Section VII-C, we test this hypothesis by using SN-LDA with an extended training set of images from multiple databases.

C. Within-class covariance normalisation (WCCN) WCCN [19] aims to find a transform that normalises the within-class covariance matrix of a training set of i-vectors to the identity matrix. WCCN was originally developed in the context of SVM-based speaker verification and has since proven effective when applied to i-vectors in the TVM framework [9]. −1 = BB 0 where N The WCCN transform B is found using the Cholesky decomposition of N1 S W is the number of subjects in the training data set and the scatter matrix S W is defined by (12). Finally, WCCN is applied to an i-vector using w ← B 0 w.

(19)

D. Use of both LDA and WCCN The use of LDA followed by WCCN was shown in [7] to be effective for speaker verification. In this case, the LDA transform, A, is first estimated and applied as described in Section IV-A. Alternatively, the SNLDA transform (Section IV-B) may be used in its place. The within-class scatter matrix (12) is

September 29, 2012

DRAFT

IET BIOMETRICS

10

then calculated on the (SN)LDA-transformed i-vectors (A0 w) and used to derive the WCCN transform B as described in Section IV-C. Together, these two steps apply the transformation w ← B 0 A0 w.

(20)

E. Cosine distance scoring The aforementioned approaches to session compensation are employed prior to verification using cosine distance scoring [7]. This straightforward scoring technique essentially finds the angle between two ivectors using the cosine kernel. For verification, the i-vector representing the probe image wt is compared to the subject’s model (i-vector) wi as follows scosine (wt , wi ) =

wt · wi . kwt kkwi k

(21)

The smaller the angle between wt and wi , the greater the score will be. For this reason, scosine is used as a measure of similarity between the i-vectors. 1) Cosine kernel normalisation (C-norm): Recently, [10] proposed an alternative scoring method using a normalised cosine kernel. In particular, C-norm aims to centre the i-vector space with respect to an impostor i-vector distribution to improve verification accuracy. Just prior to scoring, the i-vectors wt and wi are length-normalised using w ˆ = w/kwk. A normalised score is then produced via scosine (wt , wi ) =

(w ˆ t − wimp )0 (w ˆ i − wimp ) 1

1

,

(22)

2 2 w ˆ t kkC imp w ˆik kC imp

where wimp is the mean of a training data set of length-normalised i-vectors and C imp is a diagonal matrix that contains the square root of the diagonal covariance matrix of those i-vectors. C-norm alleviates the need for the widely adopted ZT-norm used to normalise the likelihood ratio scores output by previous techniques including ISV and JFA [4], [5], [20]. C-norm differs from ZT-norm in three major aspects. Firstly, it is performed in the cosine kernel space instead of being separately applied to a set of scores. Secondly, it is symmetric with respect to the model wi and probe wt . Finally, statistics for C-norm may be calculated before a probe image is seen because, unlike ZT-norm, the required statistics are independent of the probe image (represented by wt ), thus reducing the computational burden of the normalisation at probe time. F. Probabilistic linear discriminant analysis (PLDA) PLDA is a probabilistic model used to calculate the likelihood ratio that two images are of the same subject [21]. In contrast to the previously described cosine distance scoring and session compensation September 29, 2012

DRAFT

IET BIOMETRICS

11

techniques, PLDA incorporates both session compensation as well as a probabilistic scoring framework. Consequently, LDA, SN-LDA, WCCN and C-norm are not applied in conjunction with PLDA scoring, as shown in Figure 1. PLDA assumes that the j ’th i-vector from subject i can be modelled by wi,j = F hi + Gkij + ij ,

(23)

where subspaces F and G (with dimensionalities of nF and nG ) contain the major directions of subject and session variability, respectively, while ij represents the residual variation (noise) and has diagonal covariance Σ . The factors hi and kij describe the subject and session information, respectively, from the i-vector in the corresponding subspaces and have a standard normal prior distribution. Similar to JFA, model parameters θ = {F , G, Σ } are learned via the EM algorithm over a training data set. PLDA compares probe and model i-vectors, wt and wi , by computing a likelihood ratio, sPLDA (wt , wi ) =

P (wt , wi |Htar ) P (wt |Himp )P (wi |Himp )

(24)

where P (wt , wi |Htar ) is the likelihood given the target hypothesis Htar , that is, that the two i-vectors represent images of the same subject, while P (w|Himp ) is the likelihood given the impostor hypothesis, that is, that the two i-vectors are from different subjects. Readers are referred to Section 3 of [12] and Section 2.2 of [21] for details on calculating these likelihoods from the PLDA model. V. P REPROCESSING OF I - VECTORS Recent progress in speaker verification has seen the development of i-vector preprocessing techniques, including whitening and i-vector length normalisation (I-norm), which aim to map i-vectors to a space that better fits the Gaussian assumptions of the PLDA model. They are briefly described below before being applied to face verification in Section VII. A. Whitening Whitening of the i-vector space has been found to improve PLDA modelling of i-vectors [12]. By transforming the covariance of the i-vectors into an identity matrix, whitening normalises the directions that would otherwise dominate the i-vector space. Given the mean of a set of training i-vectors, w, whitening is carried out using w ← (w − w) W ,

(25)

0 where the whitening transform, W , is found by solving the Cholesky decomposition of Σ−1 tr = W W

and Σtr is the estimated covariance from a training set of i-vectors. September 29, 2012

DRAFT

IET BIOMETRICS

12

B. I-vector length normalisation (I-norm) A recent study [11] found that the distribution of i-vector length can differ between system training and evaluation data sets. It was proposed, therefore, to counteract this mismatch by normalising the length of i-vectors via w←

w kwk

(26)

This process can be viewed as a mapping of i-vectors to a unit hypersphere, which is closer to the distribution assumed by the PLDA model. VI. DATABASES AND EXPERIMENTAL PROTOCOLS For evaluation of the proposed techniques, we chose to use only publicly-available databases with welldefined protocols for training, development and test sets that contain images from disjoint sets of subjects. Unfortunately, some popular databases such as FRGC [22] and LFW [23] were thus not applicable, as they do not include separate development and test sets2 . The databases described in the remainder of section were selected due to their challenging conditions and relevance to forensics and security applications. The protocols and annotations used have been made available online for reproducibility3 .

A. SCface The Surveillance Cameras Face Database (SCface) [24] contains images of 130 subjects acquired with commercially available surveillance equipment. This makes it especially interesting for forensics, law enforcement and surveillance use case scenarios. As in [4], [5], experiments follow the Face Authentication Protocol based on the DayTime Tests. According to this protocol, facial images taken by five surveillance cameras at three specified distances (close, medium, far) are compared to a model trained using a single high-resolution mugshot image. As shown in Figure 2, there is dramatic session variability amongst the mugshot and surveillance images. All training data (688 images of 43 subjects) was used for C-norm and to train the subspaces. As in [4], ZT-norm cohort sets were formed from one third of the training data while the rest was used for UBM training. Low resolution images were upsampled where necessary for registration. 2

In the FRGC database, 153 subjects occur in both the training set as well as the test set, and there is no publicly-available

development set. In the LFW database, 758 image pairs in the training/development set (View 1) are exactly repeated in the test set (View 2). 3

http://www.idiap.ch/resource/biometric

September 29, 2012

DRAFT

IET BIOMETRICS

13

B. MOBIO The MOBIO database [25] contains videos of 150 subjects acquired in real-world conditions using mobile phone cameras over a year and a half period and, therefore, contains considerable session variation as shown in Figure 2. It is highly relevant to the goal of securing mobile services by biometric authentication means. As in [4], [5], experiments use the MOBIO still-image database and protocol, which consists of one image extracted from each video with manually annotated eye locations. The development and test sets are partitioned in a gender-dependent manner, where the partitions are referred to as MOBIO.mal and MOBIO.fem for males and females respectively. We use the training data in a gender-independent manner to be consistent with prior work and the other databases. The whole training data set (9,579 images of 50 subjects) was used for C-norm and subspace training. For UBM training and ZT-norm, the training data set was split in two thirds and one third, respectively, as in [4]. C. Multi-PIE Multi-PIE is a new, large database that contains images of 337 subjects captured in up to four different sessions with various combinations of illumination, pose and expression [26]. In this work we follow the Multi-PIE Face Verification Protocol – Unmatched Illumination. In this protocol, subject models are each enrolled using a single image captured with no camera flash, while probe images are drawn from images taken over a sixth month period with flashes originating from various angles. This results in substantial session variability due to illumination change, as depicted in Figure 2. In each of the development and test sets, there are around 5,000 target trials and 300,000 impostor trials. The entire training set (9,785 images of 208 subjects) was used to train subspaces and C-norm statistics, while a subset of each of those subject’s images (totalling 1,326 images) was used for UBM training. For ZT-norm, development set scores were used to normalise the test set scores and vice-versa. For consistency with the other databases, manually-annotated eye positions were used for registration. D. Extended training data set In Section VII-B we explore the use of extended training data from multiple databases. For this purpose we defined a training data set referred to as BMFM, composed of images from the training sets of the BANCA [27], MOBIO, FRGC [28] and Multi-PIE databases. The result is a training set of 32,440 images of 510 subjects that contains a wide range of subject and session variability. When using BMFM, a UBM was trained from a randomly-selected subset of 1,275 images of 425 subjects (3 images each), while the full BMFM training set was used for subspace training and C-norm. September 29, 2012

DRAFT

IET BIOMETRICS

14

VII. R ESULTS In this section we evaluate the accuracy of the proposed total variability modelling approach and compare it to that of prior related approaches. A thorough comparison of i-vector session compensation and score normalisation techniques is first presented in Section VII-A. The effects of using extended training data are presented in Section VII-B. Source-normalisation and i-vector preprocessing results are presented in Sections VII-C and VII-D, respectively. Finally, and most importantly, VII-E compares the accuracy of the proposed TVM system to state-of-the-art GMM-based approaches to face verification. To measure accuracy, the decision threshold is first found that results in the equal error rate (EER) on the development set. This threshold is then applied to scores from the unseen test set to obtain a half total error rate (HTER), which is the average of the false accept and reject rates at that threshold. For evaluating the statistical significance of improvements in HTER, we use the methodology proposed by equation (15) and Figure 2 of [29], with a one-tailed test. In the following experiments, all images were first registered and cropped to 64 × 80 resolution as in [4]. Manually-annotated eye locations were used for registration to allow for a thorough comparison of modelling methods without bias from mislocalisation error. For the MOBIO and Multi-PIE databases, images were preprocessed using Tan & Triggs normalisation [30], before extracting D = 45 DCT coefficients from 12×12 blocks of pixels during feature extraction. For SCface, D = 66 coefficients were extracted from 20 × 20 blocks. For GMM modelling we use C = 512 components and a relevance factor of τ = 4. Tuning of other system parameters including subspace dimensionalities nU , nV , nT , nLDA , nF and nG was consistently based on minimising the EER on the development set only. The dimensionality of the total variability space nT was tuned up to a maximum of 400, as preliminary experiments suggested that higher values led to only marginal gains. A. Session compensation and normalisation Table I reports the verification error rates achieved on the SCface, Multi-PIE and MOBIO databases. The first row of the table indicates error rates obtained using the raw i-vectors with cosine distance scoring. Comparatively, and focusing initially on SCface and Multi-PIE results, the session compensation techniques of PLDA, LDA and WCCN were all very effective, reducing test set HTER from 23.5% down to 15.7% on SCface, and from 8.3% to 2.8% on the Multi-PIE database. Furthermore, the techniques of LDA and WCCN were complementary for the Multi-PIE database. For these databases, the superior technique was the combination of LDA and WCCN plus C-norm for normalised cosine distance scoring. Figure 3 shows the effect of tuning the subspace dimensionality sizes nT and nLDA for this system on the September 29, 2012

DRAFT

IET BIOMETRICS

15

SCface database, where nT = 300 and nLDA = 40 was found to minimise the EER on the development set (13.2%). It was clearly beneficial to set nLDA to higher values, approaching the maximum of 42 (one less than the number of subjects in the training set). On the MOBIO database, the results of session compensation followed a somewhat different trend. In particular, WCCN was effective as before, however the use of LDA or PLDA was disadvantageous. This trend was consistent for both male and female trials. This suggests that when LDA or PLDA was used to learn the variability in the MOBIO training set, the result did not generalise well to the disjoint sets of subjects in the development and test sets. As the MOBIO database contains images captured by handheld mobile devices, this could conceivably have been caused by idiosyncratic behaviour of subjects while using the mobile device, either in the positioning of the device or the environment in which it was used. It is also worth noting that WCCN, which was helpful on MOBIO, deals only with normalising the within-class scatter, while the LDA and PLDA techniques additionally estimate the between-class scatter. The results in Table I suggest that, for the MOBIO database, it is difficult to deal with these simultaneously. In Section VII-B we explore whether this problem can be addressed by using extended training data to learn a broader range of variability while training the subspaces. Table II compares the use of C-norm to ZT-norm. As described in Section IV, both techniques perform normalisation using statistics from background training data. When using i-vectors in their raw form (i.e. without session compensation), ZT-norm provided some improvement whereas C-norm had little effect. In contrast, results show that C-norm alone is effective for i-vector systems when used in combination with LDA and WCCN session compensation, thus making ZT-norm redundant. This finding is consistent with previous work in the speaker verification field where C-norm was found to outperform ZT-norm in the context of challenging trial conditions [10]. B. Use of extended training data In the previous section, we showed that the use of LDA or PLDA reduced performance for the MOBIO database. We hypothesised that this was due to a lack of data in the MOBIO training data set, which led to poor generalisation to disjoint sets of subjects. In order to test our hypothesis, we increased the quantity of training data using three other databases to create the BMFM training data set, as described in Section VI-D. The other databases were selected due to their similarities with MOBIO, specifically, containing close to frontal facial images affected by pose and illumination variation. It should be noted, however, that they are by no means an excellent match and that, therefore, the experiments in this section serve as a test of how useful it is to leverage training data from disjoint databases for face verification September 29, 2012

DRAFT

IET BIOMETRICS

16

using TVM. In the following experiments, the BMFM data was used to retrain all system parameters including the UBM, total variability subspace, PLDA, LDA, WCCN transforms and C-norm statistics. Results when evaluating MOBIO and Multi-PIE using the BMFM extended training data set are presented in Table III. Interestingly, results in the first row of the table indicate that using the BMFM extended training data was not helpful when the i-vectors were used directly with cosine distance scoring. This suggests that the use of the extended training data did not improve the estimation of the total variability subspace T for these databases. For example, the HTER for MOBIO females increased from 15.7% to 16.3%. In contrast, using the BMFM extended training data substantially improved the performance of LDA and PLDA for MOBIO trials, with the combination of LDA + WCCN + C-norm now providing the lowest error rates (last row of Table III). For MOBIO females, the use of the extended training data reduced the HTER for this configuration from 20.6% to 13.1%. These trends suggest that using an extended training data set can be useful when the database-specific training data is insufficient. To test this hypothesis we repeated experiments using the BMFM data on the Multi-PIE database. As shown in Table III, using extended training data was not helpful when applied to Multi-PIE, resulting in a minimum of 3.4% HTER compared to 2.3% using just the Multi-PIE training data. This suggests that the Multi-PIE training data set was sufficiently large to train the system parameters without the need for an extended training data set (it contains over 200 people compared to MOBIO’s 50) and/or that the additional data in BANCA and FRGC are a better match to the MOBIO test data than to the Multi-PIE test data.

C. Source-normalised LDA As described earlier, source-normalised LDA (SN-LDA) is able to provide an improved estimation of within and between-class variation when using disparate training data sets. Here we use SN-LDA to train an LDA transform using the BMFM extended training set, by defining the source as the database from which an image originated. Table IV presents the results of using SN-LDA in this way compared to traditional LDA, on the MOBIO and Multi-PIE databases. When C-norm was not used, SN-LDA consistently reduced the error rates. When used in conjunction with C-norm, the benefit of SN-LDA over LDA was less evident. Nonetheless, SN-LDA results in the best performance so far using an i-vector system on the MOBIO database test set for both male and female trials.

September 29, 2012

DRAFT

IET BIOMETRICS

17

D. Preprocessing of i-vectors Table V shows the effect of applying the i-vector preprocessing techniques discussed in Section V, that is, I-norm and whitening. In the absence of session compensation, applying whitening reduced the HTER across all databases, e.g. a 15% relative reduction on the SCface database. I-norm results were not presented for the TVM system since I-norm is performed intrinsically when using raw i-vectors and cosine distance scoring. When session compensation techniques were used subsequent to preprocessing, the benefit of preprocessing was less consistent. I-norm provided some improvements in the majority of PLDA experiments, while its benefits when used with other session compensation techniques appear limited. These trends are likely due to the fact that I-norm was developed to assist with PLDA-based modelling in speaker verification research [11] and due to the intrinsic length normalisation that is applied in cosine distance scoring (21). Similar to I-norm, whitening offered a marginal improvement when used in conjunction with PLDA but provided no gain in most cases when used with cosine distance scoring. When applied in conjunction with LDA + WCCN + C-norm (i.e. the combination of session compensation techniques that was found to be optimal in the previous sections), there was no clear added benefit of applying either of the i-vector preprocessing techniques. E. Comparison of TVM to prior approaches Table VI compares the accuracy of the proposed TVM system to state-of-the-art GMM-based approaches to face verification. In particular, we compare to a standard baseline GMM system using linear scoring (see Section II), with and without ZT-norm score normalisation. Furthermore, we compare our TVM results to the latest approaches recently presented in [5], that is, inter-session variability modelling (ISV) and joint factor analysis (JFA) as described in Section III. The results in Table VI show that the TVM system provided test-set accuracy that is close to or competitive with previous approaches, across all databases. In particular, all of the TVM results were significantly better than the baseline GMM system (even with ZT-norm) at a confidence level in excess of 99.7%. In fact, TVM reduced the HTER by 17%, 30%, 39% and 26% relative for SCface, MultiPIE, MOBIO.mal and MOBIO.fem databases, respectively. Furthermore, TVM provided the best result by a substantial margin on the test set of MOBIO.mal, even beyond that of ISV and JFA, which is an improvement that is statistically significant at a level greater than 99.9%. Table VI also shows that the TVM system reduced the gap between development set EER and test set HTER, compared to ISV/JFA. In particular, the relative increase in error rate from EER to HTER when using the ISV/JFA system with the best EER was 13%, 37%, 102% and 65% for SCface, Multi-PIE, September 29, 2012

DRAFT

IET BIOMETRICS

18

MOBIO.mal and MOBIO.fem, respectively. The corresponding degradation when using the TVM system was substantially less across the same databases, that is, 3%, 11%, 26% and 51%. These results suggest that the decision threshold and other parameters tuned to minimise development set EER generalise comparatively well to the unseen test data, when using the TVM system. Noteworthy from Table VI is the tendency of the baseline, ISV and JFA systems to benefit from ZTnorm, with the exception of MOBIO.fem. Statistics for ZT-norm score normalisation must be calculated at probe time, which is not the case for the C-norm technique that is used in conjunction with the TVM system. This has implications in terms of computational complexity, which should be taken into account according to the application. According to Table VI, excluding ZT-norm in order to reduce probe-time computation would result in TVM providing the best accuracy on the SCface database, 13.6% HTER (an improvement with respect to ISV that is statistically significant at a level of 89%), and very close to the best accuracy, 2.3% HTER, on the Multi-PIE test set. The following section further explores the computational benefits of TVM through analysis of its complexity compared to the alternative modelling approaches. VIII. C OMPLEXITY In the previous section, system performance was reported in terms of verification error rate. This section focuses on another important aspect, that is, system complexity. Table VIIa details the number of model parameters for each modelling approach with respect to subspaces, transforms and subject models. One significant benefit of the TVM system is the use of far fewer parameters to describe each subject model, i.e. a single nT -dimensional i-vector compared to a CD-dimensional vector z i plus an additional speaker factor vector y i of nV dimensions for JFA. This

compact representation reduces the number of parameters to estimate per enrolled subject (e.g. from > 33,000 to just 300 for SCface). While this represents only one aspect of system complexity, this

difference may be important in applications with limited enrolment data, in helping to avoid the curse of dimensionality. Table VIIb highlights the differences between systems in terms of subspaces and transforms to train, factors to estimate during enrolment and computations performed at probe time. One of the most significant advantages of the TVM system is much simpler scoring through the use of the cosine distance metric. Specifically, the final scoring calculation is performed with a single low nLDA -dimensional matrix multiplication (after straightforward projection using LDA and WCCN transforms and scaling using C-norm), while in contrast, linear scoring [14] is performed in the supervector space of CD × CD September 29, 2012

DRAFT

IET BIOMETRICS

19

dimensions. To empirically quantify the complexity of these systems, we calculated the amount of time required to score 4,940 probe images against 65 subject models (321,100 scores in total), on the test set of the Multi-PIE database. Here, 4,864 images were used for Z-norm and 64 models for T-norm. Subspaces were trained using 10 EM iterations on the training set of 9,785 images from 208 subjects. We ran very similar straightforward MATLAB implementations of each system, sharing code where possible, on an Intel(R) Xeon(R) E7-2830 2.13GHz. Timing results are reported in CPU seconds in Table VIII. It can be seen that the time taken to perform linear scoring was orders of magnitude greater than cosine distance scoring. The same observation can be made with respect to ZT-norm compared to C-norm. Unlike ZT-norm used by prior approaches, C-norm does not require repeated linear scoring calculations to compare the probe image to a cohort of other models. By using entirely precomputed statistics, determined once only from a single training data set, the use of C-norm with TVM systems thus introduces far less computational burden at probe time than the use of ZT-norm with previous approaches. The aforementioned benefits of cosine distance scoring over linear scoring would be most pronounced in a scenario of face identification (as opposed to face verification) in which a probe image must be compared to a large database of gallery subjects such as in a national mugshot database, for example. Specifically, after extraction of the probe i-vector and projection by LDA and WCCN transforms, the subsequent scoring against the entire gallery using TVM (with cosine distance scoring) would take a fraction of the time needed to perform the same comparisons using the other approaches (with linear scoring). From Table VIII it can be seen that, in the online phase, the only aspect in which TVM is slower than JFA is in the estimation of the latent factors. That is, extracting an i-vector wt is much slower than estimating the session-dependent factors xt . This is because the dimensionality nT = 400 is much greater than nU = 20. If a smaller total variability subspace is used, e.g. nT = 50, the time taken for the entire verification and normalisation process for the TVM system is comparable to that of the JFA system. This, of course, would come at the cost of increased HTER. However, there have been recent developments in speech-related research that drastically reduce the time required for i-vector extraction with minimal loss of accuracy [31]. Future work should therefore evaluate the TVM optimisation techniques in [31] when applied to the face verification task.

September 29, 2012

DRAFT

IET BIOMETRICS

20

IX. C ONCLUSIONS This paper presented the first work on TVM for face verification. The aim was to highlight the strengths as well as the shortcomings of the approach inspired by the speaker verification field, to provide a foundation for future work on further tailoring the TVM approach to face recognition. The accuracy and complexity of TVM was systematically compared to leading related approaches, demonstrating that TVM generalises well from the speech domain, providing competitive accuracy as well as some important computational advantages. Through comprehensive experimentation, we evaluated numerous state-of-the-art techniques for i-vector preprocessing, session compensation, normalisation and scoring, thus contributing benchmark face verification results on several challenging databases relevant to forensics and security applications. The results presented here should encourage several directions of future work. The first is to explore the use of alternative local feature distribution modelling, for example, using pseudo-2D hidden Markov models (P2D HMMs) [3] in place of the GMMs used in this work. Secondly, while this work concentrated on recent i-vector session compensation techniques from the speaker verification field, future work should aim to develop new techniques explicitly tailored to model the characteristics of session variation exhibited in facial image i-vectors. Finally, as mentioned previously, future work should evaluate recently proposed methods to speed up i-vector extraction [31] when applied to the face verification task. With this work we have cross-pollinated the TVM approach across biometric modalities, by presenting the first thorough study of the approach for face verification. TVM remains a hot topic, particularly in speaker verification, with improvements being published in rapid succession. Therefore, there is a strong argument for continued close collaboration across the biometrics field, especially with regards to the application of TVM into the near future. R EFERENCES [1] C. Sanderson and K. K. Paliwal, “Fast features for face authentication under illumination direction changes,” Pattern Recognition Letters, vol. 24, no. 14, pp. 2409–2419, 2003. [2] S. Lucey and T. Chen, “A GMM parts based face representation for improved verification through relevance adaptation,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2004, pp. 855–861. [3] F. Cardinaux, C. Sanderson, and S. Bengio, “User Authentication via adapted Statistical Models of Face images,” IEEE Transactions on Signal Processing, vol. 54, pp. 361–373, 2006. [4] R. Wallace, M. McLaren, C. McCool, and S. Marcel, “Cross-pollination of normalisation techniques from speaker to face authentication using Gaussian mixture models,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, pp. 553–562, 2012.

September 29, 2012

DRAFT

IET BIOMETRICS

21

[5] ——, “Inter-session variability modelling and joint factor analysis for face authentication,” in International Joint Conference on Biometrics, 2011. [6] N. Dehak, “Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application ´ to speaker verification,” Ph.D. dissertation, Ecole de technologie sup´erieure, 2009. [7] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011. [8] M. McLaren and D. A. van Leeuwen, “Source-normalised-and-weighted LDA for robust speaker recognition using ivectors,” in Proc. IEEE ICASSP, 2011, pp. 5456–5459. [9] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel, “Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification,” in Proc. Interspeech, 2009, pp. 1559–1562. [10] N. Dehak, R. Dehak, J. Glass, D. Reynolds, and P. Kenny, “Cosine similarity scoring without score normalization techniques,” in Proc. Odyssey Speaker and Language Recognition Workshop, 2010, pp. 71–75. [11] D. Garcia-Romero and C. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Proc. Interspeech, 2011, pp. 249–252. [12] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and N. Br¨ummer, “Discriminatively trained probabilistic linear discriminant analysis for speaker verification,” in Proc. IEEE ICASSP, 2011, pp. 4832–4835. [13] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000. [14] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny, “Comparison of scoring methods used in speaker recognition with joint factor analysis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 4057– 4060. [15] R. Vogt and S. Sridharan, “Explicit modelling of session variability for speaker verification,” Computer Speech & Language, vol. 22, no. 1, pp. 17–38, 2008. [16] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, May 2007. ˇ [17] L. Burget, M. Fapˇso, V. Hubeika, O. Glembek, M. Karafi´a, M. Kockmann, P. Matˇejka, P. Schwarz, and J. Cernock´ y, “BUT system description: NIST SRE 2008,” in Proc. 2008 NIST Speaker Recognition Evaluation Workshop, 2008. [18] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 980–988, 2008. [19] A. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for SVM-based speaker recognition,” in Proc. Ninth Int. Conf. on Spoken Language Processing, 2006, pp. 1471–1474. [20] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normalization for text-independent speaker verification systems,” Digital Signal Processing, vol. 10, pp. 42–54, 2000. [21] S. Prince and J. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in IEEE International Conference on Computer Vision, 2007, pp. 1–8. [22] P. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek, “Overview of the face recognition grand challenge,” in IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 947–954. [23] G. B. Huang, M. Ramesh, T. Berg, , and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments.” University of Massachusetts, Amherst, Tech. Rep. 07-49, October 2007.

September 29, 2012

DRAFT

IET BIOMETRICS

22

[24] M. Grgic, K. Delac, and S. Grgic, “SCface – surveillance cameras face database,” Multimedia Tools and Applications, vol. 51, no. 3, pp. 863–879, 2011. [25] C. McCool and S. Marcel, “MOBIO database for the ICPR 2010 face and speech competition,” Idiap Research Institute, Tech. Rep. Idiap-Com-02-2009, November 2009. [26] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-PIE,” in IEEE International Conference on Automatic Face Gesture Recognition, 2008, pp. 1–8. [27] Bailly-Bailli`ere et al., “The BANCA database and evaluation protocol,” in International Conference on Audio- and VideoBased Biometric Person Authentication, 2003, pp. 625–638. [28] P. Phillips et al., “Overview of the face recognition grand challenge,” in IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 947–954. [29] S. Bengio and J. Mari´ethoz, “A statistical significance test for person authentication,” in Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004. [30] X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition under difficult lighting conditions,” IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1635–1650, 2010. [31] O. Glembek, L. Burget, P. Matejka, M. Karafi´at, and P. Kenny, “Simplification and optimization of i-vector extraction,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 4516–4519.

Roy Wallace received his BEng(Hon) in 2006 and PhD in Engineering in 2010 with the Speech, Audio, Image and Video Technologies (SAIVT) group, Queensland University of Technology (QUT), Australia. He has been involved in patent applications at QUT as well as Microsoft Research Asia, Beijing, and has performed commercial evaluations of his research. He is now a postdoctoral researcher in biometrics and machine learning at Idiap Research Institute, Switzerland, with a particular interest in the use of biometrics for forensics.

Mitchell McLaren received his BCompSysEng in 2006 and PhD with the Speech, Audio, Image and Video Technologies (SAIVT) in 2010, at the Queensland University of Technology (QUT), Australia. Mitchell completed his postdoctoral research which focussed on speaker and face recognition at the Centre for Language and Speech Technology (CLST) at Radboud University Nijmegen, The Netherlands from 2010 to 2012. In April 2012, he joined the Speech Technology and Research (STAR) research team at SRI International, Menlo Park, USA.

September 29, 2012

DRAFT

TABLES

23

System

SCface dev test

Multi-PIE MOBIO.mal MOBIO.fem dev test dev test dev test

TVM

23.6 23.5

7.9

8.3

10.0

12.3

8.8

15.7

14.9 13.5 12.7 13.2

16.4 17.1 15.7 14.4

4.1 3.7 2.6 2.3

3.1 4.9 3.0 2.7

13.1 10.1 5.9 5.0

13.0 13.1 10.0 8.7

14.7 15.2 8.5 8.6

24.6 21.5 14.2 15.2

TVM + LDA + WCCN 14.6 17.4 TVM + LDA + WCCN + C-norm 13.2 13.6

2.4 2.1

2.8 2.3

9.1 8.7

12.9 12.3

14.4 14.7

22.3 20.6

TVM TVM TVM TVM

+ + + +

PLDA LDA WCCN WCCN + C-norm

TABLE I C OMPARISON OF SESSION COMPENSATION TECHNIQUES . (% EER

September 29, 2012

ON DEV SET,

% HTER ON TEST SET ).

DRAFT

TABLES

24

System

Score normalisation

SCface dev test

Multi-PIE dev test

TVM

— ZT-norm ZT-norm + C-norm C-norm

23.6 18.6 18.6 23.6

23.5 18.3 18.2 23.3

7.9 6.7 6.7 7.8

8.3 7.6 7.6 8.3

— ZT-norm TVM + LDA + WCCN ZT-norm + C-norm C-norm

14.6 16.0 15.3 13.2

17.4 17.7 16.1 13.6

2.4 2.3 2.3 2.1

2.8 2.6 2.6 2.3

TABLE II C OMPARISON OF SCORE NORMALISATION TECHNIQUES . (% EER ON DEV SET, % HTER ON TEST SET ).

September 29, 2012

DRAFT

TABLES

25

System

MOBIO.mal MOBIO.fem Multi-PIE Std. training Ext. training Std. training Ext. training Std. training Ext. training dev test dev test dev test dev test dev test dev test

TVM

10.0

12.3

11.4

13.1

8.8

15.7

10.3

16.3

7.9

8.3

7.8

8.3

PLDA LDA WCCN WCCN + C-norm

13.1 10.1 5.9 5.0

13.0 13.1 10.0 8.7

8.1 8.7 7.7 6.6

9.0 11.1 10.5 9.2

14.7 15.2 8.5 8.6

24.6 21.5 14.2 15.2

11.2 10.0 10.2 9.9

13.9 16.8 15.1 14.2

4.1 3.7 2.6 2.3

3.1 4.9 3.0 2.7

3.3 4.3 3.5 3.2

3.6 5.5 4.4 3.9

TVM + LDA + WCCN TVM + LDA + WCCN + C-norm

9.1 8.7

12.9 12.3

7.6 5.5

9.8 7.1

14.4 14.7

22.3 20.6

9.9 8.2

14.9 13.1

2.4 2.1

2.8 2.3

3.1 2.4

3.9 3.4

TVM TVM TVM TVM

+ + + +

TABLE III C OMPARISON OF USING EITHER THE STANDARD TRAINING DATA SET FROM THE SAME DATABASE (S TD TRAINING ), BMFM EXTENDED TRAINING DATA SET (E XT. TRAINING ). (% EER ON DEV SET, % HTER ON TEST SET ).

September 29, 2012

OR THE

DRAFT

TABLES

26

System

MOBIO.mal MOBIO.fem Multi-PIE dev test dev test dev test

TVM

11.4

13.1

10.3

16.3

7.8

8.3

+ LDA + WCCN + SN-LDA + WCCN

7.6 5.6

9.8 7.4

9.9 8.6

14.9 13.7

3.1 2.7

3.9 3.3

+ LDA + WCCN + C-norm + SN-LDA + WCCN + C-norm

5.5 5.6

7.1 7.0

8.2 8.4

13.1 12.7

2.4 2.2

3.4 3.0

TABLE IV C OMPARISON OF SOURCE - NORMALISED LDA (SN-LDA) TO LDA WHEN USING THE BMFM EXTENDED TRAINING DATA SET. (% EER ON DEV SET, % HTER ON TEST SET ).

September 29, 2012

DRAFT

TABLES

27

System

Preprocessing

SCface dev test

Multi-PIE MOBIO.mal MOBIO.fem dev test dev test dev test

TVM

— Whitening

23.6 23.5 20.6 19.9

7.9 5.9

8.3 6.1

11.4 9.9

13.1 11.6

10.3 9.7

16.3 16.0

TVM + PLDA

— I-norm Whitening Whitening + I-norm

14.9 13.9 12.7 12.4

16.4 16.0 15.0 14.4

4.1 3.4 4.1 3.3

3.1 3.5 3.0 3.5

8.1 7.7 7.6 7.0

9.0 8.6 9.2 8.7

11.2 10.5 10.6 9.9

13.9 13.0 13.9 12.8

TVM + (SN)LDA + WCCN

— I-norm Whitening Whitening + I-norm

14.6 13.6 13.5 13.5

17.4 15.9 15.2 15.1

2.4 2.4 2.5 2.4

2.8 2.9 2.9 2.9

5.6 5.8 5.7 5.8

7.4 7.7 7.6 7.6

8.6 8.5 8.7 8.7

13.7 13.7 13.7 13.9

TVM + (SN)LDA + WCCN + C-norm

— I-norm Whitening Whitening + I-norm

13.2 13.0 13.1 13.0

13.6 14.9 14.6 14.8

2.1 2.0 2.1 2.0

2.3 2.4 2.4 2.4

5.6 5.7 5.5 5.6

7.0 7.0 6.8 6.8

8.4 8.3 8.6 8.5

12.7 12.8 12.7 12.7

TABLE V C OMPARISON OF I - VECTOR PREPROCESSING STEPS WHEN OPTIONALLY APPLIED IN CONJUNCTION WITH SESSION COMPENSATION . T HE BMFM EXTENDED TRAINING DATA SET AND SN-LDA ARE USED FOR MOBIO ONLY. (% EER ON DEV SET, % HTER ON TEST SET ).

September 29, 2012

DRAFT

TABLES

28

Score norm.

SCface dev test

Multi-PIE MOBIO.mal MOBIO.fem dev test dev test dev test

GMM baseline

— ZT-norm

23.9 25.1 16.7 16.4

6.3 2.7

7.2 3.3

10.7 9.1

12.8 11.4

12.5 10.9

17.9 17.2

ISV

— ZT-norm

15.6 14.8 12.8 13.0

1.7 1.4

2.2 2.0

4.3 4.3

9.2 8.6

7.5 7.9

12.3 12.6

JFA

— ZT-norm

15.3 15.8 12.0 13.5

3.2 2.0

4.0 3.1

4.4 4.3

9.8 9.3

7.9 8.2

12.3 13.2

TVM + (SN)LDA + WCCN

C-norm

13.2 13.6

2.1

2.3

5.6

7.0

8.4

12.7

System

TABLE VI C OMPARISON OF PROPOSED TVM

APPROACH TO BASELINE , ISV AND JFA SYSTEMS . MOBIO RESULTS USE TRAINING DATA AND SN-LDA. (% EER ON DEV SET, % HTER ON TEST SET ).

September 29, 2012

BMFM

DRAFT

TABLES

29

System

Subspaces & transforms

Subject models

GMM

—

z i (CD)

ISV

U (nU CD)

z i (CD)

JFA

V (nV CD) U (nU CD) D (CD)

y i (nV ) z i (CD)

TVM

T (nT CD) LDA × WCCN (nLDA nT )

wi (nT )

(a) Comparison of number of model parameters. Additionally, all systems depend on UBM means, variances and weights (2CD + C). For SCface, e.g. C = 512, D = 66, nU = 80, nV = 10, nT = 300, nLDA = 40. Probe

System

Train

Enrol

GMM

—

zi

Linear scoring ZT-norm

ISV

U

xi,j ’s, z i

Estimate xt Linear scoring ZT-norm

JFA

V , U, D

xi,j ’s, y i , z i

Estimate xt Linear scoring ZT-norm

TVM

T , LDA, WCCN

wi

Estimate wt LDA & WCCN Cosine scoring with C-norm

(b) Comparison in terms of subspaces/transforms to train, factors to estimate during enrolment and probe computations. Additionally, all systems need to first train a UBM using the EM algorithm. TABLE VII C OMPARISON OF THE COMPLEXITY OF MODELLING APPROACHES , PROCESSING .

September 29, 2012

IN TERMS OF NUMBER OF PARAMETERS AND

DRAFT

TABLES

30

Phase

Process

ISV

JFA

TVM

Offline

Subspace training Subject enrolment

33,275 58

9,633 29

140,897 56

Factor estimation LDA and WCCN projection Cosine (TVM)/Linear scoring C-norm (TVM) or ZT-norm

1,093 – 24 23

22 – 20 17

4,162 0.97 ≈0 0.19

Online

TABLE VIII T IME (CPU SECONDS ) TO COMPLETE M ULTI -PIE TRAINING , ENROLMENT ( OFFLINE ) AND TESTING ( ONLINE ). F OR ISV: nU = 160; FOR JFA: nV = 40, nU = 20; FOR TVM: nT = 400, nLDA = 200.

September 29, 2012

DRAFT

FIGURES

31

Fig. 1. Processing of i-vectors wt and wi , representing the probe image and claimed subject model respectively. In theory, each processing stage prior to scoring is optional.

September 29, 2012

DRAFT

FIGURES

32

Fig. 2. Example images, showing a wide range of within-class variation, from the (top) SCface, (middle) MOBIO and (bottom) Multi-PIE databases.

September 29, 2012

DRAFT

33

25

25

22

22

Dev set EER (%)

Dev set EER (%)

FIGURES

19 16 13 10

19 16 13 10

0

50 100 150 200 250 300 350 400 450

Number of TV dimensions

0

5

10

15

20

25

30

35

40

45

Number of LDA dimensions

Fig. 3. Subspace dimensionality tuning results on the SCface database for the TVM + LDA + WCCN + C-norm system (% EER on dev set), in terms of (left) the effect of nT , the total variability subspace dimensionality (with nLDA = 40) and (right) the effect of nLDA , the number of LDA dimensions retained (with nT = 300).

September 29, 2012

DRAFT

pose-robust representation for face verification in ...

Learning Prototype Hyperplanes for Face Verification in the Wild

Face Verification using Gabor Filtering and Adapted ...

Extraction Of Head And Face Boundaries For Face Detection ieee.pdf

Extraction Of Head And Face Boundaries For Face Detection.pdf ...

Medication Verification Form for Physicians.pdf

Verification Methodology for DEVS Models

Ceres Water Revenue Variability

Climatic Variability

Face Authentication /Recognition System For Forensic Application ...

Property Specifications for Workflow Modelling

(SGS) modelling for Euler

Property Specifications for Workflow Modelling - Springer Link

Semantic Modelling for Variabilities

Hohwy_Phenomenal variability and introspective reliability.pdf ...

Design Principles for an Extendable Verification Tool for ... - SpaceEx

Genetic Variability For Different Biometrical Traits In Pearl ... - CiteSeerX

Ecotypic Variability for Drought Resistance in Cenchrus ...