Face Recognition Using Sparse Approximated Nearest ...

Viewer
Transcript

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 34, NO. X,

XXXXXXX 2012

1

Face Recognition Using Sparse Approximated Nearest Points between Image Sets Yiqun Hu, Ajmal S. Mian, and Robyn Owens Abstract—We propose an efficient and robust solution for image set classification. A joint representation of an image set is proposed which includes the image samples of the set and their affine hull model. The model accounts for unseen appearances in the form of affine combinations of sample images. To calculate the between-set distance, we introduce the Sparse Approximated Nearest Point (SANP). SANPs are the nearest points of two image sets such that each point can be sparsely approximated by the image samples of its respective set. This novel sparse formulation enforces sparsity on the sample coefficients and jointly optimizes the nearest points as well as their sparse approximations. Unlike standard sparse coding, the data to be sparsely approximated are not fixed. A convex formulation is proposed to find the optimal SANPs between two sets and the accelerated proximal gradient method is adapted to efficiently solve this optimization. We also derive the kernel extension of the SANP and propose an algorithm for dynamically tuning the RBF kernel parameter while matching each pair of image sets. Comprehensive experiments on the UCSD/Honda, CMU MoBo, and YouTube Celebrities face datasets show that our method consistently outperforms the state of the art. Index Terms—Image set classification, face recognition, sparse modeling, convex optimization.

Ç 1

INTRODUCTION

T

RADITIONAL

image classification [17], [23] studies the problem of classification based on a single image. Learning is performed either from a single image or multiple images per class. Multiple images are used during learning to model the within class variations or to draw robust boundaries that separate classes. Classification, however, is generally performed on the basis of individual query images. In image set classification, each class is represented by one or more image sets and a query image set is assigned the label of the gallery set that is the nearest to it using some distance criterion. For the specific case of human faces, each set comprises a different number of facial images under arbitrary poses, illumination conditions, and expressions. Image set classification is a generalization of video-based classification [13], [25], [33], which focuses on exploiting the temporal relationship between the images with a priori condition that individual images are consecutive video frames. However, in image set-based classification, the images of a set may manifest large view-point and illumination changes and nonrigid deformations without any temporal relationship. In practice, the problem of image set classification naturally arises in a wide range of contexts including video-based recognition, surveillance, multiview images from camera networks, manually created collections of relevant images, personal albums, and classification based

. The authors are with the School of Computer Science & Software Engineering (M002), the University of Western Australia, 35 Stirling Highway, Crawley, WA 6009, Australia. E-mail: {yiqun, ajmal}@csse.uwa.edu.au, [email protected]. Manuscript received 11 Apr. 2011; revised 21 Nov. 2011; accepted 15 Dec. 2011; published online 28 Dec. 2011. Recommended for acceptance by M. Tistarelli. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2011-04-0222. Digital Object Identifier no. 10.1109/TPAMI.2011.283. 0162-8828/12/$31.00 ß 2012 IEEE

on long term observations. Within an image set, individual images either share a common semantic relationship or complement the appearance variations of the subject. In the case of human faces, each set comprises a different number of facial images under arbitrary poses, illumination conditions, and expressions. Image set data offer new opportunities but at the same time pose new challenges to the visual classification task. On one hand, it holds more promise for accurate classification because image sets contain more information compared to a single image. On the other hand, it introduces a challenging problem of image set modeling in order to exploit the semantic knowledge between individual images. Traditional classification models, e.g., SVM, k-Nearest Neighbor (NN) Classifier based on a single sample, cannot address this issue. Classification based on image sets has recently attracted growing interest in the computer vision community [7], [14], [19], [29], [30], [38], [41], [44], [45]. In this paper, we propose a novel algorithm for image set classification and apply it to the face recognition problem. We represent every image set jointly by its sample images and their affine hull model to cover all possible affine combinations of the sample images. Such a loose representation of the affine hull is capable of accounting for the unseen appearances (which do not appear in the set) in the form of affine combinations of image samples. However, the affine hull model also introduces a challenge for image set classification. The image sets of different classes are more likely to intersect due to the overlarge space of their affine hulls. To address this issue, we introduce the Sparse Approximated Nearest Points (SANP) for computing the between-set distance. SANPs of two image sets are defined as the nearest points of the sets that can be sparsely approximated by the sample images of the respective set individually. The search for SANP of two image sets is formulated as a partial L1 norm regularized convex optimization problem. Published by the IEEE Computer Society

2

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Fig. 1. Sparse Approximated Nearest Points of two image sets. Given the affine hull models ði ; Ui Þ and ðj ; Uj Þ of two image sets, the points on each set can be represented as a linear combination of bases plus the mean image. They can also be represented as the linear combination of sample images. The SANPs are dynamically chosen by the joint optimization, which simultaneously searches for sparse approximated points (maximize sparsity of sample coefficients) that are the nearest (minimize distance) between the two sets. The optimal SANPs of the two image sets are shown in the center, each of which is sparsely approximated by the sample images marked with red boxes.

Fig. 1 illustrates the formulation of SANP optimization for two image sets. This novel formulation is different from the standard sparse modeling of single images in two aspects. First, the nearest points to be sparsely approximated are the unknowns, which means that we need to jointly optimize the nearest points and their sparse approximations. Second, sparsity is enforced on the sample coefficients instead of the model coefficients in our formulation. We show how recent advances in first-order optimization techniques can be adapted to solve this optimization, leading to a fast, scalable algorithm. Once the SANPs are found, the between-set distance is then calculated using these points and the Nearest Neighbor classifier is deployed to assign the query set to the class of its nearest neighbor. Our approach was initially proposed in [48]. In this paper, we also propose the kernel extension of the SANP (KSANP) and an automatic algorithm that adaptively tunes the RBF kernel parameter while matching each pair of image sets. The KSANP can handle nonlinear structures of the image sets by implicitly mapping the data to a high dimensional feature space. KSANP achieves better results compared to the original SANP with minimal computational overhead. Comprehensive experiments were performed on three benchmark datasets. Comparison with existing state-of-the-art techniques [7], [29], [30], [41] shows that our methods consistently achieve better results and are computationally efficient.

2

RELATED WORK

Existing techniques can be categorized according to the two main challenges of the image set classification problem. The first challenge is how to extract and represent the information from an image set. The second challenge is how to calculate the distance/similarity between two image sets. According to

VOL. 34,

NO. X,

XXXXXXX 2012

the first category, existing techniques include parametric and nonparametric representations. Parametric model-based representations [6], [13], [22], [45] use some parametric distributions to represent an image set with the parameters estimated from the set data itself. A limitation of these techniques is that if the set data does not have strong statistical correlations for parameter estimation, the estimated model cannot characterize the image set [30], [41] well. Nonparametric model-free methods attempt to represent an image set as a linear subspace [14], [20], [25], [41], mixture of subspaces [19], [38], [44], or nonlinear manifolds [2], [29], [30], [42]. Without any assumption on data distribution, it has been shown in [29], [30] that these model-free representations inherit many favorable properties. Another advantage of nonparametric low-dimensional subspace/manifold image set representations is that they can model the illumination of faces very well [10], [21]. Existing techniques can also be differentiated based on the second criterion of between-set distance. However, the between-set distance is usually defined specifically for certain image set representations. For example, for parametric representations, the between-set distance is calculated by measuring the similarity between the corresponding distributions of their parameters. Kullback-Leibler divergence [22] is an example of this category. For nonparametric representations, two types of distances have been proposed. The first one defines the between-set distance using some of the set samples. For example, a simple method for calculating the between-set distance is to measure the distance between the sample means of the two sets [30]. Another example is to use geometric distances (distances of closest point approach) [7] to compare different image sets. Unlike the mean difference, this method adaptively selects different samples to calculate the between-set distance for different image sets. Thus, it is able to better handle intraclass variations. Given two image sets, the closest points are obtained by minimizing the distance between them through least square optimization. The between-set distance is then defined as the distance between these two points. The second type of distance for nonparametric representations compares different image sets by analyzing their model structures instead of the sample data. Canonical Correlation Analysis (CCA) [8] is one of the most popular techniques for calculating subspace similarity. It finds d principal angles 0 1 d =2 between the linear subspaces of two sets, which are the smallest angles between any vector in the first set and any vector in the second set. The between-set similarity is then defined as the sum of canonical correlations, which are the cosines of principal angles. Various subspace methods represent image sets as different low-dimensional subspaces and compute the canonical correlations between these subspaces. For example,the Mutual Subspace Method (MSM) [25] calculates the smallest angle between the linear subspaces of two sets calculated with Principal Component Analysis (PCA). The Orthogonal Subspace Method (OSM) [5] extracts the class-specific subspace which are orthogonal to those of all other classes. It also provides a systematic way to decide the optimal dimension of the subspaces

HU ET AL.: FACE RECOGNITION USING SPARSE APPROXIMATED NEAREST POINTS BETWEEN IMAGE SETS

based on the eigenvalues. The Constrained Mutual Subspace Method (CMSM) [20] finds a constrained subspace by minimizing the variance of the entire class distribution. Compared to OSM, the subspace dimension needs to be set empirically in CMSM. In addition to addressing the two main challenges, other techniques have been investigated to further improve the performances of image set classification. For example, kernel machines have been used to extend CCA [16], CMSM [15], and OSM [14] to model the nonlinear structures of the image sets in the implicit feature space. To increase the discriminant power of set data, Kim et al. [41] proposed an iterative method to find the optimal transformation which preserves the discriminant information for CCA. Instead of modeling image sets globally, existing work also used local information to handle the nonlinear structure of set data. For example, Kim et al. [38] proposed the Locally Orthogonal Subspace Method (LOSM), where the class subspace is only required to be orthogonal to its local neighbors. The local principal angles are combined with the global ones in [46] to further improve the robustness of set classification. Several methods have been proposed for incremental learning for set-based classification whose computational costs are far lower than batch computation. In [37], [38], the orthogonal subspaces are learned incrementally by updating the principal components of the class correlation and the total correlation matrices separately. Weng et al. [11] incrementally derived discriminating features in the input space and stored the information about discriminating subspaces in a hierarchical decision tree. Chin et al. [36] approximated the kernel SVD in an incremental way to estimate the nonlinear subspaces for face recognition. Ensemble learning techniques, e.g., boosting, have also been applied for image set classification. It has been successfully used to boost the performance of the manifold principal angles [39], [40] as well as CMSM [45]. Generative models focus on estimating the posterior distribution of the class label given the image set. For video-based recognition, which is a special case of image set classification, Zhou and Chellapa [33] modeled a video sequence using a time series state-space model parameterized by both tracking state and identity variables. By marginalizing over the tracking states, the posterior distribution of an identity was estimated for recognition. Liu and Cheng [47] used an unsupervised learning technique to adapt the Hidden Markov Model (HMM) with the test video sequence. The temporal dynamics were learned from training data and used for analyzing test data for recognition. A common aspect of existing techniques is that they either measure the distance between certain samples of the two sets or the similarity between their structures. On the other hand, the proposed technique in this paper endeavors to utilize both the structure information and the image samples.

3

IMAGE SET REPRESENTATION

We propose a joint representation for image sets, each comprising different numbers of image samples. Denote Xc ¼ ½xc;1 ; xc;2 ; . . . ; xc;Nc as the data matrix of the cth image set, where xc;i is a feature vector of the ith image. The feature of an individual image can be simply the high

3

dimensional array of pixel values or any other features, e.g., Local Binary Pattern (LBP) [35] of the image. The joint representation, besides using the sample data Xc , constructs a linear model to approximate the structure of the image set in high-dimensional feature space. We model an image set as an affine hull of the set data [7]: ( ) Nc Nc X X c;i xc;i ¼1 ; ð1Þ AHc ¼ x ¼ i¼1 c;i i¼1 where c;i 2 R for i ¼ 1; 2; . . . ; Nc . This affine hull can also be represented by another P c parametric form using the sample mean c ¼ N1c N i¼1 xc;i as a reference point to represent every data: AHc ¼ fx ¼ c þ Uc vc jvc 2 Rl g;

ð2Þ

where the l columns of Uc are the orthonormal bases obtained from the Singular Value Decomposition (SVD) of the centered data matrix Xc ¼ ½xc;1 c ; xc;2 c ; . . . ; xc;Nc c and vc are the coefficients of the linear model w.r.t. the bases Uc . The proof of the equivalence of these two representations is given in Appendix A, which can be found in the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPAMI.2011.283. The difference between the two representations of an affine hull is that is a P (1) c constrained representation with the constraint of N i¼1 c;i ¼ 1 while (2) is an unconstrained representation. Differently from other tight representations, e.g., convex hull, any affine combination of sample images in the set is accommodated in this representation even when the combination does not appear in the samples of the set. Such a loose representation is particularly appealing in the context of small set sizes because the unseen data of an image set can be better modeled. However, this loose representation also brings challenges for calculating the distance between two image sets. The affine hulls of image sets are likely to be overlarge, which results in the intersection of multiple affine hulls. In this paper, we represent an image set as a triplet ðc ; Uc ; Xc Þ by including both structure information and sample images. As we will show in the next section, the information of sample images can be utilized to eliminate the ambiguity of the overlarge space of the affine hulls. This joint representation of the image set is useful for improving the robustness of matching image sets.

4

SPARSE APPROXIMATED NEAREST POINTS (SANP)

Existing methods [7], [30] directly search the nearest points in the complete space of two image sets without any additional constraints. These points could be very noisy and vulnerable to outliers. This issue is especially serious in our case because we use loose affine hulls to model image sets. Even for two image sets of different classes, it is possible to find two nearest points with a very small distance. This can degrade the classification performance. To overcome this problem, we propose Sparse Approximated Nearest Points (SANP) to measure the dissimilarity between two image sets. SANPs are the two points, one on each individual set, which satisfy the following constraints:

4

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 34,

NO. X,

XXXXXXX 2012

min ðFvi ;vj þ 1 ðGvi ; þ Qvj ; Þ þ 2 jj1 þ 3 jj1 Þ;

vi ;vj ;;

Fig. 2. Illustration of matching two image sets of different subjects. Nearest points of the sets (a) with dense approximation and (b) with sparse approximation. The first row shows the nearest point images on the sets and the second row shows the sample coefficients used for their approximation.

The euclidean distance between these two points should be small. . Each of the two points should be able to be approximated by a sparse combination of sample images in the corresponding image set. Note that the second constraint improves the discriminative power of the SANPs. Given two image sets of different classes (subjects), the nearest points between these two sets using a dense combination of all sample images could be very close. For example, the nearest images in Fig. 2a (top row) are very close but they deviate significantly from the sample images of the respective set, i.e., they neither look like the query (first row Fig. 6a) nor the gallery (second row Fig. 6a). Alternatively, using only a sparse combination of a few samples, the minimum distance between points of the same two sets (correctly) becomes large (e.g., the images in Fig. 2b are approximated by the linear combination of five samples). From a geometric point of view, the affine hull of an image set is formed from sample images which lie on the facets of the hull. The constraint of sparse approximation enforces the SANPs to be close to some facet(s) of the affine hull and consequently close to some sample image(s) on those facet(s). This pushes the SANPs apart, resulting in greater computed distances. However, the main advantage of our sparsity constraint is that it pushes the SANPs of different identities further apart compared to SANPs of the same identity. Thus, the spurious nearest points of image sets of different classes can be avoided, resulting in greater identification accuracy. .

4.1 Convex Formulation To find the SANPs of two image sets which are optimal in terms of the above two criteria, we propose a convex formulation. Given the data matrices Xi and Xj of two image sets, their corresponding affine hull representations are (i ; Ui ) and (j ; Uj ). We first define several functions as follows: Fvi ;vj ¼ jði þ Ui vi Þ ðj þ Uj vj Þj22 ; Gvi ; ¼ jði þ Ui vi Þ Xi j22 ; Qvj ; ¼ jðj þ Uj vj Þ Xj

ð3Þ

j22 :

The optimal model coefficients fvi ; vj g and sample coefficients f ; g of SANPs are obtained by optimizing the following unconstrained problem:

ð4Þ

where the first term is to keep the distance between SANPs xi ¼ i þ Ui vi and point xj ¼ j þ Uj vj small. The second term is to preserve the individual fidelities between these two points and their sample approximations. The last two terms enforce the approximations to be sparse. It can be proven that this objective function is jointly convex with respect to all variables (vi , vj , , and ) (see Appendix B, available in the online supplemental material). Here, we do not use the representation of (1) alone to formulate the problem because this would require an additional constraint, i.e., the sum of coefficients be equal to 1. This will make the corresponding optimization a constrained optimization problem which is more complex to solve. Therefore, we used (2) to represent the affine hull in the first term to eliminate the additional constraint. The second and third terms are required because we enforce sparsity constraints on the set samples (representation (1)) instead of the set bases (representation (2)). Moreover, minimizing these terms minimizes the difference between the two representations thus implicitly ensuring that the corresponding combination of the optimal sample coefficients is an affine combination. Such a formulation allows us to solve an unconstrained optimization problem more easily. Note that if the set sizes are very large compared to the feature dimension of the set sample, the two representations can be exact, reducing the second and third terms in (4) to zero. In this case, we will still get the optimal SANPs. The parameters 1 , 2 , and 3 are the tradeoff weights to control the relative importance of different terms. For the value of 1 , it should be relatively smaller to ensure that the first term is minimized. We fixed the value of 1 as 0.01 for all the experiments conducted in this paper. For 2 and 3 , we design an automatic mechanism to control the relative sparsity of and . Notice that we can find the 2 ¼ maxðj21 ðXiT i ÞjÞ for the zero vector to be optimal at zero. It is derived by the sufficient and necessary condition that the zero vector should belong to the subdifferential of the objective function at zero. In such a situation, 2 jj1 will increase more than the decrease in F þ 1 ðG þ QÞ when changing any component of to nonzero. Hence, the zero vector is optimal for alpha at zero. This condition can be satisfied when 2 is larger than the absolute gradient of at zero. The gradient of at zero can be derived as 21 ðXiT i Þ. Therefore, 2 is set to maxðj21 ðXiT i ÞjÞ. If 2 >¼ 2 , the zero vector is optimal for at zero. Similarly, if 3 >¼ 3 ¼ maxðj21 ðXjT j ÞjÞ, the zero vector is optimal for at zero. We adaptively set 2 ¼ 0:1 2 and 3 ¼ 0:1 3 for all experiments. These two parameters are thus adaptive to different image sets. To the best of our knowledge, this is the first time that sparse modeling has been formulated to match two image sets. Note that we do not enforce the sparsity on the model coefficients vi and vj because the bases Ui =Uj obtained from SVD do not align with the sample data points. Instead, we enforce the sparsity property on the sample coefficients and , which implies that each nearest point is sparsely approximated by the combination of a few sample images. Differently from sparse modeling of single image based classification [12], our formulation jointly optimizes the

HU ET AL.: FACE RECOGNITION USING SPARSE APPROXIMATED NEAREST POINTS BETWEEN IMAGE SETS

nearest points between two image sets and their sparse approximations from samples.

5

EFFICIENT OPTIMIZATION

The convex formulation of SANP can be solved by many algorithms, e.g., gradient-descent, feature-sign algorithm [9], etc. In this section, we provide an efficient solution by adapting the Accelerated Proximal Gradient (APG) method to optimize (4) which theoretically achieves the best optimization efficiency among all first-order methods for solving this problem. The whole procedure is summarized in Algorithm 1. The objective function in (4) is a composite model consisting of a smooth function and a nonsmooth function. The smooth part corresponds to fðv1; v2; ; Þ ¼ Fvi ;vj þ 1 ðGvi ; þ Qvj ; Þ and the nonsmooth part is gð; Þ ¼ 2 jj1 þ 3 jj1 . Obviously, gð; Þ is a convex function with respect to and . It can also be proven that the smooth function fðv1; v2; ; Þ is jointly convex with respect to all its variables. Hence, the objective function in (4) is convex and the global minimum solution can be obtained. In the rest of this section, we adapt the APG methods [49], [1] to solve this optimization problem, which can achieve the optimal convergence rate of first order methods. Algorithm 1. Optimization of SANPs Require: ðXi ; i ; Ui Þ, ðXj ; j ; Uj Þ 1: Set v1i ¼ v0i ¼ 0, v1j ¼ v0j ¼ 0, 1 ¼ 0 ¼ 0, 1 ¼ 0 ¼ 0, t0 ¼ 0, t1 ¼ 1, k ¼ 1, L ¼ L0 ¼ 100, ¼ 1:1, 1 ¼ 0:01, 2 ¼ 0:1 maxðj21 ðXiT i ÞjÞ and 3 ¼ 0:1 maxðj21 ðXjT j ÞjÞ. 2: while not converged do 3: compute the proximal points: k1 Þ; ykvi ¼ vki þ t tk1 ðvki vk1 i k1

ykvj ¼ vkj þ t

yk ¼ k þ t

1 k ðvj tk

k1

1 ðk tk

vk1 j Þ; k1 Þ;

k1

4:

5:

6: 7:

8: 9:

yk ¼ k þ t tk1 ðk k1 Þ; calculate gradient: 2UiT j 5fvi ¼ ð2 þ 21 ÞUiT Ui ykvi 2UiT Uj vk1 j T T k1 þð2 þ 21 ÞUi i 21 Ui Xi ; 2UjT i 5fvj ¼ ð2 þ 21 ÞUjT Uj ykvj 2UjT Ui vk1 i T T k1 þð2 þ 21 ÞUj j 21 Uj Xj ; ; 5f ¼ 21 XiT Xi yk 21 XiT i 21 XiT Ui vk1 i 5f ¼ 21 XjT Xj yk 21 XjT j 21 XjT Uj vjk1 ; optimize proximal regularization: ¼ ykvi L1 5 fvi ; vkþ1 ¼ ykvj L1 5 fvj ; vkþ1 i j kþ1 k 1 ¼ 2 ðy L 5 f Þ; L kþ1 ¼ 3 ðyk L1 5 f Þ; L If Fvkþ1 ;vkþ1 þ 1 ðGvkþ1 ;kþ1 þ Qvkþ1 ;kþ1 Þ > PL , update i j i j L ¼ L and go to Step 5; stepsize update: pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 1þ 4ðtk Þ2 þ1 ; tkþ1 ¼ 2 end while Output: optimal solution ðvi ; vj ; ; Þ to (4)

The gradient method [1], [49] was used to minimize the composite function fðwÞ þ gðwÞ by extending the equivalence

5

relationship between gradient step and the proximal regularization of the linearized function f at wk1 to the composite function fðwÞ þ gðwÞ. The corresponding iterative scheme is as follows: At every iteration k, the new solution wk is obtained by solving the following proximal regularization problem from the solution wk1 at the previous iteration: wk ¼ arg minfPL ðw; wk1 Þ þ gðwÞg; w

ð5Þ

where PL ðw; wk1 Þ ¼ fðwk1 Þþ < 5fðwk1 Þ; w wk1 > þ

ð6Þ L kw wk1 k2 ; 2

where L is the stepsize related to the Lipschitz constant 5f. When gðwÞ ¼ jwj1 , the optimal wk of (5) can be efficiently obtained by the soft-thresholding operators at every iteration as follows: ðxÞi ¼ ðjxi j Þþ sgnðxi Þ;

ð7Þ

where ðxÞþ ¼ maxð0; xÞ and sgnðxÞ returns the sign of x. APG methods [49], [1] improve the convergence rate of the gradient method from oðk1Þ to oðk12 Þ by carefully selecting a sequence of points Y k for proximal regularization instead of directly using the point in the previous iteration (Step 1 in Algorithm 1). The composite objective function (4) of our SANP optimization is different from the standard one in the nonsmooth part, where the L1 norm only relates to some optimization variables ( and ). Because the objective function is separable, the proximal regularization of SANP optimization at every iteration still can be solved efficiently: vi and vj are directly updated from proximal points in the negative gradient direction since they are independent of the nonsmooth part; and are updated using the softthresholding operator (7) with the thresholding value of L (Step 5 in Algorithm 1). The stepsize L is unfortunately unknown. We adaptively select the stepsize using the backtracking rule [1]. Given the initial L ¼ L0 and some > 1, we keep updating L ¼ L until PL between the solutions of iteration k þ 1 and k is larger than Fvi ;vj þ 1 ðGvi ; þ Qvj ; Þ at iteration k þ 1 (Step 6 in Algorithm 1). By solving the optimization problem (4), we obtain both the optimal model coefficients (vi , vj ) and the optimal sample coefficients ( , ). For each of the two image sets, ðvi and Þ=ðvj and Þ provide two approximations of the SANP: One is the linear combination of PCA bases plus sample means and the other is the affine combination of sample images. Both representations approximate a single SANP on each of two image sets. They are close to each other, but may not be exactly the same. However, both representations are used to calculate the between set SANP distance, as per (17).

5.1 Convergence Rate Analysis Following the more general results in [1], it can be proven that the sequence pk ¼ ðvki ; vkj ; k ; k Þ generated by Algorithm 1 converges to the global solution p ¼ ðvi ; vj ; ; Þ of the function (4) with a nonasymptotic convergence rate of Oðk12 Þ,

6

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 34,

NO. X,

XXXXXXX 2012

KSANP, the sample mean needs to be calculated in the feature space as c ¼

Nc 1 X

ðxc;i Þ: Nc i¼1

ð9Þ

Similarly, the orthogonal bases Uc ¼ ½v c;1 ; v c;2 ; . . . ; v c;l are also calculated by obtaining the eigenvectors of the covariance matrix in the feature space: COVc ¼ Fig. 3. Illustration of fast convergence of SANP optimization. The function converges in about 40 iterations.

where k is the iteration number. Compared to the general gradient method, whose convergence rate is Oð1kÞ, this convergence rate is optimal for the first-order optimization methods. Actually, it can be shown that k

F ðp Þ F ðp Þ

2LðfÞkpk p k2 2

ðk þ 1Þ

;

Fig. 3 plots the values of the objective function (4) over iterations when computing the SANPs of two image sets. The algorithm quickly converges after 40 iterations in 0.8 second using a Matlab implementation on a 2.3 GHz machine.

6

KERNEL SANP

We propose the kernel extension of our SANP (KSANP) to model the complex nonlinear structures in the original data. Kernel tricks [34], [43] provide a powerful tool to learn such nonlinear structures and have been successfully applied to many problems. In general, kernel techniques implicitly map the input data using nonlinear techniques to a highdimensional feature space and then apply the linear method in the feature space for nonlinear analysis. Given a feature mapping function : Rd ! Rk where k > d, the sample data (Xc ¼ ½xc;1 ; xc;2 ; . . . ; xc;Nc ) of an image set c can be mapped to a nonlinear feature space as ( ðXc Þ ¼ ½ ðxc;1 Þ; ðxc;2 Þ; . . . ; ðxc;Nc Þ). In the case where the data Xc of different classes are not linearly separable in the original space, the aim is to find a nonlinear mapping ðXc Þ to a high-dimensional feature space which makes the data linearly separable. Ideally, a 100 percent separability is desired, but in practice, a mapping function is acceptable as long as it increases the separability of the data. Therefore, the linear affine hull model applied in the feature space can be used to model the nonlinear structures of image sets. For

ð10Þ

However, both c and COVc cannot be directly computed because the mapping function is not explicitly specified. Using the Kernel PCA [3], it can be shown that every orthogonal base can be represented as the linear expansion of mapped centered data: v c ¼

ð8Þ

where > 1 is the constant for backtracking the update of stepsize and LðfÞ is the Lipschitz constant of 5f. To achieve the "-optimal solution (i.e., a p~ such that F ð~ pÞ F ðp Þ "), the number of required iterations is at most dpﬃﬃ"C1e, where qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ C ¼ 2LðfÞkp0 p k2 :

Nc 1 X ð ðxc;i Þ c Þ ð ðxc;i Þ c ÞT : Nc i¼1

Nc X

wðiÞ ð ðxc;i Þ c Þ:

ð11Þ

i¼1

Since the feature mapping is implicit (unknown), we cannot calculate the data mean c in the implicit feature space. Therefore, we cannot directly calculate the centered gram ~ cc ¼ ð ðXc Þ ~ cc . However, we can expand K matrix K

T

c Þ ð ðXc Þ c Þ with simple algebraic derivation (see [3] for details) and obtain (12), which depends on the noncentered gram matrix Kcc ¼ ðXc ÞT ðXc Þ and does not require us to explicitly evaluate the unknown c : ~ cc ¼ K

1 1 1 1 ; Kcc IKcc Kcc I þ IKcc I Nc Nc Nc Nc

ð12Þ

where I denotes a Nc Nc matrix whose elements are all 1, the expansion coefficients Wc ¼ ½wc;1 ; wc;2 ; . . . ; wc;l can be computed from the eigendecomposition of ~ cc Wc : Nc Wc ¼ K

ð13Þ

By representing the functions of (3) in the feature space : 2 Fv i ;vj ¼Ui vi þ i Uj vj þ j 2 ; 2 G vi ; ¼ Ui vi þ i ðXi Þ 2 ; 2 Q vj ; ¼ Uj vj þ j ðXj Þ 2 :

ð14Þ

The convex formulation of SANP optimization in (4) can be extended to its kernel version min Fv i ;vj þ 1 G vi ; þ Q vj ; þ 2 jj1 þ 3 jj1 : ð15Þ vi ;vj ;;

Notice that although individual mapped data ð ðXi Þ; ðXj ÞÞ and kernel principal components ðUi ; Uj Þ cannot be directly computed, we can compute ðFv i ;vj ; G vi ; ; Q vj ; Þ as well as their derivatives by only computing the inner products of ðÞT ðÞ in the projected feature space from the following related terms:

HU ET AL.: FACE RECOGNITION USING SPARSE APPROXIMATED NEAREST POINTS BETWEEN IMAGE SETS

T ~ ii Wi ; U i Ui ¼ WiT K T ~ jj Wj ; Uj ¼ WjT K Uj T ~ ij Wj ; Ui Uj ¼ WiT K T 1 1 1 1T Kij 1 ; j ¼ CiT Kij 1 Ui Nj Ni Nj T 1 1 1 ; i ¼ CiT Kii 1 1T Kii 1 Ui Ni Ni Ni T 1 1 T 1 Uj 1 Kjj 1 ; j ¼ CjT Kjj 1 Nj Nj Nj T 1 1 T 1 ; i ¼ CjT Kji 1 1 Kji 1 Uj Ni Nj Ni T 1 T 1 i i ¼ 1 Kii 1 ; Ni Ni T 1 T 1 j j ¼ 1 Kjj 1 ; Nj Nj T 1 T 1 i j ¼ 1 Kij 1 ; Ni Nj

ð16Þ

where 1 denotes a vector whose elements are all 1. Once all inner product related terms are computed, the search of KSANP can be achieved by using the same optimization method described in Algorithm 1 and substituting all inner product related terms in the original space with those in the feature space. The computational complexity of KSANP optimization does not increase because all the inner product related terms can be precalculated offline since they do not change during the iterations. The calculation is independent of the feature space dimensionality and can be performed efficiently.

6.1 Kernel Parameter Regularization The selection of suitable kernel functions is an important factor for all kernel methods. However, there is no general solution for this problem. We simply use the most common kernel function, i.e., the Radial Basis Function (RBF), which computes the inner product of two2 data points in the jx x j feature space as Kðxi ; xj Þ ¼ exp i22j . Notice that the RBF kernel is used merely to demonstrate the performance of KSANP and can be replaced by any other kernel function. An advantage of the RBF kernel is that it has only one free parameter, i.e., . We propose an algorithm that adaptively tunes the value of while matching each pair of image sets during run time. The parameter basically controls the distribution of the inner products in the feature space. If is large, the inner products are smooth. One extreme case is that all inner products are equal to 1 when is very large. The inner products become much more different when is small. Although some existing work has been proposed to select the optimal parameter values for kernel machines, most techniques, e.g., cross validation and Chapelle et al. [24], chose the parameter value which optimizes some estimated generalization error over training data. In contrast, our regularization method dynamically changes the value of when comparing a probe set with different gallery sets. Specifically, we use the priori knowledge that the value of should not be too large to regularize its value. When is too large, all inner products become very similar, regardless of the intrinsic properties of the

7

data. Consequently, the entropy of the elements in the corresponding gram matrix becomes too small. Given the data matrix X of a set, we initialize 0 with a preselected fixed value using, e.g., cross validation, to compute the kernel functions. In the case where the entropy of the elements in the corresponding gram matrix is lower than a fixed threshold Hmin (Hmin is fixed as 0.1 for all experiments in this paper), we use a binary search to find the optimal for achieving Hmin . The detailed procedure is described in Algorithm 2. When comparing a probe set with a gallery set, we dynamically select the value of to ensure that the entropies of both gram matrices of the two sets are larger than Hmin . Thus, a different value of is chosen during run time for matching each pair of image sets. Algorithm 2. Binary Search for Regularization Require: X, 0 and Hmin 1: Compute the gram matrix G of X using RBF function with 0 and the entropy of G as H; 2: if H < Hmin then 3: Set ¼ 0:5 0 ; 4: Compute the gram matrix G of X using RBF function with and the entropy of G as Hnew ; 5: If Hnew >¼ Hstep ¼ 0:5 else step ¼ 2; 6: while H < Hmin do 7: Set ¼ step; 8: Compute the gram matrix G of X using RBF function with and the entropy of G as H; 9: end while 10: Set start ¼ minð; 0 Þ, end ¼ maxð; 0 Þ; 11: while jend start j > 0:001 do 12: Set ¼ 12 ðstart þ end Þ; 13: Compute the gram matrix G of X using RBF function with and the entropy of G as H; 14: if H >¼ Hmin then 15: If step < 1 set start ¼ else set end ¼ ; 16: else 17: If setp < 1 set end ¼ else set start ¼ ; 18: end if 19: end while 20: end if The additional complexity for this regularization process is marginal due to several reasons. First, regularization is performed only when the entropy of one of the gram matrices using 0 is smaller than Hmin . In some cases, regularization is not necessary and 0 is directly used. Second, the binary search procedure for finding the optimal value which can make the entropy of the corresponding gram matrix greater than Hmin is efficient and requires only OðlogðnÞÞ iterations, where the search range n is generally small. Because the individual image set sizes are relatively small, the calculation of gram matrix (which is the only expensive operation during the iterations) does not become a computational bottleneck. Our results back up this claim (see Table 4).

7

EXPERIMENTAL EVALUATION

We evaluate the proposed method on the task of face recognition based on image sets. Once the SANPs/KSANPs are found, the nearest neighbor classifier is used for

8

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 34,

NO. X,

XXXXXXX 2012

TABLE 1 Summary of Face Recognition Using SANP/KSANP Distance between Image Sets

recognition. For every query set, the most similar image set in the gallery is searched by finding the minimum betweenset distances based on the SANPs/KSANPs of two image sets. We define the between-set distance as follows: Dðci ; cj Þ ¼ ðdi þ dj Þ ½Fmin þ 1 ðGmin þ Qmin Þ;

D ðci ; cj Þ ¼ ðd i þ d j Þ ½Fmin þ 1 ðG min þ Q min Þ;

ð17Þ

where Fmin ; Gmin ; Qmin are the optimal values achieved by

; G min ; Q min are the function the solution of (4) and Fmin values achieved by the optimal solution of (15). di and dj are the dimensions of the affine hulls of ci and cj in the original space, respectively. Similarly, d i and d j are the dimensions of affine hulls of the two image sets in the implicit feature space. The value of d i is equal to the number of the model coefficients vi which is equal to the number of the columns in Ui . Although we cannot directly compute Ui , we compute T ~ ii Wi from Kernel PCA in the optimization of Ui Ui ¼ WiT K ~ ii and Wi are calculated using (12) and (13). KSANP, where K The number of columns of Ui , and hence the value of d i , is T obtained from the number of columns of Ui Ui . The value of

dj is calculated in a similar way. Multiplication with the total dimensions of two image sets is performed to eliminate the bias to larger image sets. The bias occurs because, when calculating the distance to larger image sets, the error of the least square function Fvi ;vj , which is the projection of j i onto the null space of ½Ui ; Uj , will be smaller since the dimension of the null space is reduced. In the extreme case, if di þ dj is larger than the feature dimension, a zero minimum distance can be obtained even when the two image sets are very different. Because the least square error is linearly biased to larger image sets, multiplication with ðdi þ dj Þ ensures that a small between-set distance is only obtained when the distance between SANPs (KSANPs) and the dimensions of sets are both small. Table 1 summarizes our complete algorithm.

7.1

Experimental Setup

7.1.1 Data set configuration To evaluate the performance of the proposed SANP/KSANP and compare with existing techniques, three face video datasets are used: the Honda/UCSD [13], CMU Mobo [28], and YouTube Celebrities [18] datasets. Each video sequence in these three datasets corresponds to an image set. For the Honda/UCSD and CMU Mobo datasets, the faces in every frame of the video sequences are automatically detected by applying [27]. Because the face detection using [27] often fails on the YouTube Celebrities dataset, we used [4] to track the faces across every video sequence given the cropped face in

the first frame [18]. Once the face images are detected/ tracked, they are cropped and converted to gray-scale images. Histogram equalization is the only preprocessing step used to minimize the illumination variations. We evaluate and compare the proposed methods on different sizes of images as well as different features. For the Honda/ UCSD dataset, the gray-scale face images are resized to 20 20 as in [30]. The raw pixels values of the resized images are vectorized to form the columns of data matrix X. For the CMU Mobo dataset, the gray-scale face images are resized to 40 40 as in [7]. The Local Binary Patterns (LBP) [35] are extracted as the features of individual images. For the YouTube Celebrities dataset, the tracked faces in gray-scale format are resized to 40 40 as in [29]. The raw pixels values of the resized images are vectorized to form the columns of data matrix X, similarly to the Honda/UCSD dataset.

7.1.2 Comparison with existing methods We compare the proposed method with several image set classification methods lately proposed in the literature. They include Discriminant Canonical Correlation Analysis (DCC) [41], Manifold-to-Manifold Distance (MMD) [30], Manifold Discriminant Analysis (MDA) [29], the Linear version of the Affine Hull-based Image Set Distance (AHISD) [7], as well as the Convex Hull-based Image Set Distance (CHISD) [7]. Here, AHISD can be regarded as a baseline method which finds the nearest neighbors without the sparsity constraint. Note that [41], [30], [29], [7] have conducted extensive comparisons with exemplar-based methods, e.g., Linear Discriminant Analysis (LDA) [26], Kernel Fisher Discriminant (KFD) [31], and Marginal Fisher Analysis (MFA) [32] have shown that set-based methods generally outperform exemplar-based methods. Due to this reason, we do not provide comparison with exemplar-based methods in this paper. The standard implementations of all methods from the original authors are used except MDA. We carefully implemented the MDA algorithm since it is not publicly available. The important parameters of different methods are carefully optimized as follows: For DCC, the dimension of the embedding space is set to 100. The subspace dimensions are set to 10, which preserves 90 percent energy and the corresponding 10 maximum canonical correlations are used to define set similarity. For MMD and MDA, the parameters are configured according to [29], [30]. Specifically, the ratio between euclidean distance and geodesic distance is optimized for different datasets (i.e., 2.0 for Honda, 5.0 for Mobo, and 2.0 for YouTube dataset1). The 1. The optimal parameter for the Mobo dataset is different because the LBP histograms are used in this case.

HU ET AL.: FACE RECOGNITION USING SPARSE APPROXIMATED NEAREST POINTS BETWEEN IMAGE SETS

9

Fig. 4. The Honda/UCSD dataset. (a) Image samples. Each row represents a different image set. (b) SANPs between a query set and gallery sets sorted according to their respective distance (shown below the SANPs). The SANPs between image sets of the same subject are highlighted with a red bounding box. Note that the distance of the second nearest set is over 16 times larger than that of the correct set.

maximum canonical correlation is used in defining MMD. For MDA, the number of between-class NN local models and the dimension of MDA embedding space are tuned for different datasets as specified in [29]. The number of connected nearest neighbors for computing geodesic distance in both MMD and MDA is fixed to its default value, i.e., 12. There is no parameter setting for AHISD. For CHISD, we set the error penalty parameter to be the same as in [7] (C ¼ 100 for gray-scale features and C ¼ 50 for LBP in linear SVM). Both methods apply PCA to preserve 90 percent energy as before.

7.2

Results and Analysis

7.2.1 Honda/UCSD Data Set The Honda/UCSD dataset [13] contains 59 video sequences of 20 different subjects. The lengths of the sets vary from 12 to 645. Fig. 4a shows exemplar images of some video sequences. Each row corresponds to an image set of a subject. In this dataset, different poses and expressions appear across different sequences of each subject. In our experiment, we use the standard training/testing configuration provided in [13]: 20 sequences are used for training and the remaining 39 sequences for testing. We report results using all frames as well as with limited number of frames. Specifically, we conduct the experiments by setting an upper bound M of maximum set length to 100 and 50. In case a set contains fewer than M images, all images are used for classification. Such situations often occur in real-world applications, for example, the tracking of a face may fail for a long sequence and only the first part of the sequence is available for classification. Moreover, classification based on smaller sets can also be more efficient. Although this dataset is relatively easy when complete video sequences are used, the performances of various techniques degrade in a different manner when the sizes of image sets are reduced. Fig. 4b shows the optimized SANPs of a given query image set to all gallery sets. We can see that the SANPs of the query and correct gallery set are the most similar and have the minimum corresponding distance. Table 1 summarizes the identification rates of all methods. Notice that overall the proposed SANP and KSANP both outperform other methods in different configurations. When the whole sequences are used, only SANP and KSANP achieve perfect classification. Both methods achieve much more stable results when

only the first 50 or 100 frames of training/test videos are used. KSANP has the highest performance in all columns of Table 2, whereas the performance of AHISD [7] fluctuates and that of MDA [29] drops significantly when 50 training/ testing frames are used. It is interesting to note that the performances of discriminant learning methods (DCC and MDA) degrade more heavily due to the reduction of training data. Geometric models (AHISD and CHISD) perform more consistently across different set lengths but with lower accuracy. The robustness of SANP and KSANP to fewer training samples is attributed to the fact that we use the loose affine hull model. This model is particularly appealing in the case of fewer training samples since the unseen appearances of a set are modeled as affine combinations of its sample images. Note that the accuracies of AHISD and CHISD are lower than those reported in [7] because the images are resized to 20 20 instead of 40 40. The results are obtained by the implementation provided by the authors of [7]. Although we used video sequences in our experiments, we did not utilize the temporal relationship between consecutive frames. If we randomly shuffle the order of images within each set, the SANPs and recognition results would be exactly the same. For further verification, we randomly selected the 50 and 100 frames (which are not temporally consecutive) in every video of the UCSD/Honda data to form image sets and achieved the same ranking of all methods.

7.2.2 CMU Mobo Data Set the Mobo (Motion of Body) dataset [28] was originally created for human pose identification. There are 96 sequences of 24 subjects walking on a treadmill. Multiple cameras were used to capture videos of four walking patterns: slow, fast, inclined, and carrying a ball. For each subject, four video TABLE 2 Identification Rates on the Honda/UCSD Data Set

10

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 34,

NO. X,

XXXXXXX 2012

TABLE 3 Average Identification Rates and the Standard Deviations of Different Methods on the CMU Mobo Data Set

Fig. 5. Image examples of the CMU Mobo dataset [29], [7]. Each row represents a set and two sets per subject are shown.

sequences are collected, each of which corresponds to a walking pattern. Fig. 5 shows the cropped images from six sequences of three subjects. On this dataset, the uniform LBP histograms using circular ð8; 1Þ neighborhoods are extracted from the 8 8 squares of gray-scale images as the image features. One sequence per subject is randomly selected for training and the remaining are used for testing. We conduct 10 experiments by repeating the random selection of training/testing data and report the average identification rates and standard deviations of different methods. The results summarized in Table 3 show that the proposed methods consistently achieve the best performance (highest classification rate and smallest standard deviation). Even though the proposed SANP already achieves very high performance, its kernel extension KSANP further improves performance in terms of both accuracy and stability. Relative to SANP, KSANP improves the error rate by 28.2 percent with an even smaller standard deviation, which is significant. It is worth mentioning that our method is generic and gives good performance across different types of features, e.g., pixel values or LBP features. Table 2 and 3 show that our method consistently achieves good results using pixel values (Honda and YouTube) and LBP features (Mobo). On the other hand, other methods may achieve good results using one feature and degraded performance using another feature. For example, MDA achieves the second best overall performance on the Honda

dataset using pixel values and CHISD achieves the second best performance on Mobo dataset using LBP histograms.

7.2.3 YouTube Celebrities Data Set The YouTube Celebrities dataset [18] is the largest video dataset collected for face tracking and recognition. It contains 1,910 video sequences of 47 celebrities (actors, actresses, and politicians) which are collected from YouTube. The clips contain different numbers of frames (from 8 to 400) which are mostly low resolution and highly compressed. Fig. 6a shows some examples of cropped faces in this dataset. We can see that this dataset introduces more challenging situations for image set classification because of two reasons. First, the video sequences exhibit larger variations in pose, illumination, and expressions. Second, the low quality of frames due to the high compression rate introduces tracking errors and noises in the cropped faces. Without enforcing facial constraints as in [18], the cropped faces we used in this paper contain larger tracking errors than the face images from [18], which makes our experimental setting even more challenging. We conduct five-fold cross-validation experiments. The whole dataset is equally divided (with minimal overlap) into five folds each containing nine video sequences per subject. In each fold, three image sets per subject are randomly selected for training and the remaining six are selected for testing. The average identification rates and the associated standard deviations of different methods are summarized in Table 4. Because the videos are captured from real world in low quality and broad appearance variations are covered in this dataset, all methods achieve lower recognition rates compared to the other two datasets. Notice that the results

Fig. 6. The YouTube Celebrities dataset. (a) Image samples. Each row represents a different image set. (b) The top 20 SANPs between a query set and gallery sets sorted according to their respective distance (shown below the SANPs). The SANPs between image sets of the same subject are highlighted with a red bounding box. Note that the distance of the second nearest set is over 13 times larger than that of the correct set.

HU ET AL.: FACE RECOGNITION USING SPARSE APPROXIMATED NEAREST POINTS BETWEEN IMAGE SETS

11

TABLE 4 Computational Times and Average Performances on the YouTube Dataset

Fig. 7. Three failure examples from the YouTube Celebrities dataset. Four image samples are shown for each set. The first row contains three query sets that could not be correctly matched to their gallery sets (same column) in the next three rows. Note that these cases are extremely challenging and contain large pose/expression variations and motion blurr.

of some methods are relatively lower than those reported in [29] because our experimental setting is more challenging, the automatically cropped faces contain larger tracking errors, and the data distribution of training/testing in 5-fold cross validation is broader than [29]. Our proposed SANP and KSANP methods again achieve the best performance on this dataset with the same set of parameters used in previous experiments. Fig. 6b shows the optimized SANPs of a given query image set to the top 20 gallery sets which are most similar to the query set. It is obvious that the SANPs of the query and gallery sets belonging to the same identity are the most similar and have the minimum corresponding distance. In this dataset, the KSANP does not significantly improve over SANP because most of the failure cases are due to very large intraclass variations, which cannot be corrected even after the data are mapped to the nonlinear feature space using the RBF function.

7.3 Timing Table 4 also provides a comparison of computational complexity of our methods with the five image set classification methods using the YouTube dataset. We have included the Sparse Representation-based Classification (SRC) proposed by Wright et al. [12] for single image classification in order to show that a straightforward extension of single image classification techniques to the problem of image set classification can be computationally infeasible. We extend the SRC method [12] to image set classification as follows: Given a query set, all sample images are sparsely represented as a linear combination of the images of all gallery sets and the query set is assigned to the class with the minimum reconstruction error (similar to [12]) of all its samples. Here, we use an efficient algorithm [9] to solve the sparse coefficients of the linear combination. Table 4 compares the accuracy and speed of SANP/KSANP with the five image set classification methods and our modification of SRC for image set classification. Since only DCC, MMD, and MDA require a training phase, their training times are also given. For comparison, we list testing times for the same query set and average performances over all query sets. DCC and MMD are the fastest because they use a simple euclidean distance. However, the identification rates of DCC and MMD are the lowest. The computational complexity of MDA is relatively higher due to the hierarchical analysis of manifolds. However, it offers little

improvement in accuracy over DCC and MMD. Compared to our approach, AHISD and CHISD are faster due to their efficient SVM-like optimization. However, their identification rates are significantly (5 percent) lower than SANP/ KSANP. SRC has the highest computational cost and ranks third in accuracy after SANP and KSANP. The accuracy of our approach comes from the fact that it dynamically finds the nearest points (SANPs), which correspond to images that may not have appeared in the samples of either set. On the other hand, SRC relies completely upon the sparse representations of the original samples. Our method is more efficient compared to SRC because it optimizes SANPs/ KSANPs based on smaller individual gallery sets (small dictionary) compared to SRC where the query image is approximated from the complete gallery of all image sets, i.e., a much larger dictionary. Moreover, a straightforward extension of SRC to the image set classification problem requires sparse approximations of all samples in the query set whereas our method requires the sparse approximations of SANPs/KSANPs only. We can also see that the complexity of the proposed KSANP only slightly increased compared to SANP. This is due to the precomputation of all the inner product related matrices.

7.4 Limitations The major limitation of the proposed approach is that it relies on the unseen appearances of a set to be modeled by the affine combinations of its samples, features or their nonlinear kernel mapped versions. While this is true for some variations in illumination, facial expressions, and poses, it does not hold for extreme variations especially in the latter two parameters. For example, the SANP and KSANP both do not achieve good results on the YouTube dataset compared to the UCSD/Honda and CMU Mobo datasets. This is because the YouTube training and test sets contain very large intraclass variations and most of the failure cases are due to these variations which cannot be modeled even after the data are mapped to the nonlinear feature space using the RBF function. Fig. 7 shows some failure cases in the YouTube dataset which could not be correctly classified. We can see that the lighting conditions, facial expressions, and even the poses between the query set and three gallery sets are dramatically changed. Such extreme intraclass variations cannot be modeled by the raw pixel values, which explains why all other methods (DCC, MMD, MDA, AHISD, CHISD, SRC) also failed to

12

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

classify these cases. Other more complex local features need to be utilized for further improvement of performance in such cases.

8

[9]

[10]

CONCLUSION AND DISCUSSION

In this paper, a novel sparse formulation for image set classification is proposed. A joint representation of the image set is defined which includes both the sample images and their affine hull model. The Sparse Approximate Nearest Points (SANP), which are the nearest points between two sets that can be sparsely approximated from the set samples of the respective set, are introduced. Sparsity is enforced on sample coefficients rather than the model coefficients to perserve the fidelity of the SANP to the data samples. Unlike the sparse coding of single images, the optimization of SANP jointly minimizes the distance and maximizes the sparsity of the nearest points. A scalable accelerated proximal gradient method is used to solve this optimization. The proposed SANP is further extended to its kernel version KSANP, which can better model the nonlinear structure of set data. A regularization process is proposed to dynamically select the RBF kernel parameter when matching the probe set with different gallery sets. A thorough experimental evaluation is conducted on three benchmark datasets for face recognition based on image sets and the results are compared to the existing state of the art. Using the same fixed set of parameters, our methods consistently achieve the best performances across all experiments as well as features while the performances of other methods fluctuate even with tuned parameters on different datasets/features.

ACKNOWLEDGMENTS This research was supported by ARC grants DP1096801, DP0881813, and DP110102399. The authors thank T. Kim for sharing the source code of DCC and R. Wang for sharing the source code of MMD and the cropped faces of the Honda/UCSD dataset. They also thank H. Cevikalp for sharing the source code of AHISD/CHISD and providing the LBP features for the Mobo dataset.

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8]

A. Beck and M. Teboulle, “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183-202, 2009. A.W Fitzgibbon and A. Zisserman, “Joint Manifold Distance: A New Approach to Appearance Based Clustering,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 26-33, 2003. B. Scho¨lkopf, A. Smola, and K.R. Mu¨ller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1998. D.A. Ross, J. Lim, R.S. Lin, and M.H. Yang, “Incremental Learning for Robust Visual Tracking,” Int’l J. Computer Vision, vol. 77, nos. 1-3, pp. 125-141, 2008. E. Oja, Subspace Methods of Pattern Recognition. Research Studies Press, 1983. G. Shakhnnarvovich, J.W Fisher, and T. Darrell, “Face Recognition from Long-Term Observations,” Proc. European Conf. Computer Vision, pp. 851-865, 2002. H. Cevikalp and B. Triggs, “Face Recognition Based on Image Sets,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 2567-2573, 2010. H. Hotelling, “Relations between Two Sets of Variates,” Biometrika, vol. 28, nos. 3/4, pp. 321-377, 1936.

[24]

[25]

[26]

[27] [28] [29]

[30]

VOL. 34,

NO. X,

XXXXXXX 2012

H. Lee, A. Battle, R. Raina, and A.Y. Ng, “Efficient Sparse Coding Algorithms,” Proc. Ann. Conf. Neural Information Processing Systems, pp. 801-808, 2006. J.R Beveridge, B.A Draper, J.M. Chang, M. Kirby, H. Kley, and C. Peterson, “Principal Angles Separate Subject Illumination Spaces in YDB and CMU-PIE,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 351-363, Feb. 2009. J. Weng, C.H. Evans, and W.S. Hwang, “An Incremental Learning Method for Face Recognition under Continuous Video Stream,” Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition, pp. 251-256, 2000. J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, “Robust Face Recognition via Sparse Representation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, Feb. 2009. K.C. Lee, J. Ho, M.H. Yang, and D. Kriegman, “Video-Based Face Recognition Using Probabilistic Appearance Manifolds,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 313320, 2003. K. Fukui and O. Yamaguchi, “The Kernel Orthogonal Mutual Subspace Method and Its Application to 3D Object Recognition,” Proc. Asian Conf. Computer Vision, pp. 467-476, 2007. K. Fukui, B. Stenger, and O. Yamaguchi, “A Framework for 3D Object Recognition Using the Kernel Constrained Mutual Subspace Method,” Proc. Asian Conf. Computer Vision, pp. 315-324, 2006. L. Wolf and A. Shashua, “Learning over Sets Using Kernel Principal Angles,” J. Machine Learning Research, vol. 4, no. 10, pp. 913-931, 2003. M.J. Lyons, J. Budynek, and S. Akamatsu, “Automatic Classification of Single Facial Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 12, pp. 1357-1362, Dec. 1999. M. Kim, S. Kumar, V. Pavlovic, and H. Rowley, “Face Tracking and Recognition with Visual Constraints in Real-World Videos,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 18, 2008. M. Nishiyama, M. Yuasa, T. Shibata, T. Wakasugi, T. Kawahara, and O. Yamaguchi, “Recognizing Faces of Moving People by Hierarchical Image-Set Matching,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007. M. Nishiyama, O. Yamaguchi, and K. Fukui, “Face Recognition with the Multiple Constrained Mutual Subspace Method,” Proc. Int’l Conf. Audio- and Video-Based Biometric Person Authentication, pp. 71-80, 2005. O. Arandjelovic and R. Cipolla, “A Pose-Wise Linear Illumination Manifold Model for Face Recognition Using Video,” Computer Vision and Image Understanding, vol. 113, no. 1, pp. 113-125, 2009. O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell, “Face Recognition with Image Sets Using Manifold Density Divergence,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 581-588, 2005. O. Boiman, E. Shechtman, and M. Irani, “In Defense of NearestNeighbor Based Image Classification,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2008. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing Multiple Parameters for Support Vector Machines,” Machine Learning, vol. 46, pp. 131-159, 2002. O. Yamaguchi, K. Fukui, and K.i. Maeda, “Face Recognition Using Temporal Image Sequence,” Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition, pp. 318-323, 1998. P.N. Belhumeur, J.P. Hespanha, and D.J Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720, July 1997. P. Viola and M.J. Jones, “Robust Real-Time Face Detection,” Int’l J. Computer Vision, vol. 57, no. 2, pp. 137-154, 2004. R. Gross and J. Shi, “The CMU Motion of Body (MoBo) Database,” Technical Report CMU-RI-TR-01-18, Robotics Inst., 2001. R. Wang and X. Chen, “Manifold Discriminant Analysis,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 429436, 2009. R. Wang, S. Shan, X. Chen, and W. Gao, “Manifold-Manifold Distance with Application to Face Recognition based on Image Set,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2008.

HU ET AL.: FACE RECOGNITION USING SPARSE APPROXIMATED NEAREST POINTS BETWEEN IMAGE SETS

[31] S. Mika, G. Ra¨tsch, and K.R. Mu¨ller, “A Mathematical Programming Approach to the Kernel Fisher Algorithm,” Proc. Ann. Conf. Neural Information Processing Systems, pp. 801-808, 2000. [32] S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, and S. Lin, “Graph Embedding and Extensions: A General Framework for Dimensionality Reduction,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40-51, Jan. 2007. [33] S. Zhou and R. Chellappa, “Probabilistic Human Recognition from Video,” Proc. European Conf. Computer Vision, pp. 681-697, 2002. [34] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004. [35] T. Ahonen, A. Hadid, and M. Pietikainen, “Face Description with Local Binary Patterns: Application to Face Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp. 2037-2041, Dec. 2006. [36] T.J. Chin, K. Schindler, and D. Suter, “Incremental Kernel SVD for Face Recognition with Image Sets,” Proc. IEEE Int’l Conf. Automatic Face and Gesture Recognition, pp. 461-466, 2006. [37] T.K. Kim and R. Cipolla, “On-Line Learning for Maximizing Orthogonality between Subspaces and Its Application to Image Set-Based Face Recognition,” IEEE Trans. Image Processing, vol. 19, no. 4, pp. 1067-1074, Apr. 2009. [38] T.K. Kim, J. Kittler, and R. Cipolla, “Incremental Learning of Locally Orthogonal Subspaces for Set-Based Object Recognition,” Proc. British Machine Vision Conf., pp. 559-568, 2006. [39] T.K. Kim, O. Arandjelovic, and R. Cipolla, “Learning over Sets Using Boosted Manifold Principle Angles (BoMPA),” Proc. British Machine Vision Conf., pp. 779-788, 2005. [40] T.K. Kim, O. Arandjelovic, and R. Cipolla, “Boosted Manifold Principal Angles for Image Set-Based Recognition,” Pattern Recognition, vol. 40, no. 9, pp. 2475-2484, 2007. [41] T.K. Kim, O. Arandjelovic, and R. Cipolla, “Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1005-1018, June 2007. [42] T. Wang and P. Shi, “Kernel Grassmannian Distances and Discriminant Analysis for Face Recognition from Image Sets,” Pattern Recognition Letters, vol. 30, no. 13, pp. 1161-1165, 2009. [43] V.N. Vapnik, Statistical Learning Theory. Wiley-Interscience, 1998. [44] W. Fan and D.Y. Yeung, “Locally Linear Models on Face Appearance Manifolds with Application to Dual-Subspace Based Classification,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 1384-1390, 2006. [45] X. Li, K. Fukui, and N. Zheng, “Boosting Constrained Mutual Subspace Method for Robust Image-Set Based Object Recognition,” Proc. Int’l Joint Conf. Artificial Intelligence, pp. 1132-1137, 2009. [46] X. Li, K. Fukui, and N. Zheng, “Image-Set Based Face Recognition Using Boosted Global and Local Principal Angles,” Proc. Asian Conf. Computer Vision, pp. 323-332, 2009. [47] X. Liu and T. Cheng, “Video-Based Face Recognition Using Adaptive Hidden Markov Models,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 340-348, 2003. [48] Y. Hu, A.S. Mian, and R. Owens, “Sparse Approximated Nearest Points for Image Set Classification,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2011. [49] Y. Nesterov, “Gradient Methods for Minimizing Composite Objective Function,” Technical Report 2007076, Universite´ Catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2007.

13

Yiqun Hu received the BE degree from Xiamen University, China, in 2002 and the PhD degree from Nanyang Technological University, Singapore, in 2008. He was a research fellow in SCE, Nanyang Technological University, leading a research group for US National Science Foundation (NSF) funded research from 2008 to 2010 and a research assistant professor in the CSSE, The University of Western Australia, from 20102011. His research interests include visual saliency modeling, multimedia retargeting, image set classification, and image near-duplicate retrieval. He has published more than 30 premier journal and conference papers in those areas. He is currently with the Paypal innovation team, Singapore. Ajmal S. Mian received the BE degree in avionics from Nadirshaw Edulji Dinshaw (NED) University, Pakistan, in 1993, the MS degree in information security from the National University of Sciences and Technology, Pakistan, in 2003, and the PhD degree in computer science with distinction from The University of Western Australia in 2006. He received the Australasian Distinguished Doctoral Dissertation Award from the Computing Research and Education Association of Australia (CORE) in 2007. He received the prestigious Australian Postdoctoral Fellowship in 2008 and the Australian Research Fellowship in 2011. He has secured four national competitive research grants and is currently a research associate professor at The University of Western Australia. His research interests include computer vision, pattern recognition, multimodal biometrics, and multispectral image analysis. Robyn Owens received the BSc (Hons) degree in mathematics from The University of Western Australia (UW), and the MSc (1976) and DPhil (1980) degrees, also in mathematics, from Oxford University. She spent three years in Paris at l’Universite´ de Paris-Sud, Orsay, continuing research in mathematical analysis before returning to UWA to work as a research mathematician. She has lectured in mathematics and computer science. Her research has focussed on computer vision, including feature detection in images, 3D shape measurement, image understanding, and representation. She is a fellow of the Australian Computer Society and a recipient of the UK Rank Prize. She is currently deputy vice-chancellor (research) at UWA.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Efficient Speaker Recognition Using Approximated ...