1

Multilinear Graph Embedding: Representation and Regularization for Images Yi-Lei Chen, Graduate Student Member, IEEE, and Chiou-Ting Hsu, Senior Member, IEEE

Table 1. Notations used in this paper.

Abstract—Given a set of images, finding a compact and discriminative representation is still a big challenge especially when multiple latent factors are hidden in the way of data generation. To represent multi-factor images, although multilinear models are widely used to parameterize the data, most methods are based on high-order singular value decomposition (HOSVD), which preserves global statistics but interprets local variations inadequately. To this end, we propose a novel method, called multilinear graph embedding (MGE), and also its kernelization MKGE in order to leverage the manifold learning techniques into multilinear models. Our method theoretically links the linear, nonlinear, and multilinear dimensionality reduction. We also show that the supervised MGE encodes informative image priors for image regularization, provided that an image is represented as a high-order tensor. From our experiments on face and gait recognition, the superior performance demonstrates that MGE better represents multifactor images than classic methods, including HOSVD and its variants. In addition, the significant improvement in image (or tensor) completion validates the potential of MGE for image regularization. Index Terms— Multi-factor data, graph embedding, manifold learning, image regularization.

R

1.

INTRODUCTION

epresentation of digital images is the first step toward subsequent processing and analysis for many visionrelated tasks. However, the high dimensionality of images usually hinders tractable solutions from efficient computational analysis. For example, a small color image with size of 100x100 has 104 pixels; i.e., more than 104 variables are required to represent it. Efficient algorithms working on the lossless representation are probably infeasible; and when Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected] This work was supported by the National Science Council of R.O.C. under contracts NSC98-2221-E-007-084 –MY3. Yi-Lei Chen is with the Multimedia Processing Laboratory, Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, R.O.C. (e-mail: [email protected]). Chiou-Ting Hsu is with the Multimedia Processing Laboratory, Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, R.O.C. (e-mail: [email protected]).

X

,

scalar

,

vector

,

matrix

X, Y

tensor (a multidimensional array)

, ,…,

the

, ,…,

element of X

Kronecker product tr k ·

the trace of a square matrix the pre-defined kernel function the Moore-Penrose pseudoinverse of

the images are involved in model learning, the curse of dimensionality [1] might limit the efficacy of data modeling. Suppose the set of images mostly shares intrinsic structure (e.g., human face and gait) and stems from a small set of parameters (e.g., identity and pose), dimensionality reduction methods can be readily used to generate compact representation, via either linear [2-6] or nonlinear [7-11] techniques. These methods first vectorize images and then characterize the commonality between image data in terms of global statistics or local manifold. Manifold-based methods generally perform better for recognition tasks because the neighborhood structure of data samples is particularly preserved and this structure preserving property is significant in explaining numerous real-world observations. Despite the developments of dimensionality reduction, there is still room for improvement in modeling the images if their explicit structure or latent factors can be well-formulated. In this paper, we refer to the two data models which characterize the images’ explicit structure and latent factors as multistructure data and multi-factor data, respectively. Both the two models exploit the tensor representation, a high-order generalization of 1-D vector and 2-D matrix, but parameterize the data under different schemes. In multi-structure data modeling, each sample is encoded as a tensor to keep its own structure for further analysis. For example, a gray-level image is represented by a second-order tensor (i.e., a matrix), and a video sequence is represented as a high-order tensor (as shown in Fig. 1 (a)). The feasibility of dimensionality reduction on the original data structure has been well-investigated [11-15].

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) <

2

(a) (b) Fig. 1. Illustration of two different data modeling. (a) multi-structure data (a 3-D video volume); and (b) multi-factor data (face images with different variations).

For example, classic methods, such as principal component analysis (PCA) [2], linear discriminant analysis (LDA) [3], and locally linear embedding (LLE) [7], has been successfully extended into two-dimensional structure in [12], [13], and [14], respectively. Furthermore, Yan et.al [11] unified existing dimensionality reduction methods and also proposed the tensorization versions to address multi-structure data. Their method can model the high-order structure but, unfortunately, fail to characterize the latent factors hidden in the way of data generation. In multi-factor data modeling, the entire dataset is organized as a full tensor. Fig. 1(b) shows an example of a second order tensor composed of facial images with variations along the pose and lighting factors. Multilinear models (e.g., high-order singular value decomposition (HOSVD) [16]) are then used to factorize this tensor object, so that the entry corresponding to each sample is parameterized by a set of factor-specific variables. In Fig. 1 (b), the two latent factors in face appearance are head pose and lighting condition. In this case, we can parameterize a given face image by pose- and lighting- related variables. Although we can capture latent factors of real-world data using multi-factor data modeling, recognition on multi-factor data is still very challenging because of the wide variation in latent factors. For example, in the face recognition problem, most existing methods only work well on near-frontal faces but fail to indentify a subject when the pose variation dominates his/her facial appearance. This decrease in performance can also be observed in human gait recognition if one is captured from different camera views or with different clothing styles. In this paper, our goal is to seek a compact and more discriminative data representation for images stemming from multiple latent factors. Existing methods [17-24] generally parameterize multi-factor data by HOSVD. The Tensorface approach [17] first involved HOSVD to tackle factor variations in face recognition. Subsequently, more technical extensions on HOSVD were investigated for data analysis [21-24] and data synthesis [18-20]. The HOSVD factorization has been proven to preserve global statistics, but may interpret local manifold inadequately in the multilinear space [25-28]. We will show that this shortcoming can be significantly rectified by leveraging the manifold-learning techniques with graph embedding (GE) formulation. Note that [11] has unified GE with linearization, kernelization, and tensorization, but did not extend to multilinear space. To this end, we propose a novel method called Multilinear Graph Embedding (MGE) and its kernelized version MKGE, using either unsupervised or supervised strategies, in order to parameterize multi-factor data under the GE framework. In

addition to recognition tasks, we also show that if an image is represented as a high-order tensor, then the supervised MGE inherently serves as image regularization by imposing image priors on the model components estimated in the tensor factorization. In other words, the MGE enables both compact representation and potential regularization for images. Our major contributions are summarized as follows. Because MGE extends GE into multilinear subspace and its kernelization is also derived, our method theoretically links linear, nonlinear, and multilinear dimensionality reduction for data representation. We will show in Section 4 that the supervised MGE is a promising regularization, that closely relates to tensor factorization, for images. The theoretic relation to existing multilinear models will be discussed in Section 5, where our derivation justifies that HOSVD-based methods [17-28] are special cases of MGE. In Section 6, experiment results demonstrate that when the latent factor variations of testing data are unavailable in the training stage, MGE outperforms classic dimensionality reduction methods, including HOSVD, in face and gait recognition. In addition, as the byproduct for image regularization, the supervised MGE significantly improves the accuracy in image (or tensor) completion problems. 2.

PRELIMINARIES

The notations used in this paper are defined in Table 1. Below we introduce the tensor concept and two techniques essentially related to our method. Readers may refer to [29] for more theoretical details of tensor basics. 2.1 Tensor Basics Definition 1—Inner Product and Norm of Tensor: The inner … product of two -order tensors X and … is defined by Y ∑ ,…, ,…, X,Y X , ,…, Y , ,…, , and accordingly, the norm of the tensor X is defined by X X,X . Definition 2—Mode-k unfolding and folding: The mode-k … is defined by a matrix unfolding of a tensor X ∏ ; that is, the tensor X is unfolded into a matrix by stacking all the mode-k vectors as columns of . Accordingly, the mode-k folding is the the matrix to X. process from denotes the Definition 3—Mode-k product: The operator mode-k product. The mode-k product of X with a matrix is … … , where ,X , defined by Y X … … and Y . Each element in Y is computed by

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) <

Subject‐dependent

Factor‐dependent

Pose‐dependent

(a) (b) Fig. 2. An illustration of the embedded data pairs in the multilinear model. (a) HOSVD; and (b) MGE.

Y

,…,

,…,

∑

X

,…, ,…,

,

. We can also rewrite

as based on the mode-k unfolding Y X operation. -order tensor Lemma 1: Given an arbitrary … and 1 matrices , 1, … , X 1, the tensor Y X … can be rewritten by … . (For proof, please refer to Appendix A.) 2.2 Graph Embedding In graph embedding framework, given high-dimensional ,…, , the goal is to determine their lowsamples dimensional representations, e.g., ,…, in the onedimensional case, by argmin ∑ , 1, (1) s. t. ∑ , where ,…, are considered as vertices of a graph, and , define the edge weights between and in the intrinsic graph and penalty graph, respectively. Equation (1) characterizes the intrinsic structure defined by (i.e., intrinsic graph) and also suppresses the similarity defined by (i.e., penalty graph). Rewriting Equation (1), we obtain T T argmin tr s. t. 1, (2) by , where is the Laplacian matrix relating to is the , element of , and denotes the diagonal matrix whose , element equals to ∑ . The matrix is constructed similar to . Yan et.al [11] showed that linear/nonlinear dimensionality reduction can be generalized by Equation (2). Linear T dimensionality reduction assumes and the goal is to estimate the projection direction by T T argmin T s. t. T 1 . (3) On the other hand, nonlinear dimensionality reduction first maps from the input space to a higher dimensional Hilbert space by a nonlinear function and then seeks the linear projection direction . Because the projection direction relates ∑ , nonlinear dimensionality reduction to by T … as can exploit the kernel trick to approximate follows: T T s. t. T 1 , (4) argmin T element of where denotes the kernel matrix, and the , T is defined by k , . Although Equations (3)-(4) generalize linear and nonlinear dimensionality reduction algorithms, no existing research

3

successfully extended graph embedding to address multifactor data. Here we need to emphasize that the tensorization technique proposed in [11] is feasible only to multi-structure data. Formally, their method parameterizes a sample X under tensor representation by Z X … , (5) where , … , are the projection matrices learned from a training set, and Z is its compact representation. By contrast, for multi-factor data, a sample is parameterized by multilinear model as follows: T T … . (6) X Z Note that, in Equation (6), the compact representations are now , … , corresponding to the factor-related variables, whereas the core tensor Z is learned from a training set to characterize the interdependency between the latent factors. 2.3 High Order Singular Value Decomposition and Its Variants High order singular value decomposition (HOSVD) [16] … decomposes a tensor X into a core tensor … Z and a set of orthogonal matrices , 1, … , , based on the well-known Tucker model T T … . In multi-factor data analysis, is the X Z low-dimensional representation corresponding to the factor with the dimension reduced from to . The order of X generally indicates the corresponding feature dimensions. Suppose the features have been decorrelated in is not required. The problem is the preprocessing step; then formulated as follows: T T T … s. t. , 1, … , 1, X Z (7) denotes the identity matrix. Using lemma 1, where Equation (7) is rewritten by T … , 1, … , 1. s. t. (8) In [29], it has been shown that the optimal Z is derived by Z=X … to achieve the minimum residual T T … . Thus the goal of HOSVD is to X‐Z estimate , … , in order to satisfy Equation (7) (or (8)) with the least approximation error. From [16] and [29], we know that the matrix T consists of the leading whose , eigenvectors of a covariance-like matrix entry is computed by ∑ ∑ …∑ ….

∑ X ,…, , , ,…, , X ,…, , , ,…, . (9) Equation (9) computes the average similarity defined by dot product. However, Park et.al [25-28] pointed out that this average process only maintains the global statistics but fails to preserve the local manifold. Instead of using dot product, they exploited different functions to measure the , entry of , e.g., geodesic distance [25] or the dot product defined in discriminant subspace [27]. These two variants are still based on Equation (9). To overcome the limitation due to the average process, Park et.al also proposed to measure directly by other techniques [26, 28]. To sum up, the key step of HOSVD and its variants is to sequentially measure the covariance-like matrices. If one formulates HOSVD as a least square error problem, the

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < 600

800

800

400

600

600

200

400

400

0

200

200

-200

0

0

-400

-200

-200

-600

-400

-400

-800

-600

-600

-1000

-800

-1200 -2000

-1500

-1000

-500

0

500

1000

(b)

1500

2000

2500

-1000 -1500

-800

-1000

-500

0

500

(c)

1000

1500

2000

2500

-1000

-500

0

500

(d)

1000

1500

2000

2500

600

400

400

-1000 -1500

800

600

600

4

400

200 200

200 0 0

0 -200

-200 -200

-400

-400

-600 -800

-400

-600

-600

-400

-200

0

200

400

(e)

600

800

1000

-800 -1000

-600

-800

-600

-400

-200

0

(f)

200

400

600

800

1000

-800 -1000

-800

-600

-400

-200

0

200

400

600

800

(g) 3000

800

800

600

600

400

400

200

200

0

0

0

-200

-200

-1000

-400

-400

-600

-600

2000

1000

-2000

-3000

-800

-800 -1000 -1000

-800

-600

-400

-200

0

200

400

600

800

1000

-1000 -1500

-1000

-500

0

500

1000

1500

-4000 -6000

-4000

-2000

0

2000

4000

6000

(a) (h) (i) (j) Fig. 3. The low-dimensional embedding with respect to head pose using (a) the face images of the 1st subject in CMU-PIE dataset. (b)-(i): The PCA st th coefficients calculated from images of the 1 to 8 rows of (a) (corresponding to the first eight illumination conditions), respectively. (j): The low-dimensional embedding of pose factor calculated by HOSVD using images with all the 21 illumination changes. In each figure, the x-axis and y-axis indicate the first two principal coefficients calculated by either PCA or HOSVD.

optimal

derived from

can also be derived by minimizing

T

. If we solve multiple unknowns at a time, a X‐X more accurate solution is estimated as follows: ,…, T T … arg min X X T , 1, … , 1. (10) s. t. In [29], the authors solved Equation (10) iteratively by the alternating least square (ALS) algorithm (called ALS-HOSVD in our following sections). Again using lemma 1, its objective T T T equals to ‐ … . After some F derivations (please refer to Appendix B), we obtain

,…, s. t.

arg max tr ,…,

…

and

T

T

T

,

1, … ,

1 . (11) Finally, we use Fig. 2 to clarify our goal motivated from the above discussion. Because of the sequentially measured covariance-like matrices, most HOSVD-based methods only establish the links between the factor-specific data pairs along each mode (e.g., the image pairs between different subjects or between different poses in Fig. 2(a)). By contrast, ALSHOSVD, as derived in Equation (11), embeds all the data pairs simultaneously to seek , … , , though it still has the same limitation as HOSVD of only maintaining global statistics. In this paper, we try to overcome the above limitations by enabling the graph embedding technique in the multilinear space. We use Fig. 2(b) to illustrate our objective. As shown in Fig. 2(b), the proposed MGE offers a general

framework and characterizes any set of factor-related image pairs, using either statistical or manifold-like property. 3.

MULTILINEAR GRAPH EMBEDDING FOR IMAGE REPRESENTATION

3.1 Multilinear Graph Embedding (MGE) Park et.al [25-28] pointed out that, if local variations of one factor under different combinations of other factors are inconsistent, the global shape obtained by HOSVD towards this factor would be oversimplified. We use Fig. 3 to clarify this point. The face images collected from the first subject in CMU-PIE database are used in this case (please see Fig. 3(a)). Note that, for a certain subject, when we fix the illumination condition, the major variation of CMU-PIE images comes from the pose factor. Figs. 3(b)-(i) illustrate the pose-related coefficients generated by principal component analysis (PCA) under the first eight illumination conditions (i.e., the first eight rows of Fig. 3(a)), respectively. Fig. 3(j) illustrates those coefficients generated by HOSVD under all the illumination conditions. Although the pose-related coefficients behave very differently in Fig. 3(b)-(i), the overall embedding in Fig. 3(j) tends to exhibit an average V-shape and thus may fail to interpret the factor-dependent variation as shown in Fig. 3(b)(i). On the other hand, it has been shown that manifold learning techniques formulated in graph embedding successfully preserve local structure of dataset and achieve better performance in many recognition tasks [7-11]. Although Equation (10) seems irrelevant to the graph embedding framework, we observe, from its matrix form, a strong

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) <

5

resemblance between Equations (2) and (11). In Equation (11), T

the correlation matrix defines the cosine similarity between every sample pair and plays a similar role to in Equation (2). Therefore, we may consider Equation (11) as a special case of graph embedding for multi-factor data analysis. Given a full training set with samples stemming from 1 … ), we first organize the data as an factors (i.e., … order tensor X . Then we perform mode-n unfolding and obtain the new data arrangement ,…, . To parameterize the multi-factor data, we propose a novel method called Multilinear Graph Embedding (MGE) as follows: T ,…, arg min tr ,…,

T

and … , (12) ∏ is an identity matrix with , and denote the intrinsic and penalty element in and Laplacian matrices, respectively. The , indicates the pairwise reward and penalty between and in the multilinear space. In Sections 3.1.1-3.1.2, we will describe how to build an effective algorithm to solve Equation (12) and how to design and . After we obtain , … , , we determine the core tensor Z by s. t.

where

T

Z argmin X‐Z Z

…

T

.

(13)

Using lemma 1, Equation (13) is equivalent to the minimization of

‐

F

. Therefore, we obtain the

optimal Z by performing the mode-n folding on

.

3.1.1 Optimization of MGE Equation (12) is a highly non-convex function with thousands of unknowns. To address the large-scale problem with multiple variables, we adopt a common strategy, called alternating minimizing scheme, to optimize one variable at a time with the other variables fixed. We then iteratively optimize these variables until convergence to a local optimum. The key step of our algorithm is to decompose the T into T and . As shown multilinear graphs and in Equation (1), because the objective function and constraint function are both nonnegative, we can factorize the two symmetric and positive-semidefinite matrices and by Cholesky decomposition [30]: T T ,…, arg min tr ,…,

T T s. t. . (14) and Let H and H denote the mode-n folding of respectively. We obtain T T T tr H … F T T tr , 1, … , 1 , (15) T where denotes the mode-k unfolding of … … . In addition, H T T , it also holds for because T , (16)

where denotes H

∏ the …

is the normalization scalar and mode-k unfolding of … (Please refer

Fig. 4. The relative error ∑ versus the iteration index F obtained by Algorithm 1. The result is reported by the average of ten simulations using random sizes of fourth-order tensor.

to Appendix C). While we optimize and assume the others fixed, we simplify Equation (14) into arg min

tr

T

T

s. t.

T

T

. (17) The scalar in Equation (17) is trivial and can be ignored. … , However, because of the constraint ,…, are not scale-invariant and thus these terms Equation (12) has infinite solutions. To guarantee the alternating scheme converges to a local optimum, we modify Equation (17) to derive a two-step optimization: T T T T arg min tr s. t. , (18) and . (19) F Equation (18) can be solved efficiently by the generalized eigenvalue decomposition [30]. The optimal T consists of the eigenvectors corresponding to the least eigenvalues. In have unit addition, Equation (19) guarantees that , … , norm and thus are scale-invariant. Algorithm 1 summarizes the optimization steps. To demonstrate the feasibility of our algorithm, we conduct MGE on randomly generated matrices and repeat the experiments ten times. The results in and Fig. 4 demonstrate that our algorithm quickly converges to local optima within five iterations. When given a test sample , we propose to , …, approximate its low-dimensional representations by ,…,

arg min

‐

…

. (20) Note that, most literature first solves the least square problem ,…,

∏ min ‐ to obtain the mixing , and by the rank-1 approximation [29] of then estimate , … , . The task-specific is finally used for the recognition task factor. However, the recognition usually with respect to the has limited performance because of the inaccurate rank-1 approximation. To obtain more informative features [23-24], for in our experiments, we use the leading features of dimensionality reduction and subsequent applications.

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < 1 2

1

2

2

3

4

3

4 6

0.5

1

0

0

4

0.4

5 5 6

8

14

9

8

10

9

11 2

4

6

8

10

12

2

2

3

4

5

6

(a)

7

8

9

10

11

1

2

2

2

3

4

5

6

7

8

9

-0.4

-0.3

-0.1

-0.5

subject

4

6

8

10

12

9

11

10 1

2

3

4

5

6

7

8

9

10

1

2

2

3

-0.5 -1

2

3

4

5

6

7

8

9

-0.5

-0.25

-0.2

-0.15

-0.1

-0.05

-0.5 0.5

0

-0.5

subject

10

4

5

5

6

7

8 12

2

4

6

8

10

12

14

10

9

11

10

-0.28

-0.2 illumination

0.5

0.5

0

0

0

-0.5 0.5

-0.5 0.5

-0.5 0.5

1

2

3

4

5

6

7

8

9

10

11

0 -0.5

1

2

3

4

5

6

7

8

9

0

-0.4

-0.6

-0.2

0

0.2

0.5

Algorithm 1 Multilinear Graph Embedding … Input: a data tensor X , two positive and semi, and the factor-specific low definite matrices and , ,…, . ranks by Cholesky decomposition and obtain 1) Factorize and

,1 arg min tr

1 by solving T

T

,

and then normalize into unit norm. 3) For 1 to 1 , and solve by Equations (18)-(19). Fix End 4) If all converge, stop; else, go back to step 3. 5) Determine the core tensor Z by Equation (13). and Output: the low-dimensional representations , … , the core tensor Z. 3.1.2 Design of Multilinear Graph Design of multilinear graph is the most important step in the proposed MGE. Although our goal is to characterize the manifold structure of multi-factor data, neither unsupervised (e.g., locally linear embedding (LLE) [7] and locality preserving projection (LPP) [9]) nor supervised (e.g., linear discriminant analysis (LDA) [3] and marginal fisher analysis (MFA) [11]) graph is suitable to our problem. The reason is that these graphs tend to build edges between vertices with the same specific factor and thus fail to capture the intra-factor variation corresponding to that specific factor. We conduct a simulation on a subset of CMU-PIE database (containing 15

-0.25

-0.2

-0.15 subject

-0.1

-0.05

0.35

0

0

0.3 -0.5

0.25

-0.5

view

0.1

0.2

0.3

0.4

illumination

(c)

10

(c) Fig. 5. Visualization of covariance-like matrices obtained from (a) LPPgraph; (b) MFA-graph; and (c) MGE-graph. The left, middle, and right columns show , , , respectively. The changes in color from blue to red indicate the normalized magnitude from zero to one.

H and H . 2) Initialize

-0.29

view

(b)

8

9

14

-0.3

-0.31

6

7

10

0.4

illumination

3

4

8

0.2

0

-0.2

-0.4 0

1

11

1 2

6

-0.4

0 0

(b) 4

-0.5

0.5

-0.5 0.5

8

10

14

0

view

0

7

9

2

-0.4

0.5 0.5

6 8

14

-0.5

-0.2

-0.3

-0.2 0.5

(a)

7

12

0

-0.1

5

6

10

-1 0.5

10

4

5

8

-1

-0.2

0

3

4 6

-0.5

1

1

3

4

0

10 1

14

0 -0.5 0.5

7

8 12

0.2

6

7

10

6

Fig. 6. The low-dimensional submanifolds obtained from (a) HOSVD; (b) unsupervised-MGE; and (c) supervised-MGE. The left, middle, and right columns correspond to the submanifolds of subject, pose, and illumination, respectively.

subjects x 11 poses x 10 illuminations) to clarify this point. Fig. 5(a)-(b) show the visualized covariance-like matrices , , of LPP- and MFA- graphs, where T for subj, pose, illu . Note that, unsupervised methods usually find neighbors with the same dominated factor; whereas supervised methods always find neighbors with the same class label. Both the matrices of LPPgraph and of MFA-graph tend to be diagonal and their off-diagonal elements, which correspond to the relation between intra-factors, are nearly zero. (For example, under the subject factor, there still exist intra-factor variations from different poses and illuminations.) Therefore, both the two graphs fail to characterize the intra-factor variations. In order to preserve the multi-factor manifold structure, we use a factor-dependent strategy to design our graph. Given the sample with its corresponding factors , … , , we factor by define its neighboring set Ω ( ) in the , and , (21) Ω where , … , are the factors of . Then we define the by edge weight exp

,

(22)

knn Ω knn Ω ; otherwise, we set or 0. In Equation (22), knn Ω denotes the nearest neighbors of within the neighboring set Ω ( ) and denotes the kernel bandwidth. In our experiments, is set at the average of all the pairwise distance. Using Equation (22), we to normalize the impacts have and define from different samples [9]. The covariance-like matrices obtained from multilinear graph are shown in Fig. 5(c), where nor is nearly diagonal. The results verify neither that our multilinear graph successfully characterizes the intrafactor variation. if

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < Note that, the set of nearest neighbors can be defined by either unsupervised or supervised strategy. If the label relation across the factor is unavailable, then the nearest neighbors are determined in terms of feature similarity; otherwise, we can determine the neighbors from the most relevant labels. Fig. 6 shows an example of low-dimensional representations obtained from ALS-HOSVD, unsupervised MGE, and supervised MGE. The data contain 6930 face images from 30 subjects with 11 poses and 21 illuminations. MGE obtains a discriminative submanifold, where the withinsubject variation changes much smoothly than ALS-HOSVD, and is expected to considerably improve the recognition performance. In addition, the supervised MGE obtains a better discriminative submanifold than the unsupervised MGE in terms of illumination factor, because the feature similarity used in unsupervised strategy fails to approximate the changes of lighting angle across different illuminations. 3.2 Multilinear Kernel Graph Embedding (MKGE) We extend the proposed MGE to kernel space to better characterize the nonlinearity in some complex factors. To formulate MKGE, we follow the same derivation from Equations (12)-(19) with two modifications. In the design of multilinear kernel graph, because the feature is projected to a is calculated as follows: higher-dimensional Hibert space, exp exp

,

,

,

.

T

T 2 T T , (24) arg min where is the kernel matrix defined in Sec. 2.2 and T denotes k , . Finally, we obtain the ,…,k , T . optimal

3.3

five iterations (see Fig. 4), it takes about five times complexity of [16] and [25-28]. 4.

MULTILINEAR GRAPH EMBEDDING FOR IMAGE REGULARIZATION

In Section 3, we proposed using MGE to disentangle the latent factors for compact representation of multi-factor data. In this Section, we will show that the function of MGE also encodes common image priors and serves as a valuable regularization in the scenario of image completion. 4.1

Relation between Image Priors and MGE

Recall that in Section 3, MGE characterizes multiple latent factors of tensor objects in a generalized way. If we consider a T T , then its 2-D image as a second-order tensor I J two latent factors and encode the variations of row and column spaces, respectively. Therefore, MGE is able to characterize the nature of images, provided that the semantic information between rows and columns is known a priori. One of the widely used image priors is the spatial smoothness that assumes image pixels share local similarity. Using the spatial smoothness priors, we can reasonably infer that adjacent rows/columns are highly correlated and their and embed pixel-wise low-dimensional representations interdependency. Following the factor-dependent strategy in between Equations (21)-(22), we define the edge weight pixel and pixel by exp

(23)

In addition, when given a test sample , now we approximate the mixing low-dimensional representation by arg min ‐

The Issue of Time-Complexity

Similar to ALS-HOSVD [29], Algorithm 1 is also built on the alternating least square (ALS) method by iteratively solving via generalized eigen-decomposition until finding local optima. The convergence speed of MGE and ALSHOSVD are comparable; and empirical evidence shows that both converge within less than five iterations. The difference is that MGE needs additional normalization steps in Equation (19) and at first needs to conduct Cholesky decomposition to obtain H (and H ), whereas ALS-HOSVD directly utilizes X as H. Nevertheless, the time complexity of Equation (19) is insignificant to the overall run-time. When data size is less than 5000, the Cholesky decomposition can usually be done with less than one second. Therefore, the MGE algorithm is computationally efficient compared to ALS-HOSVD in most cases. On the other hand, in comparison with HOSVD [16] and its variants [25-28], MGE requires iterative optimization while these methods do not. Because MGE usually converges within

7

. (25)

Then we derive the image priors which encode spatial smoothness under the MGE framework as follows: T tr . (26) Note that Equation (26) is defined by supervised strategy in Equation (25) using only the row because we determine and column indices without knowing the exact pixel values. This strategy in Equation (26) is of practical use for image regularization when image pixels are noisy or even incomplete. In Equation (25), we set and as one so as to approximate the image priors of spatial smoothness within eight neighboring pixels. 4.2

Image (Tensor) Completion with MGE Regularization

In the scenario of image completion, the given image is an incomplete one where only the elements in the set Ω are observable. The goal is to estimate the missing elements in the complement set Ω so that a complete image is obtained with a visually plausible quality. Note that a gray-level/color image exhibits a matrix/tensor structure. Therefore, we can recast this problem into a general tensor factorization scheme with missing entries as follows: J,

,

argmin Ω I ‐Ω J

T

T

, (27)

and I J

T

T

, (28) , ,J , and are the where dimensions of image height and width, and are the reduced dimension of and , and c is either 1 or 3 for graylevel or color images respectively.

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < In [31], the authors exploited an EM-like algorithm to solve Equation (27). In the E-step, they update I from the given entries and the current estimates by T T Ω I Ω I and Ω I Ω J , (29) while in the M-step, they solve the following optimization: T T argmin I‐J . (30) J, , Equation (30) is a special case of HOSVD as we introduced in Sec. 2.3. The optima are derived as follows: T , arg max I s. t. , 1,2 (31) and . (32) J=I The maximization in Equation (31) is then solved iteratively to converge to local optima: T T arg max T , (33)

is the mode-k unfolding of either I (k=1) or where (k=2). I Equations (27)-(33) exploit the low-rank priors to recover missing entries but consider no priors of natural images. To improve the classic algorithm [31] for image completion, we include Equation (26) as a regularization term into Equation (27), and modify the maximization in Equation (33) as follows: T T T arg max T , (34) where is obtained as we derive in Sec. 3.1.1, and is a weighted parameter for MGE. A larger would usually result in blurring artifacts under low missing rate (e.g., <50%), whereas a smaller makes trivial contribution to the performance for high missing rate (e.g., 90%). Therefore, we increase to enforce the image priors as the missing rate increases, and find that a good recovery is usually achieved in the range of 10,1000 . Algorithm 2 summarizes our MGE-based image (or tensor) completion method. Note that the algorithm is not limited to images but is also feasible to all of the data under tensor representation. With the benefits from the MGE framework, the proposed regularization significantly boosts the performance over classic method [31], provided that the Laplacian matrix is appropriately defined. Algorithm 2 Image (Tensor) completion with MGE-based regularization Input: a image tensor I , a positive and semi, . definite matrices , and the specified low ranks by solving 1) Initialize arg min tr

T

T

,

into unit norm (So as on and then normalize . 2) Initialize J as I 3) Factorize by Cholesky decomposition. 4) Do Update I by Equation (29). Update , by Equation (34). . Update J as I While I not converges Output: the completed image tensor I.

).

5.

8

A GENERALIZATION TO HOSVD AND ITS VARIANTS

We now show that the MGE formulation defined in Equation (12) unifies existing multilinear models, including HOSVD and its variants [25-28], and also ALS-HOSVD. As we introduced in Sec. 2.3, the key step of HOSVD and its variants . These is to determine a set of covariance-like matrix methods have no joint function for , … , and thus ,…, individually. conduct eigen-decomposition on However, we can still reformulate these methods with a series of general form based on Equation (12) as follows: T arg max tr , s. t. T …

,

… T

arg max tr

T

s. t.

… …

,

… … T

arg max tr

s. t.

, T

,

… , … where is an identity matrix. Because HOSVD and its variants assume the orthogonality on each , we derive T . That is, the penalty Laplacian matrices of these methods are all identity matrices. On the other hand, ALS-HOSVD defined in Equation (11) also follows our MGE formulation, where and are T and , respectively. From the above derivation, we show that MGE provides a general framework and unifies existing multilinear models. This success mainly comes from 1) the generality of graph embedding framework and 2) the proposed algorithm making graph embedding tractable in the multilinear space. 6.

EXPERIMENT RESULTS

6.1. Recognition on Multi-Factor Images 6.1.1. Databases and Setting We conduct experiments on the CMU-PIE face database [32] and CASIA gait B database [33]. For CMU-PIE dataset, we use 65 subjects with 11 poses and 21 illuminations. All the face images are aligned by their eye coordinates, and then cropped and resized into size of 32x32. The CASIA gait B database consists of indoor walking sequences from 124 subjects with 11 camera views and 10 clothing styles. We represent each walking sequence by the Gait Energy Image (GEI) [34], which is resized into size of 128x88. All the images are vectorized, and the datasets are arranged as fourthorder tensors, where three for the latent factors and one for the feature dimension. We compare MGE (denoted by –U and –S as using either unsupervised or supervised strategy) with ALS-HOSVD and also four classic dimensionality reduction methods, including PCA [2], LDA [3], LPP [9], and MFA [11]. Also, we compare MKGE with their kernelization versions, including ALST

KHOSVD (by replacing in Equation (11) with the kernel matrix ), and KPCA, KLDA, KLPP, and KMFA (derived from Equation (4)). To avoid the singularity problem,

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < 0.5

1

0.4

0.35

0.3

PCA LPP LDA MFA ALS‐HOSVD MGE‐U MGE‐S

0.4 0.9

recognition accuracy

recognition accuracy

0.45

0.45

0.95

recognition accuracy

PCA LPP LDA MFA ALS‐HOSVD MGE‐U MGE‐S

9

0.85

PCA LDA ALS‐HOSVD MGE‐S

0.8

0.75

0.7

LPP MFA MGE‐U

0.35

0.3

0.25

0.2

0.65 0.25

0.15 0.6

0.2

0.55 10

20

30

40

50

60

0.1 10

20

number of class

40

50

60

10

0.4

0.25

40

KPCA KLDA ALS‐KHOSVD MKGE‐S

0.45 0.9

0.35

30

50

60

number of class

recognition accuracy

0.45

KLPP KMFA MKGE‐U recognition accuracy

KPCA KLDA ALS‐KHOSVD MKGE‐S

0.55

20

0.5

1

0.65

recognition accuracy

30

number of class

0.8

KPCA KLDA ALS‐KHOSVD MKGE‐S

0.7

0.6

KLPP KMFA MKGE‐U

0.5

0.35

KLPP KMFA MKGE‐U

0.3

0.25

0.2

0.15 0.15

0.4

0.1

0.05

0.3

0.05 10

20

30

40

50

10

60

20

30

40

50

10

60

20

30

40

50

60

number of class

number of class

number of class

(a) (b) (c) Fig. 7. The recognition rate (y-axis) versus class number (x-axis) in CMU-PIE database, where the testing set consists of (a) untrained poses; (b) untrained illuminations; and (c) two untrained factors. 0.8

LPP MFA MGE‐U

0.9 0.6

0.6

0.55

0.5

0.85

0.8

0.75

PCA LDA ALS‐HOSVD MGE‐S

0.45 0.7 0.4

0.35

recognition accuracy

0.65

LPP MFA MGE‐U

20

30

40

50

60

0.55

0.5

0.45

0.4

0.3

0.25 10

20

number of class

30

40

50

10

60

20

30

40

50

60

number of class

number of class 0.75

0.95

0.85

LPP MFA MGE‐U

0.35

0.65 10

PCA LDA ALS‐HOSVD MGE‐S

0.65

recognition accuracy

0.7

recognition accuracy

0.7

0.95

PCA LDA ALS‐HOSVD MGE‐S

0.75

0.9 0.65

0.75

0.65

KPCA KLDA ALS‐KHOSVD MKGE‐S

0.55

0.45

KLPP KMFA MKGE‐U

recognition accuracy

recognition accuracy

recognition accuracy

0.85

0.8

0.75

0.7

KPCA KLDA ALS‐KHOSVD MKGE‐S

0.65 0.35 0.6

KLPP KMFA MKGE‐U

10

20

30

40

number of class

50

60

0.45

0.35

KPCA KLDA ALS‐KHOSVD MKGE‐S

0.25

0.15

0.55

0.25

0.55

10

20

30

40

number of class

50

60

10

20

KLPP KMFA MKGE‐U 30

40

50

60

number of class

(a) (b) (c) Fig. 8. The recognition rate (y-axis) versus class number (x-axis) in CASIA database, where the testing set consists of (a) untrained views; (b) untrained clothing styles; and (c) two untrained factors.

we conduct PCA and use the first principal coefficients determined by 95% energy for all the methods except for the kernelization versions. To determine the kernel bandwidth, we use the Gaussian kernel exp / , and, as suggested / . , 5,6, … ,14, where in [11], we select from 2 is the standard derivation of the training set. The MGE and LPP, MFA are manifold-based methods and need to determine the nearest neighbors in their graphs. In all the experiments, we set 15 for LPP, 5 or 30 for MFA with respect to the intrinsic or penalty graph, and 2 for MGE with respect to all of the three latent factors. In order to investigate the robustness against varying factors, we adjust the number of available samples in our training set.

We randomly select subjects, where 10,20, … ,60, with 5 randomly selected poses/illuminations in CMU-PIE dataset and with 4 randomly selected views/clothing styles in CASIA dataset, respectively. The rest samples in each database are used for testing. Our focus here is not on classifier design but rather on evaluating the representations for multi-factor data. Therefore, we simply use nearest neighbor classifier and repeat the setting 30 times to obtain the averaged result. The performance of each setting is reported by the best one among all the possible feature dimensions and also all the kernel bandwidths.

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < 6.1.2. Results on CMU-PIE Database We first compare the overall performance in Tables 2-3. The terms “untrained pose”, “untrained illumination”, or “both untrained” stand for a subset of testing data whose corresponding factors (pose and/or illumination) are not available during training, and “all” stands for all of the testing data. From Table 2, ALS-HOSVD achieves high recognition rate on the samples with untrained illuminations, but obtains poor results (even worse than PCA and LPP) when other complex factor (such as head pose) is unobservable. However, MGE does not suffer from this drawback. It outperforms all methods by achieving about 1.1% improvement over MFA, and best improves the samples with two untrained factors. These results show that the MGE representation better interprets the cross-factor variation hidden in multi-factor data, even when the factor variation is not given in the training stage. We believe the benefits mainly come from our proposed multilinear graph, which exploits all of the latent factors to embed factor-dependent data pairs in a unified way. By contrast, classic methods conduct embedding simply based on feature similarity or class label regardless of the implicit factor relation. An interesting point in Table 3 is that the kernelization technique significantly improves multilinear-based methods but fails to improve classic methods. After kernelization, the performance of supervised methods (KLDA and KMFA) is almost the same as their original versions (LDA and MFA), but unsupervised methods, especially KLPP, obtain even poorer results. We notice that, with kernelization, classic methods become much more sensitive to algorithm parameters (e.g., the number of nearest neighbor, the kernel bandwidth). When the scale of training/testing data increases, these methods require more neighbors to define a robust graph and also prefer the extreme kernel bandwidth to maximize the discriminability. However, carefully tuning these parameters is not easy in practice. By contrast, since multilinear-based methods model the data variation across latent factors, they better characterize the data uncertainty and seem less sensitive to kernel settings (at least in our provided set of kernel bandwidth). As to the number of nearest neighbor, because most latent factor changes smoothly (e.g., head pose changing from -90˚ to 90˚), two nearest neighbors is sufficient for MGE to capture the smooth submanifolds (see Fig. 6). Therefore, MKGE still outperforms all of the kernel-based methods by leveraging the nonlinearity in its multilinear models. Finally, we show the recognition accuracy from small-scale to large-scale training set in Fig. 7. For classic methods, supervised strategies outperform unsupervised ones in most cases. MFA is better than LDA when 30, but otherwise the performance decreases. This is because the neighborhood size of MFA is sensitive to large number of data, while large-scale data just satisfy the statistical hypothesis of LDA. We also observe the similar performance decay in their kernelization versions, where KPCA/KLDA outperforms KLPP/KMFA as increases. However, because MGE and MKGE are both insensitive to the neighborhood size, our methods outperform ALS-HOSVD and its kernelization, either in small-scale or in large-scale training set.

10

6.1.3. Results on CASIA Database In comparison with CMU-PIE database, CASIA database has two major differences: 1) there is no appearance detail included in GEI; thus the changes from shape factor dominate the data population in CASIA; and 2) the changes between adjacent views in CASIA are much smaller than those in CMU-PIE. Because CASIA database contains less variation, the performance gap between ALS-HOSVD and MGE is now narrowed. In Table 4, ALS-HOSVD is comparative to LDA and MFA even on the samples with untrained views. However, MGE still obtains better results than all the other methods. The comparison between kernelization versions in Table 5 is similar to the one in Table 3. Only LDA slightly benefits from the kernelization technique because it need not to determine the neighborhood size and is also less sensitive to data scale. However, unlike in CMU-PIE database, MKGE and ALS-KHOSVD now achieve similar performance; i.e., existing multilinear models could work well to parameterize multi-factor data by exploiting the kernel trick, provided that the variations across latent factors are not too complex. Fig. 8 shows the detailed recognition rate as the number of classes increases. Recall that GEI involves no appearance detail and should lie on a low-dimensional manifold because of the low degree of freedom. Therefore, in Fig. 8, manifoldbased methods (LPP, MFA, and MGE) generally outperform statistics-based methods (PCA, LDA, and ALS-HOSVD). These results validate the significant role of the neighborhood graph, which is generalized by GE and MGE, to express the compact and low-dimensional representation. 6.1.4. Discussion From our experiments: 1) the superior recognition rates show that MGE better parameterizes multi-factor data than classic dimensionality reduction and existing multilinear models (ALS-HOSVD) even if the factor variation is not given; 2) when dealing with multi-factor data, the kernelization technique seems not to improve the classic methods but can significantly improve multilinear-based methods; and 3) by fully exploiting the latent factors known in priori, MGE is insensitive to the data scale and algorithm parameters. Finally, we explain one unusual result in Tables 2-5, where our unsupervised and supervised MGE achieve almost the same performance in both two databases. Recall that only few varying factors are given in our training set. It is difficult to capture the complete submanifolds shown in Fig. 6, though the labels of given factors are fully utilized. Therefore, the difference in neighborhood structure between the two strategies is insignificant to their performance. 6.2. Image Completion Using Tensor Factorization We use the eight images in Fig. 9 as our benchmark dataset for image completion. Recall that we propose a MGE-based regularization to capture the pixel-wise interdependency in Equation (26). However, the tremendous amount of image pixels, e.g., with size of 100x100, usually hinders the construction of Laplacian matrix, e.g., with size of 104x104. Therefore, for real-world data, we replace Equation (26) by its single-factor version as follows: T T tr tr , (35)

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) <

30

11

1 0.9

25 0.8 0.7

(a)

(b)

(c)

(d)

0.6

SSIM

PSNR

20

15

10

5

0.5 0.4

Random Missing (lambda=0) Random Missing (lambda=100) Slice Missing (lambda=0) Slice Missing (lambda=100)

0.3 0.2 0.1

0

Random Missing (lambda=0) Random Missing (lambda=100) Slice Missing (lambda=0) Slice Missing (lambda=100)

0 60

70

80

missing rate (%)

90

60

70

80

missing rate (%)

90

(a) (b) Fig. 10. The average (a) PSNR and (b) SSIM on eight benchmark images. (e) (f) (g) (h) Fig. 9. Eight benchmark images. (a) Façade; (b) airplane; (c) baboon; (d) barbara; (e) house; (f) lena; (g) peppers; and (h) sailboat.

(a) (b) Fig. 11. A visual example for image completion under either (a) random missing or (b) slice missing strategy (missing rate 80%). The three images from left to right are the incomplete image, the completed image using 0, and the completed image using 100.

(a) (b) (c) Fig. 12. An example from the CMU-PIE face database: (a) some of the ground truth face images; (b) the incomplete data in (a), where the missing images are replaced by the mean face of the given images; and (c) the completion results using existing tensor completion method incorporated with our MGE-based regularization.

and characterize the row-wise and column-wise where interdependency, respectively, with the pairwise weights defined in a similar way to Equation (25): and | and exp | | . (36) exp | In our experiments, two strategies are used to sample m% elements (m={60, 70, 80, 90}) as the missing entries. The Random Missing strategy randomly samples each element, i.e., a monochromatic pixel, with equal probability, while the Slice Missing strategy samples each pixel instead (i.e., with three elements in Red, Green, and Blue channels simultaneously). We compare Algorithm 2 (with fixed to 100) with classic tensor completion method ( 0) to show the improvement obtained from supervised MGE. For each case, the reduced ranks , are selected from / , / , where 3,4, … ,10, to achieve the highest PSNR. Figures 10-11 demonstrate the quantitative and qualitative results. As shown in Fig. 10, supervised-MGE significantly

improves the tensor completion method in terms of both PSNR and SSIM [35] by introducing the well-known image priors, i.e., local similarity, into the latent factor space hidden in tensor factorization model. The benefit becomes maximized especially in Slice Missing cases, because low-rank priors fail to estimate those unknowns when the corresponding “slices” are not available. From Fig. 11, the results also show that Algorithm 2 has the potential to achieve visually-pleasant image completion, though the tensor completion mechanism is not specifically designed for nature images. Note that our focus here is to validate the advantages of MGE-based regularization for images. In fact, it also serves as a powerful regularization for incomplete image database stemming from multiple factors [36]. For example, if the face images used in Sec. 6.1.2 cannot constitute a full tensor, we can first complete our dataset by exploiting the supervised MGE to characterize the joint-factor variation (i.e., head pose and illumination). Here we give an example in Fig. 12. More in-depth extension will be investigated in our future work.

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) <

12

Table 2. The average recognition accuracy (%) on CMU-PIE face database. PCA

LPP

LDA

MFA

ALS-HOSVD

MGE-U

MGE-S

Untrained pose Untrained illumination Both untrained

34.83

29.97

34.07

34.08

28.95

34.27

34.11

64.42

86.44

93.12

93.63

94.05

95.09

95.08

23.14

23.64

29.35

29.04

24.76

30.16

30.02

All

40.87

48.95

54.80

54.86

52.28

55.98

55.88

Table 3. The average recognition accuracy (%) on CMU-PIE face database. KPCA

KLPP

KLDA

KMFA

ALS-KHOSVD

MKGE-U

MKGE-S

17.11

19.71

34.58

33.74

37.35

38.83

38.87

75.35

47.20

89.90

91.48

95.46

95.89

95.88

13.52

27.71

27.52

29.14

30.13

30.13

27.50

52.86

53.26

56.09

56.93

56.93

MGE-S

Untrained pose Untrained illumination Both untrained

13.62

All

38.10

Table 4. The average recognition accuracy (%) on CASIA gait database. PCA

LPP

LDA

MFA

ALS-HOSVD

MGE-U

Untrained view Untrained clothing style Both untrained

50.72

52.88

64.23

65.09

63.72

65.15

65.21

76.76

80.18

85.22

86.14

86.51

86.41

86.38

38.91

40.54

51.64

52.11

50.79

52.11

52.17

All

52.09

54.34

63.97

64.67

63.76

64.75

64.79

Table 5. The average recognition accuracy (%) on CASIA gait database. KPCA

KLPP

KLDA

KMFA

ALS-KHOSVD

MKGE-U

MKGE-S

70.68

70.57

70.50

Untrained view Untrained clothing style Both untrained

44.52

39.72

67.65

64.58

77.17

70.67

87.47

86.46

89.60

89.32

89.31

34.49

30.32

55.81

52.87

58.50

58.16

58.12

All

48.38

43.32

67.42

64.93

70.07

69.81

69.77

7.

CONCLUSION

In this paper, we focus on the data representation for images, which usually stem from multiple latent factors hidden in the way of data generation. We show that the limitations of existing multilinear-based methods can be significantly rectified by leveraging the manifold-learning techniques formulated in the graph embedding. To this end, we propose a novel method, called multilinear graph embedding (MGE), and its kernel extension MKGE, to make the graph embedding technique feasible in multilinear space. From the viewpoint of dimensionality reduction, MGE parameterizes multi-factor data with a compact and discriminative representation. From the viewpoint of image regularization, we also show that the supervised MGE encodes informative image priors to improve the image completion, provided that we represent an image as a high-order tensor. We conclude our contributions from both the theoretic and empirical supports: 1) because MGE extends graph embedding into multilinear subspace and its kernelization is also derived, our method theoretically links the linear/nonlinear/multilinear dimensionality reduction; 2) MGE generalizes HOSVD and its

variants as special cases; 3) from our experiments in multifactor data (face/gait images) recognition, we show that MGE outperforms classic methods in many aspects; and 4) provided that appropriate priors on latent factors are given, supervised MGE significantly improves the performance in image (or tensor) completion. REFERENCES [1] [2] [3] [4] [5] [6]

K. Fukunnaga, Introduction to Statistical Pattern Recognition, second ed. Acdemic Press, 1991. M. Turk and A. Pentland, “Face Recognition Using Eigenfaces,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 586-591, 1991. P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE Trans. on PAMI, vol. 19, no. 7, pp. 711-720, 1997. J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319-2323, 2000. K. Muller, S. Mika, G. Riitsch, K. Tsuda, and B. Scholkopf, “An Introduction to Kernel-based Learning Algorithms,” IEEE Trans. on Neural Network, vol. 12, no. 2, pp. 181-201, 2001. S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. Muller, “Fisher Discriminant Analysis with Kernels,” In IEEE Neural Network for Signal Processing Workshop, pp. 41-48, 1999.

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < [7] [8] [9] [10] [11] [12] [13] [14]

[15] [16] [17] [18] [19] [20] [21] [22] [23] [24]

[25] [26]

[27]

[28] [29] [30]

S. Roweis and L. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, no. 22, pp. 2323-2326, 2000. M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Advance in Neural Information Processing System, vol. 14, pp. 585-591, 2001. X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face Recognition Using Laplacianfaces,” IEEE Trans. PAMI, vol. 27, no.3 , pp.328-340, 2005. H. T. Chen, H. W. Chang, and T. L. Liu. Local discriminant embedding and its variants. In Proc. CVPR, 2005. S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph Embedding and Extensions: A General Framework for Dimensionality Reduction,” IEEE Trans. on PAMI, vol. 29, no. 1, pp. 40-51, 2007. J. Yang, D. Zhang, A. Frangi, and J. Yang, “Two-Dimensional pca: A New approach to Appearance-Based Face Representation and Recognition,” IEEE Trans. PAMI, vol. 26, no.1, pp.131-137, 2004. J. Ye, R. Janardan, and Q. Li, “Two-Dimensional Linear Discriminant Analysis,” Advances in Neural Information Processing Systems, pp. 1569-1576, 2004. X. Li, S. Lin, S. Yan, and D. Xu, “Discriminant Locally Linear Embedding with High-Order Tensor Data,” IEEE Trans. System, Man, and Cybernetics, Part B: Cypernetics, vol. 38, no. 2, pp. 342-352, 2008. D. Xu, S. Yan, S. Lin, T. S. Huang, and S. F. Chang, “Enhancing Bilinear Subspace Learning by Element Rearrangement,” IEEE Trans. PAMI, vol. 31, no.10, pp.1913-1920, 2009. L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “A Multilinear Singular Value Decomposition,” SIAM J. on Matrix Anal. Appl., vol. 21, no. 4, pp. 1254-1278, 2000. M. Alex, O. Vasilescu, and D. Terzopoulos, “Multilinear Analysis of Image Ensembles: TensorFaces,” Proc. of the European Conf. on Computer Vision, pp.447-460, 2002. D. Lin, Y. Xu, X. Tang, S. Yan, “Tensor-based Factor Decomposition for Relighting,” Proc. IEEE Conf. Image Processing, pp. 386-389, 2005. Y. Li, Y. Du, and X. Lin, “Kernel-based Multifactor Analysis for Image Synthesis and Recognition,” Proc. IEEE Conf. Computer Vision, pp. 114-119, 2005. M. Alex, O. Vasilescu, and D. Terzopoulos, “TensorTexture: Multilinea Image-Based Rendering,” ACM Trans. Graphics, vol. 23, no. 3, pp. 336-342, 2004. H. S. Lee and D. Kim, “Tensor-Based AAM with Continuous Variation Estimation: Application to Variation-Robust Face Recognition,” IEEE Trans. PAMI, vol. 31, no. 6, pp. 1102-1116, 2009. Y. Xu and A. K. Roy-Chowdhury, “A Physics-Based Analysis of Image Appearance Models,” IEEE Trans. on PAMI, vol. 33, no.8, pp. 1681-1688, 2011. S. Rana, W. Liu, M. Lazarescu, and S. Venkatesh, “Recognition Faces in Unseen Modes: A Tensor Based Approach,” Proc. CVPR, pp. 1-8, 2008. S. W. Park, and M. Savvides, “Individual Kernel Tensor-Subspaces for Robust Face Recognition: A Computationally Efficient Tensor Framework without Requiring Mode Factorization,” IEEE Trans. Systems, Man, and Cybernetics-Part B: Cybernetics, vol. 37, no. 5, pp. 1156-1166, 2007. S. W. Park, and M. Savvides, “An Extension of Multifactor Analysis for Face Recognition Based on Submanifold Learning,” Proc. CVPR, pp. 2645-2652, 2010. S. W. Park, and M. Savvides, “The Multifactor Extension of Grassmann Manifolds for Face Recognition,” Proc. IEEE Conf. Automatic Face & Gesture Recognition and Workshops, pp. 464-469, 2011. S. W. Park, and M. Savvides, “An Multifactor Extension of Linear Discriminant Analysis for Face Recognition Under Varing Pose and Illumination,” in EURASIP Journal on Advances in Signal Processing, 2010. S. W. Park, and M. Savvides, “Multifactor Analysis Based on FactorDependent Geometry,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2817-2824, 2011. L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “On The Best Rank1 and Rank-(R1, R2,…, RN) Approximation of Higher Order Tensors,” SIAM J. on Matrix Anal. Appl., vol. 21, no. 4, pp. 1324-1342, 2000. R. A. Horn and C. R. Johnson, Topic in Matrix Analysis, Cambridge University Press, 1991.

[31]

[32] [33] [34] [35] [36]

13

X. Geng, K. Smith-Miles, Z. H. Zhou, and L. Wang, “Face image modeling by multilinear subspace analysis with missing values,” IEEE Trans. on System, Man, and Cybernetics- part B: Cybernetics, vol. 41, no. 3, pp. 881-892, 2011. T. Sim, S. Baker, and M. Bsat, “The CMU Pose, Illumination, and Expression Database,” IEEE Trans. PAMI, vol. 25, no.12, pp.16151618, 2003. S. Yu, D. Tan, T. Tan, “A Framework for Evaluating the Effect of View Angel, Clothing and Carrying Condition on Gait Recognition,” Proc. ICPR, pp. 441-444, 2006. J. Han and B. Bhanu, “Individual Recognition Using Gait energy Image,” IEEE Trans. PAMI, vol. 28, no.2, pp.316-322, 2006. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. on Image Processing, vol. 13, no. 4, pp. 600-612, 2004. Y. L. Chen, C. T. Hsu, and H. Y. Mark Liao, “Simultaneous tensor decomposition and completion using factor priors,” IEEE Trans. PAMI, preprint, 11 Oct. 2013, doi: 10.1109/TPAMI.2013.164.

Yi-Lei Chen received the B.S. degree in computer science in 2007 and the M.S. degree in computer science in 2009, both from National Tsing Hua University, Hsinchu, Taiwan. He is currently pursuing the Ph.D. degree at the Department of Computer Science, National Tsing Hua University, Hsinchu. His research interests include machine learning, computer vision, image processing, and multilinear representation.

Chiou-Ting Hsu (M’98–SM’13) received the Ph.D. degree in computer science and information engineering from National Taiwan University (NTU), Taipei, Taiwan, in 1997. She was a PostDoctoral Researcher with the Communication and Multimedia Laboratory, NTU, from 1997 to 1998. From 1998 to 1999, she was with Philips Innovation Center, Taipei, Philips Research, as a Senior Research Engineer. She joined the Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, as an Assistant Professor, in 1999, and is currently a Professor. Her research interests include multimedia signal processing, video analysis and content based retrieval. He received the Citation Classic Award from Thomson ISI in 2001 for her paper Hidden digital watermarks in images. She was an Associate Editor of Advances in Multimedia from 2006 to 2012, and is currently an Associate Editor of the IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY.

APPENDIX A Below we prove lemma 1 by mathematical induction. When 2, X and Y are both matrices and are denoted by is rewritten by .. If we and . Then Y X T T , where take the transpose on both sides, we have T T and T . Lemma 1 holds for 2. Next, assume lemma 1 holds for k; we have … . When k+1, Y X … . Let X sub-tensor along mode- +1, we have denote the … .

> FOR CONFERENCE-RELATED PAPERS, REPLACE THIS LINE WITH YOUR SESSION NUMBER, E.G., AB-02 (DOUBLE-CLICK HERE) < As shown in [29], given arbitrary matrices , , , and , T can be rewritten by vec vec , where vec indicates the vectorization process. Therefore, let denote vec , we have … vec … . vec T Since … , we finally obtain T

vec

T

vec

Thus lemma 1 also holds for

…

T

…

T

.

k+1.

APPENDIX B Below we first review two important properties of Kronecker product. 1)

If , , , and are matrices of size that one can form and , then . T T T .

2)

Define T

…

T

…

T

… and is further rewritten by … using property 2. ‐ Therefore, derived by T ‐

. According to property 1, is rewritten by T T T T … ,

T

T

T

tr

F

T T

T

T

…

T

‐

tr

T

…

T

‐

T

‐

T

F

T

of tr

T

tr

T

. is a constant, we have that minimization

T T

T

T T

T

Since tr

is

F

T T

T

is equal to maximization of

. APPENDIX C

Given X

… T

T

, if ∏

, we aim to prove 1.

, 1

From the tensor basics, we know that the is derived by ∑ ∑ …∑ , T

,

element of

…

…∑ , ,…, , , ,…, ,…, , , ,…, where ,…, , , ,…, denotes the vectorization of those elements with index ( , … , , , ,…, ) in X. Note that, every ,…, , , ,…, is a column vector in . Because of

T

, we know

1 ,…, , , ,…, ; otherwise, it should be zero. Therefore, we

,…,

only if derive ,

∏ That is,

T

14

, ,

∑

,…,

…∑ if

∑

…∑

1

; otherwise

is a diagonal matrix and T ∏

,

.

0.