A General Kernelization Framework for Learning Algorithms Based on Kernel PCA

Changshui Zhang, Feiping Nie ∗, Shiming Xiang The State Key Lab of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing, P.R. China, 100084.

Abstract In this paper, a general kernelization framework for learning algorithms is proposed via a two-stage procedure, i.e., transforming data by Kernel Principal Component Analysis (KPCA), and then directly performing the learning algorithm with the transformed data. It is worth noting that although a very few learning algorithms were also kernelized by this procedure before, why and under what condition this procedure is feasible have not been further studied. In this paper, we explicitly present this kernelization framework, and give a rigorous justification to reveal that under some mild conditions, the kernelization under this framework is equivalent to traditional kernel method. We show that these mild conditions are usually satisfied in most of learning algorithms. Therefore, most of learning algorithms can be kernelized under this framework without having to reformulate it into inner product form, which is a common yet vital step in traditional kernel methods. Enlightened by this framework, we also propose a novel kernel method based on the low-rank KPCA, which could be used to remove the noise in the feature space, speed up the kernel algorithm and improve the numerical stability for the kernel algorithm. Experiments are presented to verify the validity and effectiveness of the proposed methods. Key words: Kernel method; Learning algorithm; Kernel PCA; Two-stage framework.

∗ Corresponding author. Tel.: +86-10-627-96-872; Fax: +86-10-627-86-911. Email address: [email protected] (Feiping Nie).

Preprint submitted to Elsevier

1 October 2009

1

Introduction

Kernel methods [1–4] have attracted great interest in the past decades. The reason is largely because that kernel methods show better performance in most real-world applications in pattern recognition, computer vision, data minging, and so on. Many linear learning algorithms have been successfully kernelized [5–14], and most of which are implemented by making use of the kernel trick. In order to use the kernel trick, the output of the learning algorithm should be reformulated into inner product form, and then the nonlinear map from the original space to the high or even infinite dimensional feature space could be implicitly implemented by the kernel function. Kernel Principal Component Analysis(KPCA) [15], which is a kernel method for Principal Component Analysis(PCA) [16], is one of the earliest kernel methods with the kernel trick. Later, the kernelization extensions for many linear algorithms have been achieved along the same outline of KPCA for PCA. However, when the output of a learning algorithm is difficult to be reformulated into the inner product form, the kernel trick could not be directly used to kernelize the learning algorithm. In this paper, KPCA is viewed as a data transformation procedure. We propose a general kernelization framework for learning algorithms via this transformation procedure, i.e., transforming data by Kernel Principal Component Analysis (KPCA), and then directly performing the learning algorithm with the transformed data. Although a very few learning algorithms were also kernelized by this procedure before [17–19], why and under what condition this procedure is feasible have not been further studied. In this paper, we explicitly present this kernelization framework, and give a rigorous justification to reveal that under some mild conditions, the kernelization under this framework is equivalent to traditional kernel method. We will see that these mild conditions are usually satisfied by most of the learning algorithms, such as a large family of subspace learning related algorithms, distance metric learning, clustering algorithm, etc. Therefore, Most of learning algorithms can be kernelized under this framework. Note that this framework introduces a KPCA procedure, so it need additional calculation to perform KPCA. However, as KPCA is a widely used algorithm in many applications, and once the KPCA procedure has been performed, we can directly perform many linear learning algorithms to implement the respective kernel algorithms simultaneously. Therefore, by virtue of this framework, we do not need the extra development of the kernel algorithms for these linear algorithms respectively, which is very convenient especially when we need to test a large number of linear algorithms and their kernel ones at the same time. This two-stage kernelization framework provides us with a new perspective on 2

the kernel method for a learning algorithm. Usually, one cannot discern the distribution and behavior of the data in the kernel space due to the implicit map. However, we can turn to see the distribution of the data after the KPCA transformation, since the behavior of the data in the kernel space is equivalent to the behavior of the data after the KPCA transformation. This framework also gives us a mechanism to implement the kernel method for a learning algorithm with more flexibility. Enlightened by this two-stage framework, we propose a new kernel method for learning algorithms based on the low-rank KPCA. In comparison with the full-rank KPCA based kernel method, the low-rank KPCA based kernel method has several advantages. For example, it could remove the noise in the feature space, speed up the kernel algorithm and improve the numerical stability for the kernel algorithm. The rest of this paper is organized as follows: In Section 2, we revisit KPCA in details. In Section 3, we give the definition of the full-rank PCA and the full-rank KPCA, and then propose the general kernel method for learning algorithms based on the full-rank KPCA. In Section 4, we give some remarks on the general kernel method and propose a new kernel method for learning algorithms based on the low-rank KPCA. Some typical learning algorithm examples which satisfy the mild conditions are given in Section 5. In Section 6, we present the experiments to verify the validity of the general kernel method and the effectiveness of the proposed low-rank kernel method. Finally, we conclude this paper in Section 7.

2

Kernel PCA Revisited

Kernel PCA [15] is a nonlinear extension to PCA with the kernel trick. In order to use the kernel trick, the solution of PCA should be reformulated into inner product form first. Given the training data {x1 , x2 , ..., xn }, xi ∈ Rd , we denote the mean of the P training data by x ¯ = n1 i xi , the training data matrix by X = [x1 , x2 , ..., xn ] ¯ = [x1 − x and the centralized training data matrix by X ¯, x2 − x ¯, ..., xn − x ¯]. Define a centralization matrix by L = I − n1 11T , where I is an n × n identity matrix, and 1 ∈ Rn is a column vector in which all the elements are equal to one. It can be easily verified that ¯ = XL X

(1) 3

Therefore, the covariance matrix of the training data can be written as C=

1 XL(XL)T n

(2)

PCA extracts the principal components by calculating the eigenvectors of the covariance matrix C. Lemma 2.1 [20] Given two matrices A ∈ Rn×d and B ∈ Rd×n , then AB and BA have the same nonzero eigenvalues. For each non-zero eigenvalue, if the corresponding eigenvector of AB is v, then the corresponding eigenvector of BA is u = Bv. According to Lemma 2.1, the eigenvectors of C can be calculated from the eigenvectors of M = (XL)T XL = LXT XL. Denote the k-th largest eigenvalue of M by λk , and the corresponding eigenvector by v k , then the k-th largest eigenvalue of C is λk , and the corresponding eigenvector is uk = XLv k . ThereXLv k k ˜ k = ku fore the k-th principal direction calculated by PCA is u uk k = √v Tk Mv k . For any data x ∈ Rd , the k-th principal component is ˜ Tk (x − x ¯) = yk = u

¯) v Tk LXT (x − x q

v Tk Mv k

(3)

Instead of doing PCA in the input space, kernel PCA performs PCA in a mapped high-dimensional inner product space F . The map φ : Rd → F is nonlinear and is implicitly implemented via kernel function K(x, x′ ) = φ(x)T φ(x′ )

(4)

The kernel function, K : Rd ×Rd → R may be any positive kernel satisfying the Mercer’s condition [21,5]. For instance, the frequently used one is the radial basis function(RBF) kernel defined by kx − x′ k2 K(x, x ) = exp − σ2 ′

(

)

(5)

The mapped space F is also called feature space. For an algorithm that can be expressed in terms of inner product, the algorithm can be also performed in the feature space using the kernel trick. Fortunately, PCA is such an algorithm. From equation 3 we can see the output of PCA can be calculated solely by inner product. Therefore, in KPCA, for any data x ∈ Rd , the k-th principal 4

component is yk =

¯ v Tk Lφ(X)T (φ(x) − φ)

(6)

q

v Tk Kvk

where φ¯ = n1 i φ(xi ), φ(X) = [φ(x1 ), φ(x2 ), ..., φ(xn )], K = Lφ(X)T φ(X)L and v k is the k-th largest eigenvector of K. P

3

Kernel Method for Learning Algorithms Based on Full-Rank KPCA

In this section, we propose a general kernel method for learning algorithms based on full-rank KPCA. First we give the definition of the full-rank PCA and the definition of the full-rank KPCA. Suppose the training data for a learning algorithm are {x1 , x2 , ..., xn }, xi ∈ Rd . We denote the training data matrix by X = [x1 , x2 , ..., xn ] ∈ Rd×n , the P ¯ = n1 i xi , and the centralized inner product mean of the training data by x matrix by M = LXT XL. Corresponding, in the feature space, we denote the training data matrix by φ(X) = [φ(x1 ), φ(x2 ), ..., φ(xn )], the mean of P the training data by φ¯ = n1 i φ(xi ), and the centralized kernel matrix by K = Lφ(X)T φ(X)L. Definition 3.1 (full-rank PCA) For the training data matrix X, suppose the rank of the centralized inner product matrix M is r. If we extract the first r principal components of PCA, we say we have performed the full-rank PCA. Suppose the eigen-decomposition of matrix M is "

#

Λ0 M = [α, β] [α, β]T 0 0

(7)

where Λ is the diagonal matrix with the diagonal elements being the non-zero eigenvalues of M, the columns of α is the corresponding unit eigenvectors of the non-zero eigenvalues, and the columns of β is the corresponding unit eigenvectors of the zero eigenvalues. We call M = αΛαT is the full-rank eigen-decomposition of matrix M. It can be derived that the projection matrix of the full-rank PCA is 1

W = XLαΛ− 2

(8) 5

For any data point x ∈ Rd , the output data point of x after performing the full-rank PCA is 1

¯ ) = Λ− 2 αT LXT (x − x ¯) y = WT (x − x

(9)

Note that y ∈ Rr , the full-rank PCA can be seen as a data transformation procedure, in which the data are transformed from d dimension to r dimension as follows: full-rank PCA :

Rd → Rr x 7→ y

(10)

Definition 3.2 (full-rank KPCA) For the training data matrix X, suppose the rank of the centralized kernel matrix K is r. If we extract the first r principal components of KPCA, we say we have performed the full-rank KPCA. Suppose the full-rank eigen-decomposition of matrix K is K = αΛαT , then the projection matrix of the full-rank KPCA is 1

Wφ = φ(X)LαΛ− 2

(11)

For any data point x ∈ Rd , the output data point of x after performing the full-rank KPCA is 1

¯ ¯ = Λ− 2 αT Lφ(X)T (φ(x) − φ) y = WφT (φ(x) − φ)

(12)

Note that y ∈ Rr , the full-rank KPCA can be seen as a data transformation procedure, in which the data are transformed from d dimension to r dimension as follows: full-rank KPCA :

Rd → Rr x 7→ y

(13)

Suppose the transformed data of the training data xi (i ∈ {1, 2, ..., n}) by the full-rank KPCA is y i , and for any data point x ∈ Rd ( could be training data or test data), the transformed data by the full-rank KPCA is y. Subsequently, we reveal that under some mild conditions, the kernel method of a learning algorithm can be implemented by directly performing the learning algorithm with the transformed data by the full-rank KPCA. Theorem 3.1 The kernel method of a learning algorithm can be implemented by performing the learning algorithm with the transformed data by the full-rank 6

KPCA, if the learning algorithm satisfies the following two conditions simultaneously: 1. the output result of the learning algorithm can be calculated solely in terms of xT xi (i ∈ {1, 2, ..., n}), where xi is the training data, and x is a new coming test data point. 2. translating the input data with an arbitrary constant does not change the output result of the learning algorithm.

The proof is given in Appendix A. The conditions in Theorem 3.1 are usually satisfied by most of learning algorithms. In fact, most of the current kernel methods for learning algorithms are implemented by deriving the learning algorithm in terms of inner product, and then using the kernel trick to kernelize it. Theorem 3.1 gives us a new and general way to kernelize a learning algorithm, i.e., directly performing the learning algorithm with the transformed data by the full-rank KPCA. According to the conditions in Theorem 3.1, it requires that the output of a learning algorithm should be reformulated into inner product form. However, from the following theorem we will see, the two-stage kernel method is still feasible without this requirement. Theorem 3.2 The kernel method of a learning algorithm can be implemented by performing the learning algorithm with the transformed data by the full-rank KPCA, if performing the learning algorithm with original data is equivalent to performing the learning algorithm with the transformed data by the full-rank PCA. Proof. If performing a learning algorithm with original data is equivalent to performing the learning algorithm with the transformed data by the fullrank PCA, we can conclude that in the feature space, performing the learning algorithm with original data is also equivalent to performing the learning algorithm with data transformed by the full-rank PCA. Note that performing a learning algorithm in the feature space is exactly the kernel method of the learning algorithm, and performing the full-rank PCA is exactly the full-rank KPCA. Therefore, performing the kernel method of the learning algorithm with original data is equivalent to performing the learning algorithm with the transformed data by the full-rank KPCA. Thus the kernel method of the learning algorithm can be implemented by performing the algorithm with the transformed data by the full-rank KPCA. We give an illustration to illuminate it in Figure 1. 2 According to Theorem 3.2, we can implement the kernel method for a learning algorithm without having to reformulate the learning algorithm into inner product form. The “inner product” condition can be avoided as the inner prod7

xi x

yi

xi LA

x

PCA

Kernel LA xi x

φ(xi ) φ(x) LA

Kernel LA xi

φ(xi )

x

φ(x)

Kernel LA xi x

φ(xi ) φ(x) LA

LA

y

yi PCA

y

LA

Kernel PCA xi x

φ(xi ) φ(x) PCA

yi

LA

y

Fig. 1. LA denotes a learning algorithm, which is encapsulated into a black-box, with only the input ( training data xi (i ∈ {1, 2, ..., n}) and test data x ) and the output. The figure illustrates that if performing a learning algorithm with original data is equivalent to performing the learning algorithm with data transformed by PCA ( see the first row in the figure ), it can be derived that performing kernel LA is equivalent to performing a two-stage procedure, i.e. KPCA+LA (see the third row in the figure, note that the right figure in the third row is a rearrangement of the right figure in the second row, so they are equivalent, and the former exactly means KPCA+LA).

uct has been encoded in the stage of KPCA. We can see that the conditions in Theorem 3.1 and 3.2 are usually satisfied by most of learning algorithms, we will give some typical algorithm examples in Section 5.

4

Remarks

From the previous section we know that under some mild conditions, the kernel method for a learning algorithm can be implemented by a two-stage procedure, i.e., performing full-rank KPCA and then directly performing the learning algorithm with the output data of the full-rank KPCA. This two-stage kernel method can be viewed as a unified and general framework, and can provide us with a new perspective on the kernel method for a learning algorithm. The framework also gives us a more flexible mechanism to implement the kernel method for a learning algorithm by virtue of introducing the KPCA stage separately. Besides the full-rank KPCA, different KPCA implementation will produce different kernel method for learning algorithms. For example, the 8

robust implementation of KPCA or the sparse implementation of KPCA. Here we introduce another simple implementation of KPCA. It is well known that PCA is a dimensionality reduction technique, and is often effectively used to eliminate noise for data, especially for high-dimensional data. On the other hand, for a kernel method, it would be beneficial to remove noise in the feature space [22]. Therefore, based on the above motivations, we propose a kernel method for a learning algorithm that performing the learning algorithm with the transformed data by the low-rank KPCA instead of by the full-rank KPCA. First, we give the definition of the low-rank KPCA. Definition 4.1 (low-rank KPCA) For the training data matrix X, suppose the rank of the centralized kernel matrix K is r. If we extract the first m principal components of KPCA, and m < r, we say we have performed the low-rank KPCA. . ˜ T Suppose the low-rank eigen-decomposition of matrix K is K = α ˜ Λα ˜ , where ˜ Λ is the diagonal matrix with the diagonal elements being the m largest nonzero eigenvalues of K, m < r, the columns of α ˜ is the corresponding m unit eigenvectors. Similarly to the full-rank KPCA, the projection matrix of low-rank KPCA is ˜ − 21 . For any data x ∈ Rd , the output data point of x after Wφ = φ(X)Lα ˜Λ performing low-rank KPCA is ¯ =Λ ˜ − 21 α ¯ y = WφT (φ(x) − φ) ˜ T Lφ(X)T (φ(x) − φ)

(14)

Note that y ∈ Rm , the low-rank KPCA can also be seen as a data transformation procedure, in which the data are transformed from d dimension to m dimension as follows: low-rank KPCA : Rd → Rm x 7→ y

(15)

Therefore, the kernel method for a learning algorithm can be implemented by performing the learning algorithm with the transformed data by the low-rank KPCA. In comparison with the full-rank KPCA based kernel method, the low-rank KPCA based kernel method has at least the following advantages: 1. The noise in the feature space is removed, which might improve the performance for the algorithm. 9

2. The dimension of the transformed data by the low-rank KPCA is lower, which can speed up the algorithm. In fact, the low-rank KPCA used here intrinsically implements a low-rank representation for kernel matrix, which has been widely applied in kernel methods for the large-scale problems [23,24]. 3. Kernel method implicitly map the data into very high and even infinite dimensional space, which will bring on the ill-posed problem for some algorithms. Traditional method to solve this problem is the Tikhonov regularization [25]. While in the low-rank KPCA based kernel method, since the dimension by the low-rank KPCA is reduced, the ill-posed problem would be avoided naturally.

5

Some Examples

In this section, we list some popular learning algorithms which satisfy the conditions in Theorems 3.1-3.2, and thus the two-stage kernel method proposed in this paper can be used to kernelize these learning algorithms. 5.1 Linear SVM and Ridge Regression Linear SVM [5,26] seeks a direction w ∈ Rd such that a well defined margin is maximized when the training data are projected onto this direction. The goal of linear SVM indicates that the condition in Theorems 3.2 is satisfied. We will see that linear SVM also satisfy the conditions in Theorem 3.1. For the training data {x1 , x2 , ..., xn }, xi ∈ Rd , linear SVM is to solve the following optimization problem: w ∗ = arg min J (w, ξ)

(16)

where n X 1 J (w, ξ) = wT w + C ξi 2 i=1

!

(17)

C is a regularization parameter, ξi is called the hinge loss and defined by ξi = [1 − yi (wT xi + b)]+ , in which [z]+ = max(z, 0) and yi ∈ {−1, +1} is the class label of xi . The dual problem of problem (16) can be written as: α∗ = arg Pn

i=1

max

yi αi =0, 0≤αi ≤C

J (α) 10

(18)

where J (α) =

n X i=1

αi −

n 1 X αi αj yi yj xTi xj 2 i,j=1

(19)

So the optimization problem can be reformulated into inner product form. For any data x ∈ Rd , the output of linear SVM is f (x) = sgn

n X

!

αi yi xTi x

i=1

(20)

where sgn(·) is the sign function. Therefore, the output result of linear SVM can be calculated solely in terms of inner products between training data points or between a training data point and a test data point. It can be verified that translating the training data and test data with the same constant does not change the result in equation 20. Therefore, linear SVM satisfies the conditions in Theorem 3.1. Ridge regression [27] is a method coming from classical statistics. It implements a Tikhonov regularized form [25] of least-squares regression. Given the observed data pairs {(x1 , y1 ), (x2 , y2), ..., (xn , yn )}, xi ∈ Rd , ridge regression is to solve the following optimization problem: w ∗ = arg min J (w, ξ)

(21)

where n X 1 ξi 2 J (w, ξ) = wT w + C 2 i=1

!

(22)

C is a regularization parameter, ξi = yi − (w T xi + b). It can be seen that the optimization problem (21) in ridge regression is very similar to the optimization problem (16) in SVM. Denote X = [x1 , x2 , ..., xn ], x ¯ = n1 i xi , y = [y1 , y2 , ..., yn ]T and y¯ = n1 i yi . Setting the derivatives of J (w, ξ) in equation 21 with respect to w and b to zero, we have P

P

¯ T w∗ b∗ = y¯ − x

(23) 11

and w ∗ = (XLXT +

1 Id )−1 XLy 2C

(24)

where Id denotes an d × d identity matrix. It can be easily verified that 1 1 (XLXT + 2C Id )−1 XL = XL(LXT XL + 2C In )−1 , so equation 24 can be rewritten as w ∗ = XL(LXT XL +

1 In )−1 y 2C

(25)

Substituting equation 25 into equation 23, we have ¯ T XL(LXT XL + b∗ = y¯ − x

1 In )−1 y 2C

(26)

Therefore, the regression value for any data x ∈ Rd is ¯ )T XL(LXT XL + y = xT w ∗ + b∗ = (x − x

1 In )−1 y + y¯ 2C

(27)

From the above analysis we can see that, the output result of ridge regression can be calculated solely in terms of inner products between training data points or between a training data point and a test data point. It can be verified that translating all the data with the same constant does not change the result in equation 27. Therefore, ridge regression satisfies the conditions in Theorem 3.1.

5.2 LDA and LDA Variants LDA [28,29] seeks m directions W ∈ Rd×m such that the between-class scatter is maximized and the within-class scatter minimized simultaneously when the training data {x1 , x2 , ..., xn }, xi ∈ Rd are projected onto this set of directions. Many variants of LDA have been proposed in decades, such as Null space LDA [30], Direct LDA [31], Maximum Margin Criterion(MMC) [32,33], etc., and many of its 2D or tensor extensions [34–36]. Recently, the kernelization for LDA and for most of its variants have been proposed [6,18,37–40]. The goal of LDA and its variants indicates that the condition in Theorem 3.2 are satisfied. In fact, in order to speedup the algorithm, a preprocessing procedure with PCA should be performed to remove the null space of the 12

total scatter matrix St . We will see that LDA and its variants also satisfy the conditions in Theorem 3.1. In summary, after defining a between-class scatter matrix Sb and a within-class matrix Sw , there are three kinds of criterion which are commonly used in LDA and its variants, including the determinant ratio(or ratio trace) criterion 1 J1 (W) =

T W Sb W

|WT Sw W|





J1 (W) = tr (WT Sw W)−1(WT Sb W) (28)

or

the trace ratio criterion J2 (W) =

tr(WT Sb W) tr(WT Sw W)

(29)

and the trace difference criterion J3 (W) = tr(WT (Sb − Sw )W)

(30)

The corresponding optimization problems are as follows: W∗ = arg max J1 (W) d×m

(31)

W∈R

W∗ = arg W∗ = arg

max

J2 (W)

(32)

max

J3 (W)

(33)

W∈Rd×m ,WT W=I

W∈Rd×m ,WT W=I

The solution W∗ to the optimization problem (31) is the m largest eigenvectors ∗ of S−1 w Sb , the solution W to the optimization problem (32) is the m largest eigenvectors of Sb − λSw , where λ = J2 (W∗ ) is automatically determined by the optimization problem [41–43], the solution W∗ to the optimization problem (33) is the m largest eigenvectors of Sb − Sw . Denote X = [x1 , x2 , ..., xn ] and x ¯ = n1 i xi . Note that the solution W∗ lies in the subspace spanned by the centralized training data {x1 − x ¯ , x2 − x ¯, ..., xn − x ¯}, so we have P

W = XLα

(34)

1

When using this criterion, Sw is needed to be nonsingular. However, in the kernel feature, Sw is usually singular. Traditional solution to solve this problem is to use the Tikhonov regularization, i.e., add a multiple of the identity matrix λI to Sw .

13

where α ∈ Rn×m . From the graph view [44], we know Sb = XLb XT and Sw = XLw XT , where Lb and Lw are Laplacian matrices. Then, we have WT Sb W = αT LXT XLb XT XLα

(35)

WT Sw W = αT LXT XLw XT XLα

(36)

WT W = αT LXT XLα

(37)

Therefore, the above optimization problems with respect to α can be formulated into inner product form. For any data x ∈ Rd , the output of LDA and its variants is y = WT x = αT LXT x. Therefore, the output of LDA and its variants can be calculated solely in terms of inner products between training data points or between a training data point and a test data point. Note that when the data is translated with an arbitrary constant vector, the matrices Sb and Sw are unchanged, and then the solution W∗ to the above optimization problems is unchanged. Therefore, LDA and its variants satisfy the conditions in Theorem 3.1. 5.3 CCA and PLS Canonical Correlation Analysis(CCA) [45] is a technique as early as LDA, while Partial Least Squares(PLS) [46] originated from the domains of Econometrics, and attracted a great amount of attention in Chemometrics [47]. Both of CCA and PLS model the relation between two sets of variables x ∈ Rdx and y ∈ Rdy . CCA seeks wx and w y such that the correlation between the projections x = w Tx x and y = wTx y is maximized, while PLS seeks wx and w y such that the covariance between the projections x = w Tx x and y = wTx y is maximized. Given two datasets {x1 , x2 , ..., xn }, xi ∈ Rdx and {y 1 , y 2 , ..., y n }, y i ∈ Rdy , denote X = [x1 , x2 , ..., xn ] ∈ Rdx ×n and Y = [y 1 , y 2 , ..., y n ] ∈ Rdy ×n , denote the centralized inner product matrices Mx = LXT XL and My = LY T YL. CCA is to solve the following optimization problem {w ∗x , w ∗y } = arg max ρ(w x , wy )

2

: (38)

2

When the data dimensionality is larger than the data number, which is an obvious case in the kernel feature, it will become ill-posed for CCA. Traditional solution to solve this problem is to use the Tikhonov regularization, i.e., add a multiple of the identity matrix λI to XLXT and YLY T .

14

where w Tx XLYT wy ρ(w x , wy ) = q wTx XLXT w x wTy YLYT w y

(39)

The solution w ∗x to the optimization problem (38) is the largest eigenvector of (XLXT )−1 XLYT (YLYT )−1 YLXT and w ∗y = (YLYT )−1 YLXT w ∗x . Note that the solution w∗x lies in the subspace spanned by {x1 − x ¯, x2 − x ¯, ..., xn − x ¯} and the solution w ∗y lies in the subspace spanned by {y 1 − y, ¯ y 2 − y, ¯ ..., y n − y}, ¯ so we have w x = XLα, wy = YLβ

(40)

where α, β ∈ Rn are column vectors. Then the equation 39 is rewritten as αT Mx My β ρ(w x , wy ) = ρ(α, β) = q αT M2x αβ T M2y β

(41)

Therefore, the above optimization problems with respect to α and β can be reformulated in terms of inner products between data pairs in X and between data pairs in Y. Obviously, either or both of x ∈ Rdx and y ∈ Rdy could be kernelized using the kernel trick. For any data pair x ∈ Rdx and y ∈ Rdy , the output of CCA is x = w Tx x = αT LXT x and y = w Ty y = β T LY T y. Therefore, the output of CCA can be calculated solely in terms of inner products between training data points or between a training data point and a test data point. Note that when the data is translated with an arbitrary constant vector, the matrices Mx and My are unchanged, and thus the solutions w∗x and w ∗y are unchanged. Therefore, CCA satisfies the conditions in Theorem 3.1. Similar to that of CCA, PLS is to solve the following optimization problem: {w ∗x , w ∗y } = arg max ρ(w x , wy )

(42)

where wT XLYT w y ρ(w x , wy ) = q x wTx w x wTy w y

(43)

The solution w ∗x to the optimization problem (42) is the largest eigenvector of XLYT YLXT , and w ∗y = YLXT w ∗x . 15

According to Lemma 2.1, the eigenvectors of XLYT YLXT can be calculated from the eigenvectors of My Mx . Suppose the largest eigenvector of My Mx is α, then w∗x = XLα and w ∗y = YLXT w ∗x = YMx α. Therefore, either or both of x ∈ Rdx and y ∈ Rdy could be kernelized using the kernel trick. For any data pair x ∈ Rdx and y ∈ Rdy , the output of PLS is x = wTx x = αT LXT x and y = wTy y = αT Mx Y T y. Therefore, the output of PLS can be calculated solely in terms of inner products between training data points or between a training data point and a test data point. Similarly to CCA, when the data is translated with an arbitrary constant vector, the matrices Mx and My are unchanged, and then the solutions w ∗x and w ∗y are unchanged. Therefore, PLS satisfies the conditions in Theorem 3.1. One of the frequently used PLS approaches is the PLS regression [47,9], it can also be easily verified that PLS regression satisfies the conditions in Theorem 3.1.

5.4 Others The conditions in Theorems 3.1-3.2 are usually satisfied by most of learning algorithms. Besides the algorithms listed above, many other learning algorithms also satisfy these conditions, such as distance metric learning [48], clustering algorithm, K-nearest neighbor classifier(KNN), etc. Note that in KNN with the Euclidean distance metric, the calculating of the distance between a test data x and a training data xi includs an inner product between the test data and itself(xT x), which does not satisfy the conditions in Theorem 3.1. However, we have min kx − xi k2 = min (xT x + xTi xi − 2xT xi ) = min (xTi xi − 2xT xi ), the i i i inner product between the test data and itself can be eliminated. Therefore, the output result of KNN still satisfies the conditions in Theorem 3.1.

6

Experiments

In this section, we numerically verify the validity and effectiveness of the proposed kernel method. Two sets of datasets are used in the experiments, the first one is taken from the UCI Machine Learning Repository( [49]), and the second one is taken from the real-world image databases, including two face image databases AT&T( [50]) and UMIST( [51]), one object image database COIL-20( [52]), and one digit image database USPS 3 . A brief description of these datasets can be seen in Table 1. 3

Available at http://www.kernel-machines.org/data

16

Table 1 A brief description of the UCI and image Datasets in the experiments, including class number, training number, test number and data dimension

class

total num.

training num.

test num.

dimension

iris

3

150

90

60

4

balance

3

625

90

535

4

wine

3

178

90

88

13

vehicle

4

846

120

726

18

ionosphere

2

351

60

291

34

heart

2

270

60

210

13

breast

2

699

60

639

10

waveform-21

3

2746

90

2656

21

chess

2

3196

60

3136

36

diabetes

2

768

60

708

8

cars

3

392

90

302

8

german

2

1000

60

940

20

monk1

2

432

60

372

6

pima

2

768

60

708

8

crx

2

690

60

630

15

australian

2

690

60

630

14

AT&T

40

400

200

200

644

UMIST

20

575

160

415

644

COIL-20

20

1440

160

1280

1024

USPS

10

9298

200

9098

256

In each experiment, we randomly select several samples for training and the remaining samples for test. The average results and standard deviations are reported over 50 random splits. Two algorithms are used in the experiments: SVM and Trace-Ratio based Discriminant Analysis(denotes by TRDA, see equation 29 and 32). Three 17

Table 2 The experimental results in each dataset. The values are the mean accuracy rate(%) over 50 random splits. For TK, the values in the parentheses are the standard deviation, for FGK, the first values in the parentheses are the standard deviation and the second values are the mean full-rank r of the centralized kernel matrix K, for LGK, the first values in the parentheses are the standard deviation and the second values are the low-rank m (m < r) . SVM

TRDA

TK

FGK

LGK

TK

FGK

LGK

iris

95.9(2.1)

95.9(2.1,34)

96.2(1.9,6)

91.0(3.6)

91.0(3.6,34)

96.6(1.8,16)

balance

89.1(2.6)

89.1(2.6,34)

90.1(3.1,10)

85.2(2.5)

85.2(2.5,34)

86.1(2.8,23)

wine

96.2(1.6)

96.2(1.6,89)

96.6(1.6,30)

95.1(2.2)

95.1(2.2,89)

96.9(2.1,15)

vehicle

73.3(2.0)

73.3(2.0,119)

73.5(2.0,91)

71.5(2.2)

71.5(2.2,119)

73.9(2.3,81)

ionosphere

87.2(2.2)

87.2(2.2,58.9)

87.5(2.2,43)

84.5(3.4)

84.5(3.4,58.9)

87.9(2.7,28)

heart

81.8(1.8)

81.8(1.8,59)

82.5(1.7,6)

75.2(3.0)

75.2(3.0,59)

77.5(3.0,42)

breast

96.1(0.8)

96.1(0.8,58.7)

96.5(0.6,3)

90.1(2.8)

90.1(2.8,58.7)

95.9(0.8,12)

waveform-21

81.9(1.5)

81.9(1.5,89)

84.1(1.1,3)

77.8(2.0)

77.8(2.0,89)

79.1(2.1,58)

chess

79.9(3.7)

79.9(3.7,59)

80.2(3.7,57)

79.5(2.8)

79.5(2.8,59)

79.5(2.8,59)

diabetes

74.1(1.7)

74.1(1.7,59)

74.5(1.7,34)

65.5(2.9)

65.5(2.9,59)

69.1(2.5,39)

cars

84.2(3.1)

84.2(3.1,89)

84.3(3.1,43)

72.5(5.4)

72.5(5.4,89)

84.6(2.1,53)

german

69.9(7.5)

69.9(7.5,59)

70.1(7.4,28)

62.1(3.7)

62.1(3.7,59)

62.5(3.5,57)

monk1

72.7(3.9)

72.7(3.9,59)

72.9(3.8,35)

69.9(4.1)

69.9(4.1,59)

72.5(3.9,50)

pima

74.3(1.8)

74.3(1.8,59)

74.5(1.8,26)

65.5(3.2)

65.5(3.2,59)

69.2(2.3,37)

crx

84.2(1.8)

84.2(1.8,59)

84.2(1.8,39)

80.9(2.9)

80.9(2.9,59)

81.6(2.5,45)

australian

85.5(1.0)

85.5(1.0,59)

85.9(0.9,8)

81.2(3.1)

81.2(3.1,59)

82.3(2.5,49)

AT&T

95.1(1.6)

95.1(1.6,199)

95.5(1.6,62)

95.4(1.5)

95.4(1.5,199)

96.2(1.5,129)

UMIST

94.0(2.0)

94.0(2.0,158.5)

94.2(1.9,72)

96.1(1.5)

96.1(1.5,158.5)

96.5(1.6,145)

COIL-20

89.2(1.6)

89.2(1.6,159)

90.3(1.5,29)

89.5(1.8)

89.5(1.8,159)

89.5(1.8,159)

USPS

85.1(1.3)

85.1(1.3,199)

85.2(1.3,192)

85.4(1.3)

85.4(1.3,199)

86.6(1.4,109)

kinds of kernel method for these two algorithms are implemented, including traditional kernel method(denotes by TK), full-rank KPCA based general kernel method(denotes by FGK) and low-rank KPCA based general kernel method(denotes by LGK). We use a polynomial kernel function defined by K(x, x′ ) = (xT x′ + 1)3

(44)

In SVM, the regularization parameter C (see equation 17) is searched from: C ∈ {0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000}. In TRDA, the reduced dimensionality( the value m in Section 5.2) is simply set to 10 for UCI datasets and 50 for image datasets, and the 1-nearest neighbor classifier is used to classify data. 18

The experimental results on these datasets are summarized in Table 2. From the results we can see that the traditional kernel method and the general kernel method based on the full-rank KPCA produce exactly the same results, which confirms the theoretical analysis in the previous sections. We also observed that the general kernel method based on the low-rank KPCA could further improve the performance for most of the datasets, even when the dimension transformed by the low-rank KPCA is significantly reduced. The experimental results verify that it is effective to remove noise in the kernel space. For the UCI datasets, the results of SVM outperforms those of TRDA, while for the image datasets, the results of TRDA outperforms those of SVM. The results indicate that the large margin analysis maybe more suitable than the discriminant analysis for the low-dimensional data, while for the high-dimensional data, the discriminant analysis is preferred.

7

Conclusions

In this paper, we propose a general kernel method for learning algorithms. We reveal that a kernel method for a learning algorithm can be implemented by directly performing the learning algorithm with the transformed data by the full-rank KPCA under some mild conditions, which are usually satisfied by most of learning algorithms. This general kernelization framework provides us with a new perspective on the kernel method and gives us a mechanism to implement a kernel method with more flexibility. Enlightened by this framework, we propose another kernel method for a learning algorithm based on the low-rank KPCA, which could remove the noise in the feature space, speed up the kernel algorithm and improve the numerical stability for the kernel algorithm. Experiments are presented to verify the validity and effectiveness of the proposed kernel method.

8

Acknowledgments

This work is supported by NSFC (Grant No. 60835002 ) and 973 Program(2009CB320602).

A

Proof of Theorem 3.1

First, we give two lemmas before the proof. 19

Lemma A.1 Suppose matrix A ∈ Rd×m and the full-rank eigen-decomposition of AT A is AT A = αΛαT . Then we have AααT = A. Proof. Let the columns of β be the corresponding eigenvectors of the zero eigenvalues of AT A, then "

#

Λ0 A A = [α, β] [α, β]T 0 0 T

"

#

Λ0 ⇒ A Aβ = [α, β] [α, β]T β = 0 0 0 T

⇒ ⇒ ⇒ ⇒ ⇒

(Aβ)T Aβ = 0 Aβ = 0 Aββ T = 0 A(I − ααT ) = 0 AααT = A

The last but one is true according to the fact that [α, β][α, β]T = I.

2

Lemma A.2 After performing the full-rank KPCA, we have ∀i ∈ {1, 2, ..., n}, ¯ T (φ(xi ) − φ). ¯ y T y i = (φ(x) − φ) Proof. Denote Y = [y 1 , y 2 , ..., y n ], according to equation 12, we know that 1 1 Y = Λ− 2 αT Lφ(X)T φ(X)L = Λ− 2 αT K, then ¯ T φ(X)LαΛ− 12 Λ− 12 αT K y T Y = (φ(x) − φ) ¯ T φ(X)LαΛ−1 αT αΛαT = (φ(x) − φ) ¯ T φ(X)LααT = (φ(x) − φ) ¯ T φ(X)L = (φ(x) − φ) The fourth equality is true according to Lemma A.1. Therefore, ∀i ∈ {1, 2, ..., n}, ¯ T (φ(xi ) − φ). ¯ we have y T y i = (φ(x) − φ) 2 Proof of Theorem 3.1. According to the first condition, we know that the output result of the learning algorithm with the transformed data by the fullrank KPCA can be calculated in terms of y T y i (i ∈ {1, 2, ..., n}). According ¯ T (φ(xi ) − φ), ¯ thus the output result can to Lemma A.2, y T y i = (φ(x) − φ) ¯ T (φ(xi ) − φ). ¯ According to the second be calculated in terms of (φ(x) − φ) condition, the output result thus can be calculated in terms of φ(x)T φ(xi ). On the other hand, according to the first condition again, we know that the output result of the learning algorithm in the feature space can be calculated in terms of φ(x)T φ(xi ). Therefore, the output result of the learning algorithm 20

with the transformed data by the full-rank KPCA is equal to the output result of the kernel method of the learning algorithm. 2

References [1] B. Scholkopf, A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, USA, 2001. [2] R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms, MIT Press, Cambridge, MA, USA, 2001. [3] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, New York, NY, USA, 2004. [4] D. Xu, S. Yan, J. Luo, Face recognition using spatially constrained earth mover’s distance, IEEE Transactions on Image Processing 17 (11) (2008) 2256–2260. [5] V. N. Vapnik, The nature of statistical learning theory, Springer-Verlag New York, Inc., New York, NY, USA, 1995. [6] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation 12 (10) (2000) 2385–2404. [7] G. Saunders, A. Gammerman, V. Vovk, Ridge regression learning algorithm in dual variables, in: Proc. 15th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1998, pp. 515–521. [8] T. Melzer, M. Reiter, H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern Recognition 36 (9) (2003) 1961–1971. [9] R. Rosipal, L. J. Trejo, Kernel partial least squares regression in reproducing kernel hilbert space, Journal of Machine Learning Research 2 (2001) 97–123. [10] F. R. Bach, M. I. Jordan, Kernel independent component analysis, Journal of Machine Learning Research 3 (2002) 1–48. [11] M. Girolami, Mercer kernel-based clustering in feature space, IEEE Transactions on Neural Networks 13 (2002) 780–784. [12] K. Yu, L. Ji, X. Zhang, Kernel nearest-neighbor algorithm, Neural Processing Letters 15 (2002) 147–156. [13] J. Wang, J. Lee, C. Zhang, Kernel trick embedded Gaussian mixture model, in: ALT, 2003, pp. 159–174. [14] A. Gretton, R. Herbrich, A. Smola, The kernel mutual information, in: Proc. ICASSP, 2003. [15] B. Sch¨ olkopf, A. J. Smola, K.-R. M¨ uller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10 (5) (1998) 1299–1319.

21

[16] I. T. Jolliffe, Principal Component Analysis,2nd Edition., Springer-Verlag, New York, 2002. [17] S. Harmeling, A. Ziehe, M. Kawanabe, K.-R. M¨ uller, Kernel-based nonlinear blind source separation, Neural Computation 15 (5) (2003) 1089–1124. [18] J. Yang, A. F. Frangi, J.-Y. Yang, D. Zhang, Z. Jin, KPCA plus LDA: A complete kernel fisher discriminant framework for feature extraction and recognition, IEEE Transactions on PAMI 27 (2) (2005) 230–244. [19] B. Cao, D. Shen, J.-T. Sun, Q. Yang, Z. Chen, Feature selection in a kernel space, in: ICML, 2007, pp. 121–128. [20] R. A. Horn, C. R. Johnson, Matrix Analysis, Cambridge University Press, 1990. [21] R. Courant, D. Hilbert, Methods of Mathematical Physics, Interscience Publishers, 1953. [22] R. Rosipal, M. Girolami, L. J. Trejo, A. Cichocki, Kernel PCA for feature extraction and de-noising in nonlinear regression, Neural Computing and Applications 10 (3) (2001) 231–243. [23] S. Fine, K. Scheinberg, Efficient SVM training using low-rank kernel representations, Journal of Machine Learning Research 2 (2001) 243–264. [24] F. R. Bach, M. I. Jordan, Predictive low-rank decomposition for kernel methods, in: ICML, 2005, pp. 33–40. [25] A. N. Tikhonov, V. Y. Arsenin, Solutions of ill-posed problems, New York: John Wiley, 1977. [26] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998) 121–167. [27] A. E. Hoerl, R. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1) (1970) 55–67. [28] R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals Eugen. 7 (1936) 179–188. [29] K. Fukunaga, Introduction to Statistical Pattern Recognition,2nd Edition., Academic Press, Boston, MA, 1990. [30] L. Chen, H. Liao, M. Ko, J. Lin, G. Yu, A new LDA based face recognition system which can solve the small sample size problem, Pattern Recognition 33 (10) (2000) 1713–1726. [31] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional data with application to face recognition, Pattern Recognition 34 (2001) 2067–2070. [32] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion., in: NIPS, 2003. [33] H. Li, T. Jiang, K. Zhang, Efficient and robust feature extraction by maximum margin criterion, IEEE Transactions on Neural Networks 17 (2006) 1045–9227.

22

[34] D. Xu, S. Yan, S. Lin, T. S. Huang, Convergent 2-d subspace learning with null space analysis, IEEE Trans. Circuits Syst. Video Techn. 18 (12) (2008) 1753–1759. [35] D. Xu, S. Yan, L. Zhang, S. Lin, H.-J. Zhang, T. S. Huang, Reconstruction and recognition of tensor-based objects with concurrent subspaces analysis, IEEE Trans. Circuits Syst. Video Techn. 18 (1) (2008) 36–47. [36] D. Xu, S. Yan, Semi-supervised bilinear subspace learning, IEEE Transactions on Image Processing 18 (7) (2009) 1671–1676. [37] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, K. Muller, Fisher discriminant analysis with kernels, in: Proceedings of IEEE Neural Networks for Signal Processing Workshop, 1999, pp. 41–48. [38] J. Lu, K. Plataniotis, A. Venetsanopoulos, Face recognition using kernel direct discriminant analysis algorithms, IEEE Transactions on Neural Networks 14 (2003) 117–126. [39] W. Zheng, L. Zhao, C. Zou, Foley-sammon optimal discriminant vectors using kernel approach, IEEE Transactions on Neural Networks 16 (2005) 1–9. [40] Q. Liu, X. Tang, H. Lu, S. Ma, Face recognition using kernel scatter-differencebased discriminant analysis, IEEE Transactions on Neural Networks 17 (2006) 1081–1085. [41] Y.-F. Guo, S.-J. Li, J.-Y. Yang, T.-T. Shu, L.-D. Wu, A generalized foleysammon transform based on generalized fisher discriminant criterion and its application to face recognition, Pattern Recognition Letter 24 (1-3) (2003) 147– 158. [42] F. Nie, S. Xiang, C. Zhang, Neighborhood minmax projections., in: IJCAI, 2007, pp. 993–998. [43] H. Wang, S. Yan, D. Xu, X. Tang, T. S. Huang, Trace ratio vs. ratio trace for dimensionality reduction, in: CVPR, 2007. [44] X. F. He, S. C. Yan, Y. X. Hu, P. Niyogi, H. J. Zhang, Face recognition using laplacianfaces, IEEE Transactions on PAMI 27 (3) (2005) 328–340. [45] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936) 312– 377. [46] S. Wold, A. Ruhe, H. Wold, W. J. Dunn, The collinearity problem in linear regression. the partial least squares (PLS) approach to generalized inverses, SIAM Journal of Scientific and Statistical Computations 5 (1984) 735–743. [47] A. Hoskuldsson, PLS regression methods, Journal of Chemometrics 2 (1998) 211–228. [48] E. Xing, A. Ng, M. Jordan, S. Russell, Distance metric learning, with application to clustering with side-information, in: NIPS, 2003.

23

[49] A. Asuncion, D. Newman, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2007. URL http://www.ics.uci.edu/∼mlearn/MLRepository.html [50] F. S. Samaria, A. C. Harter, Parameterisation of a stochastic model for human face identification, in: 2nd IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142. [51] D. B. Graham, N. M. Allinson, Characterizing virtual eigensignatures for general purpose face recognition. in face recognition: From theory to applications, NATO ASI Series F, Computer and Systems Sciences 163 (1998) 446–456. [52] S. A. Nene, S. K. Nayar, H. Murase, Columbia object image library (COIL-20), Technical Report CUCS-005-96, Columbia University, 1996.

24

A General Kernelization Framework for Learning ...

Oct 1, 2009 - In summary, after defining a between-class scatter matrix Sb and a within-class matrix Sw ..... Kaufmann, San Francisco, CA, 1998, pp. 515–521 ...

199KB Sizes 2 Downloads 252 Views

Recommend Documents

A GENERAL FRAMEWORK FOR PRODUCT ...
procedure to obtain natural dualities for classes of algebras that fit into the general ...... So, a v-involution (where v P tt,f,iu) is an involutory operation on a trilattice that ...... G.E. Abstract and Concrete Categories: The Joy of Cats (onlin

A Distributed Kernel Summation Framework for General ...
Dequeue a set of task from it and call the serial algorithm (Algo- ..... search Scientific Computing Center, which is supported .... Learning, pages 911–918, 2000.

IFT-SLIC: A General Framework for Superpixel ...
age into relevant regions that can together represent objects. This partition can greatly reduce the computational time of the algorithms, by replacing the rigid structure of the pixel grid [1]. A superpixel can be defined as a compact region of simi

Towards a General Framework for Secure MapReduce ...
on the public cloud without protection to prevent data leakages. Cryptographic techniques such as fully homo-. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that co

A Machine Learning Framework for Image Collection ...
exponentially thanks to the development of Internet and to the easy of producing ..... preprocesing steps and JEE web application for visualizing and exploring ...

A robust incremental learning framework for accurate ...
Human vision system is insensitive to these skin color variations due to the .... it guides the region growing flow to fill up the interstices. 3.1. Generic skin model.

A conceptual framework for the integration of learning ...
Test LT in situ. • Students using the LT. Monitor and adapt the integration. • Continuous “integrative evaluation”. • Adapt the LT and the REST of the course “system”. Evaluation of implementation ..... operates, but whether it does so

A Machine Learning Framework for Image Collection ...
of our proposed framework are: first, to extract the image features; second, to build .... preprocesing steps and JEE web application for visualizing and exploring ...

A Learning-Based Framework for Velocity Control in ...
Abstract—We present a framework for autonomous driving which can learn from .... so far is the behavior of a learning-based framework in driving situations ..... c. (10a) vk+1|t = vk|t + ak|t · ∆tc,. (10b) where the variable xk|t denotes the pre

A Potential-based Framework for Online Learning with ...
This framework immediately yields natural generalizations of existing algorithms. (e.g. Binomial Weight [CFHW96] or Weighted Majority [LW94, Vov95]) onto online learning with abstentions. 1 Introduction. In many applications of machine learning, misc

A Potential-based Framework for Online Learning with ...
Show xt ∈ /. Predict yt ∈ 1-1, +1, +l. Reveal yt ∈ 1-1, +1l. Reliable predictions on non-abstention examples. Performance Metrics: ▷ Mistakes: ∑t I(yt = -yt) ...

Generation 2.0 and e-Learning: A Framework for ...
There are three generations in the development of AT. ... an organising framework this paper now conducts a ... ways in which web 2.0 technologies can be used.

A Machine Learning Framework for Image Collection ...
Abstract—In this paper, we propose a machine learning frame- work for ..... [14] Jock D.Mackinlay Stuart K. Card and Ben Shneiderman. Readings in. Information ...

A Feature Learning and Object Recognition Framework for ... - arXiv
K. Williams is with the Alaska Fisheries Science Center, National Oceanic ... investigated in image processing and computer vision .... associate set of classes with belief functions. ...... of the images are noisy and with substantial degree of.

A Potential-based Framework for Online Learning with ...
A Potential-based Framework for Online Learning with Mistakes and Abstentions. Chicheng Zhang joint work with Kamalika Chaudhuri. UC San Diego. NIPS Workshop on Reliable Machine Learning in the Wild ...

A Feature Learning and Object Recognition Framework for ... - arXiv
systematic methods proposed to determine the criteria of decision making. Since objects can be naturally categorized into higher groupings of classes based on ...

A Machine Learning Framework
ASD has attracted intensive attention in the last decade. [see Tanaka & Sung, 2013, for a review]. Overall ... emotion, are analyzed with computer vision and speech techniques based on machine learning [Bartlett, ..... responses to own name, imitatio

A Proposed Framework for Proposed Framework for ...
approach helps to predict QoS ranking of a set of cloud services. ...... Guarantee in Cloud Systems” International Journal of Grid and Distributed Computing Vol.3 ...

A general framework of hierarchical clustering and its ...
Available online 20 February 2014. Keywords: ... Clustering analysis is a well studied topic in computer science [14,16,3,31,2,11,10,5,41]. Generally ... verify that clustering on level Li simply merges two centers in the clustering on level LiА1.

Innovation timing games: a general framework with applications
Available online 15 June 2004. Abstract. We offer a ... best response of the second mover, which is the solution to a non-trivial maximization problem. ...... c1, are a composition of three polynomials of the third degree. It is somewhat tedious but 

General Framework for the Electricity Market Monitoring.pdf ...
General Framework for the Electricity Market Monitoring.pdf. General Framework for the Electricity Market Monitoring.pdf. Open. Extract. Open with. Sign In.

Innovation timing games: a general framework with applications
research and development (R&D) to obtain a better technology. Let kًtق be .... follower's payoffs as functions of t alone: define Lًtق ¼ p1ًt, Rًtقق and Fًtق ¼ p2ًt, Rًtقق ...

Bounded Rationality And Learning: A Framework and A ...
Email: [email protected]; University of Pennsylvania. ‡. Email: .... correctly specified model (which is always the best fit), while our paper corresponds ..... learning.14 The following examples illustrate several types of misspecification.

An Evidence Framework For Bayesian Learning of ...
data is sparse, noisy and mismatched with test. ... In an evidence Bayesian framework, we can build a better regularized HMM with ... recognition performance.