Constructing Sparse Kernel Machines Using Attractors

Viewer
Transcript

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

other two intervals. However, the values of MI on the three intervals are completely or approximately same. It is also shown in Table XI that for the different distributions, the percentage of change for the MI-based method is much less than that for the SA-based method. Therefore, for the above two cases, the results in Table XI indicate that the SA-based method cannot provide an exactly or approximately fixed value while the MI-based method can. 2) Dependency Between Inputs: As shown in [13], the SA-based method for neural networks may be inefficient when the inputs of the networks are dependent. However, the MI-based method can overcome this limitation successfully. To validate this assertion, one function in [13], y (x1 ; x2 ) = sin(6x1 ) + sin(6x2 ) (x1 ; x2 2 [0; 1]), is utilized to generate the samples. For the case without dependency between the two inputs, 400 samples are constructed by evenly spaced 20 2 20 grids on [0; 1] 2 [0; 1]. For the other three cases with dependencies between the two inputs, the values of the first input x1 are uniformly distributed over the interval [0; 1], while the values of the second input x2 are determined by those of x1 using the functional relationships depicted in Table XII. The values of MI and sensitivity for the two inputs are summarized in Table XII. Therefore, for each of the three cases with dependencies between the two inputs, it is shown in the table that the values of sensitivities for the two inputs are quite different, while the values of MI are approximately the same. Thus, this investigation confirms the findings in [13] towards the SA-based method. Moreover, from Table XII, one can find that the MI-based method can provide a solution to overcome this problem.

721

[4] G. Van Dijck and M. M. Van Hulle, “Speeding up feature subset selection through mutual information relevance filtering,” in Lecture Notes in Artificial Intelligence. Berlin, Germany: Springer-Verlag, 2007, vol. 4702, pp. 277–287. [5] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [6] N. Kwak and C.-H. Choi, “Input feature selection by mutual information based on Parzen window,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 12, pp. 1667–1671, Dec. 2002. [7] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Phys. Rev. E, Stat. Plasmas Fluids Relat. Interdiscip. Top., vol. 69, 2004, 0661138. [8] H. Stögbauer, A. Kraskov, S. A. Astakhov, and P. Grassberger, “Least dependent component analysis based on mutual information,” Phys. Rev. E, Stat. Plasmas Fluids Relat. Interdiscip. Top., vol. 70, no. 6, 2004, 066123. [9] C. Blake and C. Merz, UCI Repository of Machine Learning Datasets, 1998 [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html [10] B. D. Ripley, Pattern Recognition and Neural Networks. Cambridge, U.K.: Cambridge Univ. Press, 1996. [11] J. H. Friedman, “Multivariate adaptive regression splines,” Ann. Statist., vol. 19, no. 1, pp. 1–141, 1991. [12] L. Breiman and J. Friedman, “Estimating optimal transformations in multiple regression and correlation,” J. Amer. Statist. Assoc., vol. 80, pp. 580–619, 1985. [13] M. A. Mazurowski and P. M. Szecówka, “Limitations of sensitivity analysis for neural networks in cases with dependent inputs,” in Proc. 4th IEEE Int. Conf. Comput. Cybern., Tallinn, Estonia, 2006, pp. 299–303.

V. CONCLUSION To construct a compact MLP together with better generalization ability, a novel two-phase construction method is proposed for pruning both input and hidden units. In the first phase, the order of features can be ranked by the MI of target outputs with respect to certain feature subsets together with a forward strategy. Then, the salient features can be identified and taken as the input units of an MLP according to the ranking results and by considering their contributions to the network’s performance. In the second phase, the redundant hidden units can be removed using a novel relevance measure. In our experiments, the proposed method shows its efficiency towards both classification and function approximation problems. Compared with representative results of the SA-based pruning strategy, the present method shows its better performance. Moreover, the proposed method demonstrates the comparable performance in comparison with the SVMs, and superior generalization ability compared to the SVRs. The investigation shows that, for the samples on the different intervals or the samples with different distributions on the same interval, the values of sensitivity provided by the SA-based method are not exactly or approximately the same, while the values of MI given by the MI-based method are. In addition, the investigation also exhibits that when there are dependencies between inputs, the SA-based method can be ineffective while the MI-based method can successfully avoid this limitation.

REFERENCES [1] A. P. Engelbrecht, “New pruning heuristics based on variance analysis of sensitivity information,” IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 1386–1399, Nov. 2001. [2] X. Zeng and D. S. Yeung, “Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure,” Neurocomputing, vol. 69, pp. 825–837, 2006. [3] I.-S. Oh, J.-S. Lee, and B.-R. Moon, “Hybrid genetic algorithms for feature selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1424–1437, Nov. 2004.

Constructing Sparse Kernel Machines Using Attractors Daewon Lee, Kyu-Hwan Jung, and Jaewook Lee Abstract—In this brief, a novel method that constructs a sparse kernel machine is proposed. The proposed method generates attractors as sparse solutions from a built-in kernel machine via a dynamical system framework. By readjusting the corresponding coefficients and bias terms, a sparse kernel machine that approximates a conventional kernel machine is constructed. The simulation results show that the constructed sparse kernel machine improves the efficiency of testing phase while maintaining comparable test error. Index Terms—Attractors, dynamical systems, kernel method, sparse kernel machines, support vector domain description (SVDD), support vector machine (SVM).

I. INTRODUCTION Kernel-based learning methods [1]–[3] have been successfully applied to a variety of pattern recognition problems. Despite their outstanding generalization performance, kernel methods are computationally expensive in the test phase since their computational complexity is proportional to the number of training points needed to represent the Manuscript received October 23, 2007; revised August 03, 2008 and August 03, 2008; accepted January 02, 2009. First published February 27, 2009; current version published April 03, 2009. This work was supported by the Korea Research Foundation under Grant KRF-2007-313-D00918. D. Lee is with the Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany. K.-H. Jung and J. Lee are with the Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Kyungbuk 790-784, Korea (e-mail: [email protected]). Color versions of one or more of the figures in this brief are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2009.2014059

1045-9227/$25.00 © 2009 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

722

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

kernel functions. Also they are memory intensive since all the training points should be stored to evaluate the kernel functions, resulting in obstacles to large scale applications [2]. There have been two main approaches for constructing a sparse kernel machine to enhance its efficiency. One is to construct it directly from training data set, which includes support vector machine (SVM) [1], relevance vector machine (RVM) [4], etc. The other is to construct a sparse kernel machine that approximates a built-in kernel machine by locating a reduced set of kernel centers [5]–[11], and thus called reduced set (RS) method. For general purpose, we take the second approach to construct a sparser model since although we have sparse kernel machines such as SVMs from the first approach, we can make them even sparser using the second approach. In the literature, a number of algorithms that implement the second approach successfully reduced the number of kernel centers needed to construct a sparse kernel machine with some empirical drawbacks. Reduced set construction method, given by [5] and somewhat refined in [6]–[8], starts from finding an approximated preimage of the kernel solution by using an unconstrained optimization technique and iteratively finds the next optimally approximated preimage of the residual until the prespecified criterion is satisfied. However, they suffer from local minima problems because the objective function is nonconvex and so the whole optimization process should be restarted many times with different initial points to escape from local minima [5], [7], [8], thereby leading to intensive computational complexity. They also might suffer from singularity problems, which make it tricky and complicated to calculate the inverse of Gram matrix to find the optimal coefficients. Another method proposed in [10] reduces the number of support vectors (SVs) by iteratively replacing two SVs of the same class by a vector, which is an approximated preimage on a convex combination of the two SVs in the feature space. However, it does not provide a general framework applicable to a general kernel since it should be differently designed for each kernel. An exact simplified SVM proposed by [9] discards unnecessary SVs, which are linearly dependent on the other SVs. Even though it is guaranteed that the prediction quality of the resulting sparse model remains unchanged, it does not provide a principled way to control the reduced set size when one would like to speed up the test phase by marginally sacrificing accuracy. Also it suffers from low-reduction rate of kernel centers (in worst case, 0%) when the number of linearly dependent SVs is small, especially in the case of high-dimensional problems. To overcome such drawbacks and to improve the reduction ratio while keeping up the prediction accuracy, we propose a novel framework that generates new reduced kernel centers. The proposed framework first builds a dynamical system associated with a built-in kernel machine, and then uses its attractors as reduced kernel centers. Through simulation, it will be shown that the method can generate reduced kernel centers the number of which is far lower than other existing methods while not suffering from local minima problems nor singularity problem with minimum loss of accuracy. The organization of this brief is as follows. In Section II, we formulate our problem and Section III proposes a framework that constructs a sparse kernel machine. In Sections IV and V, we apply the proposed framework to some SV-based kernel machines for multiclassification problems and conduct simulations to show the effectiveness and the efficiency of the proposed method, respectively. II. PROBLEM FORMULATION Kernel algorithms [1]–[3], [8] express their solutions as expansions in terms of mapped input points as

9=

N

i=1

i 8(xi )

(1)

where i 2 and xi 2 X . Using a kernel k , the resulting kernel algorithm can then predict new data points through the computation of all (xi ; x) = h8(xi ); 8(x)i for i = 1; . . . ; Nx . For example, a typical discriminant function of a kernel machine is represented by

f (x) = h9; 8(x)i + b N = i h8(xi ); 8(x)i + b i=1 N = i (xi ; x) + b: i=1

(2)

Therefore, we can reduce the runtime complexity by reducing the number Nx of kernel centers xi . The best way to accomplish this task is to find a preimage z 2 X satisfying 8(z) = 9. However, such a preimage does not always exist since 8 is a nonlinear mapping into the feature space F [2]. Instead, we attempt to obtain an approximation to 9 with smaller number of kernel centers. Specifically, we look for a 90 = Ni=1 i 8(zi ) with Nz ( Nx ); i 2 , and zi 2 X that minimizes

k9 0 9 k2 = 0

N

i j (xi ; xj ) +

i;j =1 N N

02

i=1 j =1

N i;j =1

i j (zi ; zj )

i j (xi ; zj ):

(3)

The key point here is that we can minimize (3) through the art of kernel function, even though feature mapping 8 is not explicitly given [2], [5], [7], [8]. III. PROPOSED RS CONSTRUCTION FRAMEWORK In this section, we present a framework to generate reduced kernel centers for a sparse kernel machine. The proposed framework consists of three phases: phase I for generating attractors as new kernel centers, phase II for computing the coefficients for new kernel centers, and phase III for resetting the bias term of the kernel machine for performance improvement. The detailed procedure of the method is described below. A. Phase I: Generating Attractors as New Kernel Centers

Phase I starts by defining the distance function (x) by the L2 -norm between a given trained kernel solution 9, and 8(x), an image of x in the feature space

(x) = k9 0 8(x)k2 2 N = i 8(xi ) 0 8(x) i=1 N N N = i j (xi ; xj ) + (x; x) 0 2 i (xi ; x): i=1 j =1 i=1

(4)

Any other distance functions (e.g., weighted squared distance function) satisfying the properties of metrics can also be used in this phase without sacrificing the general procedure of the proposed method. Now to construct new reduced kernel centers for a sparser machine that approximates a built-in kernel machine, we build the following dynamical system associated with described by:

dx dt

= 0r(x):

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

(5)

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

723

The existence of a unique solution (or trajectory) x( 1 ) : < !
A(s) := clfx(0) 2
t!1

: lim

x(t) = sg:

(6)

The boundary of the basin cell defines the basin cell boundary, denoted by @A(s). One nice property of system (5) is that under quite general conditions, the whole data space can be decomposed into several separate basin cells of attractors [13], that is

n

<

M =

i

A(si )

(7)

where V = fs1 ; . . . ; sM g is the set of all the attractors of (5). [See Fig. 1(b) and (c).] Another nice property of system (5) is that each attractor s is a global minimum on its basin cell, which is given in the following lemma. Lemma 1: Let s be an attractor (or stable equilibrium vector) of system (5). Then, we have

(s) (z)

z 2 A(s):

(8)

8

Proof: This result can be easily proved by observing that (x(t)) is a decreasing function of t 2 < for each x(0) = z 2 A(s) since

d (x(t)) = 0r(x(t))T r(x(t)) 0 dt

(9)

and x(t) ! s. This lemma means that each attractor is the best approximate as a preimage of the solution vector 9 when the domain is restricted to its basin cell, which motivates us to employ the set of attractors V as new reduced kernel centers. It is worth noting that the whole kernel centers V are obtained deterministically from one specific in (4) whereas other existing methods such as [2] and [6] need to look for kernel centers by optimizing objective functions of the form in (3) that change at every step, which is computationally intensive. Also, phase I can automatically determine the number of possible kernel centers uniquely identified by the chosen kernel function k . B. Phase II: Calculating the Coefficients for the Reduced Kernel Centers In phase I, we have generated attractors V = fs1 ; . . . ; sM g as new reduced kernel centers. With these kernel centers, in phase II, we look for optimal coefficients of our approximate solution 9 =

0

M 8(s ) that makes 90 closer to the solution vector 9. To this i i=1 i end, we solve the following minimization problem: k9 0

0 9

2

k

N =

=

i=1 N

i 8(xi ) 0

2

j =1

2

j 8(sj )

i j (xi ; xj ) +

i;j =1 N 0

M

M

M i;j =1

i j (si ; sj )

i j (xi ; sj ):

i=1 j =1

(10)

By setting the derivative of (10) to zero, we can directly calculate optimal coefficients vector as follows:

ss )01 Ksx

= (K

(11)

where = (1 ; . . . ; N )T ; = ( 1 ; . . . ; M )T , and Kss is an ss = (s ; s ), and Ksx is an M 2 N -matrix M 2 M -matrix with Kij i j x sx with Kij = (si ; xj ). When calculating (11), a convergence problem may arise. It is observed that the existing methods [6]–[8] often suffer from multicollinearity in the constructed kernel matrix, leading to singularity problem in computing . On the contrary, the elements of V are distinct and well separated since the basin cells are disjoint. So its corresponding kernel matrix Kss is nonsingular by the Micchelli’s theorem [15], and in addition, it has often dominant diagonal elements when we use the Gaussian kernels, thereby leading to well-posed inversion problems in most cases. The next result presents another nice property of 9 obtained from phase II. As mentioned in the previous section, a preimage of the exact solution 9 generally does not exist. As an alternative approach, we can find the input vector whose image is close to the exact solution 9. The next theorem guarantees that 9 is closer to the solution vector than any image of single input pattern. Theorem 1: Suppose that s1 ; . . . ; sM 2 V are the attractors of (5) and 1 ; . . . ; M are optimal coefficients obtained from (11). Then, for all z 2 X , the following inequality holds:

0

0

N i

2

i 8(xi ) 0 8(z)

N

i

i 8(xi ) 0

M i

2

i 8(si )

: (12)

Proof: For an arbitrary given z 2 X , we can find an attractor sk 2 V such that z 2 A(sk ) by (7). Then, by Lemma 1, we have that

N i

2

i 8(xi ) 0 8(z)

N

i

2

i 8(xi ) 0 8(sk )

:

(13)

M ~ 8(s ), where We can represent 8(sk ) = i i i ~ ~ ~ ( 1 ; . . . ; k ; . . . ; M ) = (0; . . . ; 1; . . . ; 0). Hence by the optimality of in (11), we have N i

2

i 8(xi ) 0 8(sk )

N

i

i 8(xi ) 0

M i

2

i 8(si )

and the result follows. C. Phase III: Recomputing an Optimal Bias Term Most of kernel machines including the SVMs have a bias term as in (2). As mentioned in [7], in classification, we are interested in N (x ; x) + b) 0 sgn( N (z ; x) + ~b)jdP (x) jsgn( i j i=1 i j =1 j rather than k9 0 9 k. Because of the discrepancy between 9 and 9 in the feature space, there is a room for improving the classification accuracy by adjusting the bias term. Therefore, after finding 9 that

0

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

0

0

724

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

approximates 9, we need to recompute the bias term by optimizing the following model:

1 min ~b 2

x 2D

(f (xi 9; b) j

0

f (xi j 90 ; ~b))2

(14)

where ~b is a new bias term for the approximated kernel function. Empirically, it is observed that the error rate has been reduced after recomputing the bias in most of cases as is shown in Table II. As a result, the final model we obtain by the proposed method is f (x j 90 ; ~b), which depends on much smaller number of vectors compared to those of the original built-in kernel machine.

To make the k th SVM classifier to be sparser, we apply the proposed framework to it as follows: In phase I, we compute the set of attractors Vk = fsk ; . . . ; sk g using system (5) where Mk < Nxk and in (4) is computed with 9 = wk . Then, in phase II, we compute the coefficients k = ( k ; . . . ; k ) using (11) and obtain the approximated solution vector 90 . Finally, in phase III, we calculate the optimal bias bk0 by solving (14) and obtain the resulting new k th sparser classifier function given by

fk (xjwk0 ; bk0 ) = wk0T 8(x)

=

D. Complexity Analysis and Implementation Issue to Find Attractors The proposed framework described in the previous section might be computationally expensive for large scale data sets such as handwritten digit data. Usually, most of the computational burden arises from locating attractors in phase I. To analyze the complexity of this step, let m be the average number of iteration steps for each initial point to locate its corresponding attractor (stable equilibrium point) via steepest descent process, and N be the number of initial points to be used. Then, the time complexity of getting the set of all the attractors of system (5) is O(N m). Because attractors under the generalized negative gradient system of (5) are equivalent to the local minima of (4) as stated in Lemma 1 and we are only concerned with locating an attractor rather than accurately following the trajectory itself, it is better to find local minima (attractors) starting from N initial data points by employing a globally convergent optimization solver such as trust-region methods, rather than by numerically integrating system (5). Empirically, we have observed that it takes only two or three steps for the convergence. However, to find local minima (i.e., attractors) from the N initial points sequentially for large data sets is still time consuming. In this case, we can speed up the process by using a sampling strategy. The most optimal strategy might be to select one initial point from each basin cell. However, since the basin cells cannot be determined a priori, in this brief, we used the SVs of the built-in kernel machine as initial points. IV. APPLICATIONS TO KERNEL-BASED CLASSIFIERS In this section, we address how to construct sparser kernel classifiers by applying the proposed framework to two SVs-based kernel machine for multiclassifications: one-against-all SVM [16] and SVDD-based classifier (multi-SVDD) [17]. We first partition the given training data into K -disjoint subsets K fDk gk=1 according to their output classes where the k th class data N set is Dk = f(xk ; yk )gi=1 with input x 2 d and its corresponding output y 2 . A. Constructing a Sparser SVM (E-SVM) One-against-all SVM constructs K binary SVM classifiers, each of which separates one class from all the rest [16]. The k th SVM is trained with all the training examples of the k th class with positive labels, and all the others with negative labels, which has the following form:

fk (xj wk ; bk ) = w

= = where wk = number of SVs.

N i=1

T k

8(x)

N

k yk i=1 N i=1

k yk

k i=1

8(sk )T 8(x)+ bk0 =

8(xk ) 8(x) + bk

8(xk ) is the solution vector with Nxk

i=1

k (sk ; x) + bk0 :

At the classification step, a sample x is classified into the class k3 = arg maxk=1;...;K fk (xjwk0 ; bk0 ). The obtained sparser machine will be noted by E-SVM. B. Constructing a Sparser SVDD Classifier (E-SVDD) Multiclass SVDD (multi-SVDD) models a pseudoposterior probability density for each class data set using the support vector domain description (SVDD) [17]. The k th kernel support function is trained with only Dk , not a whole data set, and has the following form:

fk (x j ak ) = k8(x) 0 ak k2

= (x; x) 2

N

0

+

N

N

i=1

k (xk ; x)

k k (xk ; xk

i=1 j =1

)

(17)

where ak = i=1 k 8(xk ) is the solution vector with Nxk number of SVs and k are the coefficients obtained by the SVDD. At the classification step, a sample x is classified into the class k3 N

k3

Nk k (x) := = arg k=1max (rk ;...;K N

0

fk (x j ak ))

(18)

where rk = fk (xk j ak ) for any SV xk generated by the SVDD and k represents a pseudoposterior probability distribution for the k th class [17]. To make the k th kernel support function to be sparser, we apply the proposed framework to it as follows: In phase I, we compute the set of attractors Vk = fsk ; . . . ; sk g where Mk < Nxk , using system (5) where in (4) is computed with 9 = ak . Then, in phase II, we compute the coefficients k = ( k ; . . . ; k ) using (11) and obtain the approximated solution vector 90 . Then, in phase III, we optimize the bias term where we optimize rk since in (18) it plays a role as a bias term in k (x). Finally, we obtain the resulting new k th sparser classifier function given by Nk 0 k0 (x) := (19) (rk 0 fk (x j ak0 )) N where

fk (xja0k ) = k8(x) 0 ak0 k2 M

0

(15)

M

(16)

= (x; x) 2

T

k yk (xk ; x) + bk

M

+

M

M

i=1 j =1

i=1

k (sk ; x)

k k (sk ; sk ):

(20)

The obtained sparser machine will be noted by E-SVDD. To illustrate E-SVDD applied to the toy example of [2], see Fig. 1. Our object is to approximate multi-SVDD with 27 kernel centers as

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

725

Fig. 1. Illustration of constructing a sparser kernel classifier E-SVDD from the original kernel classifier multi-SVDD. (a) The solid line represents the decision boundary generated by multi-SVDD with 27 kernel centers. (b) and (c) The light solid line and inner dotted lines represent the contours of the trained kernel function values on class 1 and class 2 of (multi-SVDD), respectively. The circles represent attractors converged from the original kernel centers (marked with filled squares) when process (5) is applied. These attractors are used as the reduced kernel centers. The solid lines represent the basin cell boundaries constructed by these attractors. (d) The solid line represents the decision boundary generated by E-SVDD with nine kernel centers.

shown in Fig. 1(a). At first, we partition the given data subset into two data sets according to their output classes and perform the SVDD for each class-conditional data set. As a result, we find the trained kernel function with its decision boundary for each class-conditional data as shown in Fig. 1(b) and (c). Here, we generate four and five attractors for class 1 and class 2, respectively, which are used as reduced kernel centers. The resulting sparse kernel classifier E-SVDD with nine kernel centers is shown in Fig. 1(d), which generates the similar decision boundary shape as multi-SVDD in Fig. 1(a) does. V. EXPERIMENTAL RESULTS To evaluate the performance of the proposed sparse kernel classifiers (E-SVM and E-SVDD) empirically, we have conducted simulations on some benchmark data sets for binary and multiclass classifications and compared their performance with the existing method.

Description of the data sets is given in Table I. toy2D; toy3D [2], twocircles; tae; OXours; triangle; ring, and sunflower [17] are very challenging classification problems generated from nonlinearly separable distributions. sonar; heart; shuttle , vowel , and wine are widely used classification data sets from the University of California at Irvine (UCI) repository [18]. g50c data set is taken from [19]. USPS is a handwritten digit database [20] of 7291 training patterns and 2007 test patterns with 16 2 16 digit images. The performance of E-SVM and E-SVDD is compared with various reduced set method such as fixed-point iteration scheme (RSC) [7], [8], exact simplified SVM (Sim-SVM) [9], reduced SVM (RSVM) [10], and dynamic decay adjustment radial basis function (RBF) network (DDA-SVM) [11] as well as with the original full kernel classifiers (SVM and multi-SVDD [17]). The criteria for evaluating the performance of these methods are their misclassification error rates for test

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

726

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

TABLE I BENCHMARK DATA DESCRIPTION AND PARAMETER SETTINGS

TABLE II DATA DESCRIPTIONS AND EXPERIMENTAL RESULTS ON MULTICLASS BENCHMARK DATA SETS. ERROR IS MISCLASSIFICATION ERROR RATE (IN PERCENT) DENOTES THE NUMBER OF KERNEL CENTERS. NOTE HERE THAT FOR THE MULTICLASSIFICATIONS ON TEST DATA SET. DENOTES A TOTAL NUMBER OF KERNEL CENTERS

N

data sets and the number of kernel centers. Note that if an SV is shared with several binary classifiers, we counted it as one in order to avoid duplicate counting for the shared SVs. In our experiments, we used a Gaussian kernel ((x; y) = 2 exp(0q kx 0 yk )) for all of the kernel methods where q is a kernel bandwidth parameter. For simplicity, we used the same value q for the kernel function (1; 1) in (15) for each binary class data set and in (17) for each class-conditional data set, respectively. The regularization parameter C controls the complexity of the built-in kernel machines

N

to compromise between error reduction and regularization. The model parameter pair (C; q ) is chosen among several combinations of values on a finite grid by using tenfold cross validation. Specifically, given a parameter set (C; q ), the data set in each class is divided into ten subsets: one of the ten subsets is used as the validation set and the other nine subsets are put together to form a training set. The latter is used for constructing a kernel machine for each class and the former is used to measure the misclassification rate for the constructed machine. This process is repeated ten times and the average error

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

727

Fig. 2. Experimental results on benchmark data sets. Each column represents the results of multiclass classification applied to three different data sets: oxours; ring, and sunflower with classification boundaries and the number of kernel centers are given. The first and second rows represent the results of OaA-SVM and E-SVM, while the third and fourth rows are for multi-SVDD and E-SVDD, respectively. Similar decision boundaries are observed with far lower number of kernel centers for both sparser kernel machines.

across all ten trials is computed for each class and then the final total misclassification error rate for a given (C; q ) is obtained by averaging the error over each class. The parameter pair (C; q ) with the minimum total misclassification error rate is selected for constructing the final kernel machines. The detailed descriptions of parameter values are reported in Table I. In the case of multi-SVDD, we allowed no soft margin, so we set C = 0. Experimental results for benchmark data sets are shown in Table II and Fig. 2. In Table II, we can see that E-SVM and E-SVDD have competitive performance in terms of classification accuracy compared to the original full kernel classifiers SVM and multi-SVDD, while

achieving noteworthy reduction rates for the number of kernel centers. They also have the advantage on the classification accuracy and reduction rate over RSC, Sim-SVM, RSVM, and DDA-SVM. In particular, RSC, despite of its ability to control the reduction rate directly, shows poor reduction rate to achieve comparable classification accuracy. Sim-SVM could not reduce the number of kernels significantly partly because the feature mapping to infinite dimensional feature space under a Gaussian kernel does not keep the linear dependency between mapped images. As expected, it shows 0% reduction rates for 11 data sets among 15 ones. For RSVM, additional trials are needed to determine optimal maximum marginal difference (MMD), which trades

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

728

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

off the reduction rate and accuracy loss. Empirically, its reduction rate becomes very small to keep moderate classification accuracy for some data sets (for example, heart, sunflower, vowel, wine in Table I). For DDA-SVM, it is shown to be highly dependent not only on the positive and negative thresholds but also on whether to apply pruning step since pruning often removes too many nodes and thereby results in severe accuracy loss. Note that the influence of bias optimization on both training and test error are given in Table II where E-SVM and E-SVM0 denote the results with and without the bias optimization of phase III, respectively. The comparison between E-SVM and E-SVM0 shows that empirically the error rates can be further reduced after the bias optimization for most of data sets. Interestingly, similar decision boundaries of SVM and multi-SVDD are observed in E-SVM and E-SVDD, respectively, with far lower number of kernel centers as is shown in Fig. 2. The last row in Table II compares the performance of these methods when applied to a large real data set, called USPS handwritten digit data. In this example, we started with ten binary SVM classifiers (similarly for multi-SVDD), which are one-against-all (i.e., digit0 and others) and then we applied the methods to each classifier. For the fair comparison [8], [10], [11], we used the same parameter values and experimental settings. E-SVM is shown to reduce the set size of kernel centers significantly with a reduction rate of 75% while its average test error was comparable to those of the existing methods. E-SVDD, on the other hand, is shown to have poor classification accuracy because a given built-in classifier (multi-SVDD) itself is inaccurate (6.58% error rate) possibly due to its failure to estimate the class-conditional density for high-dimensional USPS data set. In short, the proposed method shows better reduction ratio with respect to the set size of kernel centers, especially for multiclassifications. The good sharing property of the proposed method compared to the existing method explains the significant reduction rate in part. For illustration purpose, consider a three-class classification problem. One-against-one SVM constructs three binary SVM classifiers for a combination of class 1 and class 2, class 1 and class 3, class 2 and class 3, where S12 ; S13 , and S23 are, respectively, the reduced sets of their kernel centers obtained by reduced set methods. Since the existing reduced set construction methods generate new kernel centers that are not contained in the original data set and the sparse solutions depend on the combination of binary class data sets, each binary classifier almost surely does not share its kernel centers with those of the remaining binary classifiers. In other words, S12 \ S13 = S12 \ S23 = ;, and therefore, the total reduced set size is approximately proportional to the number of combination of the binary classes. In the worst case, the size might exceed the number of SVs of the original SVM classifier, implying no practical advantage of memory saving and test phase enhancement. (See Vowel in Table II). In contrast, the total reduced set size of kernel centers obtained by E-SVDD is only affected by the distribution of class-conditional data set [17]. As mentioned in the previous section, multi-SVDD constructs a k th class pseudoposterior density function with only a k th class data set, not binary class data sets [17]. Let S1 ; S2 , and S3 be their reduced kernel centers for each pseudoposterior density function. In the case of constructing a binary classifier on class 1 and class 2, E-SVDD uses S12 = S1 [ S2 as its reduced set. Therefore, binary classifiers of class 1 and class 2 and class 1 and class 3 share S1 from their reduced sets because S12 \ S13 = S1 . This sharing property of E-SVDD causes it to achieve high reduction rate of the kernel centers in multiclassification problems. To analyze approximately the average set size of kernel centers with respect to the number of class labels, let N be the average number of reduced kernel centers from each binary classifier in one-against-one

SVM by using RSC and K be the number of class labels. By assuming that E-SVDD has the same reduction ratio with RSC in a binary classifier level, it has at most N=2 attractors for each class-conditional data set in our simulations and so the total set size of kernel centers of E-SVDD is on average K 1 N=2. On the other hand, the total set size of kernel centers constructed by RSC is on average (K (K 0 1)=2) 1 N . As a consequence, the existing methods based on reduced set construction need on average more kernel centers, at least (K 0 1) times that of E-SVDD. Finally, we would like to mention that the framework of the proposed method is quite general and can be extended and applied to approximate any other kernel-based algorithms [2], [13], [21]–[23]. VI. CONCLUSION In this paper, a novel method to construct a sparse kernel machine that approximates a given kernel machine is presented. The proposed method consists of three phases. The first phase builds a dynamical system associated with the original kernel machine and uses its attractors as reduced kernel centers, the number of which is typically far lower than that of the original ones. The second phase then recompute their corresponding coefficients. Finally, the third phase readjusts the bias term of the kernel machine for accuracy improvement. Experimental results on both benchmark and large scale data sets demonstrate that the proposed method leads to much sparser algorithms resulting in correspondingly faster performance on new data with minimum loss of accuracy. The applications of the method to more diverse real-world problems remain to be further investigated.

REFERENCES [1] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer-Verlag, 2006. [2] B. Schölkopf and A. Smola, Learning with Kernels. Cambridge, MA: MIT Press, 2002. [3] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, U.K.: Cambridge Univ. Press, 2004. [4] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, 2001. [5] C. J. C. Burges, “Simplified support vector decision rules,” in Proc. Int. Conf. Mach. Learn., 1996, pp. 71–77. [6] C. J. C. Burges and B. Schölkopf, “Improving the accuracy and speed of support vector machines,” in Advances in Neural Information Processing Systems, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambridge, MA: MIT Press, 1997, vol. 9, p. 375. [7] B. Scholkopf, P. Knirsch, A. Smola, and C. Burges, “Fast approximation of support vector kernel expansions, and an interpretation of clustering as approximation in feature spaces,” in Lecture Notes in Computer Science. Berlin, Germany: Springer-Verlag, 1998. [8] B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. J. Smola, “Input space versus feature space in kernel-based methods,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 1000–1017, Sep. 1999. [9] T. Downs, K. E. Gates, and A. Masters, “Exact simplification of support vector solutions,” J. Mach. Learn. Res., vol. 2, pp. 293–297, 2001. [10] D. Nguyen and T. B. Ho, “An efficient method for simplifying support vector machines,” in Proc. 22nd Int. Conf. Mach. Learn., Bonn, Germany, 2005, pp. 617–624. [11] R. Perfetti and E. Ricci, “Reduced complexity RBF classifiers with support vector centres and dynamic decay adjustment,” Neurocomputing, vol. 69, pp. 2446–2450, Oct. 2006. [12] J. Palis and W. de Melo, Geometric Theory of Dynamical Systems: An Introduction. New York: Springer-Verlag, 1982. [13] J. Lee and D. Lee, “An improved cluster labeling method for support vector clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 461–464, Mar. 2005. [14] J. N. Nocedal and S. J. Wright, Numerical Optimization. New York: Springer-Verlag, 1999. [15] C. A. Micchelli, “Algebraic aspects of interpolation,” in Proc. Symp. Appl. Math., 1986, pp. 81–102.

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 4, APRIL 2009

[16] Y. Liu and Y. F. Zheng, “One-against-all multi-class SVM classification using reliability measures,” in Proc. Int. Joint Conf. Neural Netw., Jul. 2005, pp. 849–854. [17] D. Lee and J. Lee, “Domain described support vector classifier for multi-classification problems,” Pattern Recognit., vol. 40, pp. 41–51, 2007. [18] D. Newman, S. Hettich, C. Blake, and C. Merz, UCI Repository of Machine Learning Databases, 1998 [Online]. Available: http://www. ics.uci.edu/mlearn/MLRepository.html [19] O. Chapelle and A. Zien, “Semi-supervised classification by low density separation,” in Proc. 10th Int. Workshop Artif. Intell. Statist., 2005, pp. 57–64. [20] Y. Lecun, L. D. Jackel, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik, “Learning algorithms for classification: A comparison on handwritten digit recognition,” Neural Netw., pp. 261–276, 1995. [21] J. Lee and D. Lee, “Dynamic characterization of cluster structures for robust and inductive support vector clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 461–464, Nov. 2006. [22] D. Lee and J. Lee, “Equilibrium-based support vector machine for semi-supervised classification,” IEEE Trans. Neural Netw., vol. 18, no. 2, pp. 578–583, Feb. 2007. [23] H.-C. Kim and J. Lee, “Clustering based on Gaussian processes,” Neural Comput., vol. 19, no. 11, pp. 3088–3107, 2007.

Trace Ratio Problem Revisited Yangqing Jia, Feiping Nie, and Changshui Zhang

Abstract—Dimensionality reduction is an important issue in many machine learning and pattern recognition applications, and the trace ratio (TR) problem is an optimization problem involved in many dimensionality reduction algorithms. Conventionally, the solution is approximated via generalized eigenvalue decomposition due to the difficulty of the original problem. However, prior works have indicated that it is more reasonable to solve it directly than via the conventional way. In this brief, we propose a theoretical overview of the global optimum solution to the TR problem via the equivalent trace difference problem. Eigenvalue perturbation theory is introduced to derive an efficient algorithm based on the Newton–Raphson method. Theoretical issues on the convergence and efficiency of our algorithm compared with prior literature are proposed, and are further supported by extensive empirical results. Index Terms—Dimensionality reduction, trace ratio (TR), eigenvalue perturbation, Newton–Raphson method.

I. INTRODUCTION Many machine learning and pattern recognition applications involve processing data in a high-dimensional space. For computational time, storage, de-noise, and other considerations, we often reduce the dimensionality of such data in order to learn a model efficiently. Also, in Manuscript received January 08, 2008; revised April 17, 2008, August 18, 2008, and December 12, 2008; accepted February 07, 2009. First published March 16, 2009; current version published April 03, 2009. This work was supported by the National Natural Science Foundation of China under Grants 60835002 and 60721003. The authors are with the State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2009.2015760

729

many cases, it has been found that the data has low-dimensional structures such as the manifold structure, so we are often interested to find a reasonable low-dimensional representation of the data. There are a number of supervised and unsupervised dimensionality reduction algorithms such as linear discriminant analysis (LDA), kernelized LDA, marginal Fisher analysis, principal component analysis (PCA), ISOMAP, locally linear embedding (LLE), etc. Many of these algorithms can be formulated into a fundamental optimization problem called the trace ratio (TR) problem, which that maximizes the TR involves searching for a transform matrix Tr[ T p ] Tr[ T l ] where p and l are method-related positive–semidefinite matrices, together with the constraint that the columns of are unitary and orthogonal. However, the TR problem does not have a closed-form global optimum solution directly. Thus, it is conventionally approximated by solving a ratio trace problem maxW Tr[( T p )01 ( T l )], which is essentially different and diverts from the original optimum value. Previous works, such as [1], have showed that the optimum solution to the original TR problem is superior to the diverted one in performance. However, the TR problem does not have a closed-form solution. There have been some attempts to find the global optimum solution: Guo et al. [2] pointed out that the original TR problem can be converted to a trace difference problem, and proposed a heuristic bisection way to find the solution. Further, Wang et al. [1] proposed an iterative algorithm called iterative trace ratio (ITR) that was empirically more efficient. However, except for the convergence of these methods, much of the theoretical nature is left not discussed. In this brief, we aim to give a theoretical view of the TR/trace difference problems by introducing the eigenvalue perturbation theory into the analysis. We then propose a new method that is both theoretically and empirically proved to be more efficient than the previous ones. This brief is organized as follows. Section II introduces the TR/trace difference problems and discusses their relationship. Section III explores the property of the trace difference function, introduces the perturbation theory, and proposes a new method. The convergence of the method is also provided. In Section IV, we discuss the theoretical explanation to previous algorithms and show the superiority of our method. Experiments on object recognition databases are presented in Section V. Finally, Section VI concludes the paper.

W S W = W SW W W SW

W

S

S

W SW

II. THE TRACE RATIO PROBLEM A. Trace Ratio, Ratio Trace, and Trace Difference We introduce the TR problem from the notion of linear discriminant analysis (LDA). Given a set of training data points X = f i gn i=1 m and the corresponding label Y = f gn , where 2 f1 . . . g i i=1 i is the label of the data point i , LDA tries to find a low-dimensional T (where representation in d via a linear transformation 0 = is an 2 matrix) that maximizes the between-class scatter and minimizes the within-class scatter at the same time c n T T k2 i0 3 = arg max i=1 n k (1) n n T i 0 T y k2 W i=1 n k

W

x

m d

n

x

y

x

; ;c W x

W m W m W x W m

W

m

y

n

where i and i are, respectively, the mean and the number of the is the mean of all the data points belonging to the th class, and data points. By defining the between-class covariance matrix b = c ( ) ( )( T i 0 )( i 0 ) and the within-class covariance i=1 i T ( ) ( )( 0 matrix w = n i y )( i 0 y ) , problem (1) i=1 y is equivalent to the following form: 3 = arg max Tr[ T b ] (2) Tr[ T w ] W

i

m

n =n m m m m S n =n x m x m W

1045-9227/$25.00 © 2009 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 17, 2009 at 06:24 from IEEE Xplore. Restrictions apply.

W SW W SW

S

Bridging logic and kernel machines

Nystrom Approximation for Sparse Kernel Methods ...

Sparse-parametric writer identification using heterogeneous feature ...

Sparse-parametric writer identification using ...

Sparse-parametric writer identification using heterogeneous feature ...

Graphical processors for speeding up kernel machines - University of ...

Dynamically Allocating the Resources Using Virtual Machines

Visualization Of Driving Behavior Using Deep Sparse ...

Sparse Representation based Anomaly Detection using ...

A SPARSE SYSTEM IDENTIFICATION BY USING ...

Photometric Stereo Using Sparse Bayesian ieee.pdf

Intrinsic Image Decomposition Using a Sparse ...

Sound Ranking Using Auditory Sparse-Code ... - Research at Google

Data Selection for Language Modeling Using Sparse ...

Sound retrieval and ranking using sparse ... - Research at Google

Speaker Recognition using Kernel-PCA and ... - Semantic Scholar

Sparse Distributed Memory Using Rank-Order Neural ...

Sparse Coding of Natural Images Using an ...

A greedy algorithm for sparse recovery using precise ...

Parallel Sparse Matrix Vector Multiplication using ...

Optical Flow Estimation Using Learned Sparse Model