Determination of Global Minima of Some Common ...

Viewer
Transcript

1

Determination of Global Minima of Some Common Validation Functions in Support Vector Machine Jian-Bo Yang, *Chong-Jin Ong

Abstract—Tuning of the regularization parameter, C, is a well-known process in the implementation of a Support Vector Machine classifier. Such a tuning process uses an appropriate validation function whose value, evaluated over a validation set, has to be optimized for the determination of the optimal C. Unfortunately, most common validation functions are not smooth functions of C. This paper presents a method for obtaining the global optimal solution of these non-smooth validation functions. The method is guaranteed to find the global optimum and relies on the regularization solution path of SVM over a range of C values. When the solution path is available, the computation needed is minimal. Index Terms—Support vector machine (SVM), model selection, regularization path, tuning of regularization parameter.

I. I NTRODUCTION The procedure for tuning the regularization parameter, C, is a well-known problem in the study of Support Vector Machine (SVM) classifier. It typically involves a validation set in the form of a separate and unseen data or as one fold in a n-fold cross-validation process. It also uses an appropriate validation function whose value has to be optimized over the validation set. The corresponding value of C is then chosen as the optimal C. In standard binary SVM classifier, the common validation functions are the error rate, weighted or balanced error rate, precision, recall or variations thereof. As these validation functions are not smooth functions of C, the resulting optimization problem is not suitable for efficient optimization routines. Consequently, approximations of the validation functions have been proposed [1, 2, 3, 4, 5]. Chapelle et al., [3] suggest several measures for such a purpose using the training data. These include various bounds on the generalization error like the Radius-Margin Bound, span bound and others. Empirical evaluations of several measures have also been reported [6]. Another popular choice is the sigmoidal approximation of the output function of SVM [4]. These approximations are used for the tuning process as they are smooth functions of C (and other parameters, like the kernel parameter and others) and facilitate numerical determination of optimal C via standard gradient-based optimization algorithms. Nevertheless, as approximations, their connection to the true validation function is not direct. They are also known to have multiple local stationary points, making the determination of the global optimum difficult for gradientbased algorithms. J. B. Yang and *C. J. Ong are with the Department of Mechanical Engineering, National University of Singapore, Singapore, Singapore 117576 (fax: +65 67791459; Email: [email protected]; [email protected]).

This paper does not propose any new validation function. It assumes that the current choices of validation function (error rate, balanced error rate, precision, recall and others) lead to good generalization performance. It differs from previous works in that it determines the global optimal C without making approximations of the validation function. The method proposed is suitable for SVM problems with polynomial or linear kernels. It can also be used for problem with Gaussian kernel when the kernel parameter is fixed. The procedure is quite efficient even for moderately large data sets. This is possible because of the availability of the solution path of SVM over a range of C values [7, 8]. The rest of this paper is arranged as follows. Section II provides a brief review of the needed results on SVM classification and the regularization solution path. It also serves to set up the notations needed for subsequent development. Section III shows the main algorithm for determining the global optimum of the validation function. Section IV provides results of numerical experiments of the proposed algorithm and a comparison with several standard approaches. This paper ends with conclusions in section V.

II. R EVIEW

OF

PAST W ORKS

Suppose a training set D := {(xi , yi ) : i ∈ ID } is given where xi ∈ Rd is the feature vector and yi ∈ {−1, +1} is the corresponding label. For notational convenience, |S| refers to the cardinality of the set S. Hence, |D| = |ID |. The standard two-class SVM classification primal problem (PP) is given by: min

w,b,ζ

X 1 w·w+C ζi 2

(1)

i∈ID

yi (w · zi + b) ≥ 1 − ζi , ζi ≥ 0, ∀ i ∈ ID

(2)

where C > 0 is the regularization parameter, zi := φ(xi ) is a vector in the high dimensional Hilbert space, H, mapped into by the function φ : Rd → H with w and b being the normal vector and the bias of the separating hyperplane H := {z|w ·z +b = 0} respectively. The Dual problem (DP) is given by min α

X 1 X X αi αj yi yj zi · zj − αi 2 i∈ID j∈ID

(3)

i∈ID

0 ≤ αi ≤ C, ∀ i ∈ ID X αi yi = 0 i

(4) (5)

2

where αi is the Lagrange multiplier for each inequality in (2). The output classifier function of SVM is X f (x) = αi yi φ(xi ) · φ(x) + b. (6) i∈ID

where αi refers to the optimal solution obtained from solving DP. Since C is a parameter in PP, the solution of DP in the form of {αi : i ∈ ID } and b are all functions of C. It is possible to numerically determine these solutions for the entire range of C, resulting in a regularization solution path of SVM. Works in this direction are given by Hastie et al. [7] and Ong et al. [8]. Hastie et al. [7] provide the framework for the approach following techniques from parametric programming while Ong et al. [8] use a different formulation to improve on the reliability of the algorithm. Among others, Ong et al.’s approach takes into consideration numerical problems that can arise in a data set having nominal features, duplicate points, and/or linearly dependent points in the kernel space. Detailed information of the approach can be found in [7] and [8]. The rest of this section provides a summary of key results of [8] needed in the sequel. To facilitate discussion, the notations used in [7] and [8] are adopted: λ : = C −1 , α0 (λ) := b(λ), α ˆi (λ) := λαi (λ), ∀ i ∈ ID ∪ {0}

(7) (8)

where the dependence of α ˆ i and αi on λ (equivalently, C) are shown explicitly. The solution of DP at any specific value of λ consists of the optimal α ˆi (λ), i ∈ ID ∪ {0}. Because of (4) and (8), α ˆ i (λ) takes value between 0 and 1 for all i ∈ ID . Hence, it is convenient to introduce the following mutually exclusive sets

III. F INDING

G LOBAL O PTIMAL S OLUTION VALIDATION F UNCTION

THE

OF THE

Assuming that the SVM solution path has been obtained using training set D such that (9) and (10) are available. Consider a given validation set denoted by V := {(xi , yi ) : i ∈ IV }. The output function of (6) at a specific value of λ can be expressed as 1 X f (x, λ) = ( α ˆi (λ)yi zi · z + α ˆ0 (λ)). (12) λ i∈ID

The tuning process involves finding the optimal λ value of a validation function on V which requires frequent evaluation of f (xj , λ) for j in IV . For convenience, define X hj (λ) : = λf (xj , λ) = α ˆ i (λ)gij (13) =

X

i∈L(λ)

i∈ID ∪{0}

gij +

X

α ˆ i (λ)gij

(14)

i∈E(λ)∪{0}

where gij := yi zi · zj for any (i, j) ∈ ID × IV and g0j = 1 for all j ∈ IV . Equation (14) follows from (13) because α ˆ i (λ) = 0 and 1 for i ∈ R(λ) and L(λ) respectively. Since hj (λ) and λf (xj , λ) have the same sign, the predicted output class of xj ∈ V is +1, if hj (λ) ≥ 0, y˜j (λ) := sign(hj (λ)) = −1, if hj (λ) < 0.

The proposed approach is applicable to the various validation functions mentioned in Section I. However, the steps involved are best illustrated using one choice of validation function. Extensions of the approach to other validation functions and R(λ) := {i ∈ ID : α ˆi (λ) = 0}, L(λ) := {i ∈ ID : α ˆ i (λ) = 1} cross-validation set are discussed in Remarks 1 and 2. Our choice corresponds to probably the most common validation and E(λ) := {i ∈ ID : 0 < α ˆi (λ) < 1} function, the error rate function, given by S S with the property that R(λ) L(λ) E(λ) = ID at every λ. 1 X The algorithm in [8], known as Improved SVM Path E(λ) = |yj − y˜j (λ)|, (15) 2|V | (ISVMP), starts with a user-defined range of λ, (λ, λ), over j∈IV which SVM solution path is needed. Typically, (λ, λ) is a large which measures the percentage of incorrect predictions. interval that covers the range of interest. The output of ISVMP The proposed approach relies on the following facts: consists of a set of critical values of λ in (a) E(λ) is a piecewise-constant function of λ and changes Λ := {λ0 , · · · , λℓmax } (9) value only when at least one y˜j (λ) changes value. (b) y˜j (λ) changes value only when hj (λ) crosses the zero with λ0 := λ, λℓmax = λ, λℓ > λℓ+1 and the corresponding value, either from positive to negative or vice versa. {α ˆ i (λℓ ) : i ∈ ID ∪ {0}} for every λℓ ∈ Λ. (10) (c) hj (λ) depends affinely on λ for λℓ ≥ λ > λℓ+1 , following (11) and (14). Each critical λ value corresponds to a qualitative change in the SVM solutions - elements in R(λ), L(λ) or E(λ) From (a) and (b), an important aspect of finding the global changes when λ crosses over λℓ . More exactly, each λℓ ∈ Λ optimum of E(λ) is to find the value of λ at which hj (λ) corresponds to the occurrence of one of the following events: crosses the zero value. For this purpose, consider the values ℓ+ ℓ ℓ • an index i ∈ E(λ ) moves to L(λ ) or R(λ ), of α ˆ i (λ) and hj (λ) between λℓ and λℓ+1 . Figures 1(a) and ℓ+ ℓ • an index i ∈ L(λ ) moves to E(λ ), 1(b) show the possible plots of α ˆ i (λ) and hj (λ) as a function ℓ+ ℓ • an index i ∈ R(λ ) moves to E(λ ), of λ in this interval respectively. For a change in the value of where λℓ+ refers to value of λ that is slightly larger than λℓ . E(λ), it follows from (b) that at least one hj (λ) among j ∈ IV The sets given by (9) and (10) fully characterize the solution must have a zero-crossover. This also means that hj (λ) is of path of SVM. For λ such that λℓ+1 < λ ≤ λℓ , α ˆi (λ) for any Type 3 or 4 in Fig 1(b). Hence, a point j causes a change in E(λ) if and only if hj (λℓ ) and hj (λℓ+1 ) have different sign. i ∈ ID ∪ {0} can be found by interpolation using Let the collection of such points be λℓ+1 − λ λ − λℓ ℓ ℓ+1 α ˆ i (λ) := ℓ+1 α ˆ (λ ) + α ˆ (λ ). (11) i i λ − λℓ λℓ+1 − λℓ ISℓ = {j : hj (λℓ ) · hj (λℓ+1 ) < 0, j ∈ IV }. (16)

3

TABLE I P SEUDO C ODE

(a)

(b)

Fig. 1. (a) Typical values of α ˆ i (λ), i ∈ E(λℓ ) for λℓ+1 < λ ≤ λℓ . (b) Typical values of hj (λ) for λℓ+1 < λ ≤ λℓ . Points A and B refer to two possible values of hj (λℓ ), positive and negative.

From (c), a convenient representation of hj (λ) is hj (λ) =

λℓ+1 − λ ℓ λ − λℓ ℓ+1 ℓ+1 hj + ℓ+1 h , λ < λ ≤ λℓ ℓ+1 ℓ λ −λ λ − λℓ j (17)

where hℓj := hj (λℓ ). Using this expression, the zero-crossover of hj (λ) for λℓ+1 < λ ≤ λℓ happens at

¯ λ, ℓmax , Λ, D, V and {α Input: λ, ˆ i (λ) : i ∈ ID ∪ {0}, λ ∈ Λ} Output: E ∗ and λ∗ 1. Initialization: ¯ Let g0j = 1, ∀j ∈ IV and λ0 = λ. Compute: gij = yi zi zj , ∀i ∈ ID , j ∈ IV , hℓj using (13) ∀ℓ = 0, 1, · · · , ℓmax and ∀j ∈ IV , E(λ0 ) from (15). Let E ∗ = E(λ0 ), λ∗ = λ0 and ℓ = 0 2. Main loop: While ℓ < ℓmax , a. Read in λℓ+1 and {hℓ+1 : j ∈ IV }. j b. Compute: hℓ+1 · hℓj , ∀j ∈ IV and form ISℓ using (16). j ℓ∗ λj using (18) ∀j ∈ ISℓ and form Iλℓ of (19). c. For each im ∈ Iλℓ starting from i1 , Compute E(λℓ∗ im ) using (20), ∗ If E(λℓ∗ im ) < E , ∗ ℓ∗ then let E ∗ = E(λℓ∗ im ) and λ = λim d. Let ℓ = ℓ + 1 end

is related to the hinged loss function in SVM. The above = , ∀j ∈ (18) development is also applicable, with minor modifications, hℓj − hℓ+1 j when E is given by • the Weighted Error rate P with E(λ) = Let these indices of λℓ∗ P j be collected into an ordered set 1 + |yj − y − η|yj − y [ ˜ (λ)| + ˜ (λ)|] j j j∈IV j∈IV 2(n+ +ηn− ) Iλℓ = {i1 , i2 , · · · , i|ISℓ | } (19) for some η > 0 where n+ (n− ) is the total number of validation samples with y = +1 (y = −1) respectively ℓ∗ ℓ∗ such that λℓ∗ i1 ≥ λi2 ≥ · · · ≥ λi|I ℓ | . and IV+ (IV− ) is the subset of indices in IV with S ℓ∗ With (18), it is possible to update E(λ) when λ crosses λj . y = +1(y = −1) respectively. The Weighted Error rate + To see this, suppose the value of E(λℓ ) is known, it follows becomes the Balanced Error rate when η = nn− . from (15) that • the Precision (percentage of positive predictions that are P correct) with E(λ) = 1 − 2N+1 (λ) j∈I − |yj − y˜j (λ)| 1 X ℓ ℓ+1 V |yj − y˜j (λ)| + constant, for λ ≥ λ > λ . E(λ) = where N+ (λ) is the total number of j with y˜j (λ) = 1. 2|V | ℓ j∈IS • Recall (percentage of positive validation examples that P ℓ∗+ are correctly predicted) with E(λ) = 2n1+ j∈I + (2 − ℓ ℓ∗ Let λim , im ∈ Iλ , be the value of λ slightly larger than λj . V |yj − y˜j (λ)|). Then • F measure (harmonic mean of precision and recall) with 1 P ℓ∗+ ℓ∗ 1 E(λℓ∗ ) = E(λ ) + {|y − y ˜ (λ )| + (2 − |yj − y E(λ) = n+ +N ˜j (λ)|). i i im m m im im j∈IV + (λ) 2|V | It is quite easy to see that these functions change their values − |yim − y˜im (λℓ∗+ im )|}. whenever there is a zero-crossover of hj . Remark 2: In the event that V is one fold of a n-fold data Since E(λ) is a piecewise constant function, E(λℓ∗+ ) is a im ℓ∗+ ℓ∗ ℓ used in a cross validation process, a few changes are needed. constant for all λ s.t., λim−1 ≥ λ ≥ λim with im , im−1 ∈ Iλ . More exactly, there is a regularization solution path for each Hence the above can also be modified as ( holdout fold, obtained using the (n − 1) remaining folds as 1 E(λℓ∗ ˜im (λℓ∗ im−1 ) + |V | , if yim = y im−1 ) ℓ∗ D. The procedures to compute ISℓ and Iλℓ for each holdout E(λim ) = 1 E(λℓ∗ ) − , otherwise fold are exactly the same as that given by (16) and (19). The im−1 |V | (20) only additional requirement is to evaluate E on a denser grid ¯ k := for m = 1, · · · , |ISℓ | and λi0 = λℓ when m = 0. Using (18), of λ in order to find its global optimal solution. Let Λ ℓ th (19), (20), (9) and (10), E(λ) can be computed for all λ ≤ {λℓ∗ holdout im : im ∈ Iλ , ℓ = 0, 1, · · · , ℓmax − 1 for the k ¯ ¯ := ∪k=1,··· ,n Λ ¯ k such that it contains the λ values λ ≤ λ. fold } and Λ It is now possible to state the pseudo code for the overall of all zero-crossovers of all holdout folds. To find the global algorithm. The algorithm assumes that the solution path in the optimum, the cross-validation function, E(λ) = E 1 (λ)+· · ·+ ¯ The evaluation of form of (9) and (10) are available for λ to λ. The output is E n (λ), has to be evaluated for all λ ∈ Λ. k ¯ the optimal λ, λ∗ and the corresponding E ∗ := E(λ∗ ). E (λ) for λ ∈ Λk is given by (20). To evaluate E k (λ) over ¯ is trivial since E k (λ) is a piecewise constant function and Remark 1: The above exposition is for the validation func- Λ ¯ k . Of course, the final SVM model tion given by (15). ThisPvalidation function can also be changes value only at λ ∈ Λ 1 expressed as E(λ) = 2|V max{0, 1 − y y ˜ (λ)} which is one that is obtained using D as the training data and with j j j∈IV | λℓ∗ j

λℓ+1 hℓj − λℓ hℓ+1 j

ISℓ .

4

svmguide3

λ obtained by the above procedure. AND

D ISCUSSIONS

For easy referencing, the proposed method is termed GO, Global Optimal approach. This section compares GO with two standard tuning processes: the grid search method (GRID) and the gradient based method (GRAD). The GRID method computes E(λ) over a grid of λ values and chooses the minimum among them. The GRAD method works only on smooth validation functions and requires expression of the gradient of the smooth validation function with respect to λ. For this reason, approximation of E(λ) by a smooth function proposed by Keerthi et al. [4] is used. Details of this approximation are given in the Appendix. Following [4], the numerical routine used in GRAD is LBFGS [9]1 . In all experiments, the optimal λ is chosen from the range [2−8 , 29 ]. Three levels of resolution are used in GRID: 2−1 , 2−0.1 and 2−0.01 and are termed GRID-1, GRID0.1 and GRID-0.01, respectively. Like most nonlinear programming methods, LBFGS solution depends on the initial choice of λ. Our experiments use five different initial values, {100, 10, 1, 0.1, 0.01}, for each data set and their results are indicated by GRAD-m where m is the initial value. In e given addition, the smooth validation function for GRAD is E, by (21) in the Appendix. For consistency in comparison, the time needed to compute the SVM solutions and the computations of hℓj , ∀j ∈ ¯ is removed from all three methods. This means IV and ∀ℓ ∈ Λ ¯ to λ is that the complete regularization solution path from λ ℓ ¯ run once and its solution with hj ∀j ∈ IV , ∀ℓ ∈ Λ are made available to all three methods. Such an approach eliminates the uncertainties associated with the SVM routines. Note that if this is not done, SVM solution for the GRID method will have to be invoked 18–1800 times while GRAD requires the SVM solutions depending on the number of intermediate λ used by the LBFGS algorithm. Of course, GO uses the entire regularization path while GRID and GRAD need SVM solutions at some selected values of λ. As an approximate guide, timing needed for one SVM regularization path is about the same as that needed for several calls (2-8) to SVM solutions [7, 8] for most data sets. Numerical experiments are done on Intel Pentium D 3.0G Hz with 1.5G memory under the Linux operating system. The regularization solution path is obtained using ISVMP [8] matlab code (Matlab 2009) available from http://guppy.mpe.nus.edu.sg/∼mpeongcj/ongcj.html. The data sets and their characteristics are given in Table II and are obtained from [10] and http://www.csie.ntu.edu.tw/ ∼cjlin/libsvmtools/datasets/. For each data set, the experiments are conducted over 10 realizations. The ℓmax for the first realization is indicated in Table II. Each of the 10 realizations is created by random (stratified) sampling of the given set into Dtrn and Dtst in the ratio of |Dtrn | : |Dtst | = 3 : 1. In each method, Dtrn is used in a 5-fold cross-validation procedure to determine the optimal C while Dtst is a test set for performance evaluation. For each realization, Dtrn is 1 Downloadable

from http://www.cs.toronto.edu/∼liam/software.shtml.

0.26 0.24 Error rate

IV. N UMERICAL E XPERIMENTS

0.28

0.22 0.2 0.18 0.16 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 log2 (λ)

Fig. 2. Curves of cross-validation error rates (CVER) as functions of λ for data set svmguide3. Solid line - 5-fold CVER; Dashed line - smooth 5-fold CVER; Dashed-dot line - CVER of fold 1; Dot line - smooth CVER of fold 1. The CVER functions for the other folds are omitted to prevent clutter. The optimal λ is 0.114 or log2 (0.114) = −3.1329.

normalized to zero mean and unit standard deviation and its normalization parameters are then used to normalize Dtst . All experiments are done using Linear kernel. Table III shows the optimal λ∗ and the 5-fold crossvalidation error, E ∗ , obtained by each method on the first e is the validation function of realization. Note that while E the GRAD method, the values shown in the table are those of E evaluated at λ∗ . Several observations are clear. First, the proposed method obtains the global minimal solution for all 14 data sets. The GRID-i methods do so for 71% to 100% of the data sets while GRAD-i methods do so around 36% to 43% of the data sets. Second, there are data sets where the minimal E ∗ are obtained at multiple values of λ. For these case, GO always returns the largest value of λ∗ . This is not so for the GRAD methods. The larger value of λ∗ (or smaller value of C) is advantageous as it yields better generalization performance [11]. Third, there are many cases for which GRAD-m returns the initial λ values as the optimal. This is not too surprising since E(λ) is a piecewise-constant function with many ranges of λ having gradients that are very close to 0 (termination condition for LBFGS). This situation is clearly depicted in Figure 2. The figure also shows that the 5-fold cross-validation error (solid line) is quite different from the smooth 5-fold error function (dashed line) obtained from e = 1 P5 E e E i=1 i . This discrepancy, we believe, is due to the 5 e (see (21)) which is less choice of the parameters used in E sensitive to variation of E(λ) at small values of λ. While the GRID-0.01 result can also obtain the global optimum of E for the data sets considered, there is no mechanism in it to ensure this performance for other data sets, unlike GO. For generalization performance, the SVM classifier with λ∗ obtained by the various methods are evaluated on Dtst . Table IV shows these test error rates E † of the first realization. It shows that GO yields the lowest test error rate among all methods for all data sets. The GRID-i does so for 86% to 100% of the data sets while GRAD-i averages around 57% to

5

Data set colon leukemia sonar heart ionosphere wbcd monk 1 monk 2 monk 3 diabetes hillvalley german svmguide3 splice

svmguide3 800

GO−1 GO−2 GO−3 GO−4 GO−5

number of intervals

700 600 500 400 300 200 100 0

0

1

2 |ISℓ |

3

4

When the SVM solution path is available, the computations needed to compute the λ∗ is quite efficient. For each realization of the data sets, the computational time needed by GO to obtain λ∗ using 5-fold cross validation process ranges from 3 milliseconds to 8 seconds. These numbers are generally higher than that by GRID-m and GRAD-m. However, since GO is implemented in Matlab while GRID and GRAD are in C, comparison by CPU timing may not be meaningful. Another useful measure is the estimate of the computational complexity of the algorithm with respect to |IV | and ℓmax . Main computations needed by the algorithm are those associated with (16), (18) and (20). These are proportional to |IV |, |ISℓ | and ℓmax . The determination of |ISℓ | of (16) for ℓmax events is O(|IV | · ℓmax ). The computation of (18) and (20) depends on the size of |ISℓ |. For this purpose, it is useful to know the distribution of |ISℓ | over ℓ. Figure 3 shows the histogram (number of intervals) with increasing values of |ISℓ | for the SVMguide3 data set for the 5-fold crossvalidation error. For each value of |ISℓ |, the five bars shown correspond to the number of intervals for the 5 folds with the first fold at the leftmost and the fifth fold the rightmost. As shown, |ISℓ | = 0 for more than 90% of all intervals. The histogram shown is typical of other data sets and realizations. Hence, the computations of (18) and (20) is much smaller than that required for (16) which means that the computational complexity of O(|IV | · ℓmax ). The dependence of ℓmax on |D| varies greatly, see [8] for details.

|IV | 15 18 69 90 117 228 186 201 185 256 404 333 428 793

d 2000 7129 60 13 33 9 6 6 6 8 100 24 22 60

ℓmax 8 1 286 315 534 495 596 1 884 407 1021 400 861 3637

TABLE II C HARACTERISTICS OF DATA SETS USED IN THE EXPERIMENTS . VARIABLE ℓmax IS THE NUMBER OF EVENTS FOR THE 1 ST REALIZATION OF EACH DATA SET USED FOR TEST ERROR RATE

Fig. 3. The histogram of intervals having various values of |ISℓ | for the 5 folds of svmguide3 in the first realization. For each value of |ISℓ |, the bars ¯ k| from left to right are those for folds GO-1 to GO-5 respectively. The set |Λ for k = 1 to 5 are 630, 755, 727, 828 and 754 respectively.

79% of the data sets. There are some minor variations in the results for the other realizations. Table V shows the mean and standard derivation values of E † of all methods over the 10 realizations. Three methods GO, GRID-0.1 and GRAD-1 have the lowest mean test error rate in 8 of the 14 data sets and their performances are better than the others. It is also interesting to note that in data sets heart, monk-1 and hillvalley, GO yields smaller standard derivations than the method with the lowest mean test error rate.

|ID | 47 54 138 180 234 455 370 400 369 512 808 667 856 2382

V. C ONCLUSIONS This paper describes an approach to determine the global optima C values of common validation functions for SVM classifier over a validation set or cross-validation set. This is possible because the SVM solution path for a range of C values can be computed. When the solution of the SVM solution path is available, the timing needed for the approach is generally very fast. In the case when there are multiple C values that attains the global optimum of the validation function, the smallest C value is returned by the approach. A PPENDIX A Gradient-based hyperparameters tuning method for SVM proposed by Keerthi et al. [4] requires a continuously differentiable function with respect to λ. Using our notations, the approximation proposed in [4] for E(λ) function of (15) is 1 X e E(λ) =1− sj (21) |IV | j∈IV

10 1 with σ(λ) := ρ(λ) , where sj = 1+exp(−σ(λ)y j hj (λ)) 1 P 2 2 ¯ ¯ ρ (λ) := |IV | i∈IV (hi (λ) − h(λ)) and h(λ) = P e 1 is dE(λ) i∈IV hhi (λ). The expression of its gradient |IV | dλ = i P e ∂sj P e ∂sj ∂hj ∂E ∂σ ∂hi ∂E j∈IV ∂sj i∈IV ∂hi ∂λ ) + ∂hj ∂λ , where ∂sj = ∂σ (

− |I1V | ,

∂sj ∂σ

=

sj (1 − sj )yj hj ,

¯ i (λ)−h(λ)) − 10(h |IV |ρ3 (λ)

∂sj = sj (1 ∂hj ℓ+1 h −hℓj ∂hj j ∂λ = λℓ+1 −λℓ .

−

∂σ and sj )σ(λ)yj , ∂h = i Note that these expressions are based on (17) and the development of this paper. In the case where the regularization solution path is not available, a different set of expressions is ∂h needed. In particular, ∂λj requires the inverse of an appropriate matrix obtained using data points in E(λ) and constraint (5), see [4] for details.

R EFERENCES [1] M.-W. Chang and C.-J. Lin, “Leave-one-out bounds for support vector regression model selection,” Neural Computation, vol. 17, no. 5, pp. 1188–1222, 2005. [2] K.-M. Chung, W.-C. Kao, C.-L. Sun, L.-L. Wang, and C.-J. Lin, “Radius margin bounds for support vector

6

Dataset colon leukemia sonar heart ionosphere wbcd monk 1 monk 2 monk 3 diabete hillvalley german svmguide3 splice

GO λ∗ 512.000 512.000 87.762 83.464 104.860 55.357 323.270 512.000 90.586 64.735 0.004 43.454 0.114 195.480

E∗ 0.156 0.034 0.000 0.134 0.000 0.000 0.283 0.348 0.183 0.217 0.289 0.231 0.178 0.153

GRID-1 λ∗ E∗ 512.000 0.156 512.000 0.034 64.000 0.000 64.000 0.134 64.000 0.000 32.000 0.000 1.000 0.288 512.000 0.348 64.000 0.195 64.000 0.217 0.004 0.289 1.000 0.232 0.031 0.178 256.000 0.156

GRID-0.1 λ∗ E∗ 512.000 0.156 512.000 0.034 84.449 0.000 78.793 0.134 103.970 0.000 51.984 0.000 315.170 0.287 512.000 0.348 90.510 0.183 64.000 0.217 0.004 0.289 42.224 0.232 0.109 0.178 194.010 0.154

GRID-0.01 λ∗ E∗ 512.000 0.156 512.000 0.034 87.427 0.000 83.286 0.134 104.690 0.000 55.330 0.000 321.800 0.283 512.000 0.348 90.510 0.183 64.445 0.217 0.004 0.289 43.411 0.231 0.113 0.178 195.360 0.153

GRAD-100 λ∗ E∗ 100.000 0.196 100.000 0.034 26.481 0.000 95.183 0.144 79.913 0.000 8.431 0.000 100.000 0.331 100.000 0.348 101.850 0.207 48.731 0.219 99.964 0.533 30.248 0.235 98.481 0.214 100.000 0.157

GRAD-10 λ∗ E∗ 10.000 0.196 10.000 0.034 10.000 0.000 9.763 0.144 10.000 0.000 10.000 0.000 10.000 0.331 10.000 0.348 10.000 0.209 10.143 0.227 1.226 0.404 40.438 0.232 10.115 0.186 10.093 0.160

GRAD-1 λ∗ E∗ 1.000 0.196 1.000 0.034 1.000 0.000 1.184 0.143 1.000 0.000 1.000 0.000 1.000 0.288 1.000 0.348 1.000 0.209 1.000 0.236 0.910 0.396 2.128 0.233 0.883 0.183 1.000 0.159

GRAD-0.1 λ∗ E∗ 0.100 0.196 0.100 0.034 0.100 0.000 0.100 0.143 0.100 0.000 0.100 0.000 0.736 0.292 0.100 0.348 0.004 0.209 0.100 0.234 0.101 0.359 0.100 0.233 0.102 0.179 0.100 0.161

GRAD-0.01 λ∗ E∗ 0.010 0.196 0.010 0.034 0.010 0.000 0.356 0.143 0.010 0.000 0.010 0.000 0.607 0.292 0.010 0.348 0.004 0.209 0.010 0.236 0.011 0.308 0.010 0.233 0.026 0.178 0.010 0.162

TABLE III O PTIMAL λ VALUE AND 5- FOLD CROSS - VALIDATION ERROR RATES FOR GO, GRID- I AND GRAD- I OF THE FIRST REALIZATION . T HE SMALLEST ERROR RATE FOR EACH DATA SET IS HIGHLIGHTED IN BOLD .

Dataset colon leukemia sonar heart ionosphere wbcd monk 1 monk 2 monk 3 diabete hillvalley german svmguide3 splice

GO λ∗ 512.000 512.000 87.762 83.464 104.860 55.357 323.270 512.000 90.586 64.735 0.004 43.454 0.114 195.480

E† 0.067 0.000 0.000 0.164 0.000 0.000 0.345 0.327 0.167 0.266 0.274 0.240 0.143 0.174

GRID-1 λ∗ E† 512.000 0.067 512.000 0.000 64.000 0.000 64.000 0.164 64.000 0.000 32.000 0.000 1.000 0.345 512.000 0.327 64.000 0.167 64.000 0.266 0.004 0.274 1.000 0.252 0.031 0.143 256.000 0.180

GRID-0.1 λ∗ E† 512.000 0.067 512.000 0.000 84.449 0.000 78.793 0.164 103.970 0.000 51.984 0.000 315.170 0.345 512.000 0.327 90.510 0.167 64.000 0.266 0.004 0.274 42.224 0.240 0.109 0.143 194.010 0.175

GRID-0.01 λ∗ E† 512.000 0.067 512.000 0.000 87.427 0.000 83.286 0.164 104.690 0.000 55.330 0.000 321.800 0.345 512.000 0.327 90.510 0.167 64.445 0.266 0.004 0.274 43.411 0.240 0.113 0.143 195.360 0.174

GRAD-100 λ∗ E† 100.000 0.067 100.000 0.000 26.481 0.000 95.183 0.164 79.913 0.000 8.431 0.000 100.000 0.345 100.000 0.327 101.850 0.167 48.731 0.266 99.964 0.459 30.248 0.240 98.481 0.209 100.000 0.179

GRAD-10 λ∗ E† 10.000 0.067 10.000 0.000 10.000 0.000 9.763 0.179 10.000 0.000 10.000 0.000 10.000 0.345 10.000 0.327 10.000 0.167 10.143 0.271 1.226 0.409 40.438 0.240 10.115 0.162 10.093 0.177

GRAD-1 λ∗ E† 1.000 0.067 1.000 0.000 1.000 0.000 1.184 0.179 1.000 0.000 1.000 0.000 1.000 0.345 1.000 0.327 1.000 0.167 1.000 0.271 0.910 0.406 2.128 0.252 0.883 0.146 1.000 0.174

GRAD-0.1 λ∗ E† 0.100 0.067 0.100 0.000 0.100 0.000 0.100 0.179 0.100 0.000 0.100 0.000 0.736 0.345 0.100 0.327 0.004 0.167 0.100 0.276 0.101 0.373 0.100 0.256 0.102 0.143 0.100 0.178

GRAD-0.01 λ∗ E† 0.010 0.067 0.010 0.000 0.010 0.000 0.356 0.179 0.010 0.000 0.010 0.000 0.607 0.345 0.010 0.327 0.004 0.167 0.010 0.276 0.011 0.310 0.010 0.256 0.026 0.146 0.010 0.178

TABLE IV O PTIMAL λ VALUE AND T EST ERROR RATES FOR GO, GRID- I AND GRAD- I OF THE FIRST REALIZATION . T HE SMALLEST ERROR RATE FOR EACH DATA SET IS HIGHLIGHTED IN BOLD .

Dataset colon leukemia sonar heart ionosphere wbcd monk 1 monk 2 monk 3 diabete hillvalley german svmguide3 splice

GO mean 0.133 0.017 0.000 0.158 0.000 0.001 0.325 0.361 0.191 0.230 0.311 0.249 0.203 0.158

std 0.070 0.027 0.000 0.029 0.000 0.002 0.032 0.036 0.047 0.039 0.065 0.032 0.096 0.013

GRID-1 mean std 0.133 0.070 0.017 0.027 0.000 0.000 0.160 0.036 0.000 0.000 0.001 0.002 0.323 0.034 0.361 0.036 0.194 0.044 0.288 0.038 0.299 0.070 0.254 0.035 0.204 0.098 0.159 0.015

M EAN AND S TANDARD D EVIATIONS OF E †

GRID-0.1 mean std 0.133 0.070 0.017 0.027 0.000 0.000 0.158 0.032 0.000 0.000 0.001 0.002 0.325 0.032 0.361 0.036 0.188 0.046 0.228 0.042 0.299 0.072 0.252 0.036 0.203 0.096 0.159 0.015 OF

GRID-0.01 mean std 0.133 0.070 0.017 0.027 0.000 0.000 0.157 0.030 0.000 0.000 0.001 0.002 0.325 0.032 0.361 0.036 0.188 0.046 0.229 0.040 0.311 0.065 0.250 0.035 0.203 0.096 0.159 0.015

GRAD-100 mean std 0.140 0.080 0.017 0.027 0.000 0.000 0.149 0.046 0.000 0.000 0.000 0.000 0.331 0.030 0.361 0.036 0.199 0.043 0.231 0.044 0.389 0.120 0.250 0.032 0.209 0.096 0.158 0.015

GRAD-10 mean std 0.140 0.080 0.017 0.027 0.000 0.000 0.158 0.035 0.000 0.000 0.000 0.000 0.331 0.030 0.361 0.036 0.192 0.033 0.229 0.040 0.316 0.081 0.249 0.032 0.207 0.105 0.158 0.015

GRAD-1 mean std 0.140 0.080 0.017 0.027 0.000 0.000 0.155 0.039 0.000 0.000 0.000 0.000 0.323 0.034 0.361 0.036 0.198 0.146 0.225 0.037 0.361 0.070 0.252 0.036 0.208 0.089 0.158 0.013

GRAD-0.1 mean std 0.140 0.080 0.017 0.027 0.000 0.000 0.164 0.025 0.000 0.000 0.000 0.000 0.323 0.034 0.361 0.036 0.194 0.044 0.225 0.038 0.313 0.078 0.252 0.037 0.205 0.101 0.159 0.014

GRAD-0.01 mean std 0.140 0.080 0.017 0.027 0.000 0.000 0.164 0.023 0.000 0.000 0.000 0.000 0.323 0.034 0.361 0.036 0.194 0.044 0.225 0.038 0.317 0.074 0.252 0.037 0.208 0.105 0.160 0.014

TABLE V GO, GRID- I AND GRAD- I OVER THE THE 10 REALIZATIONS . T HE SMALLEST M EAN FOR EACH DATA SET IS HIGHLIGHTED IN BOLD .

machines with the RBF kernel,” Neural Computation, vol. 15, no. 11, pp. 2643–2681, 2003. [3] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machine Learning, vol. 46, pp. 131–159, 2002. [4] S. S. Keerthi, V. Sindhwani, and O. Chapelle, “An efficient method for gradient-based adaptation of hyperparameters in SVM models,” in Advances in Neural Information Processing Systems 19, B. Sch¨olkopf, J. Platt, and T. Hoffman, Eds. Cambridge, MA: MIT Press, 2007, pp. 673–680. [5] S. Decherchi, S. Ridella, R. Zunino, P. Gastaldo, and

D. Anguita, “Using unsupervised analysis to constrain generalization bounds for support vector classifiers,” IEEE Transactions on Neural Networks, vol. 21, pp. 424– 438, March 2010. [6] K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple performance measures for tuning svm hyperparameters,” Neurocomputing, vol. 51, pp. 41–59, 2003. [7] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The entire regularization path for the support vector machine,” Journal of Machine Learning Research, vol. 5, pp. 1391 –1415, October 2004. [8] C.-J. Ong, S.-Y. Shao, and J.-B. Yang, “An improved

7

algorithm for the solution of the regularization path of SVM,” IEEE Transactions on Neural Networks, vol. 21, no. 3, pp. 451–462, 2010. [9] R. H. Byrd, P. Lu, and J. Nocedal, “A limited-memory algorithm for bound-constrained optimization,” SIAM Journal on Scientific and Statistical Computing, vol. 16, no. 5, pp. 1190–1208, 1995. [10] A.Asuncion and D.J.Newman, “UCI machine learning repository,” 2007. [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository.html [11] V. N. Vapnik, Statistical Learning Theory. WileyInterscience,, September 1998.