Learning with Box Kernels Stefano Melacci and Marco Gori Department of Information Engineering University of Siena, 53100 Siena, Italy {mela,marco}@dii.unisi.it
Abstract. Supervised examples and prior knowledge expressed by propositions have been profitably integrated in kernel machines so as to improve the performance of classifiers in different real-world contexts. In this paper, using arguments from variational calculus, a novel representer theorem is proposed which solves optimally a more general form of the associated regularization problem. In particular, it is shown that the solution is based on box kernels, which arises from combining classic kernels with the constraints expressed in terms of propositions. The effectiveness of this new representation is evaluated on real-world problems of medical diagnosis and image categorization. Key words: Box kernels, Constrained variational calculus, Kernel machines, Propositional rules.
1
Introduction
The classic supervised learning framework is based on a collection of ` labeled points, L = {(xi , yi ), i = 1, . . . , `}, where xi ∈ X ⊂ IRd and yi ∈ {−1, 1}. This paper focuses on supervised learning from `X labeled regions of the input space, LX = {(Xj , yj ), j = 1, . . . , `X }, where Xj ∈ 2X , and yj ∈ {−1, 1}. Of course, these regions can degenerate to single points and it is convenient to think of the available supervision without distinguishing between the supervised entities, so as one considers to deal with `t := ` + `X labeled pairs. The case of multi dimensional intervals, Xj = {x ∈ IRd : xz ∈ [azj , bzj ], z = 1, . . . , d},
(1)
where aj , bj ∈ IRd collect the lower and upper bounds, respectively, is the one which is more relevant in practice. The pair (Xj , yj ) formalizes the knowledge that a supervisor provides in terms of ∀x ∈ IRd ,
d ^
(xz ≥ azj ) ∧ (xz ≤ bzj ) ⇒ class(yj ),
(2)
z=1
so as we can interchangeably refer to it as labeled box region or propositional rule 1 . This framework has been introduced in a number of papers and its potential impact in real-world applications has been analyzed in different contexts (see 1
While this can be thought of as FOL formula, it is easy to see that the quantifier is absorbed in the involved variables and that we simply play with propositions.
2
Learning with Box Kernels
e.g. [1] and references therein). Most of the research in this field can be traced back to Fung et al. (2002) [2] who proposed to embed labeled (polyhedral) sets into Support Vector Machines (SVMs); the corresponding model was referred to as Knowledge-based SVM (KSVM) and it has been the subject of a number of significant related studies [3–6]. This paper proposes an in-depth revision of those studies that is inspired by the approach to regularization networks of [7]. The problem of learning is properly re-formulated by the natural expression of supervision on sets, which results in the introduction of a loss function that fully involves them. Basically, any set Xj is associated with Rthe characteristic function cXj (x), and its normalized form cˆXj (x) := cXj (x)/ X cXj (x)dx degenerates to the Dirac distribution δ(x − xj ) in the case in which Xj = {xj }. Interestingly, it is shown that the solution emerging from the regularized learning problem does not lead to the kernel expansion on the available data points, and the kernel is no longer the Green’s function of the associated regularization operator (see [8] and [9], pag. 94). A new representer theorem is given, which indicates an expansion into two different kernels. The first one corresponds with the Green’s function of the stabilizer, while the second one is a box kernel. Basically, the box kernel is the outcome of the chosen regularization operator and of the structure of the box region. When the region degenerates to a single point, the two kernel functions perfectly match. This goes beyond the discretization of the knowledge sets, that would make the problem rapidly intractable when the dimensionality of the input space increases. In addition, we provide an explicit expression of box kernels in the case of the regularization operator associated to the Gaussian kernel, but the proposed framework suggests extensions to other cases. The analysis clearly shows why the explicit expression becomes easy in the case of boxes, whereas for general sets this seems to be hard. However, most interesting applications suggests a scenario in which logic statements in form of propositions help, which reduces sets to boxes. The experiments indicate that the proposed approach achieves state of the art results with clear improvements in some cases.
2
Learning from Labeled Sets
We formulate the problem of learning from labeled sets and/or labeled points in a unique framework, simply considering that each point corresponds to a singleton. More formally, given a labeled set Xj , the characteristic function cXj (x) associated with it is 1 when R x ∈ Xj , otherwise it is 0. If vol(Xj ) is the measure of the set, vol(Xj ) = X cXj (x)dx, the normalized characteristic function is cˆXj (x) := cXj (x)/vol(Xj ), and when the set degenerates to a single point xj , then cˆXj (x) is the Dirac delta δ(x − xj ). Following the popular framework for regularized function learning [7] we seek for a function f belonging to F = W k,p , the subset of Lp whose functions admit derivatives up to some order k. We introduce the term mXj (f ) := R f (x)ˆ cXj (x)dx, that is the average value of f over Xj . Of course, when Xj = X {xj } we get mXj (f ) = f (xj ). The problem of learning from labeled sets can be
Learning with Box Kernels
formulated as the minimization of X V (yh , mXh (f )) + λkP f k2 , Rm [f ] :=
3
(3)
h∈IN`t
where V ∈ C 1 ({−1, 1} × IR, IR+ ) is a convex loss function, IN m denotes the set of the first m integers, and λ > 0 weights the effect of the regularization term. P is a pseudo-differential operator, which admits the adjoint P ? , so that kP f k2 =< P f, P f >=< f, P ? P f >=< f, Lf >, being L = P ? P . The unconstrained formulation of (3) allows the classifier to handle noisy supervisions, as required in real-world applications. Moreover, a positive scalar value can be associated to each term of the sum to differently weight the contribute of each labeled element. Theorem 1. Let KerL = {0} be, and let g be the Green’s function of L. Then Rm [·] admits the unique minimum X f? = αj β(Xj , x), (4) j∈IN`t
where β(Xj , x) :=
R X
g(x, ς)ˆ cXj (ς)dς, and αj are scalar values.
Proof: Any weak extreme of Rm [f ] satisfies the Euler-Lagrange equation 1 X 0 Vf (yj , mXh (f )) · cˆXj (x) (5) Lf (x) = − λ j∈IN`t
where Vf0 = ∂V ∂f . This comes straightforwardly from variational calculus ([10], pag. 16). Since KerL = {0} the functional < f, Lf > is strictly convex which, considering that V (yh , ·) is also convex, leads us to conclude that any extreme of Rm [·] collapses to the unique minimum f ? . Now, ∀x ∈ X let g(x, ·) : Lg(x, ς) = δ(x − ς) be the Green’s function of L. When using again the hypothesis KerL = {0} we can invert L, from which the thesis follows. If we separate the contributions coming from the above represenP points and setsP ter theorem can be re-written as f ? (x) = i∈IN` αi g(xi , x)+ j∈IN` αj β(Xj , x). X Now, let us define Z K(Xi , Xj ) := β(Xi , x) · cˆXj (x)dx. (6) X
The following proposition gives insights on the cases in which either Xi or Xj degenerate to points. Proposition 1. i. K(Xi , {xj }) = β(Xi , xj )
(7)
ii. K({xi } , {xj }) = g(xi , xj )
(8)
4
Learning with Box Kernels
Proof: i. If Xj = {xj } then cˆXj (x) = δ(x − xj ) and the thesis follows from 6. ii. If in addition to the above hypothesis we also have Xi = {xi } then cˆXi (x) = δ(x − xi ), again, yields the thesis when invoking 6. The function K makes it possible to devise an efficient algorithmic scheme based on the collapsing to a finite dimension of the infinite dimensional optimization problem of finding weak minima for Rm [·] (3). Now we formally prove this aspect. Theorem 2. When the hypotheses of Theorem 1 hold true then Rm [f ? ] = Rm (α), where X X X Rm (α) = V (yi , αj K(Xj , Xi )) + λ αi αj K(Xi , Xj ). (9) i∈IN`t
j∈IN`t
i,j∈IN`t
Proof: When plugging f ? expressed by 4 into 3 and using Lg = δ, we get Z X X ? Rm [f ] = V (yi , αj β(Xj , x)ˆ cXi (x)dx) X j∈IN `t
i∈IN`t
+λ <
X
αi β(Xi , x), L(
i∈IN`t
=
X
V (yi ,
i∈IN`t
+λ
αj β(Xj , x)) >
i∈IN`t
X j∈IN`t
X
X
Z αj
β(Xj , x)ˆ cXi (x)dx) X
αi αj < β(Xi , x), cˆXj (x) >
(10)
i,j∈IN`t
and the thesis follows when applying definition 6.
3
Box Kernels
The function K(·, ·) comes out from the kernel g(·, ·) and returns a number which depends on its operands that can be space regions or points. Now we consider regions bounded by multi dimensional intervals (boxes), so that K(·, ·) is referred to as the box kernel coming from g. These regions formalizes the type Qd of knowledge that we introduced in Section 1, and vol(Xj ) = i=1 |aij − bij |. The box kernel can be plugged in every existing kernel based classifier, allowing it to process labeled box regions without any modification to the learning algorithm. The function K(·, ·) inherits a number of properties from the kernel g(·, ·). Proposition 2. Let IK ∈ IR`t ,`t be the Gram matrix associated with the function K(Xi , Xj ). If g is a positive definite kernel function then IK ≥ 0. Proof: We distinguish three cases:
Learning with Box Kernels
5
i. vol(Xi ), vol(Xj ) > 0. Since g > 0 there exists φ : ∀x, ς ∈ X : g(x, ς) = < φ(x), φ(ς) >. From the definition 6 we get Z Z K(Xi , Xj ) = ( g(x, ς)ˆ cXi (ς)dς)ˆ cXj (x)dx ZX Z X = < φ(x), φ(ς) > cˆXi (ς)ˆ cXj (x)dςdx X X Z Z =< φ(x)ˆ cXj (x)dx, φ(ς)ˆ cXi (ς)dς > = < Φ(Xi ), Φ(Xj ) >(11) X RX where Φ(Z) := Z φ(x)ˆ cZ (x)dx being Z ∈ 2X . ii. vol(Xi ) > 0 and Xj = {xj }. Following the same arguments as above, Z Z K(Xi , {xj }) = ( g(x, ς)ˆ cXi (ς)dς)δ(x − xj )dx ZX X = g(xj , ς)ˆ cXi (ς)dς X Z = < φ(xj ), φ(ς)ˆ cXi (ς)dς > = < φ(xj ), Φ(Xi ) > (12) X
and φ(xz ) is the degenerate case of Φ(Z), in which Z becomes a point xz . iii. Xi = {xi } and Xj = {xj }. In this case we immediately get K(Xi , Xj ) = g(xi , xj ) =< φ(xi ), φ(xj ) >. Finally, if we construct the Gram matrix IK using i, ii, and iii the thesis comes out straightforwardly. Gaussian kernels. Now and in the rest of the paper, we focus attention on 2 the case in which g is Gaussian kernel of width σ, g(x, z) = exp(−0.5 kx − zk σ −2 ). However our framework is generic, and the extension to other cases follows similar analyses. Proposition 3. If g is a Gaussian kernel then √ d Y xi − aij xi − bij 1 ( 2πσ) β(Xj , x) = ) − erf c( √ )) (erf c( √ vol(Xj ) i=1 2 2σ 2σ
(13)
Proof: We recall that the isotropic Gaussian kernel is given by the products of d Gaussian kernels that independently operate in each dimension. Since Xj is a box region, we can rewrite the integral over Xj into a product of d definite integrals. In detail, Z d Z bij Y kx−ζk2 (xi −ζ i )2 2 β(Xj , x) · vol(Xj ) = e −2σ dζ = e −2σ2 dζ i Xj
i=1
d Z Y = ( i=1
=
+∞
e
(xi −ζ i )2 −2σ 2
aij
Z
+∞
dζ −
e
(xi −ζ i )2 −2σ 2
dζ i )
bij
√ d Y ( 2πσ)
i=1
i
aij
2
xi − bij xi − aij (erf c( √ ) − erf c( √ )) 2σ 2σ
(14)
6
Learning with Box Kernels
where erf c(x) =
√2 π
R +∞ z
2
e−t dt is the complementary error function.
Proposition 4. If g is a Gaussian kernel then K(Xh , Xk ) = qh,k ·
d Y
Ψ (bih , bik ) − Ψ (aih , bik ) −Ψ (bih , aik ) + Ψ (aih , aik ) (15)
i=1
where (a−b)2 a−b 1 a−b Ψ (a, b) := √ erf c( √ ) − √ e− 2σ2 , π 2σ 2σ √ ( πσ 2 )d qh,k := . vol(Xκ )vol(Xh )
Proof: Given ph,k :=
√ ( 2πσ)d κ )vol(Xh )
2d vol(X
(16) (17)
we have
Z Y d x i − bi xi − ai β(Xh , x) (erf c( √ h ) − erf c( √ h ))dx dx = ph,k 2σ 2σ Xk i=1 Xk vol(Xκ ) i i Z Z d bk bk Y x i − bi xi − ai = ph,k ( (18) erf c( √ h )dxi − erf c( √ h )dxi ) i 2σ 2σ aik i=1 ak Z
K(Xh , Xk ) :=
that must be paired with the proof.
R
2 √ erf c(z)dz = z · erf c(z) − e−z ( π)−1 to complete
In Fig. 1 we report an illustrative example of K(Xi , Xj ) where Xj = {xj } and Xi is progressively reduced until it degenerates to a point, leading to the classical Gaussian kernel. Using a synthetic data set, Fig. 2 (a-c) show the separation hyperplane of a box-kernel-based SVM, trained with labeled points, labeled box regions, or both of them, respectively. The optimal separation boundary between the two classes becomes nonlinear when introducing the labeled regions, and it is correctly modeled by the box kernel. Fig. 2 (d) considers the effect of increasing the parameter λ, and it shows how a soft margin estimate is allowed within the available box regions, increasing the robustness to noisy supervisions. In Fig. 2 (e) not all the training points are coherent with the knowledge sets. The averaging effect of the box kernel within each labeled box region, introduced in (3) by the mXj (f ) term, allows the classifier to handle this situation. As a matter of fact, SVMs exploits a hinge loss for the labeled entities, and the (absolute) max value of f is larger inside the region in which we find the incoherency, so that its average still matches the corresponding box label (Fig. 2 (f)). The regularized nature of the learning problem does not allow the value of f to explode to infinity.
Learning with Box Kernels
0.07
0.15
1
0.03
0.07
0.5
0 10
10
0
0 10
0 −10 −10
10
0
0 −10 −10
0 10
7
10
0
0 −10 −10
Fig. 1. The K(Xi , Xj ) function (g is Gaussian) where Xj = {xj } and Xi is defined from [−6, −4] to [6, 4] and it is progressively reduced until it degenerates to a point (left to right). The last picture corresponds to a Gaussian kernel.
(a)
(b)
(c) 1 0.5 0 −0.5 −1 −1.5
(d)
(e)
(f)
Fig. 2. SVM trained on a 2-class dataset using the box kernel (red crosses/boxes: class +1, blue circles/boxes: class -1). (a) The separation boundary when only labeled points are used; (b) using labeled box regions only; (d) using both labeled points and regions; (e) using a larger λ (it penalizes the data fitting); (e) a labeled point (+) is incoherent with the leftmost blue-dotted box; (f) the level curves of f in the case of (e).
4
Experimental Results
We ran comparative experiments that are based on real-world scenarios: diagnosing diabetes, and recognizing handwritten digits. Before going into further details, we shortly describe the related algorithms. In [2] the authors formalize a constrained linear optimization problem based on the available rules (i.e. labeled regions), that leads to a linear classification function (KSVM, Knowledge-based SVM). The extension of the KSVM framework to the nonlinear case has been studied in [3]. However, the nonlinear “kernelization” is not a transparent procedure that can be easily related to the original knowledge, making the approach less practical. Le et. al [4] proposed a simpler alternative, that we will refer to as SKSVM (Simpler KSVM). An SVM is trained from labeled points only, excluding the ones that fall in the (arbitrary
8
Learning with Box Kernels
shaped) labeled regions, and, at test time, its prediction is post processed to match the available knowledge. The main drawback of this approach is that it is not able to generalize from knowledge on labeled regions only. A more recent idea was proposed in [5, 6]. A kernel-based classifier is extended to model labeled nonlinear space regions by the discretization of the supervised space on a preselected subset of points. This criterion was applied to a linear programming SVM [5] (NKC, Nonlinear Knowledge-based Classifier) and to a proximal nonlinear classifier [6] (PKC, Proximal Knowledge-based Classifier). However, it is unclear how to sample the regions on which prior knowledge is given, and a considerable amount of points may be needed, especially in high dimensions. In each experiment, the features that are not involved in the available rules are bounded by their min, max values over the entire data collection. Classifier parameters were chosen by ranging them over dense grid of values in [10−5 , 105 ], and using a cross-validation procedure (described below). Diabetes. The Pima Indian Diabets [11] dataset is composed by the results of 8 medical tests for 768 female patients at least 21 years old of Pima Indian heritage. The task is to predict whether the patient shows signs of diabetes. KSVMs have been recently evaluated in this data [12], and we replicated the same experimental setting. Two rules from the National Institute of Health are defined, involving the second (PLASMA) and sixth (MASS) features, (M ASS ≥ 30) ∧ (P LASM A ≥ 126) ⇒ positive (M ASS ≤ 25) ∧ (P LASM A ≤ 100) ⇒ negative.
We note that the rules can be applied to directly classify 269 instances, and only 205 of them will be correctly classified. A collection of 200 random points is used to train the classifiers, 30 points to validate their parameters, whereas the results of Table 1 are computed on the rest of the data, averaged over 20 runs. When using rules and labeled point, BOX shows a slightly better accuracy than Table 1. The average accuracy and standard deviation on the Diabetes data in the setup of [12] (KSVM). Method KSVM (rules only) BOX (rules only) KSVM BOX
Mean Accuracy Std 64.23% 70.44% 76.33% 76.39%
1.19% 1.03% 0.63% 1.30%
KSVM but the two results are essentially equivalent. We noted that the information carried in the labeled data points is enough to fulfill the box constraints. Differently, when only rules (i.e. labeled box regions) are fed to the classifier, a nonlinear estimate resulted more appropriate, and BOX shows a significant improvement with respect to KSVM. Handwritten digit recognition. The USPST is the test collection of 16x16 pictures of 2007 handwritten digits from the US Postal System. We consider
Learning with Box Kernels
9
the task of predicting whether an input image, represented as a vector of gray scale intensities, is a 3 or a 8. Their representations are often very similar, and when the number of labeled training points is small the classification task is challenging. Given the pair of examples of Fig. 3 (a), a volunteer indicated the portions of image that he considered more useful to distinguish them (Fig. 3 (b)). He also provided the ranges of intensity values that he would tolerate in each region, considering that not all the data will perfectly match the given pair. The resulting rules are reported in Fig. 3 (c). We randomly generated
(Intensity of the blue region in (b) ≥ 220) ⇒ 3 (Intensity of the red region in (b) ≤ 160) ⇒ 8 (a)
(b)
(c)
Fig. 3. (a) Examples of digits 3 and 8 from USPST; (b) the region in which additional knowledge is provided to distinguish between the two classes (18 blue pixels in for class 3 and and 24 red pixels for class 8); (c) the rules provided for this task.
training/validation and test splits, repeating the process 20 times. The former group was composed of 10 labeled points only (4 of them were used to validate the classifier parameters). The pair of Fig. 3 was included in all the training sets. We compared all the described algorithms, collecting the results in Table 2. A Gaussian kernel was used for nonlinear classifiers. BOX compares favorably with Table 2. The average accuracy and standard deviation of 20 experiments on USPST 3vs8 for different algorithms. Method KSVM (rules only) NKC/PKC (rules only) BOX (rules only) SVM SKSVM KSVM NKC/PKC BOX
Mean Accuracy Std 79.42% 77.38% 80.72% 89.78% 87.87% 89.57% 90.72% 92.55%
0.28% 0.35% 0.35% 5.35% 5.03% 5.70% 4.46% 4.43%
all the other methods, also when only the box rules are provided to the classifier. This result is remarkable, since the rules only applies to 46 out of 338 data points. Differently, SKSVM suffers from the removal of the training examples that fulfill the given rules, whereas in KSVM is hard to find a good trade-off between rule fulfillment and labeled points matching. NKC and PKC require a discrete sampling of the labeled region, so that we provided those algorithms with 100 additional training points, generated by adding random noise to the
10
Learning with Box Kernels
pair of Fig. 3. However, this process is very heuristic, and BOX resulted in better accuracy without the need of any discrete sampling.
5
Conclusions
Based on the inspiring framework given by [7], in this paper, we give a unified variational framework of the class of problems introduced in [2], that incorporate both supervised points and supervised sets and prove a new representer theorem for the optimal solution. It turns out that the solution is based on a novel class of kernels, referred to as box kernels, that are created by joining a classic kernel with the collection of supervised sets - that can degenerate to points. Interestingly, supervised points and sets are treated differently by box kernels, since they adapt their shape to the measure of the sets. This suggestion of the box kernel for the problem at hand, which derives from the more general variational formulation, is the most distinguishing feature of the proposed approach. Interestingly, the algorithmic issues that hold for kernel machines still apply, which makes it easy the actual experimentation of the approach. The given set of experiments show that the proposed solution is equivalent to or compares favorably with the state-of-the art in this field, overcoming several issues of the related algorithms. Finally, it is worth mentioning that the proposed approach of carving new kernels from the specific problem might open the doors to other solutions for different forms of prior knowledge.
References 1. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: a review. Neurocomputing 71(7-9), 1578–1594 (2008) 2. Fung, G., Mangasarian, O., Shavlik, J.: Knowledge-based support vector machine classifiers. Advances in NIPS, 537–544 (2002) 3. Fung, G., Mangasarian, O., Shavlik, J.: Knowledge-based nonlinear kernel classifiers. In: Conference on Learning Theory. 102-112 (2003) 4. Le, Q., Smola, A., G´ artner, T.: Simpler knowledge-based support vector machines. In: Proceedings of ICML, 521–528. ACM (2006) 5. Mangasarian, O., Wild, E.: Nonlinear knowledge-based classification. IEEE Trans. on Neural Networks 19(10), 1826–1832 (2008) 6. Mangasarian, O., Wild, E., Fung, G.: Proximal Knowledge-based Classification. Statistical Analysis and Data Mining 1(4), 215–222 (2009) 7. Poggio, T., Girosi, F.: A theory of networks for approximation and learning. Technical report, MIT (1989) 8. Schoelkopf, B., Smola, A.: From regularization operators to support vector kernels. In: Advances in NIPS. Kaufmann, M., ed. (1998) 9. Schoelkopf., B., Smola, A.: Learning with kernels. The MIT Press (2002) 10. Giaquinta, M., Hildebrand, S.: Calculus of Variations I. Volume 1. Springer (1996) 11. Frank, A., Asuncion, A.: UCI repository (2010) 12. Kunapuli, G., Bennett, K., Shabbeer, A., Maclin, R., Shavlik, J.: Online Knowledge-Based Support Vector Machines. In: ECML, 145–161 (2010)