Can learning kernels help performance? Corinna Cortes Google Research [email protected]

Can learning with kernels help performance? Corinna Cortes Google Research [email protected]

Outline

• Learning with kernels, SVM. • Learning kernels. • Repeat:



Discuss new idea - convex vs. non-convex optimization, - linear vs. non-linear kernel combinations, - few vs. many kernels, - L1 vs. L2 regularization; Experimental check; Until conclusion. Future directions. page 3

Optimal Hyperplane: Max. Margin (Vapnik and Chervonenkis, 1965)

w

margin (x2 − x1 )

w·x+b=1 w · x + b = −1

w·x+b=0

• Canonical hyperplane: for support vectors, w · x + b ∈ {−1, +1}.

. For points on opposite side, • Margin: ρ = 1/||w|| w · (x − x ) 2 2ρ =

2

||w||

1

=

||w||

page 4

Soft-Margin Hyperplanes 2 !w!

(CC & Vapnik, 1995)

ξi ξk

ξj w · x + b = −1

w·x+b=1 w·x+b=0

• Support vectors: points along the margin and outliers.

page 5

Optimization Problem

• Constrained optimization problem minimize

1 !w!2 + C 2

m !

ξi

i=1

subject to yi [w · xi + b] ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m].

• Properties • C is a non-negative real-valued constant. • Convex optimization. • Unique solution. page 6

SVMs Equations

• Lagrangian: for all w, b,!α

i

≥ 0, βi ≥ 0,

m

L(w, b, ξ, α) = 12 !w!2 + C i=1 ξi !m !m − i=1 αi [yi (w · xi + b) − 1 + ξi ] − i=1 βi ξi .

• KKT conditions: ! m i=1

∇w L = w − αi yi xi = 0 !m ∇w b = − i=1 αi yi = 0 ∇ξi L = C − αi − βi = 0

!m

⇐⇒ w = i=1 αi yi xi . !m ⇐⇒ i=1 αi yi = 0. ⇐⇒ αi + βi = C.

∀i ∈ [1, m], αi [yi (w · xi + b) − 1 + ξi ] = 0 βi ξi = 0. page 7

Dual Optimization Problem

• Constrained moptimizationmproblem !

1 ! maximize αi − αi αj yi yj (xi · xj ) 2 i=1 i,j=1 m ! αi yi = 0. subject to ∀i ∈ [1, m], 0 ≤ αi ≤ C ∧ i=1

• Solution

!m # " h(x) = sgn αi yi (xi · x) + b ,

b = yi −

i=1 m !

αj yj (xj · xi ) for any SV xi j=1 with αi < C. page 8

SVMs - Kernel Formulation (Boser, Guyon, and Vapnik, 1992)

• Constrained optimization problem m !

m ! 1 max αi − αi αj yi yj K(xi , xj ) α 2 i,j=1 i=1

subject to 0 ≤ αi ≤ C, i = 1, . . . , m and

• Solution

n !

αi yi = 0

i=1

m ! h(x) = sign( αi yi K(x, xi ) + b). i=1

For any supportmvector such that 0 < αi < C, b = yi −

!

αj yj K(xi , xj ).

j=1 page

9

Margin Bound (Bartlett and Shawe-Taylor, 1999)

• Fix ρ > 0 . Then, for any δ > 0 , with probability at least 1−δ , the following holds: !ρ (h) + O R(h) ≤ R

"#

R2 /ρ2 log2 m + log m

1 δ

$

.

fraction !of training points with ! margin less than ρ : !{xi : yi h(xi ) < ρ}! . m

generalization error.

page 10

Kernel Ridge Regression (Saunders et al., 1998)

• Optimization problem:

max −λα! α − α! Kα + α! y α

• Solution:

h(x) =

m !

αi K(xi , x)

i=1

with

α = (K + λI)

−1

y.

page 11

Outline

• Learning with kernels, SVM. • Learning kernels. • Repeat:



Discuss new idea - convex vs. non-convex optimization, - linear vs. non-linear kernel combinations, - few vs. many kernels, - L1 vs. L2 regularization; Experimental check; Until conclusion. Future directions. page 12

Learning the Kernel

• SVM: max α

2α 1 − α Y KYα !

!

!

subject to α y = 0 ∧ 0 ≤ α ≤ C !

Structural Risk Minimization: select the kernel that minimizes an estimate of the generalization error.

• What estimate should we minimize? page 13

Minimize an Independent Bound (Chapelle,Vapnik, Bousquet & Mukherjee, 2000)

• Alternate SVM and gradient step algorithm: 1. maximize the SVM problem over α → α!

2. gradient step over bound on generalization error: - margin bound: T = R2 /ρ2 - span bound:

T =

1 m

!m

! 2 Θ(α i Si − 1). i=1

page 14

Reality Check (Chapelle,Vapnik, Bousquet & Mukherjee, 2000)

Selecting the width of a Gaussian kernel and the SVM parameter C. page 15

Kernel Learning & Feature Selection

• Rank-1 kernels (xki )! = µk xki ,

µk ≥ 0,

d !

k=1

(µk )p ≤ Λ

• Alternate between solving SVM and gradient step - the margin bound: R2 /ρ2 ,

(Weston et al., NIPS 2001).

- the SVM dual: 2α! 1 − α! Y! Kµ Yα

(Grandvalet & Canu: NIPS 2002).

page 16

Reality Check, Feature Selection (Chapelle,Vapnik, Bousquet & Mukherjee, 2000)

• Comparison with existing methods: (Weston et al., NIPS 2001)

page 17

Kernel Learning Formulation, II (Lanckriet et al., 2003)

Structural Risk Minimization problem: min max

K∈K

α

2α" 1 − α" Y" KYα

subject to 0 ≤ α ≤ C ∧ α y = 0 "

K $ 0 ∧ Tr[K] ≤ Λ

where Λ > 0 determines the family of kernels.

page 18

SVM - Linear Kernel Expansion QCQP problem:

(Lanckriet et al., 2003)

min max F (µ, α) = 2α! 1 − α! Y! µ

α

subject to 0 ≤ α ≤ C ∧ α! y = 0 µ≥0∧

p "

k=1

!" p

k=1

#

µk Kk Yα

µk Tr(Kk ) ≤ Λ.

L1 regularization page 19

Computational Complexity

• In general: SDP; • Non-negative linear combinations: QCQP, SILP (SVM-wrapper solution);

• Rank-1 kernels: QP.

page 20

Reality Check (Lanckriet et al., 2003)

page 21

Other Redeeming Properties

• Speed; • Ranking properties; • Feature selection, model understanding.

page 22

Reality Check (Lanckriet, De Bie, Cristianini, Jordan, & Noble, 2004)

• Classification performance on the cytoplasmic ribosomal class

Measuring the performance wrt a ranking criteria

page 23

Reality Check (Sonnenburg et al., 2004)

• Importance weighting in a DNA sequence around a so-called splice site.

page 24

Learning Kernels - Theory

• Linear classification, L regularization:

(Lanckriet et al., 2003)

1

"

1/ρ ! R(h) ≤ Rρ (h) + O p m

2

#

! hides logarithmic factors, O

!ρ (h) fraction of training points with margin < ρ . R

page 25

Learning Kernels - Theory

• Linear classification, L

1

(Srebro & Ben-David, 2006)

regularization:

!ρ (h) + O " R(h) ≤ R

#$

p + 1/ρ2 m

%

! hides logarithmic factors, O

!ρ (h) fraction of training points with margin < ρ . R

page 26

Hyperkernels (Ong, Smola & Williamson, 2005)

• Kernels of kernels, infinitely many kernels. • m kernel parameters to optimize over. 2

K(x, x! ) =

m !

βi,j K((xi , xj ), (x, x! ))

i,j=1

∀x, x! ∈ X,

βi,j ≥ 0

• SDP problem. page 27

Reality Check, Hyperkernels (Ong, Smola & Williamson, 2005)

!

d " # K (x, x! ), (x!! , x!!! ) =

j=1

!

$

1−λ

2 1 − λ exp − σj (xj − x!j )2 + (x!!j − x!!! j )

%"

page 28

Learning Kernels - Theory

• Regression, KRR L

2

• • •

regularization:

(CC et al, 2009)

"# $ # ! R(h) ≤ R(h) +O p/m + 1/m

additive term with number of kernels p . technical condition (orthogonal kernels). suggests using larger number of kernels p .

page 29

KRR L2, Problem Formulation

• Optimization problem: min max −λα" α −

µ∈M

α

p !

µk α" Kk α + 2α" y

k=1

2 2 M = {µ : µ ≥ 0 ∧ #µ − µ # ≤ Λ }. with 0

L2 regularization

page 30

Form of the Solution min max −λα" α −

µ∈M

α

p !

µk α" Kk α +2α" y

k=1

"

#$

%

µ! v

(von Neumann)

max −λα! α + 2α! y + min −µ! v α

max α

µ∈M

−λα! α + 2α! y − µ! 0 v −Λ"v" ! "# $

(solve min. prob.)

standard KRR with µ0 -kernel K0 .

α=

!" p

k=1

µk Kk + λI

#−1

y

with

!

v µ = µ0 + Λ !v! vk = α! Kk α page 31

Algorithm Algorithm 1 Interpolated Iterative Algorithm Input: Kk , k ∈ [1, p] α! ← (K0 + λI)−1 y repeat α ← α! v ← (α# K1 α, . . . , α# Kp α)# v µ ← µ0 + Λ $v$ α! ← ηα + (1 − η)(K(α) + λI)−1 y until $α! − α$ < # page 32

Reality Check, KRR, Rank-1(CC Kernels et al, 2009) Reuters (acq) 0.62

baseline L

baseline L2

0.6

1

L

L1

2

1.45

0.58

RMSE

RMSE

Kitchen

1.5

0.56

1.4 0.54 0.52 2000

3000

4000

5000

6000

1.1 1.05 1 0.95

1000

2000

3000 4000 # of bigrams

5000

6000

RMSE / baseline error

RMSE / baseline error

1000

1.35 0

1000

2000

3000

4000

2000 3000 # of bigrams

4000

1.04 1.02 1 0.98 0

1000

page 33

Hierarchical Kernel Learning

• Example: polynomial kernels: • Sub kernel: !q" Ki,j (xi , x!i ) =

j

(1 + xi x!i )j ,

• Full kernel:

K(x, x ) = !

(Bach, 2008)

i ∈ [1, p],

j ∈ [0, q]

!p

! q (1 + x x i i) i=1

• Convex optimization problem, complexity

polynomial in the number of kernels selected, sparsity through L1 regularization and hierarchical selection criteria. page 34

Reality Check, HKL

page 35

Summary

• Does not consistently and significantly outperform unweighted combinations.

• L regularization may work better than L . • Large number of kernels helps performance. 2

1

• Much faster. • Great for feature selection. • What about using non-linear combinations of kernels?

page 36

Non-Linear Combinations - Examples

• DC-Programming algorithm (Argyriou et al., 2005) • Generalized MKL (Varma & Babu, 2009) • Other non-linear combination studies. • Non-convex optimization problems. • Theoretical guarantees? • Can they improve performance substantially? page 37

DC-Programming Problem (Argyriou et al., 2005)

• Optimize over a continuously parameterized set of kernels.

• Kernels with bounded norm; Gaussians with the variance restricted to lie " in a bounded interval. # Kσ (x, x! ) =

• Alternate steps:

d !

exp

i=1

(xi − x!i )2 − σi2

- estimate new Gaussian; - fit the data. page 38

Reality Check, DC-Programming (Argyriou et al., 2005)

Learning the σ (s) in a Gaussian kernel, DC formulation. page 39

Generalized MKL (Varma & Babu, 2009)

• Product kernel, GMKL: • •

Gaussian: Kσ (x, x! ) = Polynomial: Kd (x, x! ) =

d !

exp

i=1 ! d " i=1

"

(xi − x!i )2 − σi2 #p

1 + µi xi x!i ,

#

µi ≥ 0

• Non-convex optimization problem, gradient

descent algorithm alternating with solving the SVM problem. page 40

Reality Check, GMKL

page 41

Future directions

• Get it to work! • Can theory guide us to how? • Should we change paradigm?

page 42

Can learning kernels help performance? - Research at Google

Canonical hyperplane: for support vectors, ... Support vectors:points along the margin and outliers. ..... DC-Programming algorithm (Argyriou et al., 2005).

3MB Sizes 1 Downloads 338 Views

Recommend Documents

Theoretical Foundations for Learning Kernels in ... - Research at Google
an unsupervised procedure, mostly used for exploratory data analysis or visual ... work is in analysing the learning kernel problem in the context of coupled ...

Learning Non-Linear Combinations of Kernels - Research at Google
... outperform either the uniform combination of base kernels or simply the best .... semi-definite (PSD) and its boundedness forms a regularization controlling the ...

Corporate Coaching can Help Improve the Overall Performance!.pdf ...
Whoops! There was a problem loading more pages. Corporate Coaching can Help Improve the Overall Performance!.pdf. Corporate Coaching can Help Improve ...

Learning sequence kernels - Semantic Scholar
such as the hard- or soft-margin SVMs, and analyzed more specifically the ..... The analysis of this optimization problem helps us prove the following theorem.

Achieving Predictable Performance through ... - Research at Google
[19] D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thot- tethodi. Near-optimal worst-case throughput routing for two- dimensional mesh networks. In Proc. of the ...

Performance Tournaments with Crowdsourced ... - Research at Google
Aug 23, 2013 - implement Thurstone's model in the CRAN package BradleyTerry2. ..... [2] Bradley, RA and Terry, ME (1952), Rank analysis of incomplete block.

Learning with Box Kernels
Of course, these regions can degenerate to single points and it is convenient ... tial impact in real-world applications has been analyzed in different contexts (see.

Learning Non-Linear Combinations of Kernels - CiteSeerX
(6) where M is a positive, bounded, and convex set. The positivity of µ ensures that Kx is positive semi-definite (PSD) and its boundedness forms a regularization ...

Learning with Deep Cascades - Research at Google
based on feature monomials of degree k, or polynomial functions of degree k, ... on finding the best trade-off between computational cost and classification accu-.

Tera-scale deep learning - Research at Google
The Trend of BigData .... Scaling up Deep Learning. Real data. Deep learning data ... Le, et al., Building high-‐level features using large-‐scale unsupervised ...

Learning with Weighted Transducers - Research at Google
b Courant Institute of Mathematical Sciences and Google Research, ... over a vector space are the polynomial kernels of degree d ∈ N, Kd(x, y)=(x·y + 1)d, ..... Computer Science, pages 262–273, San Francisco, California, July 2008. Springer-.

UNSUPERVISED CONTEXT LEARNING FOR ... - Research at Google
grams. If an n-gram doesn't appear very often in the training ... for training effective biasing models using far less data than ..... We also described how to auto-.

Searching help pages of R packages - Research at Google
34 matches - Software. Introduction. The sos package provides a means to quickly and flexibly search the help ... Jonathan Baron's R site search database (Baron, 2009) and returns the ..... from http://addictedtor.free.fr/rsitesearch . Bibliography.

UNSUPERVISED LEARNING OF SEMANTIC ... - Research at Google
model evaluation with only a small fraction of the labeled data. This allows us to measure the utility of unlabeled data in reducing an- notation requirements for any sound event classification application where unlabeled data is plentiful. 4.1. Data

Why does Unsupervised Pre-training Help Deep ... - Research at Google
pre-training acts as a kind of network pre-conditioner, putting the parameter values in the appropriate ...... 7.6 Summary of Findings: Experiments 1-5. So far, the ...

Scalable Hierarchical Multitask Learning ... - Research at Google
Feb 24, 2014 - on over 1TB data for up to 1 billion observations and 1 mil- ..... Wc 2,1. (16). The coefficients λ1 and λ2 govern the trade-off between generic sparsity ..... years for each school correspond to the subtasks of the school. ID. Thus 

Semi-supervised Sequence Learning - Research at Google
1http://ai.Stanford.edu/amaas/data/sentiment/index.html. 3 .... email with an average length of 267 words and a maximum length of 11,925 words. Attachments,.

Learning Kernels Using Local Rademacher Complexity
Figure 1: Illustration of the bound (3). The volume of the ..... 3 kernel weight l1 l2 conv dc. 85.2 80.9 85.8 55.6 72.1 n=100. 100. 50. 0. −1. 0. 1. 2 θ log(tailsum( θ).

Rogue Femtocell Owners: How Mallory Can ... - Research at Google
in a significant way, because it is entirely passive and not foiled by .... 3. Fig. 3. Close-up of packet sizes that are produced during a phone call. Moving to Figure ...

Large Scale Performance Measurement of ... - Research at Google
Large Scale Performance Measurement of Content-Based ... in photo management applications. II. .... In this section, we perform large scale tests on two.

Constellation Shaping: Can It be Useful for ... - Research at Google
Communication ? Xiang Zhou and Hong Liu ... Long Haul. ○ With optical amplifier ... Euclidean distance thus noise tolerance. Probabilistic Constellation ...

The Performance Cost of Shadow Stacks and ... - Research at Google
for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. ..... There are two main disadvantages compared to a tradi- .... it may be related to forking and our (mis-)use of %gs:108 to store the ...

CPI2: CPU performance isolation for shared ... - Research at Google
part of a job with at least 10 tasks, and 87% of the tasks are part of a job with ... similar data. A typical web-search query involves thousands of ma- ..... tagonist to 0.01 CPU-sec/sec for low-importance (“best ef- fort”) batch ..... onto a sh