Statistics and Clustering with Kernels Christoph Lampert & Matthew Blaschko Max-Planck-Institute for Biological Cybernetics Department Schölkopf: Empirical Inference Tübingen, Germany Visual Geometry Group University of Oxford United Kingdom

June 20, 2009

Overview

Kernel Ridge Regression Kernel PCA Spectral Clustering Kernel Covariance and Canonical Correlation Analysis Kernel Measures of Independence

Kernel Ridge Regression Regularized least squares regression: min w

n X i=1

(yi − hw, xi i)2 + λkwk2

Kernel Ridge Regression Regularized least squares regression: min w

Replace w with min α

n X i=1

n X i=1

Pn

i=1

 yi

(yi − hw, xi i)2 + λkwk2



α i xi n X

2

hxi , xj i + λ

j=1

n X n X i=1 j=1

α∗ can be solved in closed form solution α∗ = (K + λI )−1 y

αi αj hxi , xj i

PCA Equivalent formulations: Minimize squared error between original data and a projection of our data into a lower dimensional subspace Maximize variance of projected data Solutions: Eigenvectors of the empirical covariance matrix

PCA continued Empirical covariance matrix (biased): X ˆ = 1 C (xi − µ)(xi − µ)T n i

where µ is the sample mean. ˆ is positive (semi-)definite symmetric C PCA: max w

ˆw wT C kwk2

Data Centering We use the notation X to denote the design matrix where every column of X is a data sample We can define a centering matrix H =I− where e is a vector of all ones

1 T ee n

Data Centering We use the notation X to denote the design matrix where every column of X is a data sample We can define a centering matrix H =I−

1 T ee n

where e is a vector of all ones H is idempotent, symmetric, and positive semi-definite (rank n − 1)

Data Centering We use the notation X to denote the design matrix where every column of X is a data sample We can define a centering matrix H =I−

1 T ee n

where e is a vector of all ones H is idempotent, symmetric, and positive semi-definite (rank n − 1) The design matrix of centered data can be written compactly in matrix form as XH I

The ith column of XH is equal to xi − µ, where µ = the sample mean

1 n

P

j xj

is

Kernel PCA PCA:

ˆw wT C max w kwk2

Kernel PCA: I

Replace w by i αi (xi − µ) - this can be represented compactly in matrix form by w = XH α where X is the design matrix, H is the centering matrix, and α is the coefficient vector. P

Kernel PCA PCA:

ˆw wT C max w kwk2

Kernel PCA: I

I

Replace w by i αi (xi − µ) - this can be represented compactly in matrix form by w = XH α where X is the design matrix, H is the centering matrix, and α is the coefficient vector. ˆ in matrix form as C ˆ = 1 XHX T Compute C n P

Kernel PCA PCA:

ˆw wT C max w kwk2

Kernel PCA: I

I I

Replace w by i αi (xi − µ) - this can be represented compactly in matrix form by w = XH α where X is the design matrix, H is the centering matrix, and α is the coefficient vector. ˆ in matrix form as C ˆ = 1 XHX T Compute C n Denote the matrix of pairwise inner products K = X T X , i.e. Kij = hxi , xj i P

Kernel PCA PCA:

ˆw wT C max w kwk2

Kernel PCA: I

I I

Replace w by i αi (xi − µ) - this can be represented compactly in matrix form by w = XH α where X is the design matrix, H is the centering matrix, and α is the coefficient vector. ˆ in matrix form as C ˆ = 1 XHX T Compute C n Denote the matrix of pairwise inner products K = X T X , i.e. Kij = hxi , xj i P

ˆw wT C 1 αT HKHKH α = max α n αT HKH α kwk2 This is a Rayleigh quotient with known solution max w

HKH βi = λi βi

Kernel PCA Set β to be the eigenvectors of HKH , and λ the corresponding eigenvalues 1 Set α = βλ− 2 Example, image super-resolution:

(fig: Kim et al., PAMI 2005.)

Overview

Kernel Ridge Regression Kernel PCA Spectral Clustering Kernel Covariance and Canonical Correlation Analysis Kernel Measures of Independence

Spectral Clustering Represent similarity of images by weights on a graph Normalized cuts optimizes the ratio of the cost of a cut and the volume of each cluster Ncut(A1 , . . . , Ak ) =

¯ i) cut(Ai , A vol(Ai ) i=1

k X

Spectral Clustering Represent similarity of images by weights on a graph Normalized cuts optimizes the ratio of the cost of a cut and the volume of each cluster Ncut(A1 , . . . , Ak ) =

¯ i) cut(Ai , A vol(Ai ) i=1

k X

Exact optimization is NP-hard, but relaxed version can be solved by finding the eigenvalues of the graph Laplacian 1

1

L = I − D − 2 AD − 2 where D is the diagonal matrix with entries equal to the row sums of similarity matrix, A.

Spectral Clustering (continued) 1

1

Compute L = I − D − 2 AD − 2 . Map data points based on the eigenvalues of L Example, handwritten digits (0-9):

(fig: Xiaofei He) Cluster in mapped space using k-means

Overview

Kernel Ridge Regression Kernel PCA Spectral Clustering Kernel Covariance and Canonical Correlation Analysis Kernel Measures of Independence

Multimodal Data

A latent aspect relates data that are present in multiple modalities e.g. images and text XYZ[ _^]\ z M MMM q q q MMM q q q qx & XYZ[ _^]\ _^]\ XYZ[ ϕy (y) ϕx (x)

x: y:

“A view from Idyllwild, California, with pine trees and snow capped Marion Mountain under a blue sky.”

Multimodal Data

A latent aspect relates data that are present in multiple modalities e.g. images and text XYZ[ _^]\ z M MMM q q q MMM q q q qx & XYZ[ _^]\ _^]\ XYZ[ ϕy (y) ϕx (x)

x: y:

“A view from Idyllwild, California, with pine trees and snow capped Marion Mountain under a blue sky.”

Learn kernelized projections that relate both spaces

Kernel Covariance KPCA is maximization of auto-covariance Instead maximize cross-covariance max w ,w x

y

wx Cxy wy kwx kkwy k

Kernel Covariance KPCA is maximization of auto-covariance Instead maximize cross-covariance max w ,w x

y

wx Cxy wy kwx kkwy k

Can also be kernelized (replace wx by

P

i

αi (xi − µx ), etc.)

αT HKx HKy H β maxα,β q αT HKx H αβ T HKy H β

Kernel Covariance KPCA is maximization of auto-covariance Instead maximize cross-covariance max w ,w x

y

wx Cxy wy kwx kkwy k

Can also be kernelized (replace wx by

P

i

αi (xi − µx ), etc.)

αT HKx HKy H β maxα,β q αT HKx H αβ T HKy H β Solution is given by (generalized) eigenproblem 0 HKx HKy H HKy HKx H 0

!

!

α HKx H 0 =λ β 0 HKy H

!

α β

!

Kernel Canonical Correlation Analysis (KCCA) Alternately, maximize correlation instead of covariance max q

wx ,wy

wxT Cxy wy wxT Cxx wx wyT Cyy wy

Kernel Canonical Correlation Analysis (KCCA) Alternately, maximize correlation instead of covariance max q

wx ,wy

wxT Cxy wy wxT Cxx wx wyT Cyy wy

Kernelization is straightforward as before max q α,β

αT HKx HKy H β αT (HKx H )2 αβ T (HKy H )2 β

KCCA (continued) Problem: If the data in either modality are linearly independent (as many dimensions as data points), there exists a projection of the data that respects any arbitrary ordering Perfect correlation can always be achieved

KCCA (continued) Problem: If the data in either modality are linearly independent (as many dimensions as data points), there exists a projection of the data that respects any arbitrary ordering Perfect correlation can always be achieved This is even more likely when a kernel is used (e.g. Gaussian)

KCCA (continued) Problem: If the data in either modality are linearly independent (as many dimensions as data points), there exists a projection of the data that respects any arbitrary ordering Perfect correlation can always be achieved This is even more likely when a kernel is used (e.g. Gaussian) Solution: Regularize wxT Cxy wy max r   wx ,wy (wxT Cxx wx + εx kwx k2 ) wyT Cyy wy + εy kwy k2 As εx → ∞, εx → ∞, solution approaches maximum covariance

KCCA Algorithm

Compute Kx , Ky Solve for α and β as the eigenvectors of 0 HKx HKy H HKy HKx H 0

!

!

α = β

(HKx H )2 + εx HKx H 0 λ 2 (HKy H ) + εy HKy H 0

!

α β

!

Content Based Image Retrieval with KCCA

Hardoon et al., 2004 Training data consists of images with text captions Learn embeddings of both spaces using KCCA and appropriately chosen image and text kernels Retrieval consists of finding images whose embeddings are related to the embedding of the text query

Content Based Image Retrieval with KCCA

Hardoon et al., 2004 Training data consists of images with text captions Learn embeddings of both spaces using KCCA and appropriately chosen image and text kernels Retrieval consists of finding images whose embeddings are related to the embedding of the text query A kind of multi-variate regression

Overview

Kernel Ridge Regression Kernel PCA Spectral Clustering Kernel Covariance and Canonical Correlation Analysis Kernel Measures of Independence

Kernel Measures of Independence

We know how to measure correlation in the kernelized space

Kernel Measures of Independence

We know how to measure correlation in the kernelized space Independence implies zero correlation

Kernel Measures of Independence

We know how to measure correlation in the kernelized space Independence implies zero correlation Different kernels encode different statistical properties of the data

Kernel Measures of Independence

We know how to measure correlation in the kernelized space Independence implies zero correlation Different kernels encode different statistical properties of the data Use an appropriate kernel such that zero correlation in the Hilbert space implies independence

Example: Polynomial Kernel First degree polynomial kernel (i.e. linear) captures correlation only Second degree polynomial kernel captures all second order statistics ...

Example: Polynomial Kernel First degree polynomial kernel (i.e. linear) captures correlation only Second degree polynomial kernel captures all second order statistics ... A Gaussian kernel can be written 2

k(xi , xj ) = e −γkxi −xj k = e −γhxi ,xi i e 2γhxi ,xj i e −γhxj ,xj i and we can use the identity ez =

∞ X

1 i z i=1 i!

Example: Polynomial Kernel First degree polynomial kernel (i.e. linear) captures correlation only Second degree polynomial kernel captures all second order statistics ... A Gaussian kernel can be written 2

k(xi , xj ) = e −γkxi −xj k = e −γhxi ,xi i e 2γhxi ,xj i e −γhxj ,xj i and we can use the identity ez =

∞ X

1 i z i=1 i!

We can view the Gaussian kernel as being related to an appropriately scaled infinite dimensional polynomial kernel

Example: Polynomial Kernel First degree polynomial kernel (i.e. linear) captures correlation only Second degree polynomial kernel captures all second order statistics ... A Gaussian kernel can be written 2

k(xi , xj ) = e −γkxi −xj k = e −γhxi ,xi i e 2γhxi ,xj i e −γhxj ,xj i and we can use the identity ez =

∞ X

1 i z i=1 i!

We can view the Gaussian kernel as being related to an appropriately scaled infinite dimensional polynomial kernel I

captures all order statistics

Hilbert-Schmidt Independence Criterion F RKHS on X with kernel kx (x, x 0 ), G RKHS on Y with kernel ky (y, y 0 )

Hilbert-Schmidt Independence Criterion F RKHS on X with kernel kx (x, x 0 ), G RKHS on Y with kernel ky (y, y 0 ) Covariance operator: Cxy : G → F such that hf , Cxy giF = Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)]

Hilbert-Schmidt Independence Criterion F RKHS on X with kernel kx (x, x 0 ), G RKHS on Y with kernel ky (y, y 0 ) Covariance operator: Cxy : G → F such that hf , Cxy giF = Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)]

HSIC is the Hilbert-Schmidt norm of Cxy (Fukumizu et al. 2008): HSIC := kCxy k2HS

Hilbert-Schmidt Independence Criterion F RKHS on X with kernel kx (x, x 0 ), G RKHS on Y with kernel ky (y, y 0 ) Covariance operator: Cxy : G → F such that hf , Cxy giF = Ex,y [f (x)g(y)] − Ex [f (x)]Ey [g(y)]

HSIC is the Hilbert-Schmidt norm of Cxy (Fukumizu et al. 2008): HSIC := kCxy k2HS (Biased) empirical HSIC: \ := 1 Tr(Kx HKy H ) HSIC n2

Hilbert-Schmidt Independence Criterion (continued) Ring-shaped density, correlation approx. zero Maximum singular vectors (functions) of Cxy Dependence witness, X 0.5

Correlation: −0.00

Correlation: −0.90

1.5

0

f(x)

1

−0.5

0

2

x Dependence witness, Y

0

0

0.5

−0.5

−0.5 0

−1

g(y)

Y

−1 −2

g(Y)

0.5

0.5

−1.5 −2

COCO: 0.14

1

−1 −1

−0.5

0

X

2

−0.5

0

f(X) −1 −2

0

y

2

0.5

Hilbert-Schmidt Normalized Independence Criterion Hilbert-Schmidt Independence Criterion analogous to cross-covariance Can we construct a version analogous to correlation?

Hilbert-Schmidt Normalized Independence Criterion Hilbert-Schmidt Independence Criterion analogous to cross-covariance Can we construct a version analogous to correlation? Simple modification: decompose Covariance operator (Baker 1973) 1

1

Cxy = Cxx2 Vxy Cyy2 where Vxy is the normalized cross-covariance operator (maximum singular value is bounded by 1)

Hilbert-Schmidt Normalized Independence Criterion Hilbert-Schmidt Independence Criterion analogous to cross-covariance Can we construct a version analogous to correlation? Simple modification: decompose Covariance operator (Baker 1973) 1

1

Cxy = Cxx2 Vxy Cyy2 where Vxy is the normalized cross-covariance operator (maximum singular value is bounded by 1) Use norm of Vxy instead of the norm of Cxy

Hilbert-Schmidt Normalized Independence Criterion (continued) Define the normalized independence criterion to be the Hilbert-Schmidt norm of Vxy h \ := 1 Tr HKx H (HKx H + εx I )−1 HSNIC n2 i HKy H (HKy H + εy I )−1

where εx and εy are regularization parameters as in KCCA

Hilbert-Schmidt Normalized Independence Criterion (continued) Define the normalized independence criterion to be the Hilbert-Schmidt norm of Vxy h \ := 1 Tr HKx H (HKx H + εx I )−1 HSNIC n2 i HKy H (HKy H + εy I )−1

where εx and εy are regularization parameters as in KCCA If the kernels on x and y are characteristic (e.g. Gaussian kernels, see Fukumizu et al., 2008) kCxy k2HS = kVxy k2HS = 0 iff x and y are independent!

Applications of HS(N)IC Independence tests - is there anything to gain from the use of multi-modal data?

Applications of HS(N)IC Independence tests - is there anything to gain from the use of multi-modal data? Kernel ICA

Applications of HS(N)IC Independence tests - is there anything to gain from the use of multi-modal data? Kernel ICA Maximize dependence with respect to some model parameters I

Kernel target alignment (Cristianini et al., 2001)

Applications of HS(N)IC Independence tests - is there anything to gain from the use of multi-modal data? Kernel ICA Maximize dependence with respect to some model parameters I I

Kernel target alignment (Cristianini et al., 2001) Learning spectral clustering (Bach & Jordan, 2003) - relates kernel learning and clustering

Applications of HS(N)IC Independence tests - is there anything to gain from the use of multi-modal data? Kernel ICA Maximize dependence with respect to some model parameters I I

I

Kernel target alignment (Cristianini et al., 2001) Learning spectral clustering (Bach & Jordan, 2003) - relates kernel learning and clustering Taxonomy discovery (Blaschko & Gretton, 2008)

Summary In this section we learned how to Do basic operations in kernel space like: I I I

Regularized least squares regression Data centering PCA

Summary In this section we learned how to Do basic operations in kernel space like: I I I

Regularized least squares regression Data centering PCA

Learn with multi-modal data I I

Kernel Covariance KCCA

Summary In this section we learned how to Do basic operations in kernel space like: I I I

Regularized least squares regression Data centering PCA

Learn with multi-modal data I I

Kernel Covariance KCCA

Use kernels to construct statistical independence tests I I

I

Use appropriate kernels to capture relevant statistics Measure dependence by norm of (normalized) covariance operator Closed form solutions requiring only kernel matrices for each modality

Summary In this section we learned how to Do basic operations in kernel space like: I I I

Regularized least squares regression Data centering PCA

Learn with multi-modal data I I

Kernel Covariance KCCA

Use kernels to construct statistical independence tests I I

I

Use appropriate kernels to capture relevant statistics Measure dependence by norm of (normalized) covariance operator Closed form solutions requiring only kernel matrices for each modality

Questions?

Structured Output Learning Christoph Lampert & Matthew Blaschko Max-Planck-Institute for Biological Cybernetics Department Schölkopf: Empirical Inference Tübingen, Germany Visual Geometry Group University of Oxford United Kingdom

June 20, 2009

What is Structured Output Learning?

Regression maps from an input space to an output space g:X →Y

What is Structured Output Learning?

Regression maps from an input space to an output space g:X →Y

In typical scenarios, Y ≡ R (regression) or Y ≡ {−1, 1} (classification)

What is Structured Output Learning?

Regression maps from an input space to an output space g:X →Y

In typical scenarios, Y ≡ R (regression) or Y ≡ {−1, 1} (classification) Structured output learning extends this concept to more complex and interdependent output spaces

Examples of Structured Output Problems in Computer Vision Multi-class classification (Crammer & Singer, 2001) Hierarchical classification (Cai & Hofmann, 2004) Segmentation of 3d scan data (Anguelov et al., 2005) Learning a CRF model for stereo vision (Li & Huttenlocher, 2008) Object localization (Blaschko & Lampert, 2008) Segmentation with a learned CRF model (Szummer et al., 2008) ... More examples at CVPR 2009

Generalization of Regression Direct discriminative learning of g : X → Y I

Penalize errors for this mapping

Generalization of Regression Direct discriminative learning of g : X → Y I

Penalize errors for this mapping

Two basic assumptions employed I

Use of a compatibility function f :X ×Y →R

I

g takes the form of a decoding function g(x) = argmax f (x, y) y

Generalization of Regression Direct discriminative learning of g : X → Y I

Penalize errors for this mapping

Two basic assumptions employed I

Use of a compatibility function f :X ×Y →R

I

g takes the form of a decoding function g(x) = argmax f (x, y) y

I

linear w.r.t. joint kernel f (x, y) = hw, ϕ(x, y)i

Multi-Class Joint Feature Map Simple joint kernel map: define ϕy (yi ) to be the vector with 1 in place of the current class, and 0 elsewhere ϕy (yi ) = [0, . . . ,

1 |{z}

, . . . , 0]T

kth position

if yi represents a sample that is a member of class k

Multi-Class Joint Feature Map Simple joint kernel map: define ϕy (yi ) to be the vector with 1 in place of the current class, and 0 elsewhere ϕy (yi ) = [0, . . . ,

1 |{z}

, . . . , 0]T

kth position

if yi represents a sample that is a member of class k ϕx (xi ) can result from any kernel over X : kx (xi , xj ) = hϕx (xi ), ϕx (xj )i

Multi-Class Joint Feature Map Simple joint kernel map: define ϕy (yi ) to be the vector with 1 in place of the current class, and 0 elsewhere ϕy (yi ) = [0, . . . ,

1 |{z}

, . . . , 0]T

kth position

if yi represents a sample that is a member of class k ϕx (xi ) can result from any kernel over X : kx (xi , xj ) = hϕx (xi ), ϕx (xj )i

Set ϕ(xi , yi ) = ϕy (yi ) ⊗ ϕx (xi ), where ⊗ represents the Kronecker product

Multiclass Perceptron

Reminder: we want hw, ϕ(xi , yi )i > hw, ϕ(xi , y)i

∀y 6= yi

Multiclass Perceptron

Reminder: we want hw, ϕ(xi , yi )i > hw, ϕ(xi , y)i

∀y 6= yi

Example: perceptron training with a multiclass joint feature map Gradient of loss for example i is ∂w `(xi , yi , w) =

 0

if hw, ϕ(xi , yi )i ≥ hw, ϕ(xi , y)i∀y 6= yi maxy6=y ϕ(xi , yi ) − ϕ(xi , y) otherwise i

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

Final result (Credit: Lyndsey Pickup)

Perceptron Training with Multiclass Joint Feature Map

(Credit: Lyndsey Pickup)

Crammer & Singer Multi-Class SVM Instead of training using a perceptron, we can enforce a large margin and do a batch convex optimization: min w s.t.

n X 1 kwk2 + C ξi 2 i=1 hw, ϕ(xi , yi )i − hw, ϕ(xi , y)i ≥ 1 − ξi

∀y 6= yi

Crammer & Singer Multi-Class SVM Instead of training using a perceptron, we can enforce a large margin and do a batch convex optimization: min w s.t.

n X 1 kwk2 + C ξi 2 i=1 hw, ϕ(xi , yi )i − hw, ϕ(xi , y)i ≥ 1 − ξi

Can also be written only in terms of kernels w=

XX x

αxy ϕ(x, y)

y

Can use a joint kernel k :X ×Y ×X ×Y →R k(xi , yi , xj , yj ) = hϕ(xi , yi ), ϕ(xj , yj )i

∀y 6= yi

Structured Output Support Vector Machines (SO-SVM) Frame structured prediction as a multiclass problem I

predict a single element of Y and pay a penalty for mistakes

Structured Output Support Vector Machines (SO-SVM) Frame structured prediction as a multiclass problem I

predict a single element of Y and pay a penalty for mistakes

Not all errors are created equally I

e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes

Structured Output Support Vector Machines (SO-SVM) Frame structured prediction as a multiclass problem I

predict a single element of Y and pay a penalty for mistakes

Not all errors are created equally I

e.g. in an HMM making only one mistake in a sequence should be penalized less than making 50 mistakes

Pay a loss proportional to the difference between true and predicted error (task dependent) ∆(yi , y)

Margin Rescaling Variant: Margin-Rescaled Joint-Kernel SVM for output space Y (Tsochantaridis et al., 2005) Idea: some wrong labels are worse than others: loss ∆(yi , y) Solve min kwk2 + C w

n X

ξi

i=1

s.t. hw, ϕ(xi , yi )i − hw, ϕ(xi , y)i ≥ ∆(yi , y) − ξi Classify new samples using g : X → Y: g(x) = argmaxhw, ϕ(x, y)i y∈Y

∀y ∈ Y \ {yi }

Margin Rescaling Variant: Margin-Rescaled Joint-Kernel SVM for output space Y (Tsochantaridis et al., 2005) Idea: some wrong labels are worse than others: loss ∆(yi , y) Solve min kwk2 + C w

n X

ξi

i=1

s.t. hw, ϕ(xi , yi )i − hw, ϕ(xi , y)i ≥ ∆(yi , y) − ξi

∀y ∈ Y \ {yi }

Classify new samples using g : X → Y: g(x) = argmaxhw, ϕ(x, y)i y∈Y

Another variant is slack rescaling (see Tsochantaridis et al., 2005)

Label Sequence Learning

For, e.g., handwritten character recognition, it may be useful to include a temporal model in addition to learning each character individually As a simple example take an HMM

Label Sequence Learning

For, e.g., handwritten character recognition, it may be useful to include a temporal model in addition to learning each character individually As a simple example take an HMM We need to model emission probabilities and transition probabilities I

Learn these discriminatively

A Joint Kernel Map for Label Sequence Learning

Emissions (blue)

A Joint Kernel Map for Label Sequence Learning

Emissions (blue) I

fe (xi , yi ) = hwe , ϕe (xi , yi )i

A Joint Kernel Map for Label Sequence Learning

Emissions (blue) I I

fe (xi , yi ) = hwe , ϕe (xi , yi )i Can simply use the multi-class joint feature map for ϕe

A Joint Kernel Map for Label Sequence Learning

Emissions (blue) I I

fe (xi , yi ) = hwe , ϕe (xi , yi )i Can simply use the multi-class joint feature map for ϕe

Transitions (green)

A Joint Kernel Map for Label Sequence Learning

Emissions (blue) I I

fe (xi , yi ) = hwe , ϕe (xi , yi )i Can simply use the multi-class joint feature map for ϕe

Transitions (green) I I

ft (xi , yi ) = hwt , ϕt (yi , yi+1 )i Can use ϕt (yi , yi+1 ) = ϕy (yi ) ⊗ ϕy (yi+1 )

A Joint Kernel Map for Label Sequence Learning (continued)

p(x, y) ∝

Y i

e fe (xi ,yi )

Y i

e ft (yi ,yi+1 )

for an HMM

A Joint Kernel Map for Label Sequence Learning (continued)

p(x, y) ∝

Y

e fe (xi ,yi )

i

f (x, y) =

X

Y

e ft (yi ,yi+1 )

for an HMM

i

fe (xi , yi ) +

i

= hwe ,

X

ft (yi , yi+1 )

i

X i

ϕe (xi , yi )i + hwt ,

X i

ϕt (yi , yi+1 )i

Constraint Generation

min kwk2 + C w

n X

ξi

i=1

s.t. hw, ϕ(xi , yi )i − hw, ϕ(xi , y)i ≥ ∆(yi , y) − ξi

∀y ∈ Y \ {yi }

Constraint Generation

min kwk2 + C w

n X

ξi

i=1

s.t. hw, ϕ(xi , yi )i − hw, ϕ(xi , y)i ≥ ∆(yi , y) − ξi

∀y ∈ Y \ {yi }

Initialize constraint set to be empty Iterate until convergence: I I

Solve optimization using current constraint set Add maximially violated constraint for current solution

Constraint Generation with the Viterbi Algorithm To find the maximially violated constraint, we need to maximize w.r.t. y hw, ϕ(xi , y)i + ∆(yi , y)

Constraint Generation with the Viterbi Algorithm To find the maximially violated constraint, we need to maximize w.r.t. y hw, ϕ(xi , y)i + ∆(yi , y) For arbitrary output spaces, we would need to iterate over all elements in Y

Constraint Generation with the Viterbi Algorithm To find the maximially violated constraint, we need to maximize w.r.t. y hw, ϕ(xi , y)i + ∆(yi , y) For arbitrary output spaces, we would need to iterate over all elements in Y For HMMs, maxy hw, ϕ(xi , y)i can be found using the Viterbi algorithm

Constraint Generation with the Viterbi Algorithm To find the maximially violated constraint, we need to maximize w.r.t. y hw, ϕ(xi , y)i + ∆(yi , y) For arbitrary output spaces, we would need to iterate over all elements in Y For HMMs, maxy hw, ϕ(xi , y)i can be found using the Viterbi algorithm It is a simple modification of this procedure to incorporate ∆(yi , y) (Tsochantaridis et al., 2004)

Discriminative Training of Object Localization

Structured output learning is not restricted to outputs specified by graphical models

Discriminative Training of Object Localization

Structured output learning is not restricted to outputs specified by graphical models We can formulate object localization as a regression from an image to a bounding box g:X →Y X is the space of all images Y is the space of all bounding boxes

Joint Kernel between Images and Boxes: Restriction Kernel Note: x |y (the image restricted to the box region) is again an image. Compare two images with boxes by comparing the images within the boxes: kjoint ((x, y), (x 0 , y 0 ) ) = kimage (x |y , x 0 |y 0 , ) Any common image kernel is applicable: I I I

linear on cluster histograms: k(h, h 0 ) = i hi hi0 , P (h −h 0 )2 χ2 -kernel: kχ2 (h, h 0 ) = exp − γ1 i hii +hi0 i pyramid matching kernel, ... P

The resulting joint kernel is positive definite.

Restriction Kernel: Examples 





=k

,

kjoint



, is large.





=k

,

kjoint





, is small.





kjoint

,





=k

,

could also be large. Note: This behaves differently from the common tensor products kjoint ( (x, y), (x 0 , y 0 ) ) 6= k(x, x 0 )k(y, y 0 )) !

Constraint Generation with Branch and Bound As before, we must solve maxhw, ϕ(xi , y)i + ∆(yi , y) y∈Y

where ∆(yi , y) =

T  1 − Area(yi S y) Area(yi y)  1 − 1 (y y + 1) 2

iω ω

if yiω = yω = 1 otherwise

and yiω specifies whether there is an instance of the object at all present in the image

Constraint Generation with Branch and Bound As before, we must solve maxhw, ϕ(xi , y)i + ∆(yi , y) y∈Y

where ∆(yi , y) =

T  1 − Area(yi S y) Area(yi y)  1 − 1 (y y + 1) 2

iω ω

if yiω = yω = 1 otherwise

and yiω specifies whether there is an instance of the object at all present in the image Solution: use branch-and-bound over the space of all rectangles in the image (Blaschko & Lampert, 2008)

Discriminative Training of Image Segmentation

Frame discriminative image segmentation as learning parameters of a random field model

Discriminative Training of Image Segmentation

Frame discriminative image segmentation as learning parameters of a random field model Like sequence learning, the problem decomposes over cliques in the graph

Discriminative Training of Image Segmentation

Frame discriminative image segmentation as learning parameters of a random field model Like sequence learning, the problem decomposes over cliques in the graph Set the loss to the number of incorrect pixels

Constraint Generation with Graph Cuts

As the graph is loopy, we cannot use Viterbi

Constraint Generation with Graph Cuts

As the graph is loopy, we cannot use Viterbi Loopy belief propagation is approximate and can lead to poor learning performance for structured output learning of graphical models (Finley & Joachims, 2008)

Constraint Generation with Graph Cuts

As the graph is loopy, we cannot use Viterbi Loopy belief propagation is approximate and can lead to poor learning performance for structured output learning of graphical models (Finley & Joachims, 2008) Solution: use graph cuts (Szummer et al., 2008) ∆(yi , y) can be easily incorporated into the energy function

Summary of Structured Output Learning Structured output learning is the prediction of items in complex and interdependent output spaces

Summary of Structured Output Learning Structured output learning is the prediction of items in complex and interdependent output spaces We can train regressors into these spaces using a generalization of the support vector machine

Summary of Structured Output Learning Structured output learning is the prediction of items in complex and interdependent output spaces We can train regressors into these spaces using a generalization of the support vector machine We have shown examples for I I I

Label sequence learning with Viterbi Object localization with branch and bound Image segmentation with graph cuts

Summary of Structured Output Learning Structured output learning is the prediction of items in complex and interdependent output spaces We can train regressors into these spaces using a generalization of the support vector machine We have shown examples for I I I

Label sequence learning with Viterbi Object localization with branch and bound Image segmentation with graph cuts

Questions?

Kernel Methods for Object Recognition - University of Oxford

Jun 20, 2009 - KPCA is maximization of auto-covariance ... Training data consists of images with text captions ..... Perceptron Training with Multiclass Joint.

2MB Sizes 0 Downloads 262 Views

Recommend Documents

Kernel Methods for Object Recognition - University of Oxford
Jun 20, 2009 - We can define a centering matrix. H = I −. 1 n ..... Define the normalized independence criterion to be the ..... Structured Output Support Vector.

Sparse Distance Learning for Object Recognition ... - Washington
objects, we define a view-to-object distance where a novel view is .... Google 3D Warehouse. ..... levels into 18 (0◦ −360◦) and 9 orientation bins (0◦ −180◦),.

Graphical processors for speeding up kernel machines - University of ...
on a multi-core graphical processor (GPU) to partially address this lack of scalability. GPUs are .... while the fastest Intel CPU could achieve only ∼ 50. Gflops speed theoretically, GPUs ..... Figure 4: Speedups obtained on the Gaussian kernel co

Facilitate Object Recognition
the subject's verbal response triggered a voice key linked to a headset micro- phone. The intertriai interval was fixed at 2000 msec after the response. The five ...

A survey of kernel methods for relation extraction
tasks were first formulated, all but one of the systems (Miller et al., 1998) were based on handcrafted ... Hardcom Corporation”. Fig. 1. Example of the .... method first automatically determined a dynamic context-sensitive tree span. (the original

University of Oxford
Analysis of the literature data on sequence homologies, structural-functional ..... visualized in PrettyPlot program [36] through the “tutorial” section at CBRG ...... to analyze the very big sets of viruses (even all of them) on the presence of

University of Oxford
sequences/structures the most useful were: NCBI website [39, 40], CBRG ...... [31] Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ.

A Feature Learning and Object Recognition Framework for ... - arXiv
K. Williams is with the Alaska Fisheries Science Center, National Oceanic ... investigated in image processing and computer vision .... associate set of classes with belief functions. ...... of the images are noisy and with substantial degree of.

A Feature Learning and Object Recognition Framework for ... - arXiv
systematic methods proposed to determine the criteria of decision making. Since objects can be naturally categorized into higher groupings of classes based on ...

Optimal Delegation - Oxford Journals - Oxford University Press
results to the regulation of a privately informed monopolist and to the design of ... right to make the decision, only the agent is informed about the state.

Introduction to Kernel Methods
4 A Kernel Pattern Analysis Algorithm. Primal linear regression. Dual linear regression. 5 Kernel Functions. Mathematical characterisation. Visualizing kernels in ...

Kernel Methods for Learning Languages - NYU Computer Science
Dec 28, 2007 - cCourant Institute of Mathematical Sciences,. 251 Mercer Street, New ...... for providing hosting and guidance at the Hebrew University. Thanks also to .... Science, pages 349–364, San Diego, California, June 2007. Springer ...

Kernel Methods for Minimum Entropy Encoding
crucial impact in diverse domains, ranging from bioinformat- ics, medical science ... recently proposed to make it more affordable. Xu et al. [3] for- mulate a ...

Kernel Methods for Learning Languages - Research at Google
Dec 28, 2007 - its input labels, and further optimize the result with the application of the. 21 ... for providing hosting and guidance at the Hebrew University.

Nystrom Approximation for Sparse Kernel Methods ...
synthetic data and real-world data sets. Experimental results have indicated the huge acceleration of the Nyström method on training time while maintaining the ...

Speaker Recognition using Kernel-PCA and ... - Semantic Scholar
[11] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models,". Digital Signal Processing, Vol. 10, No.1-3, pp.

Image normalization for pattern recognition - National Taiwan University
resulting pattern becomes compact, which we call the compact image. We shall .... where fuy(u, v) is the p.d.f of the distorted image and C denotes the region we ...

Speaker Recognition using Kernel-PCA and ...
Modeling in common speaker subspace (CSS). The purpose of the projection of the common-speaker subspace into Rm using the distance preserving ...

Recupero Grammar - Oxford University Press
OXFORD UNIVERSITY PRESS • PHOTOCOPIABLE. Name: Class: Date: Recupero unit. Network Level 1 Recupero • Unit 6. 6. 4 Scrivi frasi affermative e negative usando la forma corretta di there is/are. 1 a school ✓. There's a school. . 2 shops ✗ . 3 t

Sociology Working Papers Department of Sociology University of Oxford
Sociology Working Papers. Paper Number: 2009–02. An Occupational Status Scale for Russia. Alexey Bessudnov. Department of Sociology. University of Oxford. Manor Road. Oxford OX1 3UQ, UK www.sociology.ox.ac.uk/swp.html ...

Sociology Working Papers Department of Sociology University of Oxford
-0.1971 trades workers building electricians. 32 Sales and services elementary SEO. Cleaners, doorkeepers,. 91. 236. 551. 70. -0.2193 occupations building caretakers. 33 Labourers in construction,. LCM. Freight-handlers, hand. 93. 226. 164. 42. -0.24

pdf-1424\formal-methods-for-open-object-based-distributed-systems ...
... the apps below to open or edit this item. pdf-1424\formal-methods-for-open-object-based-distrib ... ational-conference-on-formal-methods-for-open-obj.pdf.