Computational Vision Computing with Gabors • Orientation and frequency tuning
• Classification
Visual categorization
(1) Feature computation
(2) Categorization process
<0
w
?
1 xi
>0 1 2 xi xi
2 xi
Classification
N-D x4
x
x x
x
x x
x
x
x x x
x3
x
x
x
x1
x
x x2
...
Classification • Two classes
x2
<0
- positive and negative class (mostly convention)
- labels (0,1) or (-1,1)
x x
x
x x
x
x x
x
x x x x
x
x x x1
>0
Classification • Two classes
x2
<0
- positive and negative class (mostly convention)
- labels (0,1) or (-1,1)
• Learning is finding a decision boundary that separates points from the two classes
wT. x = 0
x x
x
x x
x
x x
x
x x x x
x
x x x1
>0
Classification • Two classes
x2
<0
- positive and negative class (mostly convention)
- labels (0,1) or (-1,1)
• Learning is finding a decision boundary that separates points from the two classes
• Learning corresponds to estimating the coefficients of w
wT. x = 0
x x
x
x
x
x
x x
x w=(w1,w2)
x x
x
x x
x x x1
>0
Classification • Train classifier
- online vs. batch training
x2
wT. x = 0
<0 x x
x
x x
x
x x
x
x x x x
x
x x x1
>0
Classification • Train classifier
- online vs. batch training
x2
<0
• Then throw away training examples (keep classifier!)
>0
x1
Classification y = g(x) = wTx >0 or <0 ?
• Train classifier
- online vs. batch training
• Then throw away training examples (keep classifier!)
• Predict label for new (test) example
• By figuring out what side of the classification function it falls onto
x2
<0 ? x=(x1,x2) w=(w1,w2)
>0 x1
Geometric interpretation •
y = g(x) = wTx >0 or <0 ?
wTx
gives a signed measure of the perpendicular distance r of the point x from the decision surface
• Distance to the boundary gives you a measure of discriminability for the sample
x2
y
w
x1
Classification
Neuron’s estimate of the desired output
• Straightforward biological interpretation
• How we measure the discrepancy (i.e., choice of loss function) leads to different algorithms (ie perceptron, least square, etc)
yˆ = g(
X
wi xi ) = g(wT x) input vector
x1 w1 x2
…
• Learning with a teacher by minimizing the discrepancy between desired and actual output
weight vector
xn
w2 wn
y
L OSS FUNCTIONS Loss functions
LSR SVM
Why not just minimizing the training error? e.g. perceptron
w·x>0
? w·x=0
Why not just minimizing the training error? • Need some guarantee that the learned decision function will be stable to small perturbations of the data
• More generally, classifying well training data does not provide any guarantee that we will classify well future (unseen) data (=generalization)
• Always consider separate training and test data
Why not just minimizing the training error? • Need some extra constraints on the decision boundary (i.e., smoothness)
• Generalization vs. overfitting
Why not just minimizing the training error? • Never select a classifier using the test set! - e.g., don't report the accuracy of the classifier that does best on your test set
y = g(x) = wTx >0 or <0 ? x2
y
w
- Use validation set (ie subset of training data) x1
Sets • Leave-one-out crossvalidation
• Random splits
• k-folds
Tikkhonov regularization Training error
1X minf 2H [ (V (f (xi ) l i
Regularization
yi ) +
2 ||f ||K ]
Err CV Train
hyper-parameters
SVM
V (yi , w · (xi )) + ⇥w⇥
2
C i=1
C = 1/
Support Vector Machine (SVM) w·x<0
w·x>0
M
Support vectors
The margin M measures the distance of the two closest points
w·x=0
RE
:X
M AP Kernels
kernel
F
planes in the feature space
C
V (yi , w · (xi )) + ⇥w⇥
f (x)i=1 = ⌅ , (x)⇧
on linear functions in the original space.
2
Examples of pd kernels
Examples of Kernels
Very common examples of symmetric pd kernels are • Linear kernel K (x, x ⇥ ) = x · x ⇥ • Gaussian kernel
K (x, x ⇥ ) = e
⇤x
x ⇥ ⇤2 2
,
>0
• Polynomial kernel K (x, x ⇥ ) = (x · x ⇥ + 1)d ,
d ⇤N
For specific applications, designing an effective kernel is a challenging problem.
L. Rosasco
RKHS
Multi-class classification