Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain
Neural Networks Types of learning problems
3
SUPERVISED LEARNING Topics: supervised learning • Training
• Test
data :
‣
{x , y
(t)
time
setting :
x(t) , y (t) ⇠ p(x, y)
(t)
} ‣
• Example
data :
‣
(t)
‣
time
{x , y
(t)
}
setting :
x(t) , y (t) ⇠ p(x, y)
‣
classification
‣
regression
UNSUPERVISED LEARNING Topics: unsupervised learning • Training data :
‣
• Test
(t)
x
(t)
‣
⇠ p(x)
• Example
(t)
{x } setting :
time
data :
‣
‣
time
{x } setting :
x
(t)
⇠ p(x)
‣
distribution estimation
‣
dimensionality reduction
4
SEMI-SUPERVISED LEARNING Topics: semi-supervised learning • Training
time
data :
‣
• Test
setting :
‣
x(t) , y (t) ⇠ p(x, y) x
(t)
⇠ p(x)
{x(t) , y (t) }
{x(t) } ‣
data :
‣
{x(t) , y (t) }
time
setting :
x(t) , y (t) ⇠ p(x, y)
5
6
MULTITASK LEARNING Topics: multitask learning • Training
• Test
data :
‣
‣
time
{x
(t)
x
(t)
(t) (t) , y1 , . . . , yM }
‣
(t) (t) , y1 , . . . , yM
⇠
p(x, y1 , . . . , yM )
• Example
data :
‣
setting :
time
{x
(t)
‣
(t) (t) , y1 , . . . , yM }
setting :
x
(t)
(t) (t) , y1 , . . . , yM
⇠
p(x, y1 , . . . , yM )
object recognition in images with multiple objects
• x1 xd b w 1 w d
Hugo Larochelle MULTITASK LEARNING Hugo Larochelle
• w
• {learning Topics: multitask
y
• g(a) = a
...
• p(y = c|x)
• p(y = c|x)
• g(a) = hsigm(a) exp(a )
• o(a) = softmax(a) =
h
• f (x) = • o(a) = softmax(a) •
• f (x) • p(y =
P
D´epartement d’informatique D´epartement d’informatique Universit´eUniversit´ de Sherbrooke e de Sherbrooke
[email protected] [email protected]
1
c exp(ac )
...
SeptemberSeptember 6, 2012 6, 2012
1 = i> 1+exp( a) Pexp(aC ) c
exp(ac )
i>
exp(a) exp( a) Pexp(aC ) . . . = tanh(a) exp(ac = )
Pexp(a1 ) c exp(ac )
g(a) p(y = c|x)
exp(a)+exp( a) ...(1) (2) (3)
• (2) ... (2) W(3) b b b • h (x) h (x) W(1) W h i> c|x) exp(a ) exp(a ) C P(0) 1 P . . . • a(k) (x)•=o(a) b(k)= + softmax(a) W(k) h(k 1)= x (h (x) = x) c exp(ac ) c exp(ac ) (1)
c
Abstract
c P P • g(a) = reclin(a) = max(0, a) (k) (k) (k) (k 1) (0) > ...a(x) (L+1) (1)o(a (1) (2) (3) (1) (2) (3)> x a (x) = b • + W (x) xa(x) (h (x) =+x) h(L+1) (x)) = f (x) • = b w x = b + w • = b + w x = b + w x • h h= (x) h(2) (x)... W W W b b b i i i i i i • f (x) (k) (k) (k) (k 1) (0) (k) (k) P • a (x) = b + W h x (h (x) = x) P h (x) = g(a (x)) • g(·) b (2)= g(a(x)) (3)• h(x) (1) = = (2)g(a(x)) (3) = g(b + w x ) g(b + w x ) • h(1) (x) h(2) (x) (k)W•(1)h(x) W W b b b i i i i i i (k) c
•
Abstract
• g(a)h = max(0, i>“Feedforward Math fora) my slides neural network”. Math for my slides “Feedforward neural network”.
(2) (1) (2) exp(a (3) (1) exp(a (2) C ) (3) ) 1 • h(1) (x) h (x) W W W b b b (k) P • o(a) =• softmax(a) h(k) (x)•=f (x) g(a= (x)) exp(ac ) . . . P exp(ac )
•
=
exp(2a) 1 exp(2a)+1
• h
(x) = g(a
(x))
• h(L+1) (x) = o(a(L+1) (x)) = f (x) (k) (k 1) (0) (1) (1) ... (L+1) (L+1) • a(k) (x) = b(k)• + W h x (h • hW •(x)x= o(a = f ... (x) • h(x) x=1 x)xd b x (x)) x (x) i,j
• h(k) (x) = g(a(k) (x))
1 i d
j
i
7
• x1 xd b w 1 w d
Hugo Larochelle MULTITASK LEARNING Hugo Larochelle
• w
• {learning Topics: multitask
y•1 g(a) = a
y2
• p(y =...c|x)
• p(y = c|x)
h
• f (x) = • o(a) = softmax(a) •
• f (x) • p(y =
...
• g(a) = hsigm(a) exp(a )
• o(a) = softmax(a) =
P
D´epartement d’informatique D´epartement d’informatique Universit´eUniversit´ de Sherbrooke e de Sherbrooke
[email protected] [email protected] 3
1
c exp(ac )
...
SeptemberSeptember 6, 2012 6, 2012
1 = i> 1+exp( a) Pexp(aC ) c
exp(ac )
i>
exp(a) exp( a) Pexp(aC ) . . . = tanh(a) exp(ac = )
Pexp(a1 ) c exp(ac )
g(a) p(y = c|x)
exp(a)+exp( a) ...(1) (2) (3)
• (2) ... (2) W(3) b b b • h (x) h (x) W(1) W h i> c|x) exp(a ) exp(a ) C P(0) 1 P . . . • a(k) (x)•=o(a) b(k)= + softmax(a) W(k) h(k 1)= x (h (x) = x) c exp(ac ) c exp(ac ) (1)
c
Abstract
Abstract
c P P • g(a) = reclin(a) = max(0, a) (k) (k) (k) (k 1) (0) > ...a(x) (L+1) (1)o(a (1) (2) (3) (1) (2) (3)> x a (x) = b • + W (x) xa(x) (h (x) =+x) h(L+1) (x)) = f (x) • = b w x = b + w • = b + w x = b + w x • h h= (x) h(2) (x)... W W W b b b i i i i i i • f (x) (k) (k) (k) (k 1) (0) (k) (k) P • a (x) = b + W h x (h (x) = x) P h (x) = g(a (x)) • g(·) b (2)= g(a(x)) (3)• h(x) (1) = = (2)g(a(x)) (3) = g(b + w x ) g(b + w x ) • h(1) (x) h(2) (x) (k)W•(1)h(x) W W b b b i i i i i i (k) c
•
=
exp(2a) 1 exp(2a)+1
• g(a)h = max(0, i>“Feedforward Math fora) my slides neural network”. Math for my slides “Feedforward neural network”.
(2) (1) (2) exp(a (3) (1) exp(a (2) C ) (3) ) 1 • h(1) (x) h (x) W W W b b b (k) P • o(a) =• softmax(a) h(k) (x)•=f (x) g(a= (x)) exp(ac ) . . . P exp(ac )
•
...
• h
(x) = g(a
(x))
• h(L+1) (x) = o(a(L+1) (x)) = f (x) (k) (k 1) (0) (1) (1) ... (L+1) (L+1) • a(k) (x) = b(k)• + W h x (h • hW •(x)x= o(a = f ... (x) • h(x) x=1 x)xd b x (x)) x (x) i,j
• h(k) (x) = g(a(k) (x))
1 i d
j
i
7
TRANSFER LEARNING Topics: transfer learning • Training
• Test
data :
‣
‣
time
{x
(t)
x
(t)
data :
‣
(t) (t) , y1 , . . . , yM }
setting :
‣
(t) (t) , y1 , . . . , yM
⇠
p(x, y1 , . . . , yM )
time {x
(t)
(t) , y1 }
setting :
x
(t)
(t) , y1
⇠ p(x, y1 )
8
STRUCTURED OUTPUT PREDICTION Topics: structured output prediction • Training
time
• Test
data :
‣
(t)
of arbitrary structure (vector, sequence, graph)
‣
setting : (t)
x ,y
‣
(t)
(t)
{x , y }
⇠ p(x, y)
• Example
data :
‣
(t)
time (t)
{x , y } setting : (t)
x ,y
(t)
⇠ p(x, y)
‣
image caption generation
‣
machine translation
9
10
DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Training
• Test
data :
‣
{¯ x
(t0 )
setting :
x y
(t)
(t)
x ¯
(t)
‣
⇠ p(x)
‣
setting :
x ¯ (t)
⇠ p(y|x ) ⇠ q(x) ⇡ p(x)
• Example
{¯ x(t) , y (t) }
}
time
data :
‣
{x(t) , y (t) }
‣
time
y
(t)
(t)
⇠ q(x)
(t)
⇠ p(y|¯ x )
classify sentiment in reviews of different products
11
October 17, 2012 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift networks (Ganin et al. 2015)
for my slides “Autoencoders”. trainMath hidden layer representation to be
Abstract
• Domain-adversarial
•
1. predictive of the target class
f (x) c
V
2. indiscriminate of the domain
• Trained ‣
by stochastic gradient descent
for each random pair x
(t)
,x ¯
(t0 )
2. update w,d in direction of gradient
o(h(x))
w
h(x) b= g(a(x)) = sigm(b + Wx)
W
1. update W,V,b,c in opposite direction of gradient
•
d
x b = o(a b(x)) x
11
October 17, 2012 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift networks (Ganin et al. 2015)
for my slides “Autoencoders”. trainMath hidden layer representation to be
• Domain-adversarial
•
1. predictive of the target class 2. indiscriminate of the domain
• Trained ‣
by stochastic gradient descent
for each random pair x
(t)
,x ¯
(t0 )
1. update W,V,b,c in opposite direction of gradient
•
Abstract
f (x) c
V
d
o(h(x))
w
h(x) b= g(a(x)) = sigm(b + Wx)
W
May also be used to promote
2. update w,d in direction of gradient x fair and unbiased models … b = o(a b(x)) x
12
ONE-SHOT LEARNING Topics: one-shot learning • Training data :
‣
• Test
setting :
x(t) , y (t) ⇠ p(x, y) subject to y (t) 2 {1, . . . , C}
time
data :
‣
{x(t) , y (t) }
‣
time
{x(t) , y (t) }
setting :
‣
x(t) , y (t) ⇠ p(x, y)
subject to y (t) 2 {C + 1, . . . , C + M } ‣ side information : - a single labeled example from each of the M new classes
• Example ‣
recognizing a person based on a single picture of him/her
ONE-SHOT LEARNING Topics: one-shot learning Learning Similarity Metric D[ya ,yb ] ya
30
30 W4
2000
W4
2000
W3
W3
500
500 W2
500
W2
500 W1
Xa
yb
W1
Xb
13
We can then learn the non-linear trans mizing the log probability of the pairs t Siamese architecture the training set. The normalizing term (figureintaken from of Salakhutdinov
the number training cases rather and Hinton, 2007) the number of pixels or the number of cause we are only attempting to mode pairings, not the structure in the indiv mutual information between the code v
The idea of using Eq. 2 to train a m work was originally described in [9]. a network would extract a two-dimen plicitly represented the size and orien was trained on pairs of face images th and orientation but were otherwise very to extract more elaborate properties w partly because of the difficulty of train
14
ZERO-SHOT LEARNING Topics: zero-shot learning, zero-data learning • Training
time
data :
‣
• Test
data :
‣
{x(t) , y (t) }
setting :
‣
x(t) , y (t) ⇠ p(x, y)
subject to y (t) 2 {1, . . . , C} ‣ side information : - description vector zc of each of the C classes
time {x(t) , y (t) }
setting :
‣
x(t) , y (t) ⇠ p(x, y)
subject to y (t) 2 {C + 1, . . . , C + M } ‣ side information : - description vector zc of each of the new M classes
• Example ‣
recognizing an object based on a worded description of it
Ruslan Salakhutdinov ZERO-SHOT LEARNING
15
n Swersky Sanja Fidler University of Toronto
wersky,fidler,
[email protected]
Topics: zero-shot learning, zero-data learning Ba, Swersky, Fidler, Salakhutdinov arxiv 2015
1xC
Dot product f
g
Cxk MLP
CNN
a eric
…
am
no genrth us sou th
TF-IDF
fam birdily s
t Learning of viibutes to accomat learning from icles, avoids the e attributes. We nseen categories we use text feahe convolutional nvolutional neuthe architecture yers, rather than h modalities, as proposed model a list of pseudong of words from s end-to-end us-
Class score
Wikipedia article The Cardinals or Cardinalidae are a family of passerine birds found in North and South America The South American cardinals in the genus…
Image
1xk
DESIGNING NEW ARCHITECTURES Topics: designing new architectures •
Tackling a new learn problem often requires designing
an adapted neural architecture
•
Approach 1: use our intuition for how a human would reason the problem
•
Approach 2: take an existing algorithm/procedure and
turn it into a neural network
16
DESIGNING NEW ARCHITECTURES Topics: designing new architectures •
Many other examples ‣
structured prediction by unrolling probabilistic inference in an MRF
‣ planning by unrolling the 2017 paper as a conference at ICLR 2017 paper at ICLR 2017
value iteration algorithm
(Tamar et al., NIPS 2016)
‣
Under review as a conference paper at ICLR 2017
few-shot learning by unrolling gradient descent on small training set Ravi and Larochelle, ICLR 2017
_ _
Neural
network
_
_
Learning
algorithm
Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line divides
17
Neural networks Unintuitive properties of neural networks
THEY CAN MAKE DUMB ERRORS Topics: adversarial examples •
Intriguing Properties of Neural Networks
Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, ICLR 2014
Correctly
classified
Difference
(b)
Badly
classified
19
THEY CAN MAKE DUMB ERRORS Topics: adversarial examples •
Humans have adversarial examples too
•
However they don’t match those of neural networks
20
THEY CAN MAKE DUMB ERRORS Topics: adversarial examples •
Humans have adversarial examples too
•
However they don’t match those of neural networks
21
THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014
avg loss
θ
22
THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014
avg loss
θ
22
THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014
23
hm struggles to make progress.
THEY ARE STRANGELY NON-CONVEX
Topics: non-convexity, saddle points •
Qualitatively Characterizing Neural Network Optimization Problems
Goodfellow, Vinyals, Saxe, ICLR 2015
24
THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •
If dataset is created by labeling points using a N-hidden units neural network ‣
training another N-hidden units network is likely to fail
‣
but training a larger neural network is more likely to work!
(saddle points seem to be a blessing)
25
literature. (Hochreiter & Schmidhuber, 1997) (informally) define a flat minimizer x ¯ as one for which the function varies slowly in a relatively large neighborhood of x ¯. In contrast, a sharp minimizer x ˆ is such that the function increases rapidly in a small neighborhood of x ˆ. A flat minimum can be described with low precision, whereas a sharp minimum requires high precision. The large sensitivity of the training function at a sharp minimizer negatively impacts the ability of the trained model to generalize on new data; see Figure 1 for a hypothetical illustration. This can be explained through sharp vs. flat miniman theTopics: lens of the minimum description length (MDL) theory, which states that statistical models that require fewer bits to describe (i.e., are of low complexity) generalize better (Rissanen, 1983). Since flat minimizers can be specified with lower precision than to sharp minimizers, they tend to have bet• Flat Minima
ter generalization performance. Alternative explanations are proffered through the Bayesian view Hochreiter, Schmidhuber, Computation of learning (MacKay, 1992), and through theNeural lens of free Gibbs energy; see1997 e.g. Chaudhari et al. (2016).
THEY WORK BEST WHEN BADLY TRAINED
avg loss
Training Function Testing Function
f (x)
Flat Minimum
Sharp Minimum
θ
26
THEY WORK BEST WHEN BADLY TRAINED Topics: sharp vs. flat miniman •
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Keskar, Mudigere, Nocedal, Smelyanskiy, Tang, ICLR 2017 ‣
•
found that using large batch sizes tends to find sharped minima and generalize worse
This means that we can’t talk about generalization without taking the training algorithm into account
27
THEY CAN EASILY MEMORIZE Topics: model capacity vs. training algorithm •
Understanding Deep Learning Requires Rethinking Generalization
Zhang, Bengio, Hardt, Recth, Vinyals, ICLR 2017
28
P
> w x = b + w x i i i P • h(x) = g(a(x)) = g(b + i wi xi )
• a(x) = b +
• x1 xd b w 1
THEY CAN BE COMPRESSED Feedforward neural network Feedforward neural network wd
Topics: knowledge distillation
Hugo Larochelle Hugo Larochelle the Knowledge in a Neural Network
D´epartement d’informatique D´epartement d’informatique Hinton, Vinyals, Dean,Universit´ arXiv 2015 e de Sherbrooke eUniversit´ de Sherbrooke
[email protected] [email protected] • g(a) = a ... 1 • g(a) = sigm(a) = 1+exp( September 6, 2012 6, 2012 a)September • w • Distilling • {
• g(a) = tanh(a) =
...
...
exp(a) exp( a) exp(a)+exp( a)
=
exp(2a) 1 exp(2a)+1
Abstract
Abstract
• g(a) = max(0, a) Math“Feedforward for my slides “Feedforward neural network”. Math for my slides neural network”. P > P • g(a) = reclin(a) = max(0, a) > = b + w x = b + w x • a(x) ...= b + • ...a(x) w x = b + w x i i i i i i
P P • g(·)• h(x) b • h(x) = g(a(x)) == g(bg(a(x)) + i= wig(b xi )+ i wi xi )
...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)
(1)
29
P
P
> w x = b + w x • a(x) = b + w x = b + w x i i i i i i P P • h(x) = g(a(x)) = g(b + i wi •xi )h(x) = g(a(x)) = g(b + i wi xi )
• a(x) = b +
• x1 xd b w 1
>
THEY CAN BE COMPRESSED
30
Feedforward neural network Feedforward neural networ Feedforward neural network Feedforward neural network wd • x1 xd b w 1 w d
Topics: knowledge distillation
Hugo Larochelle Hugo Larochelle • w Larochelle Hugo Hugo Larochelle the Knowledge in a Neural Network
D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique • { Hinton, Vinyals, Dean,Universit´ arXiv 2015 e de Sherbrooke e de Sherbrooke eUniversit´ de Sherbrooke Universit´eUniversit´ de Sherbrooke
[email protected] [email protected] [email protected] • g(a) = a •
[email protected] g(a) = a ... ... 1 1 • g(a) = sigm(a) = 1+exp( a)September • g(a) = sigm(a) = September September 6, 2012 6, 2012 6, 2012 6, 2012 1+exp( a)September • w • Distilling • {
• g(a) = tanh(a) =
...
...
exp(a) exp( a) • exp(a)+exp( a)
exp(2a) 1 g(a) = exp(2a)+1 = tanh(a)
Abstract
=
... Abstract
exp(a) exp( a) exp(a)+exp( a)
=
exp(2a) 1 exp(2a)+1
Abstract
Abstract
• g(a) = max(0, a) Math“Feedforward • g(a) = max(0, a)network”. for my slides “Feedforward Math“Feedforward for my slides “Feedforward neural network”. Math for my slides neural network”. Mathneural for my slides neural network”. P • >g(a) = reclin(a) P > P P • g(a) = reclin(a) = max(0, a) = max(0, a) > > ... = b + w x = b + w x • a(x) = b + w x = b + w x • a(x) ...= b + • ...a(x) w x = b + w x • a(x) = b + w x = b + w x i i i i i i i i i i i i
P P • g(·) bP P • g(·)• h(x) b • h(x) ) • h(x) = g(a(x)) == g(bg(a(x)) + i= wig(b xi•)+h(x) == g(bg(a(x)) + i= wig(b xi )+ i wi xi ) i xig(a(x)) i w=
...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)
(1)
...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)
(1)
P
P
> w x = b + w x • a(x) = b + w x = b + w x i i i i i i P P • h(x) = g(a(x)) = g(b + i wi •xi )h(x) = g(a(x)) = g(b + i wi xi )
• a(x) = b +
• x1 xd b w 1
>
THEY CAN BE COMPRESSED
31
Feedforward neural network Feedforward neural networ Feedforward neural network Feedforward neural network wd • x1 xd b w 1 w d
Topics: knowledge distillation
Hugo Larochelle Hugo Larochelle • w Larochelle Hugo Hugo Larochelle the Knowledge in a Neural Network
D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique • { Hinton, Vinyals, Dean,Universit´ arXiv 2015 e de Sherbrooke e de Sherbrooke eUniversit´ de Sherbrooke Universit´eUniversit´ de Sherbrooke
[email protected] [email protected] [email protected] • g(a) = a •
[email protected] g(a) = a ... ... y 1 1 • g(a) = sigm(a) = 1+exp( a)September • g(a) = sigm(a) September 6, 2012 September 6, 2012 = 6, 2012 6, 2012 1+exp( a)September • w • Distilling • {
• g(a) = tanh(a) =
...
...
exp(a) exp( a) • exp(a)+exp( a)
exp(2a) 1 g(a) = exp(2a)+1 = tanh(a)
Abstract
=
... Abstract
exp(a) exp( a) exp(a)+exp( a)
=
exp(2a) 1 exp(2a)+1
Abstract
Abstract
• g(a) = max(0, a) Math“Feedforward • g(a) = max(0, a)network”. for my slides “Feedforward Math“Feedforward for my slides “Feedforward neural network”. Math for my slides neural network”. Mathneural for my slides neural network”. P • >g(a) = reclin(a) P > P P • g(a) = reclin(a) = max(0, a) = max(0, a) > > ... = b + w x = b + w x • a(x) = b + w x = b + w x • a(x) ...= b + • ...a(x) w x = b + w x • a(x) = b + w x = b + w x i i i i i i i i i i i i
P P • g(·) bP P • g(·)• h(x) b • h(x) ) • h(x) = g(a(x)) == g(bg(a(x)) + i= wig(b xi•)+h(x) == g(bg(a(x)) + i= wig(b xi )+ i wi xi ) i xig(a(x)) i w=
...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)
(1)
...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)
(1)
THEY CAN BE COMPRESSED Topics: knowledge distillation •
•
Can successfully distill ‣
a large neural network
‣
an ensemble of neural network
Works better than training it from scratch! ‣
Do Deep Nets Really Need to be Deep?
Jimmy Ba, Rich Caruana, NIPS 2014
32
THEY ARE INFLUENCED BY INITIALIZATION Topics: impact of initialization
E RHAN , B ENGIO , C OURVILLE , M ANZAGOL , V INCENT AND B ENGIO
•
Why Does Unsupervised PreTraining Help Deep Learning
focus respectively on local and global structure. Each point is colored according to the training iteration, to help follow Manzagol, the trajectory movement. Erhan, Bengio, Courville, Vincent, JMLR 2010 9
10
100
2 layers with pre−training 80
60
40
20
0
−20
−40
−60
−80
−100 −100
2 layers without pre−training −80
−60
−40
−20
0
20
40
60
80
100
33
THEY ARE INFLUENCED BY FIRST EXAMPLES W HY D OES U NSUPERVISED P RE - TRAINING H ELP D EEP L EARNING ?
first million examples (across 10 different random draws, sampling a different set of 1 million Topics: impactthe of early examples examples each time) and keep the other ones fixed. After training the (10) models, we measure the
•
variance (across the 10 draws) of the output of the networks on a fixed test set (i.e., we measure the variance in function space). We then vary the next million examples in the same fashion, and so on, to see how much each of the ten parts of the training set influenced the final function.
Why Does Unsupervised Pre- Training Help Deep Learning
Erhan, Bengio, Courville, Manzagol, Vincent, JMLR 2010
34
YET THEY FORGET WHAT THEY LEARNED Topics: lifelong learning, continual learning •
Overcoming Catastrophic Forgetting in Neural Networks
Kirkpatrick et al. PNAS 2017
35
36
SO THERE IS A LOT
MORE TO UNDERSTAND!!
37
MERCI!