Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain

Neural Networks Types of learning problems

3

SUPERVISED LEARNING Topics: supervised learning • Training

• Test

data :






{x , y

(t)

time



setting :

x(t) , y (t) ⇠ p(x, y)

(t)



} ‣

• Example

data :




(t)





time

{x , y

(t)

}

setting :

x(t) , y (t) ⇠ p(x, y)



classification



regression

UNSUPERVISED LEARNING Topics: unsupervised learning • Training data :




• Test 


(t)



x

(t)



⇠ p(x)

• Example

(t)



{x } setting :

time

data :








time

{x } setting :

x

(t)

⇠ p(x)



distribution estimation



dimensionality reduction

4

SEMI-SUPERVISED LEARNING Topics: semi-supervised learning • Training

time

data :






• Test 


setting :



x(t) , y (t) ⇠ p(x, y) x

(t)

⇠ p(x)

{x(t) , y (t) } 


{x(t) } ‣

data :




{x(t) , y (t) }



time

setting :

x(t) , y (t) ⇠ p(x, y)

5

6

MULTITASK LEARNING Topics: multitask learning • Training

• Test

data :





 




time

{x

(t)

x

(t)



(t) (t) , y1 , . . . , yM }





(t) (t) , y1 , . . . , yM



p(x, y1 , . . . , yM )

• Example

data :




setting :

time

{x

(t)



(t) (t) , y1 , . . . , yM }

setting :

x

(t)

(t) (t) , y1 , . . . , yM



p(x, y1 , . . . , yM )

object recognition in images with multiple objects

• x1 xd b w 1 w d

Hugo Larochelle MULTITASK LEARNING Hugo Larochelle

• w

• {learning Topics: multitask

y

• g(a) = a

...

• p(y = c|x)

• p(y = c|x)

• g(a) = hsigm(a) exp(a )

• o(a) = softmax(a) =

h

• f (x) = • o(a) = softmax(a) •

• f (x) • p(y =

P

D´epartement d’informatique D´epartement d’informatique Universit´eUniversit´ de Sherbrooke e de Sherbrooke [email protected] [email protected]

1

c exp(ac )

...

SeptemberSeptember 6, 2012 6, 2012

1 = i> 1+exp( a) Pexp(aC ) c

exp(ac )

i>

exp(a) exp( a) Pexp(aC ) . . . = tanh(a) exp(ac = )

Pexp(a1 ) c exp(ac )

g(a) p(y = c|x)

exp(a)+exp( a) ...(1) (2) (3)

• (2) ... (2) W(3) b b b • h (x) h (x) W(1) W h i> c|x) exp(a ) exp(a ) C P(0) 1 P . . . • a(k) (x)•=o(a) b(k)= + softmax(a) W(k) h(k 1)= x (h (x) = x) c exp(ac ) c exp(ac ) (1)

c

Abstract

c P P • g(a) = reclin(a) = max(0, a) (k) (k) (k) (k 1) (0) > ...a(x) (L+1) (1)o(a (1) (2) (3) (1) (2) (3)> x a (x) = b • + W (x) xa(x) (h (x) =+x) h(L+1) (x)) = f (x) • = b w x = b + w • = b + w x = b + w x • h h= (x) h(2) (x)... W W W b b b i i i i i i • f (x) (k) (k) (k) (k 1) (0) (k) (k) P • a (x) = b + W h x (h (x) = x) P h (x) = g(a (x)) • g(·) b (2)= g(a(x)) (3)• h(x) (1) = = (2)g(a(x)) (3) = g(b + w x ) g(b + w x ) • h(1) (x) h(2) (x) (k)W•(1)h(x) W W b b b i i i i i i (k) c



Abstract

• g(a)h = max(0, i>“Feedforward Math fora) my slides neural network”. Math for my slides “Feedforward neural network”.

(2) (1) (2) exp(a (3) (1) exp(a (2) C ) (3) ) 1 • h(1) (x) h (x) W W W b b b (k) P • o(a) =• softmax(a) h(k) (x)•=f (x) g(a= (x)) exp(ac ) . . . P exp(ac )



=

exp(2a) 1 exp(2a)+1

• h

(x) = g(a

(x))

• h(L+1) (x) = o(a(L+1) (x)) = f (x) (k) (k 1) (0) (1) (1) ... (L+1) (L+1) • a(k) (x) = b(k)• + W h x (h • hW •(x)x= o(a = f ... (x) • h(x) x=1 x)xd b x (x)) x (x) i,j

• h(k) (x) = g(a(k) (x))

1 i d

j

i

7

• x1 xd b w 1 w d

Hugo Larochelle MULTITASK LEARNING Hugo Larochelle

• w

• {learning Topics: multitask

y•1 g(a) = a

y2

• p(y =...c|x)

• p(y = c|x)

h

• f (x) = • o(a) = softmax(a) •

• f (x) • p(y =

...

• g(a) = hsigm(a) exp(a )

• o(a) = softmax(a) =

P

D´epartement d’informatique D´epartement d’informatique Universit´eUniversit´ de Sherbrooke e de Sherbrooke [email protected] [email protected] 3

1

c exp(ac )

...

SeptemberSeptember 6, 2012 6, 2012

1 = i> 1+exp( a) Pexp(aC ) c

exp(ac )

i>

exp(a) exp( a) Pexp(aC ) . . . = tanh(a) exp(ac = )

Pexp(a1 ) c exp(ac )

g(a) p(y = c|x)

exp(a)+exp( a) ...(1) (2) (3)

• (2) ... (2) W(3) b b b • h (x) h (x) W(1) W h i> c|x) exp(a ) exp(a ) C P(0) 1 P . . . • a(k) (x)•=o(a) b(k)= + softmax(a) W(k) h(k 1)= x (h (x) = x) c exp(ac ) c exp(ac ) (1)

c

Abstract

Abstract

c P P • g(a) = reclin(a) = max(0, a) (k) (k) (k) (k 1) (0) > ...a(x) (L+1) (1)o(a (1) (2) (3) (1) (2) (3)> x a (x) = b • + W (x) xa(x) (h (x) =+x) h(L+1) (x)) = f (x) • = b w x = b + w • = b + w x = b + w x • h h= (x) h(2) (x)... W W W b b b i i i i i i • f (x) (k) (k) (k) (k 1) (0) (k) (k) P • a (x) = b + W h x (h (x) = x) P h (x) = g(a (x)) • g(·) b (2)= g(a(x)) (3)• h(x) (1) = = (2)g(a(x)) (3) = g(b + w x ) g(b + w x ) • h(1) (x) h(2) (x) (k)W•(1)h(x) W W b b b i i i i i i (k) c



=

exp(2a) 1 exp(2a)+1

• g(a)h = max(0, i>“Feedforward Math fora) my slides neural network”. Math for my slides “Feedforward neural network”.

(2) (1) (2) exp(a (3) (1) exp(a (2) C ) (3) ) 1 • h(1) (x) h (x) W W W b b b (k) P • o(a) =• softmax(a) h(k) (x)•=f (x) g(a= (x)) exp(ac ) . . . P exp(ac )



...

• h

(x) = g(a

(x))

• h(L+1) (x) = o(a(L+1) (x)) = f (x) (k) (k 1) (0) (1) (1) ... (L+1) (L+1) • a(k) (x) = b(k)• + W h x (h • hW •(x)x= o(a = f ... (x) • h(x) x=1 x)xd b x (x)) x (x) i,j

• h(k) (x) = g(a(k) (x))

1 i d

j

i

7

TRANSFER LEARNING Topics: transfer learning • Training

• Test

data :





 




time

{x

(t)

x

(t)

data :






(t) (t) , y1 , . . . , yM }

setting :





(t) (t) , y1 , . . . , yM



p(x, y1 , . . . , yM )

time {x

(t)

(t) , y1 }

setting :

x

(t)

(t) , y1

⇠ p(x, y1 )

8

STRUCTURED OUTPUT PREDICTION Topics: structured output prediction • Training

time

• Test

data :








(t)

of arbitrary structure (vector, sequence, graph)



setting : (t)

x ,y



(t)

(t)



{x , y }

⇠ p(x, y)

• Example

data :




(t)



time (t)

{x , y } setting : (t)

x ,y

(t)

⇠ p(x, y)



image caption generation



machine translation

9

10

DOMAIN ADAPTATION Topics: domain adaptation, covariate shift • Training

• Test

data :






{¯ x

(t0 )



setting :

x y

(t)

(t)

x ¯

(t)



⇠ p(x)



setting :

x ¯ (t)

⇠ p(y|x ) ⇠ q(x) ⇡ p(x)

• Example

{¯ x(t) , y (t) } 


}

time

data :




{x(t) , y (t) }





time

y

(t)

(t)

⇠ q(x)

(t)

⇠ p(y|¯ x )

classify sentiment in reviews of different products

11

October 17, 2012 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift networks (Ganin et al. 2015)
 for my slides “Autoencoders”. trainMath hidden layer representation to be

Abstract

• Domain-adversarial



1. predictive of the target class

f (x) c

V

2. indiscriminate of the domain

• Trained ‣

by stochastic gradient descent

for each random pair x

(t)

,x ¯

(t0 )

2. update w,d in direction of gradient

o(h(x))

w

h(x) b= g(a(x)) = sigm(b + Wx)

W

1. update W,V,b,c in opposite direction of gradient



d

x b = o(a b(x)) x

11

October 17, 2012 DOMAIN ADAPTATION Topics: domain adaptation, covariate shift networks (Ganin et al. 2015)
 for my slides “Autoencoders”. trainMath hidden layer representation to be

• Domain-adversarial



1. predictive of the target class 2. indiscriminate of the domain

• Trained ‣

by stochastic gradient descent

for each random pair x

(t)

,x ¯

(t0 )

1. update W,V,b,c in opposite direction of gradient



Abstract

f (x) c

V

d

o(h(x))

w

h(x) b= g(a(x)) = sigm(b + Wx)

W

May also be used to promote
 2. update w,d in direction of gradient x fair and unbiased models … b = o(a b(x)) x

12

ONE-SHOT LEARNING Topics: one-shot learning • Training data :






• Test

setting :

x(t) , y (t) ⇠ p(x, y) subject to y (t) 2 {1, . . . , C}

time

data :






{x(t) , y (t) } 




time

{x(t) , y (t) } 


setting :





 


x(t) , y (t) ⇠ p(x, y)


 subject to y (t) 2 {C + 1, . . . , C + M } ‣ side information : - a single labeled example from each of the M new classes

• Example ‣

recognizing a person based on a single picture of him/her

ONE-SHOT LEARNING Topics: one-shot learning Learning Similarity Metric D[ya ,yb ] ya

30

30 W4

2000

W4

2000

W3

W3

500

500 W2

500

W2

500 W1

Xa

yb

W1

Xb

13

We can then learn the non-linear trans mizing the log probability of the pairs t Siamese architecture the training set. The normalizing term (figureintaken from of Salakhutdinov 
 the number training cases rather and Hinton, 2007) the number of pixels or the number of cause we are only attempting to mode pairings, not the structure in the indiv mutual information between the code v

The idea of using Eq. 2 to train a m work was originally described in [9]. a network would extract a two-dimen plicitly represented the size and orien was trained on pairs of face images th and orientation but were otherwise very to extract more elaborate properties w partly because of the difficulty of train

14

ZERO-SHOT LEARNING Topics: zero-shot learning, zero-data learning • Training

time

data :






• Test

data :






{x(t) , y (t) } 


setting :





 
 


x(t) , y (t) ⇠ p(x, y)

subject to y (t) 2 {1, . . . , C} ‣ side information : - description vector zc of each of the C classes

time {x(t) , y (t) }



setting :





 


x(t) , y (t) ⇠ p(x, y)


 subject to y (t) 2 {C + 1, . . . , C + M } ‣ side information : - description vector zc of each of the new M classes

• Example ‣

recognizing an object based on a worded description of it

Ruslan Salakhutdinov ZERO-SHOT LEARNING

15

n Swersky Sanja Fidler University of Toronto

wersky,fidler,[email protected]

Topics: zero-shot learning, zero-data learning Ba, Swersky, Fidler, Salakhutdinov arxiv 2015

1xC

Dot product f

g

Cxk MLP

CNN

a eric



am

no genrth us sou th

TF-IDF

fam birdily s

t Learning of viibutes to accomat learning from icles, avoids the e attributes. We nseen categories we use text feahe convolutional nvolutional neuthe architecture yers, rather than h modalities, as proposed model a list of pseudong of words from s end-to-end us-

Class score

Wikipedia article The Cardinals or Cardinalidae are a family of passerine birds found in North and South America The South American cardinals in the genus…

Image

1xk

DESIGNING NEW ARCHITECTURES Topics: designing new architectures •

Tackling a new learn problem often requires designing 
 an adapted neural architecture



Approach 1: use our intuition for how a human would reason the problem



Approach 2: take an existing algorithm/procedure and 
 turn it into a neural network

16

DESIGNING NEW ARCHITECTURES Topics: designing new architectures •

Many other examples ‣

structured prediction by unrolling probabilistic inference in an MRF

‣ planning by unrolling the 2017 paper as a conference at ICLR 2017 paper at ICLR 2017

value iteration algorithm


(Tamar et al., NIPS 2016)



Under review as a conference paper at ICLR 2017

few-shot learning by unrolling gradient descent on small training set Ravi and Larochelle, ICLR 2017

_ _

Neural
 network

_

_

Learning
 algorithm

Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line divides

17

Neural networks Unintuitive properties of neural networks

THEY CAN MAKE DUMB ERRORS Topics: adversarial examples •

Intriguing Properties of Neural Networks
 Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, ICLR 2014

Correctly
 classified

Difference

(b)

Badly
 classified

19

THEY CAN MAKE DUMB ERRORS Topics: adversarial examples •

Humans have adversarial examples too



However they don’t match those of neural networks

20

THEY CAN MAKE DUMB ERRORS Topics: adversarial examples •

Humans have adversarial examples too



However they don’t match those of neural networks

21

THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization


Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014

avg loss

θ

22

THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization


Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014

avg loss

θ

22

THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization


Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014

23

hm struggles to make progress.

THEY ARE STRANGELY NON-CONVEX

Topics: non-convexity, saddle points •

Qualitatively Characterizing Neural Network Optimization Problems
 Goodfellow, Vinyals, Saxe, ICLR 2015

24

THEY ARE STRANGELY NON-CONVEX Topics: non-convexity, saddle points •

If dataset is created by labeling points using a N-hidden units neural network ‣

training another N-hidden units network is likely to fail



but training a larger neural network is more likely to work! 
 (saddle points seem to be a blessing)

25

literature. (Hochreiter & Schmidhuber, 1997) (informally) define a flat minimizer x ¯ as one for which the function varies slowly in a relatively large neighborhood of x ¯. In contrast, a sharp minimizer x ˆ is such that the function increases rapidly in a small neighborhood of x ˆ. A flat minimum can be described with low precision, whereas a sharp minimum requires high precision. The large sensitivity of the training function at a sharp minimizer negatively impacts the ability of the trained model to generalize on new data; see Figure 1 for a hypothetical illustration. This can be explained through sharp vs. flat miniman theTopics: lens of the minimum description length (MDL) theory, which states that statistical models that require fewer bits to describe (i.e., are of low complexity) generalize better (Rissanen, 1983). Since flat minimizers can be specified with lower precision than to sharp minimizers, they tend to have bet• Flat Minima
 ter generalization performance. Alternative explanations are proffered through the Bayesian view Hochreiter, Schmidhuber, Computation of learning (MacKay, 1992), and through theNeural lens of free Gibbs energy; see1997 e.g. Chaudhari et al. (2016).

THEY WORK BEST WHEN BADLY TRAINED

avg loss

Training Function Testing Function

f (x)

Flat Minimum

Sharp Minimum

θ

26

THEY WORK BEST WHEN BADLY TRAINED Topics: sharp vs. flat miniman •

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
 Keskar, Mudigere, Nocedal, Smelyanskiy, Tang, ICLR 2017 ‣



found that using large batch sizes tends to find sharped minima and generalize worse

This means that we can’t talk about generalization without taking the training algorithm into account

27

THEY CAN EASILY MEMORIZE Topics: model capacity vs. training algorithm •

Understanding Deep Learning Requires Rethinking Generalization
 Zhang, Bengio, Hardt, Recth, Vinyals, ICLR 2017

28

P

> w x = b + w x i i i P • h(x) = g(a(x)) = g(b + i wi xi )

• a(x) = b +

• x1 xd b w 1

THEY CAN BE COMPRESSED Feedforward neural network Feedforward neural network wd

Topics: knowledge distillation

Hugo Larochelle Hugo Larochelle the Knowledge in a Neural Network
 D´epartement d’informatique D´epartement d’informatique Hinton, Vinyals, Dean,Universit´ arXiv 2015 e de Sherbrooke eUniversit´ de Sherbrooke [email protected] [email protected] • g(a) = a ... 1 • g(a) = sigm(a) = 1+exp( September 6, 2012 6, 2012 a)September • w • Distilling • {

• g(a) = tanh(a) =

...

...

exp(a) exp( a) exp(a)+exp( a)

=

exp(2a) 1 exp(2a)+1

Abstract

Abstract

• g(a) = max(0, a) Math“Feedforward for my slides “Feedforward neural network”. Math for my slides neural network”. P > P • g(a) = reclin(a) = max(0, a) > = b + w x = b + w x • a(x) ...= b + • ...a(x) w x = b + w x i i i i i i

P P • g(·)• h(x) b • h(x) = g(a(x)) == g(bg(a(x)) + i= wig(b xi )+ i wi xi )

...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)

(1)

29

P

P

> w x = b + w x • a(x) = b + w x = b + w x i i i i i i P P • h(x) = g(a(x)) = g(b + i wi •xi )h(x) = g(a(x)) = g(b + i wi xi )

• a(x) = b +

• x1 xd b w 1

>

THEY CAN BE COMPRESSED

30

Feedforward neural network Feedforward neural networ Feedforward neural network Feedforward neural network wd • x1 xd b w 1 w d

Topics: knowledge distillation

Hugo Larochelle Hugo Larochelle • w Larochelle Hugo Hugo Larochelle the Knowledge in a Neural Network
 D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique • { Hinton, Vinyals, Dean,Universit´ arXiv 2015 e de Sherbrooke e de Sherbrooke eUniversit´ de Sherbrooke Universit´eUniversit´ de Sherbrooke [email protected] [email protected] [email protected] • g(a) = a •[email protected] g(a) = a ... ... 1 1 • g(a) = sigm(a) = 1+exp( a)September • g(a) = sigm(a) = September September 6, 2012 6, 2012 6, 2012 6, 2012 1+exp( a)September • w • Distilling • {

• g(a) = tanh(a) =

...

...

exp(a) exp( a) • exp(a)+exp( a)

exp(2a) 1 g(a) = exp(2a)+1 = tanh(a)

Abstract

=

... Abstract

exp(a) exp( a) exp(a)+exp( a)

=

exp(2a) 1 exp(2a)+1

Abstract

Abstract

• g(a) = max(0, a) Math“Feedforward • g(a) = max(0, a)network”. for my slides “Feedforward Math“Feedforward for my slides “Feedforward neural network”. Math for my slides neural network”. Mathneural for my slides neural network”. P • >g(a) = reclin(a) P > P P • g(a) = reclin(a) = max(0, a) = max(0, a) > > ... = b + w x = b + w x • a(x) = b + w x = b + w x • a(x) ...= b + • ...a(x) w x = b + w x • a(x) = b + w x = b + w x i i i i i i i i i i i i

P P • g(·) bP P • g(·)• h(x) b • h(x) ) • h(x) = g(a(x)) == g(bg(a(x)) + i= wig(b xi•)+h(x) == g(bg(a(x)) + i= wig(b xi )+ i wi xi ) i xig(a(x)) i w=

...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)

(1)

...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)

(1)

P

P

> w x = b + w x • a(x) = b + w x = b + w x i i i i i i P P • h(x) = g(a(x)) = g(b + i wi •xi )h(x) = g(a(x)) = g(b + i wi xi )

• a(x) = b +

• x1 xd b w 1

>

THEY CAN BE COMPRESSED

31

Feedforward neural network Feedforward neural networ Feedforward neural network Feedforward neural network wd • x1 xd b w 1 w d

Topics: knowledge distillation

Hugo Larochelle Hugo Larochelle • w Larochelle Hugo Hugo Larochelle the Knowledge in a Neural Network
 D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique D´epartement d’informatique • { Hinton, Vinyals, Dean,Universit´ arXiv 2015 e de Sherbrooke e de Sherbrooke eUniversit´ de Sherbrooke Universit´eUniversit´ de Sherbrooke [email protected] [email protected] [email protected] • g(a) = a •[email protected] g(a) = a ... ... y 1 1 • g(a) = sigm(a) = 1+exp( a)September • g(a) = sigm(a) September 6, 2012 September 6, 2012 = 6, 2012 6, 2012 1+exp( a)September • w • Distilling • {

• g(a) = tanh(a) =

...

...

exp(a) exp( a) • exp(a)+exp( a)

exp(2a) 1 g(a) = exp(2a)+1 = tanh(a)

Abstract

=

... Abstract

exp(a) exp( a) exp(a)+exp( a)

=

exp(2a) 1 exp(2a)+1

Abstract

Abstract

• g(a) = max(0, a) Math“Feedforward • g(a) = max(0, a)network”. for my slides “Feedforward Math“Feedforward for my slides “Feedforward neural network”. Math for my slides neural network”. Mathneural for my slides neural network”. P • >g(a) = reclin(a) P > P P • g(a) = reclin(a) = max(0, a) = max(0, a) > > ... = b + w x = b + w x • a(x) = b + w x = b + w x • a(x) ...= b + • ...a(x) w x = b + w x • a(x) = b + w x = b + w x i i i i i i i i i i i i

P P • g(·) bP P • g(·)• h(x) b • h(x) ) • h(x) = g(a(x)) == g(bg(a(x)) + i= wig(b xi•)+h(x) == g(bg(a(x)) + i= wig(b xi )+ i wi xi ) i xig(a(x)) i w=

...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)

(1)

...x1 xi d • Wi,j • x1 bix...d xj • h(x) (1)

(1)

THEY CAN BE COMPRESSED Topics: knowledge distillation •



Can successfully distill ‣

a large neural network



an ensemble of neural network

Works better than training it from scratch! ‣

Do Deep Nets Really Need to be Deep?
 Jimmy Ba, Rich Caruana, NIPS 2014

32

THEY ARE INFLUENCED BY INITIALIZATION Topics: impact of initialization

E RHAN , B ENGIO , C OURVILLE , M ANZAGOL , V INCENT AND B ENGIO



Why Does Unsupervised PreTraining Help Deep Learning
 focus respectively on local and global structure. Each point is colored according to the training iteration, to help follow Manzagol, the trajectory movement. Erhan, Bengio, Courville, Vincent, JMLR 2010 9

10

100

2 layers with pre−training 80

60

40

20

0

−20

−40

−60

−80

−100 −100

2 layers without pre−training −80

−60

−40

−20

0

20

40

60

80

100

33

THEY ARE INFLUENCED BY FIRST EXAMPLES W HY D OES U NSUPERVISED P RE - TRAINING H ELP D EEP L EARNING ?

first million examples (across 10 different random draws, sampling a different set of 1 million Topics: impactthe of early examples examples each time) and keep the other ones fixed. After training the (10) models, we measure the



variance (across the 10 draws) of the output of the networks on a fixed test set (i.e., we measure the variance in function space). We then vary the next million examples in the same fashion, and so on, to see how much each of the ten parts of the training set influenced the final function.

Why Does Unsupervised Pre- Training Help Deep Learning
 Erhan, Bengio, Courville, Manzagol, Vincent, JMLR 2010

34

YET THEY FORGET WHAT THEY LEARNED Topics: lifelong learning, continual learning •

Overcoming Catastrophic Forgetting in Neural Networks
 Kirkpatrick et al. PNAS 2017

35

36

SO THERE IS A LOT 
 MORE TO UNDERSTAND!!

37

MERCI!

Larochelle - Neural Networks 2 - DLSS 2017.pdf

Page 5 of 40. Larochelle - Neural Networks 2 - DLSS 2017.pdf. Larochelle - Neural Networks 2 - DLSS 2017.pdf. Open. Extract. Open with. Sign In. Main menu.

7MB Sizes 2 Downloads 169 Views

Recommend Documents

Larochelle - Neural Networks 2 - DLSS 2017.pdf
data : ‣ setting : • Test time. ‣ data : ‣ setting : {x(t) {x } (t). } x(t) ⇠ p(x) x(t) ⇠ p(x). • Example. ‣ distribution estimation. ‣ dimensionality. reduction. Page 4 of 40 ...

Bengio - Recurrent Neural Networks - DLSS 2017.pdf
model: every variable predicted from all previous ones. Page 4 of 42. Bengio - Recurrent Neural Networks - DLSS 2017.pdf. Bengio - Recurrent Neural Networks ...

Bengio - Recurrent Neural Networks - DLSS 2017.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Bengio ...

Neural Networks - GitHub
Oct 14, 2015 - computing power is limited, our models are necessarily gross idealisations of real networks of neurones. The neuron model. Back to Contents. 3. ..... risk management target marketing. But to give you some more specific examples; ANN ar

Artificial neural networks for automotive air-conditioning systems (2 ...
Artificial neural networks for automotive air-conditioning systems (2).pdf. Artificial neural networks for automotive air-conditioning systems (2).pdf. Open. Extract.

Recurrent Neural Networks
Sep 18, 2014 - Memory Cell and Gates. • Input Gate: ... How LSTM deals with V/E Gradients? • RNN hidden ... Memory cell (Linear Unit). . =  ...

Intriguing properties of neural networks
Feb 19, 2014 - we use one neural net to generate a set of adversarial examples, we ... For the MNIST dataset, we used the following architectures [11] ..... Still, this experiment leaves open the question of dependence over the training set.

Neural Graph Learning: Training Neural Networks Using Graphs
many problems in computer vision, natural language processing or social networks, in which getting labeled ... inputs and on many different neural network architectures (see section 4). The paper is organized as .... Depending on the type of the grap

lecture 17: neural networks, deep networks, convolutional ... - GitHub
As we increase number of layers and their size the capacity increases: larger networks can represent more complex functions. • We encountered this before: as we increase the dimension of the ... Lesson: use high number of neurons/layers and regular

Learning Methods for Dynamic Neural Networks - IEICE
Email: [email protected], [email protected], [email protected]. Abstract In .... A good learning rule must rely on signals that are available ...

Adaptive Incremental Learning in Neural Networks
structure of the system (the building blocks: hardware and/or software components). ... working and maintenance cycle starting from online self-monitoring to ... neural network scientists as well as mathematicians, physicists, engineers, ...

Genetically Evolving Optimal Neural Networks - Semantic Scholar
Nov 20, 2005 - URL citeseer.ist.psu.edu/curran02applying.html. [4] B.-T. Zhang, H. Mühlenbein, Evolving optimal neural networks using genetic algorithms.

Simon Haykin-Neural Networks-A Comprehensive Foundation.pdf ...
Simon Haykin-Neural Networks-A Comprehensive Foundation.pdf. Simon Haykin-Neural Networks-A Comprehensive Foundation.pdf. Open. Extract. Open with.

Scalable Object Detection using Deep Neural Networks
neural network model for detection, which predicts a set of class-agnostic ... way, can be scored using top-down feedback [17, 2, 4]. Us- ing the same .... We call the usage of priors for matching ..... In Proceedings of the IEEE Conference on.

Genetically Evolving Optimal Neural Networks - Semantic Scholar
Nov 20, 2005 - Genetic algorithms seem a natural fit for this type .... For approaches that address network architecture, one termination criteria that has been.

Programming Exercise 4: Neural Networks Learning - csns
set up the dataset for the problems and make calls to functions that you will write. ... The training data will be loaded into the variables X and y by the ex4.m script. 3 ..... One way to understand what your neural network is learning is to visuali

neural networks and the bias variance dilemma.pdf
can be learned? ' Nonparametric inference has matured in the past 10 years. There ... Also in Section 4, we will briefly discuss the technical issue of consistency, which .... m classes, 3 e {1,2.....rr1}, and an input, or feature, vector x. Based o

Deep Learning and Neural Networks
Online|ebook pdf|AUDIO. Book details ... Learning and Neural Networks {Free Online|ebook ... descent, cross-entropy, regularization, dropout, and visualization.

Volterra series and neural networks
Towards a Volterra series representation from a Neural Network model. GEORGINA ... 1), for different voltages combinations, ..... topology of the network on Fig.

DEEP NEURAL NETWORKS BASED SPEAKER ...
1National Laboratory for Information Science and Technology, Department of Electronic Engineering,. Tsinghua .... as WH×S and bS , where H denotes the number of hidden units in ..... tional Conference on Computer Vision, 2007. IEEE, 2007 ...

Explain Images with Multimodal Recurrent Neural Networks
Oct 4, 2014 - In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating .... It needs a fixed length of context (i.e. five words), whereas in our model, ..... The perplexity of MLBL-F and LBL now are 9.90.