EECS542 Presentation Recurrent Neural Networks Part I Xinchen Yan, Brian Wang Sept. 18, 2014

Outline • Temporal Data Processing • Hidden Markov Model • Recurrent Neural Networks • Long Short-Term Memory

Temporal Data Processing • Speech Recognition • Object Tracking • Activity Recognition • Pose Estimation

Figure credit: Feng

Sequence Labeling • Sequence Classification 1 (“verb”)

• Segment Classification (hamming distance) swim

• Temporal Classification (edit distance) swim

Example: The Dishonest Casino • A casino has two dice: • Fair dice 1 Pr 𝑋 = 𝑘 = , 𝑓𝑜𝑟 1 ≤ 𝑘 ≤ 6 6 • Loaded dice 1 Pr 𝑋 = 𝑘 = , 𝑓𝑜𝑟 1 ≤ 𝑘 ≤ 5 10 1 Pr 𝑋 = 6 = 2 Casino player switches back-&-forth between fair and loaded dice once every 20 turns Slide credit: Eric Xing

Example: The Dishonest Casino • Given: A sequence of rolls by the casino player 124552656214614613613666166466 …. • Questions: • How likely is this sequence, given our model of how the casino works? (Evaluation) • What portion of the sequence was generated with the fair die, and what portion with the loaded die? (Decoding) • How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? (Learning) Slide credit: Eric Xing

time 

output

output

• Hidden State: 𝐻𝑡 • Output: 𝑂𝑡 • Transition Prob between two states 𝑃(𝐻𝑡 = 𝑘|𝐻𝑡−1 = 𝑗) • Start Prob 𝑃(𝐻1 = 𝑗) • Emission Prob associated with each state 𝑃(𝑂𝑡 = 𝑖|𝐻𝑡 = 𝑗)

output

Hidden Markov Model

Hidden Markov Model (Generative Model)

𝑂𝑡

𝐻𝑡

Slide credit: Eric Xing

Example: dice rolling vs. speech signal • Output sequence: Discrete value

• Output sequence: Continuous data

• Limited number of states

• Syntax, semantics, accent,

rate, volume, and etc. • Temporal Segmentation

Limitations of HMM • Modeling continuous data • Long-term dependencies • Can only remember log(N) bits about what it generated so far.

Slide credit: G. Hinton

Feed-Forward NN vs. Recurrent NN

• “piped” vs. cyclic • Function vs. dynamic system

Definition: Recurrent Neural Networks • • • • •

Observation/Input vector: 𝑥𝑡 Hidden state vector: ℎ𝑡 Output vector: 𝑦𝑡 Weight Matrices Input-Hidden Weights: 𝑊𝐼 Hidden-Hidden Weights: 𝑊𝐻 Hidden-Output Weights: 𝑊𝑂 Updating Rules ℎ𝑡 = 𝜎(𝑊𝐼 𝑥𝑡 + 𝑊𝐻 ℎ𝑡−1 + 𝑏) 𝑦𝑡 = 𝑊𝑂 ℎ𝑡

ℎ𝑡 𝑥𝑡

𝑦𝑡

Unfolded RNN: Shared Weights

Recurrent Neural Networks

time 

output

output

output

hidden

hidden

hidden

input

input

input

• Power of RNN: - Distributed hidden units - Non-linear dynamics ℎ𝑡 = 𝜎(𝑊𝐼 𝑥𝑡 + 𝑊𝐻 ℎ𝑡−1 + 𝑏) • Quote from Hinton

Providing input to recurrent networks

• Specify the initial states of all the units. • Specify the initial states of a subset of the units. • Specify the states of the same subset of the units at every time step.

w1

w3 w4 w2 

• We can specify inputs in several ways:

w1

w3 w4

w2

w1

w3 w4

w2

time

• This is the natural way to model most sequential data.

Slide credit: G. Hinton

Teaching signals for recurrent networks • We can specify targets in several ways:

• Specify desired final activities of all the units • Specify desired activities of all units for the last few steps • Good for learning attractors • It is easy to add in extra error derivatives as we backpropagate.

• Specify the desired activity of a subset of the units.

w1

w3 w4 w2

w1

w3 w4

w2

w1

w3 w4

w2

• The other units are input or hidden units.

Slide credit: G. Hinton

Next Q: Training Recurrent Neural Networks w1 w3

w2 w4

Backprop through time (BPTT)

time=3

w1

w3 w4 w2

w1

w3 w4

w2

w1

w3 w4

w2

time=2

time=1

time=0

Slide credit: G. Hinton

Recall: Training a FFNN • The algorithm: - Provide: 𝑥, 𝑦 - Learn: 𝑊 (1) , ⋯ , 𝑊 𝐿 - Forward Pass: 𝑧 (𝑖+1) = 𝑊 (𝑖) 𝑎(𝑖) , 𝑎(𝑖+1) = 𝑓(𝑧 (𝑖+1) ) - Backprop: 𝛿 (𝑖)

=

𝑇 (𝑖+1) (𝑖) 𝑊 𝛿

𝛻W i J W, b; x, y = 𝛿

⋅ 𝑓′(𝑧 (𝑖) )

𝑖+1

𝑎

(𝑖) 𝑇

About function 𝑓 • Example: Sigmoid Function •𝑓 𝑧 =

1 1+𝑒 −𝑡

∈ (0,1)

• 𝑓 ′ 𝑧 = 𝑓 𝑧 1 − 𝑓 𝑧 ≤ 0.25 • Backprop Analysis (RNN): Magnitude of gradients 𝑞 (𝑖) 𝜕𝛿 𝑇 𝑖+𝑚−1 (𝑖+𝑚−1) ) = 𝑊 𝑓′(𝑧 𝜕𝛿 (𝐿) 𝑚=1 (𝑖) 𝜕𝛿 ′ 𝑁𝑒𝑡 𝑞 ≤ 𝑊 max 𝑓 𝜕𝛿 (𝐿)

Exploding/Vanishing gradients • Better Initialization • Long-range dependencies

Long Short-Term Memory • Memory Block: Basic Unit • Store and Access information Xt-1

Ht-1

Yt-1

Block Xt-1

Ht-1

Yt-1

Ht

Yt

Ct-1

Xt

Ht

Yt

Xt

Ct

Memory Cell and Gates • Input Gate: 𝑖𝑡 • Forget Gate: 𝑓𝑡 • Output Gate: 𝑜𝑡 • Cell Activation Vector: 𝑐𝑡

Example: LSTM network • 4 input units • 5 output units • 1 block - 2 LSTM cells

LSTM: Forward Pass

• Input Gate:

𝑖𝑡 = 𝜎(𝑊𝑥𝑖 𝑥𝑡 + 𝑊ℎ𝑖 ℎ𝑡−1 + 𝑊𝑐𝑖 𝑐𝑡−1 + 𝑏𝑖 )

• Forget Gate:

𝑓𝑡 = 𝜎(𝑊𝑥𝑓 𝑥𝑡 + 𝑊ℎ𝑓 ℎ𝑡−1 + 𝑊𝑐𝑓 𝑐𝑡−1 + 𝑏𝑓 )

• Output Gate:

𝑜𝑡 = 𝜎 𝑊𝑥𝑜 𝑥𝑡 + 𝑊ℎ𝑜 ℎ𝑡−1 + 𝑊𝑐𝑜 𝑐𝑡 + 𝑏𝑜

Preservation of gradient information • LSTM: 1 input unit, 1 hidden unit, & 1 output unit • Node: black (activated) • Gate: “-” (closed), “o” (open)

LSTM: Forward Pass

• Memory cell:

𝑐𝑡 = 𝑓𝑡 𝑐𝑡−1 + 𝑖𝑡 tanh(𝑊𝑥𝑐 𝑥𝑡 + 𝑊ℎ𝑐 ℎ𝑡−1 + 𝑏𝑐 )

• Hidden Vector: ℎ𝑡 = 𝑜𝑡 tanh(𝑐𝑡 )

How LSTM deals with V/E Gradients? • RNN hidden unit ℎ𝑡 = 𝜎 𝑊𝐼 𝑥𝑡 + 𝑊𝐻 ℎ𝑡−1 + 𝑏ℎ • Memory cell (Linear Unit) 𝑐𝑡 = 𝑓𝑡 𝑐𝑡−1 + 𝑖𝑡 tanh(𝑊𝑥𝑐 𝑥𝑡 + 𝑊ℎ𝑐 ℎ𝑡−1 + 𝑏𝑐 )

Summary • HMM: discrete data • RNN: continuous domain • LSTM: long-range dependencies

The End Thank you!

## Recurrent Neural Networks

Sep 18, 2014 - Memory Cell and Gates. â¢ Input Gate: ... How LSTM deals with V/E Gradients? â¢ RNN hidden ... Memory cell (Linear Unit). . =  ...

#### Recommend Documents

Explain Images with Multimodal Recurrent Neural Networks
Oct 4, 2014 - In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating .... It needs a fixed length of context (i.e. five words), whereas in our model, ..... The perplexity of MLBL-F and LBL now are 9.90.

Using Recurrent Neural Networks for Time.pdf
Submitted to the Council of College of Administration & Economics - University. of Sulaimani, As Partial Fulfillment for the Requirements of the Master Degree of.

Using Recurrent Neural Networks for Slot Filling in Spoken ... - Microsoft
experiments on the well-known airline travel information system. (ATIS) benchmark. ... The dialog manager then interprets and decides on the ...... He received an. M.S. degree in computer science from the University .... He began his career.

On Recurrent Neural Networks for Auto-Similar Traffic ...
auto-similar processes, VBR video traffic, multi-step-ahead pre- diction. ..... ulated neural networks versus the number of training epochs, ranging from 90 to 600.

Using Recurrent Neural Networks for Slot Filling in Spoken ... - Microsoft
two custom SLU data sets from the entertainment and movies .... searchers employed statistical methods. ...... large-scale data analysis, and machine learning.

The Context-dependent Additive Recurrent Neural Net
Inspired by recent works in dialog systems (Seo et al., 2017 ... sequence mapping problem with a strong control- ling context ... minimum to improve train-ability in domains with limited data ...... ference of the European Chapter of the Associa-.

Intriguing properties of neural networks
Feb 19, 2014 - we use one neural net to generate a set of adversarial examples, we ... For the MNIST dataset, we used the following architectures [11] ..... Still, this experiment leaves open the question of dependence over the training set.

Recurrent Neural Network based Approach for Early Recognition of ...
a specific AD signature from EEG. As a result they are not sufficiently ..... [14] Ifeachor CE, Jervis WB. Digital signal processing: a practical approach. Addison-.