Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February �rd, ����@MIT

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR) ! “Hello my name is Heiga Zen” Machine translation (MT) “Hello my name is Heiga Zen” ! “Ich heiße Heiga Zen” Text-to-speech synthesis (TTS) “Hello my name is Heiga Zen” !

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech production process

modulation of carrier wave by speech information

freq transfer char

voiced/unvoiced

fundamental freq

text (concept)

speech

frequency transfer characteristics magnitude start-- end

Sound source voiced: pulse unvoiced: noise

fundamental frequency

air flow Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Typical �ow of TTS system TEXT Sentence segmentation Word segmentation Text normalization Part-of-speech tagging Pronunciation

discrete ) discrete NLP Frontend

Heiga Zen

Text analysis Speech synthesis

SYNTHESIZED SEECH

Generative Model-Based Text-to-Speech Synthesis

Prosody prediction Waveform generation

discrete ) continuous Speech

Backend

February �rd, ����

� of ��

Speech synthesis approaches

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech synthesis approaches Rule-based, formant synthesis [�]

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech synthesis approaches Rule-based, formant synthesis [�]

Heiga Zen

Sample-based, concatenative synthesis [�]

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Speech synthesis approaches Rule-based, formant synthesis [�]

Sample-based, concatenative synthesis [�]

Model-based, generative synthesis p(speech=

Heiga Zen

| text=”Hello, my name is Heiga Zen.”)

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Probabilistic formulation of TTS Random variables

Heiga Zen

X

Speech waveforms (data)

Observed

W

Transcriptions (data)

Observed

w

Given text

Observed

x

Synthesized speech

Unobserved

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

6 of ��

Probabilistic formulation of TTS Random variables X

Speech waveforms (data)

Observed

W

Transcriptions (data)

Observed

w

Given text

Observed

x

Synthesized speech

Unobserved

Synthesis • Estimate posterior predictive distribution ! p(x | w, X , W) ¯ from the posterior distribution • Sample x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

X

W

w

x

February �rd, ����

6 of ��

Probabilistic formulation Introduce auxiliary variables (representation) + factorize dependency y XX p(x | w, X , W) = p(x | o)p(o | l, )p(l | w) 8l

8L

p(X | O)p(O | L, )p( )p(L | W)/ p(X ) dodOd

where

O, o: Acoustic features

L, l: Linguistic features : Model

X

W

w

O

L

l

o x Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Approximate {sum & integral} by best point estimates (like MAP) [�] where

ˆ p(x | w, X , W) ⇡ p(x | o)

ˆ O, ˆ L, ˆ ˆ } = arg max ˆ l, {o, o,l,O,L,

X

W

w

O

L

l

p(x | o)p(o | l, )p(l | w)

p(X | O)p(O | L, )p( )p(L | W)

o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

8 of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Synthesize waveform



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

O

Lˆ = arg max p(L | W) L

ˆ = arg max p(O ˆ | L, ˆ )p( ) lˆ = arg max p(l | w) l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) lˆ = arg max p(l | w) l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w) l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w)

Predict linguistic features

l

ˆ ˆ) oˆ = arg max p(o | l, o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

l

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

X

W

w

O

L

l

ˆ O





ˆ o oˆ x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Synthesize waveform



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Approximation (�) Joint ! Step-by-step maximization [�] ˆ = arg max p(X | O) O

Extract acoustic features

Lˆ = arg max p(L | W)

Extract linguistic features

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Learn mapping

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Predict linguistic features

ˆ ˆ) oˆ = arg max p(o | l,

Predict acoustic features

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Synthesize waveform



l

o

ˆ

x

Representations: acoustic, linguistic, mapping Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

� of ��

Representation – Linguistic features Hello, world. Hello,

world.

Phrase: intonation, ...

hello

world

Word: POS, grammar, ...

w-er1-l-d

Syllable: stress, tone, ...

h-e2

l-ou1

h

l

Heiga Zen

Sentence: length, ...

e

ou

w er

l

d

Phone: voicing, manner, ...

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Linguistic features Hello, world.

Sentence: length, ...

Hello,

world.

Phrase: intonation, ...

hello

world

Word: POS, grammar, ...

w-er1-l-d

Syllable: stress, tone, ...

h-e2

l-ou1

h

l

e

ou

w er

l

d

Phone: voicing, manner, ...

! Based on knowledge about spoken language • Lexicon, letter-to-sound rules • Tokenizer, tagger, parser • Phonology rules Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Acoustic features Piece-wise stationary, source-�lter generative model p(x | o) Vocal source

Vocal tract filter

Pulse train (voiced) Cepstrum, LPC, ... Aperiodicity, voicing, ...

Fundamental frequency

[dB]

overlap/shift windowing

80

+

Speech

e(n)

x(n) = h(n)*e(n)

0 0

8 [kHz]

h(n) White noise (unvoiced)

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Acoustic features Piece-wise stationary, source-�lter generative model p(x | o) Vocal source

Vocal tract filter

Pulse train (voiced) Cepstrum, LPC, ... Aperiodicity, voicing, ...

Fundamental frequency

[dB]

overlap/shift windowing

80

+

Speech

e(n)

x(n) = h(n)*e(n)

0 0

8 [kHz]

h(n) White noise (unvoiced)

! Needs to solve inverse problem • Estimate parameters from signals • Use estimated parameters (e.g., cepstrum) as acoustic features Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping Rule-based, formant synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Extract rules

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Apply rules

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping Rule-based, formant synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Extract rules

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Apply rules

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



l

o

ˆ

x

! Hand-crafted rules on knowledge-based features Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping HMM-based [�], statistical parametric synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Train HMMs

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Parameter generation

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



l

o

ˆ

x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Representation – Mapping HMM-based [�], statistical parametric synthesis [�] ˆ = arg max p(X | O) O

Vocoder analysis

Lˆ = arg max p(L | W)

Text analysis

O L

ˆ = arg max p(O ˆ | L, ˆ )p( ) Train HMMs

X

W

w

O

L

l

ˆ O





lˆ = arg max p(l | w)

Text analysis

ˆ ˆ) oˆ = arg max p(o | l,

Parameter generation

o

¯ ⇠ fx (o) ˆ = p(x | o) ˆ x

Vocoder synthesis



ˆ

l

o

x

! Replace rules by HMM-based generative acoustic model Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

HMM-based generative acoustic model for TTS • Context-dependent subword HMMs • Decision trees to cluster & tie HMM states ! interpretable l1

...

lN l

... o1 o2 o3 o4 o5 o6 ... ... ... ... oT o2 o2

p(o | l, ) = =

T XY 8q t=1

T XY 8q t=1

Heiga Zen

q1

q2

q3

q4

: Discrete

o1

o2

o3

o4

: Continuous

p(ot | qt , )P (q | l, )

qt : hidden state at t

N (ot ; µqt , ⌃qt )P (q | l, )

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

HMM-based generative acoustic model for TTS

• Non-smooth, step-wise statistics ! Smoothing is essential • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Use low-dimensional features (e.g., cepstra) • Data fragmentation ! Ineffective, local representation A lot of research work have been done to address these issues

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�6 of ��

Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic ! acoustic Linguistic features l

yes yes yes

no yes

no no

...

no yes

no

Statistics of acoustic features o

Regression tree: linguistic features ! Stats. of acoustic features

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic ! acoustic Linguistic features l

yes yes yes

no yes

no no

...

no yes

no

Statistics of acoustic features o

Regression tree: linguistic features ! Stats. of acoustic features Replace the tree w/ a general-purpose regression model ! Arti�cial neural network Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

FFNN-based acoustic model for TTS [6] Target Frame-level acoustic feature o t

o t−1

ot

o t+1

lt−1

lt

lt+1

ht

Frame-level linguistic feature lt Input

ht = g (Whl lt + bh )

ˆ = arg min

X t

oˆt = Woh ht + bo

kot

oˆt k2

= {Whl , Woh , bh , bo }

oˆt ⇡ E [ot | lt ] ! Replace decision trees & Gaussian distributions Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�8 of ��

RNN-based acoustic model for TTS [�] Target Frame-level acoustic feature o t

o t−1

ot

o t+1

lt−1

lt

lt+1

Recurrent connections

Frame-level linguistic feature lt Input

ht = g (Whl lt + Whh ht

1

+ bh )

ˆ = arg min

X t

oˆt = Woh ht + bo FFNN: oˆt ⇡ E [ot | lt ] Heiga Zen

kot

oˆt k2

= {Whl , Whh , Woh , bh , bo }

RNN: oˆt ⇡ E [ot | l1 , . . . , lt ]

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

NN-based generative acoustic model for TTS • Non-smooth, step-wise statistics ! RNN predicts smoothly varying acoustic features [�, 8] • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Layered architecture can handle high-dimensional features [�] • Data fragmentation ! Distributed representation [��]

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

NN-based generative acoustic model for TTS • Non-smooth, step-wise statistics ! RNN predicts smoothly varying acoustic features [�, 8] • Dif�cult to use high-dimensional acoustic features (e.g., raw spectra) ! Layered architecture can handle high-dimensional features [�] • Data fragmentation ! Distributed representation [��] NN-based approach is now mainstream in research & products • Models: FFNN [6], MDN [��], RNN [�], Highway network [��], GAN [��] • Products: e.g., Google [��] Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

NN-based generative model for TTS

modulation of carrier wave by speech information

freq transfer char

voiced/unvoiced

fundamental freq

text (concept)

speech

frequency transfer characteristics magnitude start-- end

Sound source voiced: pulse unvoiced: noise

fundamental frequency

air flow

Text ! Linguistic ! (Articulatory) ! Acoustic ! Waveform Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Knowledge-based features ! Learned features Unsupervised feature learning ~ x(t)

world

w(n+1)

0000000000000000000100

Acoustic ) feature o(t)

) l(n-1)

Linguistic feature l(n)

0000000000100000000000 hello

x(t) (raw FFT spectrum)

w(n) (1-hot representation of word)

• Speech: auto-encoder at FFT spectra [�, ��] ! positive results • Text: word [�6], phone & syllable [��] ! less positive Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Joint acoustic feature extraction & model training Two-step optimization ! Joint optimization 8 ˆ = arg max p(X | O) >
ˆ | L, ˆ )p( ) > : ˆ = arg max p(O

+ ˆ = arg max p(X | O)p(O | L, ˆ )p( ) { ˆ , O} ,O

X O

W

w

L

l





ˆ

Joint source-�lter & acoustic model optimization

o

• HMM [�8, ��, ��]



• NN [��, ��] Heiga Zen

x Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Joint acoustic feature extracion & model training Mixed-phase cepstral analysis + LSTM-RNN [��] pn G(z)

fn

Pulse train z

xn Hu-1 (z)

-

z -1

z

en

z -1

z -1

z -1

sn

Speech 1-

... Cepstrum

...

dt(v)

ot(v)

d(u) t

Forward propagation

o(u) t

Back propagation Linguistic features

Heiga Zen

...

...

Derivatives

...

lt

...

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Direct mapping from linguistic to waveform No explicit acoustic features ˆ = arg max p(X | O)p(O | L, ˆ )p( ) { ˆ , O} ,O

X

+

ˆ = arg max p(X | L, ˆ )p( )

W

w

L

l





Generative models for raw audio • LPC [��]

ˆ

• WaveNet [��]

• SampleRNN [��] Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

x

February �rd, ����

�6 of ��

WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN

Heiga Zen

1} 1

: raw waveform | )=

N Y1 n=0

Generative Model-Based Text-to-Speech Synthesis

p(xn | x0 , . . . , xn

1,

)

February �rd, ����

�� of ��

WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN WaveNet [��] ! p(xn | x0 , . . . , xn

Heiga Zen

1,

1} 1

: raw waveform | )=

N Y1 n=0

p(xn | x0 , . . . , xn

1,

)

) is modeled by convolutional NN

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x0 , x1 , . . . , xN p(x | ) = p(x0 , x1 , . . . , xN WaveNet [��] ! p(xn | x0 , . . . , xn

1,

1} 1

: raw waveform | )=

N Y1 n=0

p(xn | x0 , . . . , xn

1,

)

) is modeled by convolutional NN

Key components • Causal dilated convolution: capture long-term dependency • Gated convolution + residual + skip: powerful non-linearity • Softmax at output: classi�cation rather than regression Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Causal dilated convolution ���ms in �6kHz sampling = �,6�� time steps ! Too long to be captured by normal RNN/LSTM

Dilated convolution Exponentially increase receptive �eld size w.r.t. # of layers p(xn | xn-1 ,..., xn-16) Output Hidden layer3 Hidden layer2 Hidden layer1 Input xn-16 Heiga Zen

... Generative Model-Based Text-to-Speech Synthesis

xn-3 xn-2 xn-1 February �rd, ����

�8 of ��

WaveNet – Non-linearity Residual block

Skip connections

Residual block 256

Softmax

1x1

256

ReLU

256

1x1

+

ReLU

...

...

30

p(xn | x0 ,..., xn-1 )

Residual block To residual block

Residual block

512

Residual block 1x1

1x1

: 1x1 convolution

ReLU

: ReLU activation

Gated

: Gated activation

Softmax

Heiga Zen

Gated

1x1

256

To skip connection

2x1 dilated

: Softmax activation Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Softmax

Amplitude

Time

Analog audio signal

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Softmax

Amplitude

Time

Sampling & Quantization

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Softmax

Amplitude

Category index

1

16

Time

Categorical distribution → Histogram - Unimodal - Multimodal - Skewed ...

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet – Conditional modelling

Linguistic features l

1x1

Residual block Softmax

Residual block

1x1

1x1

ReLU

Residual block

1x1

1x1

+

ReLU

...

...

hn

Residual block

...

Embedding at time n

1x1

p(xn | x0 ,..., xn-1,l)

Residual block 1x1 Gated

1x1

conditioning 2x1 dilated

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

WaveNet vs conventional audio generative models Assumptions in conventional audio generative models [��, �6, ��, ��] • Stationary process w/ �xed-length analysis window ! Estimate model within ��–��ms window w/ �–�� shift • Linear, time-invariant �lter within a frame ! Relationship between samples can be non-linear

• Gaussian process ! Assumes speech signals are normally distributed WaveNet • Sample-by-saple, non-linear, capable to take additional inputs • Arbitrary-shaped signal distribution

SOTA subjective naturalness w/ WaveNet-based TTS [��] HMM LSTM Concatenative WaveNet Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Towards Bayesian end-to-end TTS Integrated end-to-end 8 >
X

W

w

ˆ )p( ) > : ˆ = arg max p(X | L, +

ˆ = arg max p(X | W, )p( )

ˆ

Text analysis is integrated to model x

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Relax approximation Towards Bayesian end-to-end TTS Bayesian end-to-end 8 < ˆ = arg max p(X | W, )p( )

X

W

w

:x ¯ ⇠ fx (w, ˆ ) = p(x | w, ˆ ) +

¯ ⇠ fx (w, X , W) = p(x | w, X , W) x Z = p(x | w, )p( | X , W)d K 1 X ⇡ p(x | w, ˆ k ) K

Ensemble

x

k=1

Marginalize model parameters & architecture Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

Generative model-based text-to-speech synthesis • Bayes formulation + factorization + approximations • Representation: acoustic features, linguistic features, mapping Mapping: Rules ! HMM ! NN Feature: Engineered ! Unsupervised, learned • Less approximations Joint training, direct waveform modelling Moving towards integrated & Bayesian end-to-end TTS Naturalness: Concatenative  Generative

Flexibility: Concatenative ⌧ Generative (e.g., multiple speakers) Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�6 of ��

Beyond “text”-to-speech synthesis TTS on conversational assistants • Texts aren’t fully contained

• Need more context Location to resolve homographs User query to put right emphasis

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Beyond “text”-to-speech synthesis TTS on conversational assistants • Texts aren’t fully contained

• Need more context Location to resolve homographs User query to put right emphasis

We need representation that can organize the world information & make it accessible & useful from TTS generative models , Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Beyond “generative” TTS

Generative model-based TTS • Model represents process behind speech production Trained to minimize error against human-produced speech Learned model ! speaker

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�8 of ��

Beyond “generative” TTS

Generative model-based TTS • Model represents process behind speech production Trained to minimize error against human-produced speech Learned model ! speaker • Speech is for communication Goal: maximize the amount of information to be received Missing “listener” ! “listener” in training / model itself?

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�8 of ��

Thanks!

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References I [�]

D. Klatt. Real-time speech synthesis by rule. Journal of ASA, 68(S�):S�8–S�8, ��8�.

[�]

A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP, pages ���–��6, ���6.

[�]

K. Tokuda. Speech synthesis as a statistical machine learning problem. https://www.sp.nitech.ac.jp/~tokuda/tokuda_asru2011_for_pdf.pdf. Invited talk given at ASRU ����.

[�]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. IEICE Trans. Inf. Syst., J8�-D-II(��):����–����, ����. (in Japanese).

[�]

H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commn., ��(��):����–��6�, ����.

[6]

H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pages ��6�–��66, ����.

[�]

Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech, pages ��6�–��68, ����.

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References II [8]

H. Zen. Acoustic modeling for speech synthesis: from HMM to RNN. http://research.google.com/pubs/pub44630.html. Invited talk given at ASRU ����.

[�]

S. Takaki and J. Yamagishi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In Proc. ICASSP, pages ����–����, ���6.

[��] G. Hinton, J. McClelland, and D. Rumelhart. Distributed representation. In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in the microstructure of cognition. MIT Press, ��86. [��]

H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP, pages �8��–�8�6, ����.

[��] X. Wang, S. Takaki, and J. Yamagishi. Investigating very deep highway networks for parametric speech synthesis. In Proc. ISCA SSW�, ���6. [��] Y. Saito, S. Takamichi, and Saruwatari. Training algorithm to deceive anti-spoo�ng veri�cation for DNN-based speech synthesis. In Proc. ICASSP, ����. [��] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In Proc. Interspeech, ���6.

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References III [��] P. Muthukumar and A. Black. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:����.8��8, ����. [�6] P. Wang, Y. Qian, F. Soong, L. He, and H. Zhao. Word embedding for recurrent neural network based TTS synthesis. In Proc. ICASSP, pages �8��–�88�, ����. [��] X. Wang, S. Takaki, and J. Yamagishi. Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis. IEICE Trans. Inf. Syst., E��-D(��):����–��8�, ���6. [�8] T. Toda and K. Tokuda. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In Proc. ICASSP, pages ����–���8, ���8. [��] Y.-J. Wu and K. Tokuda. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech, pages ���–�8�, ���8. [��] R. Maia, H. Zen, and M. Gales. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In Proc. ISCA SSW�, pages 88–��, ����. [��] K. Tokuda and H. Zen. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. ICASSP, pages ����–����, ����. [��] K. Tokuda and H. Zen. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In Proc. ICASSP, pages �6��–�6��, ���6.

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

References IV

[��] F. Itakura and S. Saito. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J���A:��–��, ����. [��] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:�6��.�����, ���6. [��] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv:�6��.��8��, ���6. [�6] S. Imai and C. Furuichi. Unbiased estimation of log spectrum. In Proc. EURASIP, pages ���–��6, ��88. [��] H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux. Speech analysis with multi-kernel linear prediction. In Proc. Spring Conference of ASJ, pages ���–���, ����. (in Japanese).

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

X

W

w

X

W

O

L

w l

X

W

O

L

w

X

W

l

O

L

l

ˆ O





o

w

ˆ

o

o





x

x

x

(1) Bayesian

(2) Auxiliary variables + factorization

(3) Joint maximization

x

X O

W

w

W

w

L

l

L

l









X

ˆ

ˆ

X

W

w

(4) Step-by-step maximization e.g., statistical parametric TTS X

W

w

ˆ

o oˆ x

x

x

x

(5) Joint acoustic feature extraction + model training

(6) Conditional WaveNet -based TTS

(7) Integrated end-to-end

(8) Bayesian end-to-end

Heiga Zen

Generative Model-Based Text-to-Speech Synthesis

February �rd, ����

�� of ��

Generative Model-Based [6pt] Text-to-Speech ... - Research at Google

Speech: auto-encoder at FFT spectra [ , ] ! positive results ... Joint acoustic feature extraction & model training .... Joint training, direct waveform modelling.

2MB Sizes 3 Downloads 321 Views

Recommend Documents

A Generative Model for Rhythms - Research at Google
When using cross-validation, the accuracy Acc of an evaluated model is given by ... For the baseline HMM model, double cross-validation optimizes the number ...

Mathematics at - Research at Google
Index. 1. How Google started. 2. PageRank. 3. Gallery of Mathematics. 4. Questions ... http://www.google.es/intl/es/about/corporate/company/history.html. ○.

Faucet - Research at Google
infrastructure, allowing new network services and bug fixes to be rapidly and safely .... as shown in figure 1, realizing the benefits of SDN in that network without ...

BeyondCorp - Research at Google
41, NO. 1 www.usenix.org. BeyondCorp. Design to Deployment at Google ... internal networks and external networks to be completely untrusted, and ... the Trust Inferer, Device Inventory Service, Access Control Engine, Access Policy, Gate-.

VP8 - Research at Google
coding and parallel processing friendly data partitioning; section 8 .... 4. REFERENCE FRAMES. VP8 uses three types of reference frames for inter prediction: ...

JSWhiz - Research at Google
Feb 27, 2013 - and delete memory allocation API requiring matching calls. This situation is further ... process to find memory leaks in Section 3. In this section we ... bile devices, such as Chromebooks or mobile tablets, which typically have less .

Yiddish - Research at Google
translation system for these language pairs, although online dictionaries exist. ..... http://www.unesco.org/culture/ich/index.php?pg=00206. Haifeng Wang, Hua ...

Eye In-Painting with Exemplar Generative ... - Facebook Research
Full manipulation of the face appearance [23], at- tribute transferral [35], face frontalization [31] or synthetic make-up [14], are also becoming very popular.

traits.js - Research at Google
on the first page. To copy otherwise, to republish, to post on servers or to redistribute ..... quite pleasant to use as a library without dedicated syntax. Nevertheless ...

sysadmin - Research at Google
On-call/pager response is critical to the immediate health of the service, and ... Resolving each on-call incident takes between minutes ..... The conference has.

Introduction - Research at Google
Although most state-of-the-art approaches to speech recognition are based on the use of. HMMs and .... Figure 1.1 Illustration of the notion of margin. additional ...

References - Research at Google
A. Blum and J. Hartline. Near-Optimal Online Auctions. ... Sponsored search auctions via machine learning. ... Envy-Free Auction for Digital Goods. In Proc. of 4th ...

BeyondCorp - Research at Google
Dec 6, 2014 - Rather, one should assume that an internal network is as fraught with danger as .... service-level authorization to enterprise applications on a.

Browse - Research at Google
tion rates, including website popularity (top web- .... Several of the Internet's most popular web- sites .... can't capture search, e-mail, or social media when they ..... 10%. N/A. Table 2: HTTPS support among each set of websites, February 2017.

Continuous Pipelines at Google - Research at Google
May 12, 2015 - Origin of the Pipeline Design Pattern. Initial Effect of Big Data on the Simple Pipeline Pattern. Challenges to the Periodic Pipeline Pattern.

Accuracy at the Top - Research at Google
We define an algorithm optimizing a convex surrogate of the ... as search engines or recommendation systems, since most users of these systems browse or ...

slide - Research at Google
Gunhee Kim1. Seil Na1. Jisung Kim2. Sangho Lee1. Youngjae Yu1. Code : https://github.com/seilna/youtube8m. Team SNUVL X SKT (8th Ranked). 1 ... Page 9 ...

1 - Research at Google
nated marketing areas (DMA, [3]), provides a significant qual- ity boost to the LM, ... geo-LM in Eq. (1). The direct use of Stolcke entropy pruning [8] becomes far from straight- .... 10-best hypotheses output by the 1-st pass LM. Decoding each of .

1 - Research at Google
circles on to a nD grid, as illustrated in Figure 6 in 2D. ... Figure 6: Illustration of the simultaneous rasterization of ..... 335373), and gifts from Adobe Research.