Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN Heiga Zen Google July 9th, 2015

Outline

Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results

Lecturer

• Heiga Zen

• PhD from Nagoya Institute of Technology, Japan (2006)

• Intern, IBM T.J. Watson Research, New York (2004–2005)

• Research engineer, Toshiba Research Europe, Cambridge (2009–2011) • Research scientist, Google, London (2011–Present) Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

1 of 104

Outline

Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results

Text-to-speech as sequence-to-sequence mapping

Automatic speech recognition (ASR) Speech (real-valued time series) → Text (discrete symbol sequence)

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

3 of 104

Text-to-speech as sequence-to-sequence mapping

Automatic speech recognition (ASR) Speech (real-valued time series) → Text (discrete symbol sequence) Statistical machine translation (SMT) Text (discrete symbol sequence) → Text (discrete symbol sequence)

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

3 of 104

Text-to-speech as sequence-to-sequence mapping

Automatic speech recognition (ASR) Speech (real-valued time series) → Text (discrete symbol sequence) Statistical machine translation (SMT) Text (discrete symbol sequence) → Text (discrete symbol sequence) Text-to-speech synthesis (TTS) Text (discrete symbol sequence) → Speech (real-valued time series)

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

3 of 104

Speech production process

modulation of carrier wave by speech information

freq transfer char

voiced/unvoiced

fundamental freq

text (concept)

speech

frequency transfer characteristics magnitude start--end

Sound source voiced: pulse unvoiced: noise

fundamental frequency

air flow

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

4 of 104

Typical flow of TTS system

TEXT Sentence segmentaiton Word segmentation Text normalization Part-of-speech tagging Pronunciation

discrete ⇒ discrete NLP Frontend

Text analysis Speech synthesis

Prosody prediction Waveform generation

SYNTHESIZED discrete ⇒ continuous Speech SPEECH Backend

This presentation mainly talks about backend Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

5 of 104

Concatenative, unit selection speech synthesis All segments

Target cost

Concatenation cost

• Concatenate actual instances of speech from database • Large data + automatic learning → High-quality synthetic voices can be built automatically • Single inventory per unit → diphone synthesis [1] • Multiple inventory per unit → unit selection synthesis [2] Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

6 of 104

Statistical parametric speech synthesis (SPSS) [3] Speech

Speech analysis

Text

Text analysis

y

Model training

x

Parameter generation

ˆl



x

Speech synthesis Text analysis

Speech Text

Training • Extract linguistic features x & acoustic features y

• Train acoustic model λ given (x, y)

ˆ = arg max p(y | x, λ) λ

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

7 of 104

Statistical parametric speech synthesis (SPSS) [3] Speech

Speech analysis

Text

Text analysis

y

Model training

x

Parameter generation

ˆl



x

Speech synthesis Text analysis

Speech Text

Training • Extract linguistic features x & acoustic features y

• Train acoustic model λ given (x, y)

ˆ = arg max p(y | x, λ) λ Synthesis • Extract x from text to be synthesized ˆ then reconstruct waveform • Generate most probable y from λ ˆ yˆ = arg max p(y | x, λ) Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

7 of 104

Statistical parametric speech synthesis (SPSS) [3]

Speech

Speech analysis

Text

Text analysis

y

Model training

x

Parameter generation

ˆl

x



Speech synthesis Text analysis

Speech Text

• Vocoded speech (buzzy or muffled) • Small footprint

Hidden Markov model (HMM) as its acoustic model → HMM-based speech synthesis system (HTS) [4]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

8 of 104

HMM-based speech synthesis [4] SPEECH Speech signal DATABASE Excitation

parameter extraction Excitation parameters

TEXT

Text analysis Excitation parameters

Synthesis part Heiga Zen

Spectral parameter extraction Spectral parameters

Training HMMs

Labels

Labels

Training part

Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters

Excitation Excitation Synthesis generation Filter

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

SYNTHESIZED SPEECH

July 9th, 2015

9 of 104

HMM-based speech synthesis [4]

SPEECH Speech signal DATABASE Excitation

parameter extraction Excitation parameters

TEXT

Text analysis Excitation parameters

Synthesis part Heiga Zen

Spectral parameter extraction Spectral parameters

Training HMMs

Labels

Labels

Training part

Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters

Excitation Excitation Synthesis generation Filter

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

SYNTHESIZED SPEECH July 9th, 2015

10 of 104

Speech production process

modulation of carrier wave by speech information

freq transfer char

voiced/unvoiced

fundamental freq

text (concept)

speech

frequency transfer characteristics magnitude start--end

Sound source voiced: pulse unvoiced: noise

fundamental frequency

air flow

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

11 of 104

Source-filter model Source excitation part

Vocal tract resonance part

pulse train e(n)

white noise

excitation

linear time-invariant system h(n)

speech x(n) = h(n) ∗ e(n)

x(n) = h(n) ∗ e(n) ↓ Fourier transform

X(ejω ) = H (ejω )E(ejω )

 H ejω should be defined by HMM state-output vectors e.g., mel-cepstrum, line spectral pairs Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

12 of 104

Parametric models of speech signal

Autoregressive (AR) model K

H(z) = 1−

M X

Exponential (EX) model M X c(m)z −m H(z) = exp m=0

c(m)z −m

m=0

Estimate model parameters based on ML c = arg max p(x | c) c

• p(x | c): AR model → Linear predictive analysis [5]

• p(x | c): EX model → (ML-based) cepstral analysis [6]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

13 of 104

80

80

60

60

Log magnitude (dB)

Log magnitude (dB)

Examples of speech spectra

40 20 0 -20

0

1

2 3 4 Frequency (kHz)

(a) ML-based cepstral analysis

Heiga Zen

5

40 20 0 -20

0

1

2 3 4 Frequency (kHz)

5

(b) Linear prediction

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

14 of 104

HMM-based speech synthesis [4]

SPEECH Speech signal DATABASE Excitation

parameter extraction Excitation parameters

TEXT

Text analysis Excitation parameters

Synthesis part Heiga Zen

Spectral parameter extraction Spectral parameters

Training HMMs

Labels

Labels

Training part

Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters

Excitation Excitation Synthesis generation Filter

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

SYNTHESIZED SPEECH July 9th, 2015

15 of 104

Structure of state-output (observation) vectors ot ct Spectrum part

Excitation part

Heiga Zen

Mel-cepstral coefficients

D ct

D Mel-cepstral coefficients

D2c t

DD Mel-cepstral coefficients

pt

log F0

δpt

D log F0

δ 2 pt

DD log F0

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

16 of 104

Hidden Markov model (HMM)

a11 π1

1

a22 a12

b1 (ot ) Observation sequence State sequence

Heiga Zen

O o1 o2 o3 o4 o5 Q

2 b2 (ot )

a33 a23

3 b3 (ot )

... . . ...

1 1 1 1 2 ...

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

2 3 ...

oT

3

July 9th, 2015

17 of 104

Multi-stream HMM structure ot bj (ot ) Spectrum

o1t

b2j (o2t ) b3j (o3t ) b4j (o4t )

4

b1j (o1t )

D2c t

Excitation Heiga Zen

D ct

3

s=1

bj (ot )

ct

Stream 1 2

S Y ¡ s s ¢ws = bj (ot )

pt

o2t

δ pt

o3t

δ 2 pt

o4t

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

18 of 104

Training process data & labels

Compute variance floor (HCompV)

Reestimate CD-HMMs by EM algorithm (HERest)

Estimate CD-dur. models from FB stats (HERest)

Initialize CI-HMMs by segmental k-means (HInit)

Decision tree-based clustering (HHEd TB)

Decision tree-based clustering (HHEd TB)

Reestimate CI-HMMs by EM algorithm (HRest & HERest)

Reestimate CD-HMMs by EM algorithm (HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Untie parameter tying structure (HHEd UT)

monophone (context-independent, CI) Heiga Zen

Estimated dur models Estimated HMMs

fullcontext (context-dependent, CD)

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

19 of 104

Context-dependent acoustic modeling • • • • • • • • • • • • •

{preceding, succeeding} two phonemes Position of current phoneme in current syllable # of phonemes at {preceding, current, succeeding} syllable {accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word # of {preceding, succeeding} {stressed, accented} syllables in phrase # of syllables {from previous, to next} {stressed, accented} syllable Guess at part of speech of {preceding, current, succeeding} word # of syllables in {preceding, current, succeeding} word Position of current word in current phrase # of {preceding, succeeding} content words in current phrase # of words {from previous, to next} content word # of syllables in {preceding, current, succeeding} phrase

...

Impossible to have all possible models Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

20 of 104

Decision tree-based state clustering [7] k-a+b t-a+n L=voice?

R=silence? yes

L="w" ? yes

yes

no

no

yes

no

R=silence? no yes

L="gy" ? no

leaf nodes

synthesized states

w-a+t

w-a+sil

Heiga Zen

gy-a+sil

w-a+sh

g-a+sil

gy-a+pau

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

21 of 104

Stream-dependent tree-based clustering

Decision trees for mel-cepstrum Decision trees for F0 Spectrum & excitation can have different context dependency → Build decision trees individually Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

22 of 104

State duration models [8] t1

i

t0

1

2

3

4

5

6

7

T=8

t

Probability to enter state i at t0 then leave at t1 + 1 χt0 ,t1 (i) ∝

X

αt0 −1 (j)aji atii1 −t0

j6=i

→ estimate state duration models

Heiga Zen

t1 Y

t=t0

bi (ot )

X

aik bk (ot1 +1 )βt1 +1 (k)

k6=i

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

23 of 104

Stream-dependent tree-based clustering

State duration model HMM Decision trees for mel-cepstrum

Decision tree for state dur. models

Decision trees for F0 Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

24 of 104

HMM-based speech synthesis [4] SPEECH Speech signal DATABASE Excitation

parameter extraction Excitation parameters

TEXT

Text analysis Excitation parameters

Synthesis part Heiga Zen

Spectral parameter extraction Spectral parameters

Training HMMs

Labels

Labels

Training part

Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters

Excitation Excitation Synthesis generation Filter

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

SYNTHESIZED SPEECH

July 9th, 2015

25 of 104

Speech parameter generation algorithm [9] Generate most probable state outputs given HMM and words ˆ oˆ = arg max p(o | w, λ) o X ˆ = arg max p(o, q | w, λ) o

∀q

ˆ ≈ arg max max p(o, q | w, λ) o

q

ˆ (q | w, λ) ˆ = arg max max p(o | q, λ)P o

Heiga Zen

q

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

26 of 104

Speech parameter generation algorithm [9] Generate most probable state outputs given HMM and words ˆ oˆ = arg max p(o | w, λ) o X ˆ = arg max p(o, q | w, λ) o

∀q

ˆ ≈ arg max max p(o, q | w, λ) o

q

ˆ (q | w, λ) ˆ = arg max max p(o | q, λ)P o

q

Determine the best state sequence and outputs sequentially ˆ qˆ = arg max P (q | w, λ) q

ˆ ˆ λ) oˆ = arg max p(o | q, o

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

26 of 104

Best state sequence

a11 π1

1

a22 a12

b1 (ot ) Observation sequence

Heiga Zen

O o1 o2 o3 o4 o5

State sequence

Q

State duration

D

2 b2 (ot )

a23

3 b3 (ot )

... . . ...

1 1 1 1 2 ... 4

a33

10

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

2 3 ...

oT

3

5

July 9th, 2015

27 of 104

Best state outputs w/o dynamic features

Mean

Variance

oˆ becomes step-wise mean vector sequence

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

28 of 104

Using dynamic features State output vectors include static & dynamic features

£ ¤ > > ot = c> t , D ct M

D ct = ct − ct−1 c t-2

c t-1

ct

c t+1

c t+2

Dct-2

Dc t-1

Dc t

Dc t+1

Dct+2

M

2M

Relationship between static and dynamic features can be arranged as 

Heiga Zen

o .. .



   ct−1   ot−1 D ct−1     ct   o t D c   t    ct+1   ot+1 D ct+1    .. .



· · · · · ·  · · ·  · · · =  · · ·  · · ·  · · ·  ···

.. . 0 −I 0 0 0 0 .. .

W .. . I I 0 −I 0 0 .. .

.. . 0 0 I I 0 −I .. .

.. . 0 0 0 0 I I .. .

 · · · · · ·  · · ·  · · ·  · · ·  · · ·  · · ·  ···

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

c 

 .. .   ct−2    ct−1     ct    ct+1    .. . July 9th, 2015

29 of 104

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints ˆ ˆ λ) oˆ = arg max p(o | q, o

Heiga Zen

subject to

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

o = Wc

July 9th, 2015

30 of 104

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints ˆ ˆ λ) oˆ = arg max p(o | q, o

subject to

o = Wc

If state-output distribution is single Gaussian ˆ = N (o; µ ˆ qˆ) ˆ λ) ˆ qˆ, Σ p(o | q,

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

30 of 104

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints ˆ ˆ λ) oˆ = arg max p(o | q, o

subject to

o = Wc

If state-output distribution is single Gaussian ˆ = N (o; µ ˆ qˆ) ˆ λ) ˆ qˆ, Σ p(o | q, ˆ qˆ)/∂c = 0 ˆ qˆ, Σ By setting ∂ log N (W c; µ ˆ −1 W c = W > Σ ˆ −1 µ W >Σ qˆ qˆ ˆ qˆ

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

30 of 104

Speech parameter generation algorithm [9] Σ−1 qˆ

c c1 c2

0 1 0 ... -1 1 0 ...

...

1 0 0 ... 0 0 0 ...

...

W 0

1 0 0 ... 1 -1 0 ...

0 1 0 ... 0 1 -1 ...

... 0 1 0 ... 0 1 -1

...

... 0 0 1 ... 0 0 0

W>

cT

... 0 1 0 ... -1 1 0 ... 0 0 1

0

... 0 -1 1

Σ−1 qˆ

µqˆ 0

1 0 0 ... 1 -1 0 ...

0 1 0 ... 0 1 -1 ...

... 0 1 0 ... 0 1 -1

...

=

... 0 0 1 ... 0 0 0

W>

µq1 µq2

0 µqT Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

31 of 104

Dynamic

Static

Generated speech parameter trajectory

Mean

Heiga Zen

Variance

c

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

32 of 104

HMM-based speech synthesis [4] SPEECH Speech signal DATABASE Excitation

parameter extraction Excitation parameters

TEXT

Text analysis Excitation parameters

Synthesis part Heiga Zen

Spectral parameter extraction Spectral parameters

Training HMMs

Labels

Labels

Training part

Context-dependent HMMs & state duration models Parameter generation from HMMs Spectral parameters

Excitation Excitation Synthesis generation Filter

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

SYNTHESIZED SPEECH

July 9th, 2015

33 of 104

Waveform reconstruction

Generated excitation parameter (log F0 with V/UV)

Generated spectral parameter (cepstrum, LSP)

pulse train e(n)

white noise

Heiga Zen

excitation

linear time-invariant system h(n)

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

synthesized speech x(n) = h(n) ∗ e(n)

July 9th, 2015

34 of 104

Synthesis filter

• Cepstrum → LMA filter

• Generalized cepstrum → GLSA filter • Mel-cepstrum → MLSA filter

• Mel-generalized cepstrum → MGLSA filter • LSP → LSP filter

• PARCOR → all-pole lattice filter • LPC → all-pole filter

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

35 of 104

Any questions?

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

37 of 104

Outline

Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results

Advantages

• Flexibility to change voice characteristics

• Small footprint • More data

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

39 of 104

Adaptation (mimicking voice) [10]

Average-voice model

Training speakers

Adaptive Training

Adaptation Target speakers

• Train average voice model (AVM) from training speakers using SAT • Adapt AVM to target speakers

• Requires small data from target speaker/speaking style → Small cost to create new voices Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

40 of 104

Adaptation demo · Speaker adaptation - VIP voice: GWB - Child voice:

BHO

· Style adaptation (in Japanese) - Joyful - Sad - Rough

From http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

41 of 104

Interpolation (mixing voice) [11, 12, 13, 14] λ2 λ1 I(λ0 , λ2)

I(λ0 , λ1)

λ : HMM set

I(λ0 , λ ) : Interpolation ratio

λ0 I(λ0 , λ3) I(λ0 , λ4)

λ3

λ4

• Interpolate representive HMM sets

• Can obtain new voices w/o adaptation data

• Eigenvoice / CAT / multiple regression → estimate representative HMM sets from data Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

42 of 104

Interpolation demo (1) · Speaker interpolation (in Japanese) - Male & Female

Male

Female

· Style interpolation - Neutral → Angry - Neutral → Happy

From http://www.sp.nitech.ac.jp/ & http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

43 of 104

Interpolation demo (2) Speaker characteristics modification Weights for eigenvectors +30

1st

2nd

3rd

4th

5th

Weights for eigenvectors +30

0

0

-30

-30 Weights for eigenvectors

+30

1st

2nd

3rd

4th

5th

1st

2nd

3rd

4th

5th

Weights for eigenvectors +30

0

0

-30

-30

1st

2nd

3rd

4th

5th

From http://www.sp.nitech.ac.jp/~demo/synthesis_demo_2001/

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

44 of 104

Interpolation demo (3) Style-control Rough

Sad

Joyful From http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/demo.html Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

45 of 104

Drawbacks

• Quality buzzy, muffled synthetic speech • Major factors for quality degradation [3] − Vocoder (speech analysis & synthesis) − Acoustic model (HMM) − Oversmoothing (parameter generation)

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

46 of 104

Vocoding issues • Simple pulse / noise excitation Difficult to model mix of V/UV sounds (e.g., voiced fricatives) pulse train e(n)

white noise

excitation Unvoiced

Voiced

• Spectral envelope extraction Harmonic effect often cause problem Power [dB]

80 40

0 0

2

4

6

8 [kHz]

• Phase Important but usually ignored Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

47 of 104

Better vocoding

• Mixed excitation linear prediction (MELP)

• STRAIGHT

• Multi-band excitation

• Harmonic + noise model (HNM) • Harmonic / stochastic model • LF model

• Glottal waveform

• Residual codebook • ML excitation

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

48 of 104

Heiga Zen

70

30 20 0

0

1

2 3 Frequency (kHz)

4

80 70 60 50 40 30 20

50 40 30 20

80

10 0

0

1 2 3 Frequency (kHz)

4

80 70

70 60 50 40 30 20 10

60

0

50 40

0

1 2 3 Frequency (kHz)

4

Mixed Excitation

30 20 10

10 0

⇓ Bandpass filtering ⇓

40

60

⇓ Mix ⇓

50

Log magnitude (dB)

60

Log magnitude (dB)

80

70

Log magnitude (dB)

Log magnitude (dB)

80

10

Log magnitude (dB)

Noise excitation

Pulse excitation

MELP-style mixed excitation [15]

0

1

2 3 Frequency (kHz)

4

0

0

1 2 3 Frequency (kHz)

4

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

49 of 104

MELP-style mixed excitation [15]

Amplitude

12

-12

0

1144

2288

3432

4576

5720

6864

8008

9152

10296 sample

2288

3432

4576

5720

6864

8008

9152

10296 sample

z

u

Amplitude

12

-12

0

1144

s

Heiga Zen

U

k

o

sh

I

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

ts

u

July 9th, 2015

50 of 104

STRAIGHT [16]

Waveform

Synthetic waveform

F0 extraction

Synthesis

Fixed-point analysis Analysis F0 adaptive spectral smoothing in the time-frequency region

Heiga Zen

F0

Mixed excitation with phase manipulation

Smoothed spectrum Aperiodic factors

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

51 of 104

STRAIGHT [16] 120

FFT power spectrum FFT + mel-cepstral analysis STRAIGHT + mel-cepstral analysis

100

Power [dB]

80 60 40 20 0 -20

Heiga Zen

0

2

4

Frequency [kHz]

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

6

8 July 9th, 2015

52 of 104

Trainable excitation model [17] Sentence HMM Mel-cepstral coefficients

ct-2

ct-1

ct

c t+1

c t+2

log F0 values

pt-2

pt-1

pt

p t+1

p t+2

Filters

Hv (z), Hu (z)

Pulse train t(n) generator

White noise

Heiga Zen

w(n)

Hv (z)

Hu (z)

v(n) Voiced excitation

u(n)

e(n) Mixed excitation

H(z)

Synthesized speech

Unvoiced excitation

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

53 of 104

ML excitation STRAIGHT Pulse/noise

Natural

Trainable excitation model [17]

0 0

0

0 0 0

0 0

Upper: Waveform

Heiga Zen

Lower: excitation (residual)

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

54 of 104

Limitations of HMMs for acoustic modeling

• Piece-wise constatnt statistics Statistics do not vary within an HMM state • Conditional independence assumption State output probability depends only on the current state • Weak duration modeling State duration probability decreases exponentially with time None of them hold for real speech

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

55 of 104

Better acoustic modeling

• Piece-wise constatnt statistics → Dynamical model − Trended HMM, autoregressive HMM (ARHMM) − Polynomial segment model, hidden trajectory model (HTM) − Trajectory HMM • Conditional independence assumption → Graphical model − Buried Markov model, ARHMM, linear dynamical model (LDM) − HTM, Gaussian process (GP) − Trajectory HMM • Weak duration modeling → Explicit duration model − Hidden semi-Markov model Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

56 of 104

Trajectory HMM [18] • Derived from HMM by imposing dynamic feature constraints

• Underlying generative model in HMM-based speech synthesis p(c | λ) =

X ∀q

p(c | q, λ)P (q | λ)

p(c | q, λ) = N (c; c¯q , Pq ) where Pq−1 = Rq = W > Σ−1 q W rq = W > Σ−1 q µq c¯q = Pq rq

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

57 of 104

Trajectory HMM [18] mean trajectory c¯q

sil

a

i

d

a

sil sil

5 10

a

15

25 i

30 35

d

Time (frame)

20

40 45 a

50 55 sil

5

10

15

20

25 30 35 Time (frame)

40

45

50

55

Temporal covariance matrix Pq Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

58 of 104

Relation to HMM-based speech synthesis

• Mean vector of trajectory HMM ¯q = W > Σ−1 W > Σ−1 q Wc q µq • Speech parameter trajectory used in HMM-based speech synthesis > −1 W > Σ−1 q W c = W Σq µq

ML estimation of trajectory HMM → Make training & synthesis consistent

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

59 of 104

Oversmoothing • Speech parameter generation algorithm

− Dynamic feature constraints make generated parameters smooth − Often too smooth → sounds muffled

0 4 8 Frequency (kHz)

Generated

4 8 Frequency (kHz)

Natural

0

• Why? − Details of spectral (formant) structure disappear − Use of better AM relaxes the issue, but not enough Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

60 of 104

Oversmoothing compensation

• Postfiltering

− Mel-cepstrum − LSP

• Nonparametric approach − Conditional parameter generation − Discrete HMM-based speech synthesis • Combine multiple-level statistics − Global variance (intra-utterance variance) − Modulation spectrum (intra-utterance frequency components)

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

61 of 104

Global variance [19]

Generated

1

0 v(m)

2nd mel-cepstral coefficient

Natural

-1 0

1

2

3

Time [sec]

GVs of synthesized speech are typically narrower Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

62 of 104

Speech parameter generation with GV [19]

• Speech parameter generation cˆ = arg maxc log N (W c; µq , Σq ) • Speech parameter generation w/ GV cˆ = arg maxc log N (W c; µq , Σq ) + ω log N (v(c); µv , Σv ) 2nd term works as a penalty for oversmoothing

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

63 of 104

Effect of GV

4 8 Frequency (kHz)

Generated (standard)

0 0 4 8 Frequency (kHz)

Generated (w/ GV)

4 8 Frequency (kHz)

Natural

0

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

64 of 104

Any questions?

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

65 of 104

Outline

Basics of HMM-based speech synthesis Background HMM-based speech synthesis Advanced topics in HMM-based speech synthesis Flexibility Improve naturalness Neural network-based speech synthesis Feed-forward neural network (DNN & DMDN) Recurrent neural network (RNN & LSTM-RNN) Results

Characteristics of SPSS • Advantages − Flexibility to change voice characteristics ◦ Adaptation ◦ Interpolation / eigenvoice / CAT / multiple regression − Small footprint − Robustness • Drawback − Quality • Major factors for quality degradation [3] − Vocoder (speech analysis & synthesis) − Acoustic model (HMM) → Neural networks − Oversmoothing (parameter generation) Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

67 of 104

Linguistic → acoustic mapping • Training Learn relationship between linguistic & acoustic features

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

68 of 104

Linguistic → acoustic mapping • Training Learn relationship between linguistic & acoustic features • Synthesis Map linguistic features to acoustic ones

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

68 of 104

Linguistic → acoustic mapping • Training Learn relationship between linguistic & acoustic features • Synthesis Map linguistic features to acoustic ones • Linguistic features used in SPSS − Phoneme, syllable, word, phrase, utterance-level features − e.g., phone identity, POS, stress, # of words in a phrase − Around 50 different types, much more than ASR (typically 3–5) Effective modeling is essential

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

68 of 104

HMM-based acoustic modeling for SPSS [4]

Acoustic space yes yes yes

no no

no yes

...

no yes

no

Decision tree-clustered HMM w/ GMM state-output distributions

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

69 of 104

NN-based acoustic modeling for SPSS [20] Acoustic features y

h3 h2 h1

Linguistic features x

NN output → E [yt | xt ] → replace decision trees & GMMs Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

70 of 104

Advantages of NN-based acoustic modeling for SPSS

• Integrating feature extraction − Efficiently model high-dimensional, highly correlated features − Layered architecture w/ non-linear operations → Integrated linguistic feature extraction to acoustic modeling

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

71 of 104

Advantages of NN-based acoustic modeling for SPSS

• Integrating feature extraction − Efficiently model high-dimensional, highly correlated features − Layered architecture w/ non-linear operations → Integrated linguistic feature extraction to acoustic modeling • Distributed representation More efficient than localist one if data has componential structure → Better modeling / Fewer parameters

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

71 of 104

Advantages of NN-based acoustic modeling for SPSS

• Integrating feature extraction − Efficiently model high-dimensional, highly correlated features − Layered architecture w/ non-linear operations → Integrated linguistic feature extraction to acoustic modeling • Distributed representation More efficient than localist one if data has componential structure → Better modeling / Fewer parameters • Layered hierarchical structure in speech production concept → linguistic → articulatory → vocal tract → waveform

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

71 of 104

Framework Binary features

Duration prediction

Input features including binary & numeric features at frame T

...

Waveform synthesis

Spectral features

Output layer

...

SPEECH

Heiga Zen

...

...

...

Duration feature Frame position feature

Hidden layers

TEXT

Statistics (mean & var) of speech parameter vector sequence

Numeric features

Text analysis

Input features including binary & numeric features at frame 1

Input layer

Input feature extraction

Excitation features V/UV feature

Parameter generation

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

72 of 104

Framework

Is this new? . . . no • NN [21]

• RNN [22]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

73 of 104

Framework

Is this new? . . . no • NN [21]

• RNN [22] What’s the difference? • More layers, data, computational resources • Better learning algorithm

• Statistical parametric speech synthesis techniques

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

73 of 104

Experimental setup Database Training / test data Sampling rate Analysis window Linguistic features Acoustic features HMM topology DNN architecture Postprocessing

Heiga Zen

US English female speaker 33000 & 173 sentences 16 kHz 25-ms width / 5-ms shift 11 categorical features 25 numeric features 0–39 mel-cepstrum log F0 , 5-band aperiodicity, ∆, ∆2 5-state, left-to-right HSMM [23], MSD F0 [24], MDL [25] 1–5 layers, 256/512/1024/2048 units/layer sigmoid, continuous F0 [26] Postfiltering in cepstrum domain [15]

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

74 of 104

Example of speech parameter trajectories

5-th Mel-cepstrum

w/o grouping questions, numeric contexts, silence frames removed

Natural speech HMM (α=1) DNN (4x512)

1

0

-1 0

100

200

300

400

500

Frame

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

75 of 104

Subjective evaluations Compared HMM-based systems with DNN-based ones with similar # of parameters • Paired comparison test

• 173 test sentences, 5 subjects per pair • Up to 30 pairs per subject • Crowd-sourced HMM (α) 15.8 (16) 16.1 (4) 12.7 (1)

Heiga Zen

DNN (#layers × #units) 38.5 (4 × 256) 27.2 (4 × 512) 36.6 (4 × 1 024)

Neutral 45.7 56.8 50.7

p value < 10−6 < 10−6 < 10−6

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

z value -9.9 -5.1 -11.5

July 9th, 2015

76 of 104

Limitations of DNN-based acoustic modeling y2 Data samples NN prediction

y1

• Unimodality − Human can speak in different ways → one-to-many mapping − NN trained by MSE loss → approximates conditional mean

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

77 of 104

Limitations of DNN-based acoustic modeling y2 Data samples NN prediction

y1

• Unimodality − Human can speak in different ways → one-to-many mapping − NN trained by MSE loss → approximates conditional mean • Lack of variance − DNN-based SPSS uses variances computed from all training data − Parameter generation algorithm utilizes variances Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

77 of 104

Limitations of DNN-based acoustic modeling y2 Data samples NN prediction

y1

• Unimodality − Human can speak in different ways → one-to-many mapping − NN trained by MSE loss → approximates conditional mean • Lack of variance − DNN-based SPSS uses variances computed from all training data − Parameter generation algorithm utilizes variances Linear output layer → Mixture density output layer [27]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

77 of 104

Mixture density network [27] w2 (x1 ) w1 (x1 )

σ2 (x1 )

σ1 (x1 ) µ1 (x1 )

µ2 (x1 )

y

w1 (x1 ) w2 (x1 ) µ1 (x1 ) µ2 (x1 )σ1 (x1 ) σ2 (x1 )

Inputs of activation function 4 X zj = hi wij i=1

: Weights → Softmax activation function w1 (x) = P2

exp(z1 )

m=1 exp(zm )

w2 (x) = P2

exp(z2 )

m=1

exp(zm )

: Means → Linear activation function

1-dim, 2-mix MDN

µ1 (x) = z3

µ1 (x) = z4

: Variances → Exponential activation function σ1 (x) = exp(z5 )

σ2 (x) = exp(z6 )

NN + mixture model (GMM) → NN outputs GMM weights, means, & variances

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

78 of 104

TEXT

DMDN-based SPSS [28]

w2 (x1 ) w1 (x1 )

σ2 (x1 )

µ1 (x1 )

µ2 (x1 )

σ1 (x2 )

y

...

σ2 (x2 )

µ1 (x2 )

µ2 (x2 )

σ1 (xT )

y

µ1 (xT )

w1 (x1 ) w2 (x1 ) µ1 (x1 ) µ2 (x1 ) σ1 (x1 ) σ2 (x1 ) w1 (x2 ) w2 (x2 ) µ1 (x2 ) µ2 (x2 ) σ1 (x2 ) σ2 (x2 )

w2 (xT ) σ2 (xT ) µ2 (xT )

y

Duration prediction

x1

x2

...

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

xT

July 9th, 2015

SPEECH

Heiga Zen

Waveform synthesis

Input feature extraction

w1 (xT ) w2 (xT ) µ1 (xT ) µ2 (xT ) σ1(xT ) σ2 (xT )

Parameter generation

Text analysis

σ1 (x1 )

w1 (xT )

w1 (x2 ) w2 (x2 )

79 of 104

Experimental setup

• Almost the same as the previous setup

• Differences:

DNN architecture DMDN architecture

Optimization

Heiga Zen

4–7 hidden layers, 1024 units/hidden layer ReLU (hidden) / Linear (output) 4 hidden layers, 1024 units/ hidden layer ReLU [29] (hidden) / Mixture density (output) 1–16 mix AdaDec [30] (variant of AdaGrad [31]) on GPU

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

80 of 104

Subjective evaluation • 5-scale mean opinion score (MOS) test (1: unnatural – 5: natural)

• 173 test sentences, 5 subjects per pair • Up to 30 pairs per subject • Crowd-sourced

HMM DNN

DMDN (4×1024)

Heiga Zen

1 mix 2 mix 4×1024 5×1024 6×1024 7×1024 1 mix 2 mix 4 mix 8 mix 16 mix

3.537 3.397 3.635 3.681 3.652 3.637 3.654 3.796 3.766 3.805 3.791

± ± ± ± ± ± ± ± ± ± ±

0.113 0.115 0.127 0.109 0.108 0.129 0.117 0.107 0.113 0.113 0.102

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

81 of 104

Limitations of DNN/MDN-based acoustic modeling Fixed time span for input features • Fixed number of preceding / succeeding contexts

• Difficult to incorporate long time span contextual effect

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

82 of 104

Limitations of DNN/MDN-based acoustic modeling Fixed time span for input features • Fixed number of preceding / succeeding contexts

• Difficult to incorporate long time span contextual effect

Frame-by-frame mapping • Each frame is mapped independently • Smoothing is still essential

DNN w/ dyn 67.8

Heiga Zen

Preference score (%) DNN w/o dyn No pref 12.0

20.0

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

82 of 104

Limitations of DNN/MDN-based acoustic modeling Fixed time span for input features • Fixed number of preceding / succeeding contexts

• Difficult to incorporate long time span contextual effect

Frame-by-frame mapping • Each frame is mapped independently • Smoothing is still essential

DNN w/ dyn 67.8

Preference score (%) DNN w/o dyn No pref 12.0

20.0

Recurrent connections → Recurrent NN (RNN) [32]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

82 of 104

Simple Recurrent Network (SRN) Output y

y t-1

yt

y t+1

Input x

xt-1

xt

xt+1

Recurrent connections

SRN-based acoustic modeling ht = f (Whx xt + Whh ht−1 + bh ) ,

Heiga Zen

yt = φ (Wyh ht + by )

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

83 of 104

Simple Recurrent Network (SRN) Output y

y t-1

yt

y t+1

Input x

xt-1

xt

xt+1

Recurrent connections

SRN-based acoustic modeling ht = f (Whx xt + Whh ht−1 + bh ) ,

yt = φ (Wyh ht + by )

With squared loss. . . • DNN output (prediction) yˆt → E [yt | xt ]

• RNN output (prediction) yˆt → E [yt | x1 , . . . , xt ]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

83 of 104

Simple Recurrent Network (SRN) Output y

y t-1

yt

y t+1

Input x

xt-1

xt

xt+1

Recurrent connections

• Only able to use previous contexts → bidirectional RNN [32]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

84 of 104

Simple Recurrent Network (SRN) Output y

y t-1

yt

y t+1

Input x

xt-1

xt

xt+1

Recurrent connections

• Only able to use previous contexts → bidirectional RNN [32] • Trouble accessing long-range contexts − Information in hidden layers loops through recurrent connections → Quickly decay over time − Prone to being overwritten by new information arriving from inputs → long short-term memory (LSTM) RNN [34]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

84 of 104

Long short-term memory (LSTM) [34] • RNN architecture designed to have better memory • Uses linear memory cells surrounded by multiplicative gate units bi

Input gate

h t-

bo

sigm

Output gate

it

bc xt

xt

xt

h t-

Input gate: Write

sigm

Output gate: Read

Memory cell

ct

tanh

tanh

ht

Forget gate: Reset

h t-

sigm

Block

bf Heiga Zen

xt

Forget gate

h t-

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

85 of 104

Advantages of RNN-based acoustic modeling for SPSS

• Model dependency between frames − HMM: discontinuous (step-wise) → smoothing − DNN: discontinuous (frame-by-frame mapping) [35] → smoothing − RNN: smooth [36, 35]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

86 of 104

Advantages of RNN-based acoustic modeling for SPSS

• Model dependency between frames − HMM: discontinuous (step-wise) → smoothing − DNN: discontinuous (frame-by-frame mapping) [35] → smoothing − RNN: smooth [36, 35] • Low latency − Unidirectional structure allows fully frame-level streaming [35]

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

86 of 104

Advantages of RNN-based acoustic modeling for SPSS

• Model dependency between frames − HMM: discontinuous (step-wise) → smoothing − DNN: discontinuous (frame-by-frame mapping) [35] → smoothing − RNN: smooth [36, 35] • Low latency − Unidirectional structure allows fully frame-level streaming [35] • More efficient representation − RNN offers more efficient representation than DNN for time series

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

86 of 104

Synthesis pipeline

Duration prediction

Linguistic feature extraction

Acoustic feature prediction

Text analysis

Vocoder synthesis

TEXT

SPEECH

Duration & acoustic feature prediction blocks involve NN

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

87 of 104

Duration modeling

Acoustic features Alignments Durations (targets)

9

12

10

10

Duration prediction LSTM

phoneme syllable

h

e

l

h e2







Feature functions



Linguistic features (phoneme)

ou l ou1

hello

word

Linguistic Structure

Feature function examples phoneme == ’h’ ? syllable stress == ’2’ ? Heiga Zen

# of syllables in word?

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

88 of 104

Acoustic modeling Acoustic features (targets)

Acoustic feature prediction LSTM

phoneme syllable

h

e

l

h e2

word

⇒ ⇒

⇒ ⇒

Feature functions

⇒ ⇒

Append frame-level features Linguistic features (phoneme)

⇒ ⇒

Linguistic features (input)

ou l ou1

hello Linguistic Structure

Append frame-level features Relative position of frame in phoneme Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

89 of 104

Streaming synthesis

Acoustic feature prediction LSTM

Duration prediction LSTM

phoneme syllable word

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis

Acoustic feature prediction LSTM

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis

Acoustic feature prediction LSTM

Durations (targets)

9

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis Waveform

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis Waveform

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis Waveform

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis Waveform

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

Duration prediction LSTM

Feature functions phoneme syllable word



Linguistic features (phoneme)

h

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis Waveform

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

12

Duration prediction LSTM

phoneme syllable word

h



Feature functions



Linguistic features (phoneme)

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis Waveform

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

12

10

Duration prediction LSTM

phoneme syllable word

h





Feature functions



Linguistic features (phoneme)

e

l

h e2

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Streaming synthesis Waveform

Acoustic features (targets)

Acoustic feature prediction LSTM

Linguistic features (frame)

Durations (targets)

9

12

10

10

Duration prediction LSTM

phoneme syllable word

h

e

l

h e2







Feature functions



Linguistic features (phoneme)

ou l ou1

hello Linguistic Structure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

90 of 104

Data & speech analysis

Heiga Zen

Database

US English female speaker 34 632 utterances

Speech analysis

16 kHz sampling 25-ms width / 5-ms shift

Synthesis

Vocaine [?] Postfiltering-based enhancement

Input

DNN: 442 linguistic features ULSTM: 291 linguistic features

Target

0–39 mel-cepstrum features continuous log F0 [26] 5-band aperiodicity optionally ∆, ∆2

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

91 of 104

Training

Heiga Zen

Preprocessing

Acoustic: removed 80% silence Duration: removed first/last silence

Normalization

Input: mean / standard deviations Output: 0.01 – 0.99

Architecture

DNN: 4 × 1024 units, ReLU [29] ULSTM: 1 × 256 cells

Output layer

Acoustic: feed-forward or recurrent Duration: feed-forward

Initialization

DNN: random + layer-wise BP [?] ULSTM: random

Optimization

Common: squared loss, SGD DNN: GPU, AdaDec [?] ULSTM: distributed CPU [?]

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

92 of 104

Subjective tests

Common

MOS

Preference

Heiga Zen

100 sentences Crowd-sourcing Using head-phones 7 evaluations per sample Up to 30 stimuli per subject 5-scale score in naturalness (1: Bad – 5: Excellent) 5 evaluations per pair Up to 30 pairs per subject Chose prefered one or “neutral”

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

93 of 104

# of future contexts

# of future contexts 0 1 2 3 4

Heiga Zen

5-scale MOS 3.571 3.751 3.812 3.779 3.753

± ± ± ± ±

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

0.121 0.119 0.115 0.118 0.115

July 9th, 2015

94 of 104

Preference scores

DNN Feed-forward w/

w/o

67.8 18.4

12.0

ULSTM Feed-forward w/ 34.9 21.0 21.8

w/o

Recurrent w/

Heiga Zen

w/o

12.2 16.6

Neutral

21.0 29.2

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

20.0 47.6 66.8 57.2 54.2

July 9th, 2015

95 of 104

MOS

• DNN w/ dynamic features

• ULSTM w/o dynamic features, w/ recurrent output layer

Heiga Zen

Model

# params

5-scale MOS

DNN ULSTM

3,747,979 476,435

3.370 ± 0.114 3.723 ± 0.105

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

96 of 104

Latency • Nexus 7 2013

• Use Advanced SIMD (NEON), single thread • Audio buffer size: 1024

• HMM one used time-recursive version w/ L = 15

• HMM & ULSTM used the same text analysis front-end Average latency (ms)

chars short long

Heiga Zen

HMM

ULSTM

26 123 311

25 55 115

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

97 of 104

Summary Statistical parametric speech synthesis • Vocoding + acoustic model • HMM-based SPSS − Flexible (e.g., adaptation, interpolation) − Improvements ◦ Vocoding ◦ Acoustic modeling ◦ Oversmoothing compensation • NN-based SPSS − Learn mapping from linguistic features to acoustic ones − Static network (DNN, DMDN) → dynamic ones (LSTM) Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

98 of 104

Google academic program • Award programs − Google Faculty Research Awards Provides unrestricted gifts to support fulltime faculty members − Google Focused Research Awards Fund specific key research areas − Visiting Faculty Program Support full-time faculty in research areas of mutual interest • Student support programs − Graduate Fellowships Recognize outstanding graduate students − Internships Work on real-world problems with Google’s data & infrastructure

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

99 of 104

References I [1]

E. Moulines and F. Charpentier. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun., 9:453–467, 1990.

[2]

A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP, pages 373–376, 1996.

[3]

H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commun., 51(11):1039–1064, 2009.

[4]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proc. Eurospeech, pages 2347–2350, 1999.

[5]

F. Itakura and S. Saito. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J53–A:35–42, 1970.

[6]

S. Imai. Cepstral analysis synthesis on the mel frequency scale. In Proc. ICASSP, pages 93–96, 1983.

[7]

J. Odell. The use of context in large vocabulary speech recognition. PhD thesis, Cambridge University, 1995.

[8]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Duration modeling for HMM-based speech synthesis. In Proc. ICSLP, pages 29–32, 1998.

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

100 of 104

References II [9]

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. ICASSP, pages 1315–1318, 2000.

[10] J. Yamagishi. Average-Voice-Based Speech Synthesis. PhD thesis, Tokyo Institute of Technology, 2006. [11] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Speaker interpolation in HMM-based speech synthesis system. In Proc. Eurospeech, pages 2523–2526, 1997. [12] K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Eigenvoices for HMM-based speech synthesis. In Proc. ICSLP, pages 1269–1272, 2002. [13] H. Zen, N. Braunschweiler, S. Buchholz, M. Gales, K. Knill, S. Krstulovic, and J. Latorre. Statistical parametric speech synthesis based on speaker and language factorization. IEEE Trans. Acoust. Speech Lang. Process., 20(6):1713–1724, 2012. [14] T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi. A style control technique for HMM-based expressive speech synthesis. IEICE Trans. Inf. Syst., E90-D(9):1406–1413, 2007. [15] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Incorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesis. IEICE Trans. Inf. Syst., J87-D-II(8):1563–1571, 2004. [16] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveign´ e. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds. Speech Commun., 27:187–207, 1999.

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

101 of 104

References III [17] R. Maia, T. Toda, H. Zen, Y. Nankaku, and K. Tokuda. An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. ISCA SSW6, pages 131–136, 2007. [18] H. Zen, K. Tokuda, and T. Kitamura. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features. Comput. Speech Lang., 21(1):153–173, 2007. [19] T. Toda and K. Tokuda. A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inf. Syst., E90-D(5):816–824, 2007. [20] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pages 7962–7966, 2013. [21] O. Karaali, G. Corrigan, and I. Gerson. Speech synthesis with neural networks. In Proc. World Congress on Neural Networks, pages 45–50, 1996. [22] C. Tuerk and T. Robinson. Speech synthesis using artificial network trained on cepstral coefficients. In Proc. Eurospeech, pages 1713–1716, 1993. [23] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. A hidden semi-Markov model-based speech synthesis system. IEICE Trans. Inf. Syst., E90-D(5):825–834, 2007. [24] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi. Multi-space probability distribution HMM. IEICE Trans. Inf. Syst., E85-D(3):455–464, 2002.

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

102 of 104

References IV [25] K. Shinoda and T. Watanabe. Acoustic modeling based on the MDL criterion for speech recognition. In Proc. Eurospeech, pages 99–102, 1997. [26] K. Yu and S. Young. Continuous F0 modelling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process., 19(5):1071–1079, 2011. [27] C. Bishop. Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, 1994. [28] H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP, pages 3872–3876, 2014. [29] M. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.-V. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G. Hinton. On rectified linear units for speech processing. In Proc. ICASSP, pages 3517–3521, 2013. [30] A. Senior, G. Heigold, M. Ranzato, and K. Yang. An empirical study of learning rates in deep neural networks for speech recognition. In Proc. ICASSP, pages 6724–6728, 2013. [31] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, pages 2121–2159, 2011. [32] M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. IEEE Trans. Signal Process., 45(11):2673–2681, 1997.

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

103 of 104

References V

[33] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. Kremer and J. Kolen, editors, A field guide to dynamical recurrent neural networks. IEEE Press, 2001. [34] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [35] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470–4474, 2015. [36] Y. Fan, Y. Qian, F. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech, 2014. (Submitted) http://research.microsoft.com/en-us/projects/dnntts/.

Heiga Zen

Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN

July 9th, 2015

104 of 104

Statistical Parametric Speech Synthesis: From ... - Research at Google

Jul 9, 2015 - Statistical Parametric Speech Synthesis: From HMM to LSTM-RNN. July 9th, 2015 ... Large data + automatic learning. → High-quality ... generation. Speech synthesis. Text analysis. Speech analysis. Text analysis. Model.

11MB Sizes 7 Downloads 181 Views

Recommend Documents

Large Vocabulary Automatic Speech ... - Research at Google
Sep 6, 2015 - child speech relatively better than adult. ... Speech recognition for adults has improved significantly over ..... caying learning rate was used. 4.1.

Accuracy of Contemporary Parametric Software ... - Research at Google
parametric software estimation models, namely COCOMO II, ... [10] reports on a similar study that ... collected by companies, non-profit organizations, research.

Soft 3D Reconstruction for View Synthesis - Research at Google
Progress of rendering virtual views of di icult scenes containing foliage, wide baseline occlusions and reflections. View ray and ... ubiquitous digital cameras from cell phones and drones, paired with automatic ... Furthermore, we show that by desig

music models for music-speech separation - Research at Google
applied, section 3 describes the training and evaluation setup, and section 4 describes the way in which parameters were tested and presents the results. Finally, section 5 ..... ments, Call Centers and Clinics. 2010, A. Neustein, Ed. Springer.

Building Transcribed Speech Corpora Quickly ... - Research at Google
Sep 30, 2010 - a client application running on an Android mobile device with ..... Around 10% of utterances contain a speaking error, which com-.

Learning the Speech Front-end With Raw ... - Research at Google
tion in time approach that is inspired by the frequency-domain mel filterbank similar to [7], to model the raw waveform on the short frame-level timescale.

STATE-OF-THE-ART SPEECH RECOGNITION ... - Research at Google
model components of a traditional automatic speech recognition. (ASR) system ... voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural

Challenges in Automatic Speech Recognition - Research at Google
Case Study:Google Search by Voice. Carries 25% of USA Google mobile search queries! ... speech-rich sub-domains such as lectures/talks in ... of modest size; 2-3 orders of magnitude more data is available multi-linguality built-in from start.