Japanese Pronunciation Prediction as Phrasal SMT

Viewer
Transcript

Japanese Pronunciation Prediction as Phrasal SMT Jun Hatori (University of Tokyo) Hisami Suzuki (Microsoft Research)

In IJCNLP-2011 Chiang Mai, Thailand 2011/11/9

Task 

Predict the pronunciation of Japanese text. ◦ Input – ◦ Output – (pron.) - kyoo wa yomitanson ni itta



Applications ◦ Text-to-speech conversion ◦ Transliteration of proper nouns for MT & search ◦ Training data creation for input methods (a.k.a. kana-to-kanji conversion)

Japanese Orthography 

A Japanese text consists of ◦ Kanji: ideographic characters (e.g. “village”) ◦ Kana: phonetic characters (e.g. /ni/)



Kanji is the source of pron. ambiguity ◦ A kanji has 2.5 pronunciations on average. ◦ Frequent kanji characters tend to have many (10–20) pronunciations.

An Example of Japanese Text 国境の長いトンネルを抜けると The train came out of the long tunnel into the snow country.

雪国であった。夜の底が白くなった。 The earth lay white under the night sky. – Snow Country (Yasunari Kawabata)

An Example of Japanese Text 国境の長いトンネルを抜けると no i ton-neru o keruto

雪国であった。夜の底が白くなった。 de a tta. no ga kuna tta. Each of “kana” characters has a unique pronunciation.

An Example of Japanese Text 国境の長いトンネルを抜けると kuni guni kuna koku kok kou

sakai kyo kei zakai

naga osa take na cho choo

nu batsu batt bachi

雪国であった。夜の底が白くなった。 yuki setsu sett sechi susu soso

kuni guni kuna koku kok kou

yoru ya yo

soko zoko tei tai

shiro shira haku byaku

“kanji” characters usually have multiple pronunciations.

Ambiguity and Idiosyncrasy 

Character-level pronunciations are highly ambiguous. ◦ ◦3



/kan-naoto/ 14 12 = 504 possibilities!

Idiosyncrasy (non-compositionality) ◦ Pronunciation of a word is commonly a meaning-based mapping of the sound of a Japanese word to a Chinese writing form.  E.g.



/ashita/ “tomorrow” (mei + jitsu)

Word-level pronunciation dictionaries are essential.

Pointwise Approach [Mori+ 10] 

Two-step approach ◦ Step 1: word segmentation ◦ Step 2: pronunciation disambiguation as SVM-based classification for each word  E.g. “ ” /ninki/ (popularity), /jinki/ (people’s atmosphere), /hitoke/ (sign of life)



Requires a separate model for handling OOV (out-of-vocabulary) words ◦ A simple noisy channel model with a character bigram probability is used.

Substring-based Model [Hatori+ 11] Focuses on the pron. prediction for OOV words.  SMT-like unsupervised (no-dictionary) approach 

◦ Pronunciations are learned by parallel corpora ◦ Monotone alignments; no insertion or deletion

Single-character translation operations  Use composed operations to capture substring-level information and context.  No mechanism to accommodate dictionary information. 

Our Approach 

Dictionary-based phrasal SMT ◦ Dictionary entries as minimal translation unit  Entries: word and character-level pronunciations  For known words, word-level pron. are used.  For OOV words, the pronunciation is reasonably guessed by using character-level pronunciations.

◦ A unified approach that can deal with the sentence-level pronunciation assignment, while integrating OOV pron. prediction as part of the whole task.

Dictionary-based Operations   

Dictionary words are the standard units. As a back-off, character-level pron. are used for OOV words. Dictionary-based alignments are obtained using our dictionary-based phrasal decoder. ◦ Unreachable operations are discarded. ◦ Makes the model more robust to noise.

Composed Operations During decoding, translation operations are composed so as to maximize the overall probability.  In our current work, composed operations are the compositions of dictionary words. 

◦ Allows us to consider wider, phrasal context

SMT-based Framework 

Simplified SMT ◦ Monotone alignment, no insertion or deletion



Linear model: Score(s,t,λ) = Σi λifi (s,t)

◦ Weights are trained with averaged perceptron. ◦ Stack decoder [Zens+ 04]



Features (component models) ◦ ◦ ◦ ◦

Bi-directional translation probability P(t|s),P(s|t) Character 5-gram probability P(t) Number of phrases/characters Joint trigram probability P(s,t)

Joint N-gram Language Model 

Joint n-gram: a language model for the sequence of translation operations [Bisani+ 04] ◦ Provide smoothed context for pron. disambiguation ◦ Incorporate single-kanji pronunciation dependencies into OOV pronunciation prediction

Summary of Training

←Substring-based model ↓Proposed model

Related Works 

Japanese Pronunciation Prediction ◦ SVM-based two-step approach [Mori+ 10] ◦ Substring-based word pron. prediction [Hatori+ 11]



Transliteration / letter-to-phoneme conversion ◦ Joint n-gram & discriminative features [Jiampojamarn+ 10]  2–2 (source-target) substring-based alignment



Our contribution ◦ Integrating word- and character-level pronunciations based on dictionary-based alignment. ◦ Capturing larger context by the composition of wordlevel pronunciations (e.g. 8–24 alignment) ◦ Scalable: probabilities of the component models are obtained from the frequencies in the training corpus.

Experiment – Baseline Models  

SubStr: Substring-based model [Hatori+ 11] SubStr+: Extended substring-based model ◦ Additionally uses joint n-gram probability and dictionary features



KyTea

[Mori+ 10]

◦ A state-of-the-art Japanese pronunciation prediction system ◦ Performs SVM-based classification of word pronunciations, along with a simple OOV model

Experiment - Training Data 

Dictionary (770k token pairs) ◦ UniDic (630k entries) ◦ Iwanami Dictionary (107k entries) ◦ in-house dictionary (226k entries)



Wikipedia-derived pairs (460k instances) ◦ Extracted word-pronunciation pairs using pattern matching with parenthesis. (noisy)



Newspaper corpus (1.4m sents)

Experiment – Evaluation Dataset 

Nikkei/Kyodo Newspaper

News-1/2)

◦ Consisting of full complete sentences 

Bing Search query log (Query-1/2) ◦ General nouns phrases/proper nouns



Difficult-to-pronounce word corpus (Name) ◦ Consisting mostly of person names



Wikipedia instances (Wiki) ◦ Mostly named entities and technical words

Test set

News-1

News-2

Query-1

Query-2

Name

Wiki

Avg. len.

51.8

44.9

3.8

5.7

3.0

4.1

OOV rate

0.3%

0.3%

3.5%

12.7%

23.4%

13.7%

Final Result 100 90 80 70 60 SubStr SubStr+ Proposed

50 40 30

20 10 0 News-1 News-2 Query-1 Query-2 Names

Wiki

Final Result 100 90 80 70 60 SubStr SubStr+ Proposed

50 40 30

20 10 0 News-1 News-2 Query-1 Query-2 Names

Wiki

Final Result 100 90 80 70 60 SubStr SubStr+ Proposed

50 40 30

20 10 0 News-1 News-2 Query-1 Query-2 Names

Wiki

Comparison to KyTea (training with BCCWJ and UniDic) 100 95 90 85 80 KyTea (w/noise) KyTea Proposed

75 70 65

60 55 50 News-1

News-2

Query-1

Query-2

Comparison to KyTea (training with BCCWJ, Wiki, and UniDic) 100 95 90 85 80 KyTea (w/noise) KyTea Proposed

75 70 65

60 55 50 News-1

News-2

Query-1

Query-2

Conclusion 

We proposed an SMT-based pronunciation prediction model with an effective use of dictionary-based operations. ◦ Achieved ∼90% accuracies in various domains. ◦ Robust for OOV words, and work effectively for standard texts.



Future work ◦ Investigate the use of contextual features such as character- and pronunciation-type dependencies.