Confusion-based Statistical Language Modeling (or what I did on my summer vacation)
Brian Roark Center for Spoken Language Understanding, OHSU Joint work with D. Bikel, C. Callison-Burch, Y. Cao, A. C ¸ elebi, E. Dikici, N. Glenn, K. Hall, E. Hasler, D. Karakos, S. Khudanpur, P. Koehn, M. Lehr, A. Lopez, M. Post, E. Prud’hommeaux, D. Riley, K. Sagae, H. Sak, M. Sarac¸lar, I. Shafran, P. Xu
NAACL-HLT Workshop on the Future of Language Modeling for HLT, Montreal
Story of a JHU summer workshop
• Birth of an idea • Pressure to submit idea for summer workshop • Presenting a project at the planning meeting, building a team • Preparing/planning for the workshop • At the workshop (what we did) – Some things that panned out
– Some things that are panning out – Some things that didn’t pan out • After the workshop (what we kept doing and keep doing) 1
Idea starting point: discriminative language modeling
• Supervised training of language models
– Training data (x, y), x ∈ X (inputs), y ∈ Y (outputs) – e.g., x input speech, y output reference transcript
• Run system on training inputs, update model
– Commonly a linear model, n-gram features and others – Learn parameterizations using perceptron-like or global conditional likelihood methods – Use n-best or lattice output (Roark et al., 2004; 2007); or update directly on decoding graph WFST (Kuo et al., 2007)
• Run to some stopping criterion; regularize final model 2
Perceptron algorithm • On-line learning approach, i.e.,
– Consider each example in training set in turn – Use the current model to produce output for example – Update model based on example, move on to the next one
• For structured learning problems (parsing, tagging, transcription)
– Given a set of input utterances and reference output sequences – Typically trying to learn parameters for features in a linear model – Need some kind of regularization (typically averaging)
• Learning a language model
– Consider each input utterance in training set in turn – Use the current model to produce output for example (transcription) – Update feature parameters based on example, move on to the next one 3
Thinking about semi-supervised version
• Outlined approach needs labeled training data
– Relatively small corpora versus monolingual text – Would love to train on the same data as regular n-gram models
• From around 2005, discussing this idea with Murat Sarac¸lar • Several rounds of NSF proposals
– First proposal rejected in 2006; given SGER funding (for pilot) – Teamed up with Khudanpur for NSF Large in 2008: not funded – Same team: NSF Medium funded in 2009
• Others working on this too: Kurata et al. (2009; 2011); Jyothi & Fosler-Lussier (2010)
4
ASR and MT
• These ideas are applicable to any LM consuming application • Our initial ideas well developed for speech; not so much for MT • Zhifei Li made some progress on this idea for MT on small task Li et al. (2010; 2011)
• Wanted to pull more people into this problem, especially for MT • But in fact, discriminative language modeling has had limited success in MT (cf., Li and Khudanpur, 2009)
• Good opportunity to push some of these methods in other areas – Explore features; build tools; learning methods 5
Putting together a workshop project
• Encouraged by JHU folks to submit a summer project proposal • Invited to the “circus” (aka planning meeting) • Hard fought competition (strangely well behaved group) • Selected to be held; some recruitment at meeting • What follows is from my pitch at the meeting
6
Main Motivation • Generative language models built from monolingual corpora are task agnostic – But tasks differ in the kinds of ambiguities that arise
• Supervised discriminative language modeling needs paired input:output sequences – Limited data vs. vast amounts of monolingual text used in generative models
• Semi-supervised discriminative language modeling would have large benefits – Optimize models for task specific objectives
– Applicable to arbitrary amounts of monolingual text in target language • How would this work? Here’s one method:
– Use baseline models to discover confusable sequences for observed target – Learn to discriminate between observed sequence and confusables
• Similar to Contrastive Estimation but with observed output rather than input 7
For speech recognition • Given a text string from the NY Times:
He has not hesitated to use
his country’s daunting problems as a kind of threat
... country’s
problems ... kind of threat
countries
problem
time
threats
country
proms
kinds
thread
kinda
threads
countries’ trees
spread
conferees
read
conference
fred
company copy
8
For speech recognition, graphical view (A)
(B) ◦
(C)
⇓
(D)
Reward features associated with good candidates; penalize those from bad cands 9
For speech recognition with WFST • Basic Approach: – Use baseline STT models to build HMM state or phone confusion models – Use baseline STT recognizer HCLG transducers to map from word strings to phone or HMM state strings – Use confusion model to simulate likely confusions of state/phone sequences – Again use HCLG transducers to map from confusion state/phone sequences to set (lattice) of competitor word sequences • Very large search space, scalable confusion set generation a key challenge • Confusion models can be calculated: (1) directly from model or (2) through recognizing unlabeled speech and deriving confusions from output lattices
10
For Machine Translation • Just as the WFST methods in the previous slide perform word string to word string transductions, want to do the same for MT target strings
• Monolingual translation, from S to confusable strings based on models • Much like paraphrasing, we can leverage models to find confusions – e.g., target phrases mapping to the same source phrase in the phrase table • Li et al. (2010) looked at learning MT models from simulated confusions – Used a monolingual synchronous context-free grammar • In fact, we don’t want the paraphrases, since we want to penalize confusions – need non-substitutable confusions (mentioned in Li et al., not resolved)
– Teasing apart substitutable and non-subst confusions useful for paraphrasing • As with WFSTs in speech, decoding to produce confusion sets is the bottleneck 11
For machine translation Good paraphrase (bad confusable) from a phrase table: i have no objection to a
military force
even in europe
yo no me opongo a una
fuerza militar
también para europa
puedo corroborar que la
fuerza militar
i can confirm that the
military power
12
no ha podido solucionar los problemas could not resolve the problems
For machine translation Bad paraphrase (good confusable) due to misalignment: commissioner , you want to engage in
a test
of strength with parliament
herr kommissar , sie laufen auf
eine kraftprobe
mit dem parlament hinaus
es ist keineswegs meine absicht mich auf
eine kraftprobe
oder einen machtkampf mit dem parlament einzulassen
it is by no means my intention to have
any clash
or test of strength with parliament
Bad paraphrase (good confusable) due to polysemy: un exemple est la voie d' eau formée par la
rive
gauche du nervión au pays basque
one example is the waterway formed by the left
bank
of the nervión in the basque country
bank
to buy his materials
he had to borrow money from the il a dû emprunter de l' argent à la
banque
13
pour acheter ses matériaux
Issues to pursue • Modeling confusability – Much better handle on acoustic confusability than MT confusability ∗ Though requiring string-to-string confusions increases space
– Leverage baseline models to discover likely confusions, induce models – In MT, left with a decoding problem that may include rich grammars • Scaling confusion set generation – Ultimate benefit of approach is applying methods to massive text corpora – Very efficient decoding critical, which is why Keith and Kenji were recruited – May need to use sampling methods • Methods for learning from confusions – Have mainly thought of existing discriminative LM methods 14
Rest of “circus” presentation
• Speculative blather about project specifics – Specific tasks that would be worked on – Who exactly would be on the team – What the overall project outcomes would be – Even deliverables on a weekly basis • We now know how it played out, so spare you those
– Not too far off from plans, but not much paraphrasing – Ended up with large team, working on diverse topics – Produced papers, data and code; some topics still going 15
Pre-workshop preparation
• The 6 week workshop is actually a much longer project • As workshop lead, responsible for team member productivity – Get them the data they need when the workshop starts – Get them whatever codebase is needed – Assemble the right team with appropriate expertise – Figure out who is going to work on what; student mentoring • Want to remove any barriers to the research sprint • Team leader = team facilitator and administrator • Made choice: every paper has every team member as author 16
Building a team
• Big and diverse team
– Original recruits Keith, Kenji, Chris, Sanjeev and Murat – Added Philipp Koehn at the “circus” planning meeting – Dan Bikel became interested in the project – Others at JHU: Damianos Karakos, Adam Lopez, Matt Post – Other NSF project members participated: Zak Shafran – PhD students from OHSU (2); JHU (2); Edinburgh (1) – Three of Murat’s PhD students working from Turkey – Two excellent undergrads from Rochester and BYU
• Many sub-projects; both speech and MT; varied issues explored 17
sday, August 18, 2011
Workshop project team members Team and affiliates
18
Data prep • Interested in controlled experimentation (comparison with supervised) – Large scale systems – Produce real system outputs on training data (lattices, n-best lists) – Produce “hallucinated” outputs on training data – English CTS and Turkish BN LVCSR; Urdu,German,Chinese/English MT • Method of confusion generation central question being asked – For ASR, most effort in producing supervised data – For MT, round-trip methods for producing confusions very expensive – Major pre-workshop effort was focused on data preparation – Many cycles at Edinburgh and OHSU to produce data for workshop • Opted for n-best lists, to allow for easy feature exploration 19
Software prep
• Exploring methods for producing DLM training data – Need code for learning model from produced data
• Scalability is a goal, hence interested in distributed processing • Wanted to produce reranking codebase before workshop
– Didn’t make enough progress on that before workshop – Made use of existing code for most results during workshop: (C code written by me for various reranking projects) – New codebase was written during workshop and subsequently – Open source release near – details later
• All DLM learned with perceptron-like algorithms 20
Sub-project focus: ASR n-best list hallucination • K. Sagae, M. Lehr, E. Prud’hommeaux, P. Xu, N. Glenn, D. Karakos, S. Khudanpur, B. Roark, M. Sarac¸lar, I. Shafran, D. Bikel, C. Callison-Burch, Y. Cao, K. Hall, E. Hasler, P. Koehn, A. Lopez, M. Post and D. Riley. 2012. Hallucinated n-best lists for discriminative language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5001-5004.
21
Focus of paper
• Simulating ASR errors or pseudo-ASR on English CTS task;
then training a discriminative LM (DLM) for n-best reranking
• Running controlled experimentation under several conditions: – Three different methods of training data “hallucination” – Different sized training corpora • Comparing WER reductions from real vs. hallucinated n-best lists • Standard methods for training linear model
– Simple features: unigrams, bigrams and trigrams – Using averaged perceptron algorithm 22
ICASSP poster
• Three methods for hallucinating being compared: – FST phone-based confusion model – Word-based phrasal cohorts model – Machine translation system: reference to ASR 1-best Going to be moderately lazy and just highlight parts of the poster
23
Other ICASSP papers • Semi-supervised discriminative language modeling for Turkish ASR.
A. C¸elebi, H. Sak, E. Dikici, M. Sarac¸lar and the gang. pp. 5025-5028. – Experiments on Turkish BN task – Focus on phone, syllable, morph, and word based confusions – Some different choices in setup and experimental evaluation – Also looked at different methods for creating n-best lists – Larger sub-word units (syllable, morph) performed best
• Continuous space discriminative language modeling. P. Xu, S. Khudanpur, and the gang. pp. 2129-2132.
– Use a neural net to parameterize a non-linear function that replaces standard linear model dot product – Learns a much more compact parameterization at the same accuracy 24
Supervised DLM
• The last paper is focused on supervised DLM
– We worked features and learning methods applicable to both supervised and semi-supervised approaches
• New kinds of features and feature representations • Methods for optimizing mixture with baseline scores • Generalization to a number of “perceptron-like” algorithms 25
Perceptron algorithm (following Collins, 2002)
Inputs: Training examples (xi, yˆi) Initialization: α ¯0 ← 0 Algorithm: For t = 1 . . . T , i = 1 . . . N j ← (t − 1) ∗ N + i
¯ j−1) (yj , zj ) ← (ˆ yi, argmax Φ(xi, z) · α z∈GEN(xi)
¯j ← α ¯ j−1 + Φ(xi, yj ) − Φ(xi, zj ) α Output: Regularized parameters:
NT � 1 ¯ ← ¯i avg(α) α NT i=1
26
Perceptron-like algorithm
Inputs: Training examples (xi, yˆi) Initialization: α ¯0 ← 0 Algorithm: For t = 1 . . . T , i = 1 . . . N j ← (t − 1) ∗ N + i
¯ j−1) (yj , zj ) ← PickYZ(xi, yˆi, α ¯j ← α ¯ j−1 + update(xi, yj , zj , α ¯ j−1) α Output: Regularized parameters:
NT � 1 ¯ ← ¯i avg(α) α NT i=1
Choice of PickYZ and update functions vary (regularization, too, of course) 27
Choices for functions • Perceptron as presented has the following functions: ¯ j−1) = (ˆ yi, argmax Φ(xi, z) · α ¯ j−1) PickYZ(xi, yˆi, α z∈GEN(xi )
update(xi, yj , zj , α ¯ j−1) = Φ(xi, yj ) − Φ(xi, zj ) • We often use the oracle rather than straight reference in update: ¯ j−1) = ( argmin L(xi, yˆi, z), PickYZ(xi, yˆi, α z∈GEN(xi )
¯ j−1) argmax Φ(xi, z) · α
z∈GEN(xi )
where L(xi, yˆi, z) is the specified loss (e.g., WER) • Choosing a solution to move the model towards; and one to move away from • Often use a learning rate, e.g., how big a step to take: ¯ j−1) = ηj (Φ(xi, yj ) − Φ(xi, zj )) update(xi, yj , zj , α 28
Perceptron-like algorithms • Methods vary in how they choose the candidate to move towards – Perceptron: lowest loss candidate across whole set
– Direct loss minimization: highest score minus loss (D. Chiang, too) • Methods vary in how they choose the candidate to move away from – Perceptron : highest scoring candidate across whole set
– D. Chiang style MIRA: highest score plus loss (hope and fear) • Methods vary in how to update the parameter values (step size) – Perceptron: Update by the difference in the feature vectors
– MIRA/PA: update so that margin is enforced proportional to loss – Loss sensitive perceptrons: update proportional to loss • General methods with provably good performance in the limit – Including random selection of two candidates in the list 29
Random selection direct loss minimization
• N-best reranking from speech recognition system • Candidate selection
– Establish a distribution over n-best list, e.g., uniform or proportional to reciprocal rank – Flip a coin twice according to distribution, pick 2 candidates – Lowest loss candidate is candidate to move towards – Highest score candidate is candidate to move away from – Update according to step size method
• Show some experiments on n-best reranking of ASR
– Compare to baseline perceptron discriminative language model 30
Some supervised DLM results for English CTS 23
baseline ASR system: English CTS, dev RT04 perceptron 2g set 1 perceptron 2g all
22.8
Baseline development set WER: 22.8
22.6
WER
Perceptron trained on 400 hours with bigrams yields 0.5% WER reduction
22.4
Perceptron trained on 1900 hours with bigrams yields 0.8% WER reduction
22.2
22
21.8
0
2
4 6 Model size (bytes)
8 7
x 10
31
Some supervised DLM results for English CTS 23
baseline ASR system: English CTS, dev RT04 perceptron 2g set 1 perceptron 2g all random pairs uniform 2g all
22.8
Baseline development set WER: 22.8
22.6
WER
40 trials of selecting random pairs over 1900 hrs distribution over n-best uniform
22.4
yields 0.5–0.8% WER reduction 22.2
22
21.8
0
2
4 6 Model size (bytes)
8 7
x 10
32
Some supervised DLM results for English CTS 23
baseline ASR system: English CTS, dev RT04 perceptron 2g set 1 perceptron 2g all random pairs uniform 2g all random pairs ranknorm 2g all
22.8
Baseline development set WER: 22.8
22.6
WER
40 trials of selecting random pairs over 1900 hrs distribution proportional to reciprocal rank
22.4
yields 0.6–0.9% WER reduction 22.2
22
21.8
0
2
4 6 Model size (bytes)
8 7
x 10
33
Features and parameters
• Not much from this area has led yet to publishable results • Using passive/aggressive methods help in MT; not really in ASR • Darcey Riley came up with some cool parse-based features – Didn’t result in measureable system differences
• Damianos Karakos built semi-supervised TF-IDF style features – Paper in submission, nice results
• Haven’t pursued the “direct loss minimization” as far as we might – Potential to achieve improvements via randomization, IMO 34
Software library
• Development of a new reranking library during workshop Goooooooooggglerzzzzz
• General algorithms, flexible data/feature I/O • Meant to allow for research flexibility, but “industrial strength” • Support for both serial and parallel training algorithms • Open source version of the library (REFR) nearing release: https://github.com/refr
35
REFR madness
• Uses Google Protocol Buffers for I/O
– language independent, flexible, extensible
• General learning framework, various perceptron-like updates • Off-line or on-line feature extractors (e.g., n-grams on-the-fly) • Support for less memory-intensive modes • Can be run using Hadoop for map/reduce approach – Include distributed perceptron style updates
• Some LM specific utilities, e.g., n-gram feature extractors – But applicable to general reranking problems 36
Parameter Mixtures Trainer
Trainer
feat1
0.075
feat57
-0.33
feat100
-0.33
feat12
0.345
feat88
-0.33
feat32
1.23
feat1
0.075
feat57
-0.33
feat100
-0.33
feat12
0.345
feat88
-0.33
feat32
1.23
feat1
Trainer
Trainer
feat57
0.075 -0.33
feat100
-0.33
feat12
0.345
feat88
-0.33
feat32
1.23
feat1
0.075
feat57
-0.33
feat100
-0.33
feat12
0.345
feat88
-0.33
feat32
1.23
37
Feature Reducer
Feature Reducer
feat1
0.333
feat57
-0.55
feat100
-0.99
feat12
0.34
feat88
-0.33
feat32
1.434
Machine Translation
• Several MT related sub-projects
– Multi-system (Moses, Joshua) hierarchical model sanity check – Discriminative language modeling methods for MT – Methods for generating “hallucinated” confusion sets
• Important to include DLM score in MERT to get gains • Results interesting and suggestive, more work required 38
Baseline SMT systems and training data � �
baseline SMT systems for all 6 translation directions with Moses/Joshua toolkits 10-fold split of training data: � � �
for each translation direction: build 10 models on 9/10 of data, leaving out one fold at a time use each of these models to translate the missing fold of the training data result: nbest confusions of all training data
Training data conditions �
3 different methods to generate training sets of nbest translations (confusions, “real“ or “hallucinated“) and reference translations
�
Monolingual roundtrip can be used to produce arbitrary amounts of training data
Task: Reranking MT output with discriminative LM �
trying to improve translation results in terms of BLEU score
�
experiments with translation edit rate (TER) as well
Learning method: perceptron training �
train discriminative LM on nbest confusions of MT output
�
model learns to differentiate between oracle translations and other candidate translations in nbest confusions)
�
apply model to rerank nbest list (based on a combination of baseline score + perceptron score)
�
perceptron learner: different example selection methods and update types
�
MIRA update: change weight vector such that model score difference of oracle and confusion reflects loss between them
Perceptron optimized towards BLEU (ur-en)
�
one-way/roundtip: 88K, Mono. roundtrip: 20K, baseline factor: 1
Perceptron optimized towards BLEU (en-de)
�
one-way/roundtrip: 180K (one fold), baseline factor: 1
Retuning step � � �
have MERT decide on usefulness of discriminative LM → re-tune baseline system + perceptron model MERT: run decoder to produce nbest lists, optimize parameters on nbest list, iterate until convergence modification: � � � �
�
produce nbest list using baseline features compute discriminative features score nbest list with DLM optimize baseline weights + DLM weights
2 step training process: Perceptron (dlm) + MERT (baseline+dlm)
Pivot data �
data set (88K) created from confusion grammars using a Pivot language
�
extract en-ur and en-de SAMT grammars from which to extract en-en confusion grammars (more details later)
�
find rules that appear in both rule tables → potential paraphrases
�
remove “paraphases” from confusion grammar, only real confusions are useful for discriminative training
Retuning baseline + DLM (ur-en)
�
results averaged over 3 MERT runs per experiment
Retuning baseline + DLM (en-de)
�
results averaged over 3 MERT runs per experiment
Conclusions �
perceptron training alone yields no gains in BLEU score, retuning step is essential
�
perceptron models optimized towards both objectives (TER, BLEU) yield gains with retuning
�
additional POS ngram features improve over word ngram features in most cases
�
results show that one-way and roundtrip training data yield comparable results, in some cases better results with roundtrip data
�
promising results with monolingual roundtrip data (similar results with less data), expect larger gains with more data!
Current activity
• Puyang Xu has a submitted paper on extensions to phrasal cohorts • Damianos submitted his TF-IDF inspired features paper • Working on a unified extension of two of the hallucination papers • Nearing submission of the open-source REFR library • Phillip and Eva have been in LA finding out about D. Chiang’s MIRA approach
• Darcey Riley’s going to JHU to work with Jason Eisner • I’m spending this summer at home in Portland 39
Summary
• Big, diverse team; lots of productive sub-projects • Several publications on the speech modeling side
– Confusion set generation yields useful DLM training data – Methods using larger units (morphs; words) slightly better
• MT results less straightforward; more research required – Seems important to include DLM score in MERT
• Open-source software library nearly ready for release
– Supports distributed versions of perceptron-like algorithms
• LOADS of unanswered questions 40
Some directions to follow
• Move beyond controlled experiments; train on large text • Further explore variants of general direct loss minimization • Combined methods of confusion set generation • Combined supervised and semi-supervised DLM training • Application to areas like OCR, different confusion modeling – Text normalization
• DLM score in MERT training iterations; use weights in decoder • Compact feature representations within open-source library • What happened to paraphrasing? 41