Richer Syntactic Dependencies for Structured Language Modeling Ciprian Chelba
Peng Xu
STG Microsoft Research
[email protected]
CLSP Johns Hopkins University
[email protected]
Abstract
two simple methods of enriching the dependencies in the syntactic parse trees used for intializing the struc-
tured language model (SLM) achieve improvement in perplexity (PPL) and word-error-rate (WER, N-best rescoring) over the baseline results reported using the SLM on the UPenn Treebank and Wall Street Journal (WSJ) corpora, respectively
X )=
✔ Word level probability assignment:
Structured Language Model ✔ Generalize trigram modeling (local) by taking advantage of sentence structure (influence by more distant past) [1] ✔ Use exposed heads h (words w and their corresponding nonterminal tags l) for prediction:
P (wi+1jWi; Ti(Wi)) = P (wi+1jh 2(Ti(Wi)); h 1(Ti(Wi)))
Ti is the partial hidden structure, with head assignment, assigned to Wi = w1 : : : wi ✔ Model will assign joint probability to sequences of words and hidden parse structure: P (Ti; Wi) ✔ Number of parses Tk for a given word prefix Wk is jfTk gj (2k ) — need to prune it by discarding the unlikely ones
P (wk+1=Wk
Tk2Sk
P (wk+1=Wk Tk ) (Wk ; Tk )
X )
(Wk ; Tk ) = P (Wk Tk =
Tk 2Sk
P (Wk Tk )
– Sk is the set of all parses present in the stacks at the current stage k ✔ Model statistics estimation — unsupervised algorithm for maximizing P (W ) (minimizing perplexity) that belongs to the class of Expectation-Maximization algorithms ✔ Parameters are initialized on parse trees that have been binarized and the non-terminal (NT) tags at each node have been enriched with headwords
predict word PREDICTOR
ended_VP’
TAGGER
with_PP null
loss_NP
tag word PARSER
of_PP
adjoin_{left,right}
contract_NP
cents_NP
loss_NP
the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN
7_CD cents_NNS
; null; predict cents; POStag cents; adjoin-right-NP; adjoin-left-PP; : : :; adjoin-left-VP’; null; : : :;
:::
P (Tn+1; Wn+1
Y ) = | ( j {z )} | ( j {z n+1 i=1
} | {z }
P wi h 2; h 1 P gi wi; h 1:tag; h 2:tag ) P (Tijwi; gi; Ti 1) predictor
tagger
parser
Richer Syntactic Dependencies Enrich the non-terminal tag of a node in the binarized parse tree with the NT tag of one if its children or both: 1. same: we use the non-terminal tag of the node from which the headword is being percolated
2. opposite: we use the non-terminal tag of the sibling node from which the headword is being percolated 3. both: both of the above A given binarized tree is traversed recursively in depth first order and each constituent is enriched in the above manner.
predict word PREDICTOR
ended_VP’+PP
TAGGER
with_PP+NP null
loss_NP+PP
tag word PARSER
of_PP+NP
adjoin_{left,right}
contract_NP+DT
loss_NP+DT
the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN :::
cents_NP+CD
7_CD cents_NNS
; null; predict cents; POStag cents; adjoin-right-NP+CD; adjoin-left-PP+NP; : : :; adjoin-left-VP’+PP; null; : : :;
Perplexity experiments
Model
Evaluate the perplexity on the UPenn Treebank. Training set: 1Mwds (Sections 00-22) Test set: 82.4kwds (Sections 23-24) Vocabulary: 10kwds open POS-tagger vocabulary: 40 NT tag vocabulary: 52 baseline, 954 opposite, 712 same, 3816 both CONSTRUCTOR operation vocabulary: 157 baseline, 2863 opposite, 2137 same, 11449 both The SLM was interpolated with the 3-gram model: P () = P3gram() + (1 ) PSLM () Enriching scheme baseline opposite same both
Train Iter 3 3 3 3
= 0.0
= 0.6
= 1.0
158.75 150.83 155.29 153.30
148.67 144.08 146.39 144.99
166.63 166.63 166.63 166.63
✘ opposite initialization scheme performed best ✘ 5% relative reduction compared to SLM baseline ✘ 3% relative improvement after interpolation with 3-gram
WER (N-best rescoring) results Evaluate the WER performance of the SLM in the WSJ DARPA’93 HUB-1 test setup. Training set: 20Mwds(SLM)/40Mwds(3-gram), WSJ Test set size: 213 utterances, 3446 words. Vocabulary: 20kwds open baseline: standard (LDC) 3-gram model trained on 40Mwds — lattices and the N-best lists SLM: trained on 20Mwds subset of WSJ automatically parsed (Ratnaparkhi), opposite NT tag scheme
Iter
0.0 baseline SLM WER % 0 13.1 opposite SLM WER, % 0 12.7 MPSS significance test p-value 0.020
Interpolation weight 0.2 0.4 0.6 0.8 1.0 13.1 13.1 13.0 13.4 13.7 12.8 12.7 12.7 13.1 13.7 0.017 0.014 0.005 0.070 —
✘ 0.3-0.4% absolute reduction in WER over the baseline SLM ✘ 1.0% absolute reduction in WER over the baseline 3-gram ✘ SLM performance as a second pass language model is the same even without interpolating it with the 3-gram model
Conclusions and Future Directions ☞ Simple but effective method of enriching the syntactic dependencies in the structured language model (SLM) that achieves 0.3-0.4% absolute reduction in WER over the best previous results reported using the SLM on WSJ. ☞ Implementation could be greatly improved by predicting only the relevant part of the enriched non-terminal tag and then adding the part inherited from the child. ☞ A more comprehensive study of the most productive ways of increasing the probabilistic dependencies in the parse tree would be desirable.
Acknowledgements The authors would like to thank Brian Roark for making available the N-best lists for the HUB1 test set. SLM publicly available: ftp://ftp.clsp.jhu.edu/pub/clsp/chelba/SLM RELEASE
References [1] Ciprian Chelba and Frederick Jelinek, “Structured language modeling,” Computer Speech and Language, vol. 14, no. 4, pp. 283–332, October 2000.