adjoin-left-PP; ; adjoin-left-VP'; null - Semantic Scholar

Viewer
Transcript

Richer Syntactic Dependencies for Structured Language Modeling Ciprian Chelba

Peng Xu

STG Microsoft Research [email protected]

CLSP Johns Hopkins University [email protected]

Abstract

two simple methods of enriching the dependencies in the syntactic parse trees used for intializing the struc-

tured language model (SLM) achieve improvement in perplexity (PPL) and word-error-rate (WER, N-best rescoring) over the baseline results reported using the SLM on the UPenn Treebank and Wall Street Journal (WSJ) corpora, respectively

X )=

✔ Word level probability assignment:

Structured Language Model ✔ Generalize trigram modeling (local) by taking advantage of sentence structure (influence by more distant past) [1] ✔ Use exposed heads h (words w and their corresponding nonterminal tags l) for prediction:

P (wi+1jWi; Ti(Wi)) = P (wi+1jh 2(Ti(Wi)); h 1(Ti(Wi)))

Ti is the partial hidden structure, with head assignment, assigned to Wi = w1 : : : wi ✔ Model will assign joint probability to sequences of words and hidden parse structure: P (Ti; Wi) ✔ Number of parses Tk for a given word prefix Wk is jfTk gj (2k ) — need to prune it by discarding the unlikely ones

P (wk+1=Wk

Tk2Sk

P (wk+1=Wk Tk ) (Wk ; Tk )

X )

(Wk ; Tk ) = P (Wk Tk =

Tk 2Sk

P (Wk Tk )

– Sk is the set of all parses present in the stacks at the current stage k ✔ Model statistics estimation — unsupervised algorithm for maximizing P (W ) (minimizing perplexity) that belongs to the class of Expectation-Maximization algorithms ✔ Parameters are initialized on parse trees that have been binarized and the non-terminal (NT) tags at each node have been enriched with headwords

predict word PREDICTOR

ended_VP’

TAGGER

with_PP null

loss_NP

tag word PARSER

of_PP

adjoin_{left,right}

contract_NP

cents_NP

loss_NP

the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN

7_CD cents_NNS

; null; predict cents; POStag cents; adjoin-right-NP; adjoin-left-PP; : : :; adjoin-left-VP’; null; : : :;

:::

P (Tn+1; Wn+1

Y ) = | ( j {z )} | ( j {z n+1 i=1

} | {z }

P wi h 2; h 1 P gi wi; h 1:tag; h 2:tag ) P (Tijwi; gi; Ti 1) predictor

tagger

parser

Richer Syntactic Dependencies Enrich the non-terminal tag of a node in the binarized parse tree with the NT tag of one if its children or both: 1. same: we use the non-terminal tag of the node from which the headword is being percolated

2. opposite: we use the non-terminal tag of the sibling node from which the headword is being percolated 3. both: both of the above A given binarized tree is traversed recursively in depth first order and each constituent is enriched in the above manner.

predict word PREDICTOR

ended_VP’+PP

TAGGER

with_PP+NP null

loss_NP+PP

tag word PARSER

of_PP+NP

adjoin_{left,right}

contract_NP+DT

loss_NP+DT

the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN :::

cents_NP+CD

7_CD cents_NNS

; null; predict cents; POStag cents; adjoin-right-NP+CD; adjoin-left-PP+NP; : : :; adjoin-left-VP’+PP; null; : : :;

Perplexity experiments

Model

Evaluate the perplexity on the UPenn Treebank. Training set: 1Mwds (Sections 00-22) Test set: 82.4kwds (Sections 23-24) Vocabulary: 10kwds open POS-tagger vocabulary: 40 NT tag vocabulary: 52 baseline, 954 opposite, 712 same, 3816 both CONSTRUCTOR operation vocabulary: 157 baseline, 2863 opposite, 2137 same, 11449 both The SLM was interpolated with the 3-gram model: P () = P3gram() + (1 ) PSLM () Enriching scheme baseline opposite same both

Train Iter 3 3 3 3

= 0.0

= 0.6

= 1.0

158.75 150.83 155.29 153.30

148.67 144.08 146.39 144.99

166.63 166.63 166.63 166.63

✘ opposite initialization scheme performed best ✘ 5% relative reduction compared to SLM baseline ✘ 3% relative improvement after interpolation with 3-gram

WER (N-best rescoring) results Evaluate the WER performance of the SLM in the WSJ DARPA’93 HUB-1 test setup. Training set: 20Mwds(SLM)/40Mwds(3-gram), WSJ Test set size: 213 utterances, 3446 words. Vocabulary: 20kwds open baseline: standard (LDC) 3-gram model trained on 40Mwds — lattices and the N-best lists SLM: trained on 20Mwds subset of WSJ automatically parsed (Ratnaparkhi), opposite NT tag scheme

Iter

0.0 baseline SLM WER % 0 13.1 opposite SLM WER, % 0 12.7 MPSS significance test p-value 0.020

Interpolation weight 0.2 0.4 0.6 0.8 1.0 13.1 13.1 13.0 13.4 13.7 12.8 12.7 12.7 13.1 13.7 0.017 0.014 0.005 0.070 —

✘ 0.3-0.4% absolute reduction in WER over the baseline SLM ✘ 1.0% absolute reduction in WER over the baseline 3-gram ✘ SLM performance as a second pass language model is the same even without interpolating it with the 3-gram model

Conclusions and Future Directions ☞ Simple but effective method of enriching the syntactic dependencies in the structured language model (SLM) that achieves 0.3-0.4% absolute reduction in WER over the best previous results reported using the SLM on WSJ. ☞ Implementation could be greatly improved by predicting only the relevant part of the enriched non-terminal tag and then adding the part inherited from the child. ☞ A more comprehensive study of the most productive ways of increasing the probabilistic dependencies in the parse tree would be desirable.

Acknowledgements The authors would like to thank Brian Roark for making available the N-best lists for the HUB1 test set. SLM publicly available: ftp://ftp.clsp.jhu.edu/pub/clsp/chelba/SLM RELEASE

References [1] Ciprian Chelba and Frederick Jelinek, “Structured language modeling,” Computer Speech and Language, vol. 14, no. 4, pp. 283–332, October 2000.

The null space property for sparse recovery from ... - Semantic Scholar