Richer Syntactic Dependencies for Structured Language Modeling Ciprian Chelba

Peng Xu

STG Microsoft Research [email protected]

CLSP Johns Hopkins University [email protected]


 two simple methods of enriching the dependencies in the syntactic parse trees used for intializing the struc-

tured language model (SLM)  achieve improvement in perplexity (PPL) and word-error-rate (WER, N-best rescoring) over the baseline results reported using the SLM on the UPenn Treebank and Wall Street Journal (WSJ) corpora, respectively

X )=

✔ Word level probability assignment:

Structured Language Model ✔ Generalize trigram modeling (local) by taking advantage of sentence structure (influence by more distant past) [1] ✔ Use exposed heads h (words w and their corresponding nonterminal tags l) for prediction:

P (wi+1jWi; Ti(Wi)) = P (wi+1jh 2(Ti(Wi)); h 1(Ti(Wi)))

Ti is the partial hidden structure, with head assignment, assigned to Wi = w1 : : : wi ✔ Model will assign joint probability to sequences of words and hidden parse structure: P (Ti; Wi) ✔ Number of parses Tk for a given word prefix Wk is jfTk gj  (2k ) — need to prune it by discarding the unlikely ones

P (wk+1=Wk


P (wk+1=Wk Tk )  (Wk ; Tk )

X )

(Wk ; Tk ) = P (Wk Tk =

Tk 2Sk

P (Wk Tk )

– Sk is the set of all parses present in the stacks at the current stage k ✔ Model statistics estimation — unsupervised algorithm for maximizing P (W ) (minimizing perplexity) that belongs to the class of Expectation-Maximization algorithms ✔ Parameters are initialized on parse trees that have been binarized and the non-terminal (NT) tags at each node have been enriched with headwords

predict word PREDICTOR



with_PP null


tag word PARSER






the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN

7_CD cents_NNS

; null; predict cents; POStag cents; adjoin-right-NP; adjoin-left-PP; : : :; adjoin-left-VP’; null; : : :;


P (Tn+1; Wn+1

Y ) = | ( j {z )} | ( j {z n+1 i=1

} | {z }

P wi h 2; h 1 P gi wi; h 1:tag; h 2:tag ) P (Tijwi; gi; Ti 1) predictor



Richer Syntactic Dependencies Enrich the non-terminal tag of a node in the binarized parse tree with the NT tag of one if its children or both: 1. same: we use the non-terminal tag of the node from which the headword is being percolated

2. opposite: we use the non-terminal tag of the sibling node from which the headword is being percolated 3. both: both of the above A given binarized tree is traversed recursively in depth first order and each constituent is enriched in the above manner.

predict word PREDICTOR



with_PP+NP null


tag word PARSER





the_DT contract_NN ended_VBD with_IN a_DT loss_NN of_IN :::


7_CD cents_NNS

; null; predict cents; POStag cents; adjoin-right-NP+CD; adjoin-left-PP+NP; : : :; adjoin-left-VP’+PP; null; : : :;

Perplexity experiments


Evaluate the perplexity on the UPenn Treebank.  Training set: 1Mwds (Sections 00-22)  Test set: 82.4kwds (Sections 23-24)  Vocabulary: 10kwds open  POS-tagger vocabulary: 40  NT tag vocabulary: 52 baseline, 954 opposite, 712 same, 3816 both  CONSTRUCTOR operation vocabulary: 157 baseline, 2863 opposite, 2137 same, 11449 both  The SLM was interpolated with the 3-gram model: P () =   P3gram() + (1 )  PSLM () Enriching scheme baseline opposite same both

Train Iter 3 3 3 3

 = 0.0

 = 0.6

 = 1.0

158.75 150.83 155.29 153.30

148.67 144.08 146.39 144.99

166.63 166.63 166.63 166.63

✘ opposite initialization scheme performed best ✘ 5% relative reduction compared to SLM baseline ✘ 3% relative improvement after interpolation with 3-gram

WER (N-best rescoring) results Evaluate the WER performance of the SLM in the WSJ DARPA’93 HUB-1 test setup.  Training set: 20Mwds(SLM)/40Mwds(3-gram), WSJ  Test set size: 213 utterances, 3446 words.  Vocabulary: 20kwds open  baseline: standard (LDC) 3-gram model trained on 40Mwds — lattices and the N-best lists  SLM: trained on 20Mwds subset of WSJ automatically parsed (Ratnaparkhi), opposite NT tag scheme


0.0 baseline SLM WER % 0 13.1 opposite SLM WER, % 0 12.7 MPSS significance test p-value 0.020

Interpolation weight 0.2 0.4 0.6 0.8 1.0 13.1 13.1 13.0 13.4 13.7 12.8 12.7 12.7 13.1 13.7 0.017 0.014 0.005 0.070 —

✘ 0.3-0.4% absolute reduction in WER over the baseline SLM ✘ 1.0% absolute reduction in WER over the baseline 3-gram ✘ SLM performance as a second pass language model is the same even without interpolating it with the 3-gram model

Conclusions and Future Directions ☞ Simple but effective method of enriching the syntactic dependencies in the structured language model (SLM) that achieves 0.3-0.4% absolute reduction in WER over the best previous results reported using the SLM on WSJ. ☞ Implementation could be greatly improved by predicting only the relevant part of the enriched non-terminal tag and then adding the part inherited from the child. ☞ A more comprehensive study of the most productive ways of increasing the probabilistic dependencies in the parse tree would be desirable.

Acknowledgements The authors would like to thank Brian Roark for making available the N-best lists for the HUB1 test set. SLM publicly available: RELEASE

References [1] Ciprian Chelba and Frederick Jelinek, “Structured language modeling,” Computer Speech and Language, vol. 14, no. 4, pp. 283–332, October 2000.

adjoin-left-PP; ; adjoin-left-VP'; null - Semantic Scholar

–Л is the set of all parses present in the stacks at the current stage. ✓Model statistics estimation — unsupervised algorithm for max- imizing И´Пµ (minimizing ...

61KB Sizes 2 Downloads 226 Views

Recommend Documents

The null space property for sparse recovery from ... - Semantic Scholar
Nov 10, 2010 - E-mail addresses: [email protected] (M.-J. Lai), [email protected] (Y. Liu). ... These motivate us to study the joint sparse solution recovery.

the value of null theories in ecology - Semantic Scholar
the-two-thirds-power scaling law, because that is a null ..... model for the origin of allometric scaling laws in biology. ...... unreachable in the Australian Outback.

Reality Checks - Semantic Scholar
recently hired workers eligible for participation in these type of 401(k) plans has been increasing ...... Rather than simply computing an overall percentage of the.

Wilson So - Semantic Scholar
Phone: E-mail: 2283 Hearst Ave, Apt 9. Berkeley, CA 94709. (415) 309-7714 ... Control Protocol for Ad-Hoc Wireless Networks ... Adaptive QoS over ad hoc.

fibromyalgia - Semantic Scholar
William J. Hennen holds a Ph.D in Bio-organic chemistry. An accomplished ..... What is clear is that sleep is essential to health and wellness, while the ..... predicted that in the near future melatonin administration will become as useful as bright

TURING GAMES - Semantic Scholar
DEPARTMENT OF COMPUTER SCIENCE, COLUMBIA UNIVERSITY, NEW ... Game Theory [9] and Computer Science are both rich fields of mathematics which.

vehicle safety - Semantic Scholar
primarily because the manufacturers have not believed such changes to be profitable .... people would prefer the safety of an armored car and be willing to pay.

Physics - Semantic Scholar
... Z. El Achheb, H. Bakrim, A. Hourmatallah, N. Benzakour, and A. Jorio, Phys. Stat. Sol. 236, 661 (2003). [27] A. Stachow-Wojcik, W. Mac, A. Twardowski, G. Karczzzewski, E. Janik, T. Wojtowicz, J. Kossut and E. Dynowska, Phys. Stat. Sol (a) 177, 55

PESSOA - Semantic Scholar
ported in [ZPJT09, JT10] do not require the use of a grid of constant resolution. We are currently working on extending Pessoa to multi-resolution grids with the.