A Structured Language Model Ciprian Chelba The Johns Hopkins University CLSP, Barton Hall 320 3400 N. Charles Street, Baltimore, MD-21218 [email protected] c Copyright 1997 by The Association for Computational Linguistics

Abstract The paper presents a language model that develops syntactic structure and uses it to extract meaningful information from the word history, thus enabling the use of long distance dependencies. The model assigns probability to every joint sequence of words–binary-parse-structure with headword annotation. The model, its probabilistic parametrization, and a set of experiments meant to evaluate its predictive power are presented.

1

Introduction

The main goal of the proposed project is to develop a language model(LM) that uses syntactic structure. The principles that guided this proposal were: • the model will develop syntactic knowledge as a built-in feature; it will assign a probability to every joint sequence of words–binary-parse-structure; • the model should operate in a left-to-right manner so that it would be possible to decode word lattices provided by an automatic speech recognizer. The model consists of two modules: a next word predictor which makes use of syntactic structure as developed by a parser. The operations of these two modules are intertwined.

2

The Basic Idea and Terminology

Consider predicting the word barked in the sentence: the dog I heard yesterday barked again. A 3-gram approach would predict barked from (heard, yesterday) whereas it is clear that the predictor should use the word dog which is outside the reach of even 4-grams. Our assumption is that what enables us to make a good prediction of barked is the syntactic structure in the

dog dog heard

the dog I heard yesterday barked

Figure 1: Partial parse =h_{-m}

T_{-m}

h_{-m+1}

T_{-m+1}

h_{-1}

T{-1}

h_0

T_0

w_1 ... w_p ........ w_q ... w_r w_{r+1} ... w_k w_{k+1} ..... w_n

Figure 2: A word-parse k-prefix past. The correct partial parse of the word history when predicting barked is shown in Figure 1. The word dog is called the headword of the constituent ( the (dog (...))) and dog is an exposed headword when predicting barked — topmost headword in the largest constituent that contains it. The syntactic structure in the past filters out irrelevant words and points to the important ones, thus enabling the use of long distance information when predicting the next word. Our model will assign a probability P (W, T ) to every sentence W with every possible binary branching parse T and every possible headword annotation for every constituent of T . Let W be a sentence of length l words to which we have prepended and appended so that w0 = and wl+1 =. Let Wk be the word k-prefix w0 . . . wk of the sentence and Wk Tk the word-parse k-prefix. To stress this point, a word-parse k-prefix contains only those binary trees whose span is completely included in the word kprefix, excluding w0 =. Single words can be regarded as root-only trees. Figure 2 shows a wordparse k-prefix; h_0 .. h_{-m} are the exposed headwords. A complete parse — Figure 3 — is any binary parse of the w1 . . . wl sequence with the restriction that
is the only allowed headword.



h_{-2}

h_{-1}

h_0



T_{-m}



w_1 ...... w_l





.........

Figure 3: Complete parse

T_{-2}

T_{-1}

T_0

Figure 4: Before an adjoin operation h’_{-1} = h_{-2}

Note that (w1 . . . wl ) needn’t be a constituent, but for the parses where it is, there is no restriction on which of its words is the headword. The model will operate by means of two modules: • PREDICTOR predicts the next word wk+1 given the word-parse k-prefix and then passes control to the PARSER; • PARSER grows the already existing binary branching structure by repeatedly generating the transitions adjoin-left or adjoin-right until it passes control to the PREDICTOR by taking a null transition. The operations performed by the PARSER ensure that all possible binary branching parses with all possible headword assignments for the w1 . . . wk word sequence can be generated. They are illustrated by Figures 4-6. The following algorithm describes how the model generates a word sequence with a complete parse (see Figures 3-6 for notation): Transition t; // a PARSER transition generate ; do{ predict next_word; //PREDICTOR do{ //PARSER if(T_{-1} != ) if(h_0 == ) t = adjoin-right; else t = {adjoin-{left,right}, null}; else t = null; }while(t != null) }while(!(h_0 == && T_{-1} == )) t = adjoin-right; // adjoin ; DONE It is easy to see that any given word sequence with a possible parse and headword annotation is generated by a unique sequence of model actions.

3

Probabilistic Model

The probability P (W, T ) can be broken into: l+1 P (W, T ) = k=1 [P (wk /Wk−1 Tk−1 )· Nk k k k i=1 P (ti /wk , Wk−1 Tk−1 , t1 . . . ti−1 )] where: • Wk−1 Tk−1 is the word-parse (k − 1)-prefix • wk is the word predicted by PREDICTOR • Nk − 1 is the number of adjoin operations the PARSER executes before passing control to the PREDICTOR (the Nk -th operation at position k is the null transition); Nk is a function of T

h’_0 = h_{-1}

h_{-1}

T’_0

h_0

T’_{-m+1}<- ...............



T’_{-1}<-T_{-2}

T_{-1}

T_0

Figure 5: Result of adjoin-left h’_{-1}=h_{-2}

h’_0 = h_0

h_{-1}

h_0

T’_{-m+1}<- ...............



T’_{-1}<-T_{-2}

T_{-1}

T_0

Figure 6: Result of adjoin-right • tki denotes the i-th PARSER operation carried out at position k in the word string; tki ∈ {adjoin-left,adjoin-right}, i < Nk , tki =null, i = Nk Our model is based on two probabilities: P (wk /Wk−1 Tk−1 ) P (tki /wk , Wk−1 Tk−1 , tk1 . . . tki−1 )

(1) (2)

As can be seen (wk , Wk−1 Tk−1 , tk1 . . . tki−1 ) is one of the Nk word-parse k-prefixes of Wk Tk , i = 1, Nk at position k in the sentence. To ensure a proper probabilistic model we have to make sure that (1) and (2) are well defined conditional probabilities and that the model halts with probability one. A few provisions need to be taken: • P (null/Wk Tk ) = 1, if T_{-1} == ensures that is adjoined in the last step of the parsing process; • P (adjoin-right/Wk Tk ) = 1, if h_0 == ensures that the headword of a complete parse is ; •∃ > 0s.t. P (wk =
/Wk−1 Tk−1 ) ≥ , ∀Wk−1 Tk−1 ensures that the model halts with probability one. 3.1

The first model

The first term (1) can be reduced to an n-gram LM, P (wk /Wk−1 Tk−1 ) = P (wk /wk−1 . . . wk−n+1 ). A simple alternative to this degenerate approach would be to build a model which predicts the next word based on the preceding p-1 exposed headwords and n-1 words in the history, thus making the following equivalence classification: [Wk Tk ] = {h_0 .. h_{-p+2},wk−1 ..wk−n+1 }.

The approach is similar to the trigger LM(Lau93), the difference being that in the present work triggers are identified using the syntactic structure. 3.2

Preliminary Experiments

Assuming that the correct partial parse is a function of the word prefix, it makes sense to compare the word level perplexity(PP) of a standard n-gram LM with that of the P (wk /Wk−1 Tk−1 ) model. We developed and evaluated four LMs: • 2 bigram LMs P (wk /Wk−1 Tk−1 ) = P (wk /wk−1 ) referred to as W and w, respectively; wk−1 is the previous (word, POStag) pair; • 2 P (wk /Wk−1 Tk−1 ) = P (wk /h0 ) models, referred to as H and h, respectively; h0 is the previous exposed (headword, POS/non-term tag) pair; the parses used in this model were those assigned manually in the Penn Treebank (Marcus95) after undergoing headword percolation and binarization. All four LMs predict a word wk and they were implemented using the Maximum Entropy Modeling Toolkit1 (Ristad97). The constraint templates in the {W,H} models were: 4 <= <*>_<*> ; 2 <= _<*> ; 2 <= _ ; 8 <= <*>_ ; and in the {w,h} models they were: 4 <= <*>_<*> ; 2 <= _<*> ; <*> denotes a don’t care position, _ a (word, tag) pair; for example, 4 <= _<*> will trigger on all ((word, any tag), predicted-word) pairs that occur more than 3 times in the training data. The sentence boundary is not included in the PP calculation. Table 1 shows the PP results along with 1

LM W H

The second model

Model (2) assigns probability to different binary parses of the word k-prefix by chaining the elementary operations described above. The workings of the PARSER are very similar to those of Spatter (Jelinek94). It can be brought to the full power of Spatter by changing the action of the adjoin operation so that it takes into account the terminal/nonterminal labels of the constituent proposed by adjoin and it also predicts the nonterminal label of the newly created constituent; PREDICTOR will now predict the next word along with its POS tag. The best equivalence classification of the Wk Tk word-parse k-prefix is yet to be determined. The Collins parser (Collins96) shows that dependencygrammar–like bigram constraints may be the most adequate, so the equivalence classification [Wk Tk ] should contain at least {h_0, h_{-1}}.

4

the number of parameters for each of the 4 models described .

ftp://ftp.cs.princeton.edu/pub/packages/memt

PP 352 292

param 208487 206540

LM w h

PP 419 410

param 103732 102437

Table 1: Perplexity results

5

Acknowledgements

The author thanks to all the members of the Dependency Modeling Group (Chelba97):David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Ristad, Roni Rosenfeld, Andreas Stolcke, Dekai Wu.

References Michael John Collins. 1996. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 184-191, Santa Cruz, CA. Frederick Jelinek. 1997. Information extraction from speech and text — course notes. The Johns Hopkins University, Baltimore, MD. Frederick Jelinek, John Lafferty, David M. Magerman, Robert Mercer, Adwait Ratnaparkhi, Salim Roukos. 1994. Decision Tree Parsing using a Hidden Derivational Model. In Proceedings of the Human Language Technology Workshop, 272-277. ARPA. Raymond Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Trigger-based language models: a maximum entropy approach. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, volume 2, 45-48, Minneapolis. Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz. 1995. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313-330. Eric Sven Ristad. 1997. Maximum entropy modeling toolkit. Technical report, Department of Computer Science, Princeton University, Princeton, NJ, January 1997, v. 1.4 Beta. Ciprian Chelba, David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Sven Ristad, Roni Rosenfeld, Andreas Stolcke, Dekai Wu. 1997. Structure and Performance of a Dependency Language Model. In Proceedings of Eurospeech’97, Rhodes, Greece. To appear.

A Structured Language Model

1 Introduction. The main goal of the proposed project is to develop a language model(LM) that uses syntactic structure. The principles that guided this proposal were: • the model will develop syntactic knowledge as a built-in feature; it will assign a probability to every joint sequence of words–binary-parse-structure;.

84KB Sizes 2 Downloads 247 Views

Recommend Documents

refinement of a structured language model
... to the important ones, thus enabling the use of long distance information when predicting the next word. ..... We will call this new component the L2R-WORD-.

The subspace Gaussian mixture model – a structured model for ...
Aug 7, 2010 - We call this a ... In HMM-GMM based speech recognition (see [11] for review), we turn the .... of the work described here has been published in conference .... ize the SGMM system; we do this in such a way that all the states' ...

The subspace Gaussian mixture model – a structured ...
Oct 4, 2010 - advantage where the amount of in-domain data available to train .... Our distribution in each state is now a mixture of mixtures, with Mj times I.

The subspace Gaussian mixture model – a structured ...
Aug 7, 2010 - eHong Kong University of Science and Technology, Hong Kong, China. fSaarland University ..... In experiments previously carried out at IBM ...... particularly large improvements when training on small data-sets, as long as.

structured language modeling for speech ... - Semantic Scholar
20Mwds (a subset of the training data used for the baseline 3-gram model), ... it assigns probability to word sequences in the CSR tokenization and thus the ...

STRUCTURED LANGUAGE MODELING FOR SPEECH ...
A new language model for speech recognition is presented. The model ... 1 Structured Language Model. An extensive ..... 2] F. JELINEK and R. MERCER.

STRUCTURED ADAPTIVE MODEL INVERSION ...
guidance but also for his continuous support and encouragement during the course of my research. I thank my committee members ... in the courses that they taught me. I would also like to thank Dr. ..... mathematical formulation is derived and the con

Introduction to Structured Query Language
Jan 10, 1998 - To find those people with LastName's ending in "L", use. '%L', or if you ..... Let's delete this new row back out of the database: DELETE FROM ...

structured query language pdf
Download now. Click here if your download doesn't start automatically. Page 1 of 1. structured query language pdf. structured query language pdf. Open. Extract.

Introduction to Structured Query Language
Jan 10, 1998 - in to the database and entering SQL commands; see the local computer "guru" to help you get .... Good database design suggests that each table lists data only ..... Embedded SQL allows programmers to connect to a database and ..... inf

A Category-integrated Language Model for Question ... - Springer Link
to develop effective question retrieval models to retrieve historical question-answer ... trieval in CQA archives is distinct from the search of web pages in that ...

A Middleware-Independent Model and Language for Component ...
A component implements a component type τ, same as a class implements an interface. A component (τ, D, C) is characterized by its type τ, by the distribution D of Boolean type which indicates whether the implementation is distributed, and by its c

Information Extraction Using the Structured Language ...
syntactic+semantic parsing of test sentences; retrieve the semantic parse by ... Ї initialize the syntactic SLM from in-domain MiPad treebank (NLPwin) and out-of-.

Information Extraction Using the Structured Language ...
Ї Data driven approach with minimal annotation effort: clearly identifiable ... Ї Information extraction viewed as the recovery of a two level semantic parse Л for a.

Back-Off Language Model Compression
(LM): our experiments on Google Search by Voice show that pruning a ..... Proceedings of the International Conference on Spoken Language. Processing ...

LANGUAGE MODEL CAPITALIZATION ... - Research at Google
tions, the lack of capitalization of the user's input can add an extra cognitive load on the ... adding to their visual saliency. .... We will call this model the Capitalization LM. The ... rive that “iphone” is rendered as “iPhone” in the Ca

Structured Sparse Low-Rank Regression Model for ... - Springer Link
3. Computer Science and Engineering,. University of Texas at Arlington, Arlington, USA. Abstract. With the advances of neuroimaging techniques and genome.

RAPID LANGUAGE MODEL DEVELOPMENT USING ...
RAPID LANGUAGE MODEL DEVELOPMENT USING EXTERNAL RESOURCES. FOR NEW ... tention when using web data for language modeling: 1) query.

RAPID LANGUAGE MODEL DEVELOPMENT USING ...
We are aware of three recent studies in language .... internal call centers where customers having trouble with their ..... Three way interpolation of SCLM,.

LANGUAGE MODEL ADAPTATION USING RANDOM ...
Broadcast News LM to MIT computer science lecture data. There is a ... If wi is the word we want to predict, then the general question takes the following form:.

Character-based Language Model
Natural Language Processing Centre ... 1 I call this ChaRactEr-BasEd LangUage Model (CBLM) cerebellum: a part of .... The dots in the Figure mean a space.