Generative and Discriminative Latent Variable Grammars Slav Petrov Google Research New York, NY 10011 [email protected] [email protected]

Abstract Latent variable grammars take an observed (coarse) treebank and induce more fine-grained grammar categories, that are better suited for modeling the syntax of natural languages. Estimation can be done in a generative or a discriminative framework, and results in the best published parsing accuracies over a wide range of syntactically divergent languages and domains. In this paper we highlight the commonalities and the differences between the two learning paradigms.

1

Introduction

Latent variable grammars for parsing [1, 2] model an observed treebank of coarse parse trees with a model over more refined, but unobserved, derivation trees. Given sentences as input, the parse trees represent the desired output of the system, while the derivation trees represent the typically much more complex underlying syntactic processes. For example, the single treebank category NP (noun phrase) may be better modeled by several finer categories representing subject NPs, object NPs, and so on. Rather than attempting to manually specify these fine-grained subcategories, we automatically split each category into k subcategories (e.g. NP1 - NPk ) and induce them from data. Learning latent variable grammars therefore encompasses two tasks: determining the number of subcategories for each observed grammar category (some categories are more complex than others and we therefore expect them to have more subcategories), and learning the parameters of the grammar (i.e. the probabilities of the different grammar productions). In this paper, we review our past work on both generative [2, 3], as well as, discriminative [4, 5] latent variable grammars. While the final parsing accuracies are comparable, we highlight the unique challenges involved in estimating the models under the different paradigms. In our generative approach [2], we constrain model complexity using a split-merge heuristic and use the EM algorithm for parameter estimation. Alternatively, we can also accomplish both tasks together by including a prior in our objective function. In [3], we take a Bayesian standpoint and use a sparse Dirichlet prior and variational EM for learning. In our discriminative work [5], we include an L1 -regularization term in our objective function, which we maximize using numerical optimization (L-BFGS). In addition to reviewing the technical differences of learning generative and discriminative latent variable grammars, we also compare the resulting grammars with each other. We show that even though the final accuracies of the two approaches are usually comparable, the underlying models are quite different and make different types of errors.

2

Latent Variable Grammars

In latent variable parsing, we view the training trees as a coarse trace of the true underlying processes. By augmenting the trees with a latent variable at each node (see Figure 1), we can model 1

S NP

S-x

VP

P

V

ADJP

He

was

right

.

NP-x

.

P-x

V-x ADJP-x

He

was

(a) Parse tree T , Sentence w

VP-x

S0 S0 S0 S0 S1

.-x .

right

(b) Derivations t : T

Grammar G → NP0 VP0 → NP1 VP0 → NP0 VP1 → NP1 VP1 → NP0 VP0 ...

? ? ? ? ?

(c) Parameters θ

Figure 1: The observed parse trees T (a) are split into derivations t (b). Learning latent variable grammars involves determining the set of grammar productions/features and their parameters (c). this more refined process. Our log-linear grammars are parametrized by a vector θ which is indexed by productions X → γ. The conditional probability of a derivation t given a sentence w is then: Y T 1 1 (1) eθX→γ = eθ f (t) , Pθ (t|w) = Z(θ, w) Z(θ, w) X→γ∈t

where Z(θ, w) is the partition function and f (t) is a vector indicating how many times each production occurs in the derivation t. The inside/outside algorithm gives us an efficient way of summing over this exponential set of derivations. We will consider generative grammars, where the parameters θ are set to maximize the joint likelihood of the training sentences and their parse trees, and discriminative grammars, where the parameters θ are set to maximize the likelihood of the correct parse tree (vs. all possible trees) given a sentence. Note that this is merely a comparison of different estimators, as probabilistic and weighted CFGs are equivalent [6]. 2.1

Generative Grammars

Generative latent variables grammars can be seen as tree structured hidden Markov models. A simple EM algorithm [1] allows us to learn parameters for generative grammars which maximize the log joint likelihood of the training sentences w and parse trees T : Y YX Ljoint (θ) = log Pθ (wi , Ti ) = log Pθ (wi , t), (2) i

i

t:Ti

where t are derivations (over split categories) corresponding to the observed parse tree (over unsplit categories). In the E-Step we compute inside/outside scores over the set of derivations corresponding to the observed gold tree (in linear time), which allows us to compute expectations over the grammar productions.These expectations are then normalized in the M-Step to update the production probabilities φX→γ = eθX→γ to their maximum likelihood estimates: P T Eθ [fX→γ (t)|T ] φX→γ = P P (3) γ0 T Eθ [fX→γ 0 (t)|T ] Here, Eθ [fX→γ (t)|T ] denotes the expected count of the production (or feature) X → γ with respect to Pθ in the set of derivations t, which are consistent with the observed parse tree T . Similarly, we will write Eθ [fX→γ (t)|w] for the expectation over all derivations of the sentence w. To learn grammar complexity, we use a simple, yet powerful split-merge approach [2]. We iteratively refine the grammars in a hierarchical way. In each stage, all symbols are split in two, e.g. NP becoming NP1 and NP2 , and then fit the model using the EM algorithm described above. After a splitting stage, half of the splits are rolled back based on (an approximation) to their likelihood gain. Empirically the gains level off after six split-merge rounds, and learning a generative grammar takes about 20 hours on a single CPU. 2.2

Discriminative Grammars

Discriminative latent variables grammars can be seen as conditional random fields [7] over trees. For discriminative grammars, we maximize the log conditional likelihood: Lcond (θ) = log

Y i

Y X eθT f (t) Pθ (Ti |wi ) = log Z(θ, wi ) i t:Ti

2

(4)

≤ 40 words F1 EX

Parser ENGLISH Single-Scale Generative Latent Variable Grammars [1] Single-Scale Discriminative Latent Variable Grammars [4] Multi-Scale Discriminative Latent Variable Grammars [5] Split-Merge Generative Latent Variable Grammars [10] Lexicalized Generative Grammars [11] Discriminative Reranking after Lexicalized Grammars [12] GERMAN Split-Merge Generative Latent Variable Grammars [10] Multi-Scale Discriminative Latent Variable Grammars [5] Lexicalized Generative Grammars [13]

all F1

EX

32.8 35.7 40.1 39.1 39.6 46.2

86.3 88.3 89.4 90.1 89.7 91.7

30.3 33.1 37.7 37.1 37.2 43.5

80.8 40.8 81.5 45.2 F1 76.3

80.1 80.7

39.1 43.9

86.8 88.8 90.0 90.6 90.3 92.3

-

Table 1: Test set parsing accuracies for latent variable grammars and other state-of-the-art methods. We directly optimize this non-convex objective function using a numerical gradient based method (LBFGS [8] in our implementation). Fitting the log-linear model involves the following derivatives:   ∂Lcond (θ) X Eθ [fX→γ (t)|Ti ] − Eθ [fX→γ (t)|wi ] , = (5) ∂θX→γ i where the first term is the expected count of a production in derivations corresponding to the correct parse tree and the second term is the expected count of the production in all parses. One of the main challenges in estimating discriminative grammars is that the computation of the derivatives requires repeatedly taking expectations over all parses of all sentences in the training set. To make this computation practical on large data sets, we extend the idea of coarse-to-fine parsing [9] to handle the repeated parsing of the same sentences. We cache computations of similar quantities between training iterations, allowing the efficient approximation of feature count expectations. Even with these approximations, training takes about three days on an 8-core CPU. In practice, we add an L1 -regularization term to the objective function in Eq. 4 in order to control the complexity of the grammars. Multi-scale grammars [5] then take advantage of the sparsity produced by L1 -regularization by allowing some productions to reference fine categories, while others to reference coarse categories. As a result, a category such as NP can be complex in some regions of the grammar while remaining simpler in other regions, giving extremely compact grammars.

3

Experiments

To compare generative and discriminative latent variable grammars, we ran experiments on a broad range of languages. Due to space reasons Table 1 contains only an excerpt of our empirical results, and we refer the interested reader to [14] for a complete overview. Figure 2 shows how parsing performance improves with the addition of more latent categories. In general, discriminatively trained grammars give better performance than their generative cousins, even when using an order of magnitude fewer parameters. However, the final accuracies of the generative models are slightly higher in terms of F1 because the split-merge approach seems to allocate the complexity in more meaningful way. Generative grammars are also significantly (20 times) faster to train, but the discriminative grammars allow for a convenient integration of additional (overlapping) features. It is also interesting to examine how the complexity is allocated. While most categories have similar complexity in the two cases, the complexity of the two most rened phrasal categories are ipped. Generative grammars split NPs most highly, discriminative grammars split the VP. This distinction seems to be because the complexity of VPs is more syntactic (e.g. complex verb subcategorization), while that of NPs is more lexical (noun choice is generally higher entropy than verb choice). We also examined the most likely parse trees produced by the different grammars and observed only a small overlap between the generative and discriminative grammars, despite their comparable F1 scores. Even in their top 50 lists, the overlap was less than 30%, suggesting that the grammars are modeling very different, and potentially complementary, linguistic phenomena. 3

NP VP PP S SBAR ADJP ADVP QP PRN

Parsing accuracy (F1)

90

Generative 32 24 20 12 12 12 8 7 5 subcategories Discriminative 19 32 20 14 14 8 7 9 6 production parameters Note that subcategories are compared to production parameters, indicating that the number of parameters grows cubicly in the number of subcategories for generative grammars, while growing linearly for multi-scale discriminative grammars.

85

80 Discriminative Multi-Scale Grammars + Additional Features Generative Split-Merge Grammars Flat Discriminative Grammars Flat Generative Grammars

75

10000 100000 Number of grammar productions

Exact Match

F1 Score

1000000

NP

VP

PP

QP

G1 G2 G3 D1 87%

88%

89% 30%

35%

40% 88%

89.5%

91% 88%

89.5%

91% 80%

82%

84% 88%

90.5%

Figure 2: Top Left: Parsing accuracy vs. number of grammar productions. Top Right: Grammar complexity allocation. Bottom Row: Breakdown of different accuracy measures for three generative grammars (G1 -G3 ), and one discriminative grammar (D1 ). Figure 2 also shows a more detailed error analysis for three generative and one discriminative grammar after three split-merge rounds. Not so surprisingly, there is a large, and maybe systematic, difference in the errors made by the generative and discriminative models. Interestingly, there is also a large difference between the different generative models that differ only in the random seed used for initialization. We leave the investigation of ensemble methods that can combine the strengths of the different grammars for future work.

References [1] T. Matsuzaki, Y. Miyao, and J. Tsujii. Probabilistic CFG with latent annotations. In ACL, 2005. [2] S. Petrov, L. Barrett, R. Thibaux, and D. Klein. Learning accurate, compact, and interpretable tree annotation. In ACL ’06, 2006. [3] P. Liang, S. Petrov, M. I. Jordan, and D. Klein. The infinite PCFG using hierarchical Dirichlet processes. In EMNLP ’07, 2007. [4] S. Petrov and D. Klein. Discriminative log-linear grammars with latent variables. In NIPS ’08, 2008. [5] S. Petrov and D. Klein. Sparse multi-scale grammars for discriminative latent variable parsing. In EMNLP ’08, 2008. [6] N. A. Smith and M. Johnson. Weighted and probabilistic context-free grammars are equally expressive. Computational Lingusitics, 2007. [7] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML ’01, 2001. [8] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999. [9] E. Charniak, S. Goldwater, and M. Johnson. Edge-based best-first chart parsing. 6th Workshop on Very Large Corpora, 1998. [10] S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In NAACL, 2007. [11] E. Charniak and M. Johnson. Coarse-to-Fine N-Best Parsing and MaxEnt Discriminative Reranking. In ACL’05, 2005. [12] L. Huang. Forest reranking: Discriminative parsing with non-local features. In ACL ’08, 2008. [13] A. Dubey. What to do when lexicalization fails: parsing German with suffix analysis and smoothing. In ACL ’05, 2005. [14] S. Petrov. Coarse-to-Fine Natural Language Processing. PhD thesis, UC Berkeley, 2009.

4

93%

Generative and Discriminative Latent Variable Grammars - Slav Petrov

framework, and results in the best published parsing accuracies over a wide range .... seems to be because the complexity of VPs is more syntactic (e.g. complex ...

166KB Sizes 1 Downloads 291 Views

Recommend Documents

Products of Random Latent Variable Grammars - Slav Petrov
Los Angeles, California, June 2010. cO2010 Association for Computational ...... Technical report, Brown. University. Y. Freund and R. E. Shapire. 1996.

Self-training with Products of Latent Variable Grammars - Slav Petrov
parsed data used for self-training gives higher ... They showed that self-training latent variable gram- ... (self-trained grammars trained using the same auto-.

A Discriminative Latent Variable Model for ... - Research at Google
attacks (Guha et al., 2003), detecting email spam (Haider ..... as each item i arrives, sequentially add it to a previously ...... /tests/ace/ace04/index.html. Pletscher ...

Randomized Pruning: Efficiently Calculating ... - Slav Petrov
minutes on one 2.66GHz Xeon CPU. We used the BerkeleyAligner [21] to obtain high-precision, intersected alignments to construct the high-confidence set M of ...

Uptraining for Accurate Deterministic Question Parsing - Slav Petrov
ing with 100K unlabeled questions achieves results comparable to having .... tions are reserved as a small target-domain training .... the (context-free) grammars.

Training Structured Prediction Models with Extrinsic Loss ... - Slav Petrov
loss function with the addition of constraints based on unlabeled data. .... at least one example in the training data where the k-best list is large enough to include ...

Improved Transition-Based Parsing and Tagging with ... - Slav Petrov
and by testing on a wider array of languages. In par .... morphologically rich languages (see for example .... 2http://ufal.mff.cuni.cz/conll2009-st/results/results.php.

Structured Training for Neural Network Transition-Based ... - Slav Petrov
depth ablative analysis to determine which aspects ... Syntactic analysis is a central problem in lan- .... action y as a soft-max function taking the final hid-.

Efficient Parallel CKY Parsing on GPUs - Slav Petrov
of applications in various domains by executing a number of threads and thread blocks in paral- lel, which are specified by the programmer. Its popularity has ...

A Generative-Discriminative Framework using Ensemble ... - Microsoft
e.g. in text dependent speaker verification scenarios, while design- ing such test .... ts,i:te,i |Λ) (3) where the weights λ = {ai, bi}, 1 ≤ i ≤ n are learnt to optimize ..... [2] C.-H. Lee, “A Tutorial on Speaker and Speech Verification”,

A Generative-Discriminative Framework using Ensemble ... - Microsoft
sis on the words occurring in the middle of the users' pass-phrase in comparison to the start and end. It is also interesting to note that some of the un-normalized ...

Using Search-Logs to Improve Query Tagging - Slav Petrov
Jul 8, 2012 - matching the URL domain name is usually a proper noun. ..... Linguistics, pages 497–504, Sydney, Australia, July. Association for ...

arXiv:1412.7449v2 [cs.CL] 28 Feb 2015 - Slav Petrov
we need to mitigate the lack of domain knowledge in the model by providing it ... automatically parsed data can be seen as indirect way of injecting domain knowledge into the model. ..... 497–504, Sydney, Australia, July .... to see that Angeles is

A Universal Part-of-Speech Tagset - Slav Petrov
we develop a mapping from 25 different tree- ... itates downstream application development as there ... exact definition of an universal POS tagset (Evans.

Multi-Source Transfer of Delexicalized Dependency ... - Slav Petrov
with labeled training data to target languages .... labeled training data for English, and achieves accu- ..... the language group of the target language, or auto- ...

A GENERATIVE-DISCRIMINATIVE FRAMEWORK ...
cation or mis-verification functions [11, 12] of these discriminative measures. Although such ..... pendent Speaker Verification - Field Trail Evaluation and Simu-.

Hybrid Generative/Discriminative Learning for Automatic Image ...
1 Introduction. As the exponential growth of internet photographs (e.g. ..... Figure 2: Image annotation performance and tag-scalability comparison. (Left) Top-k ...

LATENT VARIABLE REALISM IN PSYCHOMETRICS ...
Sandy Gliboff made many helpful comments on early drafts. ...... 15 Jensen writes “The disadvantage of Spearman's method is that if his tetrad ..... According to Boorsboom et al., one advantage of TA-1 over IA is that the latter makes the ...... st

Learning Compact Lexicons for CCG Semantic Parsing - Slav Petrov
tions, while learning significantly more compact ...... the same number of inference calls, and in prac- .... Proceedings of the Joint Conference on Lexical and.

Efficient Graph-Based Semi-Supervised Learning of ... - Slav Petrov
improved target domain accuracy. 1 Introduction. Semi-supervised learning (SSL) is the use of small amounts of labeled data with relatively large amounts of ...

Learning Better Monolingual Models with Unannotated ... - Slav Petrov
Jul 15, 2010 - out of domain, so we chose yx from Figure 3 to be the label in the top five which had the largest number of named entities. Table 3 gives results ...

Training a Parser for Machine Translation Reordering - Slav Petrov
which we refer to as targeted self-training (Sec- tion 2). ... output of the baseline parser to the training data. To ... al., 2005; Wang, 2007; Xu et al., 2009) or auto-.

LATENT VARIABLE REALISM IN PSYCHOMETRICS ...
analysis, structural equation modeling, or any other statistical method. ...... to provide useful and predictive measurements, the testing industry will retain its ...

Parsing Schemata for Grammars with Variable Number ...
to parsing algorithms for context-free gram- mars). Such an ... grammars) are a generalization of context-free grammars (CFG) in ..... 5th Conference of the European. Chapter of the ... Fernando C. N. Pereira and David H. D. War- ren. 1983.