Supertagging with Combinatory Categorial Grammar

Viewer
Transcript

Supertagging with Combinatory Categorial Grammar Leif Arda Nielsen Department of Computer Science, King’s College London [email protected], [email protected]

Abstract. Supertagging is an approach that integrates linguistically motivated lexical descriptions with the robustness of statistical techniques. Using lexical items associated with rich lexical descriptions (supertags), the computation of the linguistic structure can be localized. Such a method can resolve ambiguities by using statistical distribution information before the parse, reducing subsequent parse times considerably. A supertagger was developed using a trigram model with Good-Turing discounting and Katz’s back-off, and trained on Wall Street Journal sections 01-21, which were annotated with Combinatory Categorial Grammar (CCG) categories. The model achieves recall of more than 92%. This work allows empirical comparison of CCG and the Lexicalized Tree-Adjoining Grammar (LTAG) formalism, which has been used for supertagging previously.

1

Introduction

Supertagging is an approach to reducing ambiguities in language processing prior to parsing, using contextually descriptive lexical categories (supertags) and robust statistical techniques. Supertags encode complex local syntactic constraints to categories, thereby reducing ambiguity to a local context. Lexical items are associated with distinct supertags for each syntactic context the item can appear in. This results in many more supertags per lexical item than with standard part of speech (POS) tags, but also allows finer distinctions between different senses to be made. Lexicalised formalisms like Combinatory Categorial Grammar (CCG) {[20]; [21]} are suitable for use as supertags, as the categories used are highly descriptive and localise ambiguities. Supertagging is very much like POS tagging; using local statistical information in the form of n-gram models to output a small set of supertags per word, which can then be efficiently input to a parser [8], [13]. Due to the high number of supertags associated with most words, achieving levels of supertag disambiguation comparable to POS tagging is difficult. In this work, the effects of a range of statistical techniques for supertag disambiguation 1 Proceedings of the Seventh ESSLLI Student Session Malvina Nissim (editor) c 2002, Leif Arda Nielsen Chapter 1, Copyright

will be examined, using a corpus of CCG annotated data. This will allow empirical comparison of CCG and the Lexicalized Tree-Adjoining Grammar (LTAG) formalism, which has been used for supertagging by [19].

2

Previous work

Srinivas and Joshi [19] present a version of supertagging using LTAG [17]. Their model uses a trigram language model, Good-Turing smoothing [11], Katz’s back-off model [14] and the leaving-one-out technique [15]. They achieve per-word accuracy levels of about 92% on the Wall Street Journal (WSJ) corpus. The authors use structural information to reduce the search space prior to supertagging, using the span of supertags to distinguish viable supertags from the rest. The language model trained on 1,000,000 words of WSJ (sections 00 through 24, except 20), and tested on 47,000 words (section 20) has a baseline (unigram) recall of 77.2% and a trigram recall of 92.2%. It can be seen that the effect of contextual information for the supertagger is quite important, as it provides a 15% increase in recall. This effect is far greater than for POS tagging in absolute terms, where the baseline recall is 91% [2], which is improved to around 97%. The difference is due to the fact that there are many more supertags per word than POS tags. In terms of reduction of error, however, the effects are similar, with 66% for supertagging and 67% for POS tagging. The drawback of the supertagging process is that a mistake in the process can lead to a sequence of supertags that cannot be correctly parsed. This can be alleviated by outputting several supertags per word instead of only the best. The authors report increases of parsing speed by a factor of 350 with use of the supertagger.

3

Applications using supertags

The use of supertags in document retrieval has been investigated {[4], [5], [6]}, and has resulted in a system capable of recall and precision figures of 88% and 97% respectively in filtering out irrelevant documents. Chandrasekar et al. [3] have investigated the use of supertags for text simplification. A proper noun coreference resolution method relying on supertags was implemented by [18], who report a precision level of 79% and recall of 32%, which is less than rival methods using more linguistic support. In word sense disambiguation [7], a maximum entropy model using supertags has been tested on the NMSU interest corpus (app. 2,400 sentences) and the DSO corpus (app. 192,000 sentences) with recalls of 92.6% and 72.1%, respectively, which is an improvement over existing methods. 2

4 4.1

Evaluation Baseline algorithm

A basic version of the procedure discussed in [19] was developed as the baseline algorithm. The language model is built using a trigram Hidden Markov Model (HMM) tagger1 . Good-Turing discounting [11] and Katz’s back-off model [14] are used to assign probability space to unseen events. To supertag sentences using the model, the Viterbi algorithm is used with a beam-width of N in a breadth first search ([16], p.41). The algorithm was modified to allow the return of K best-paths instead of just one. This approach outputs an arbitrary amount of supertags per word, with a maximum of K, as some paths in the top K may, and usually do, have the same supertags at certain words.

4.2

Experimental method and Data used

The supertagger is trained on sections 01 through 21 of the Wall Street Journal corpus, which contains 1,000,000 word-supertag-POStag triplets. The data was converted to CCG form by Julia Hockenmaier from Edinburgh University. The POS info was not used in these experiments. It must be noted that the data was not manually annotated, so there are errors in the extraction algorithm, and this is bound to affect the performance of the supertagger. Section 00 (45,000 words) of the WSJ was used as the development data, upon which the following results are based. Section 23 (55,000 words) was used as the test set, and results on the set are in section 5 ‘Results’. The measures used were precision, recall and the F1 measure2 . All of these measures are given in percentage where used. The measure that is of most interest is the recall, as the lack of correct supertags for a word will lead to an unparseable sentence. Higher levels of precision will lead to faster parses of the output. 1 Descriptions of HMMs, discounting, backing-off and the Viterbi algorithm can be found in standard text-books, such as [9] and [1]. [12] provides a good introduction to language modeling, with a review of the field at present. 2 Precision and recall are defined as :

P recision =

N o(correct supertags returned) N o(correct supertags returned) , Recall = (1.1) N o(all supertags returned) N o(supertags in test)

The F1 provides a measure that combines these two at a 1/1 ratio. F1 =

2 × P recision × Recall P recision + Recall

3

(1.2)

4.3

Experiments

Baseline performance Initial experiments with the values of N , the number of paths to keep in memory, and K, the number of paths to output, give the results seen in table 1.1, where SPW stands for Supertags Per Word. N 3 3 5 5 5 5 5 7 7 7

K 1 3 1 2 3 4 5 2 3 4

Prec. 82.37 76.40 83.87 80.93 77.86 75.76 74.04 81.49 78.52 76.42

Recall 82.37 83.94 83.87 85.21 85.98 86.22 86.39 85.85 86.74 87.15

F1 82.37 79.99 83.87 83.02 81.72 80.65 79.74 83.61 82.42 81.43

SPW 1 1.099 1 1.053 1.104 1.138 1.167 1.053 1.105 1.140

N 7 7 7 10 10 10 15 15 20 20

K 5 6 7 1 3 10 3 15 5 20

Prec. 74.44 73.00 71.75 84.65 78.86 68.86 79.30 65.03 75.38 61.90

Recall 87.41 87.51 87.62 84.65 87.15 88.52 87.60 89.62 88.75 90.18

F1 80.40 79.60 78.89 84.65 82.80 77.46 83.25 75.37 81.52 73.41

SPW 1.174 1.199 1.221 1 1.105 1.285 1.105 1.378 1.177 1.457

Table 1.1: Baseline results for various values of N and K A detailed look at the average number of supertags output per word is seen in table 1.1. Looking at the values for N = 5, it can be seen that increasing K results in insignificant increases to the number of supertags output per word. This is due to the search algorithm, where setting K results in outputting all supertags associated with a word for the top K paths in memory. As many of these are usually the same, the number of supertags per word is rather low, and while increasing K changes the number of supertags per word insignificantly, it results in noticeable recall differences. Consider, for example, N = 5, where increasing K from 2 to 5 results in a 10.8% increase in supertags/word and a corresponding 1.18% recall increase, which is acceptable. In the case of N = 7, increasing K from 2 to 7 costs 15.9% in supertags/word, but gives a 1.78% recall increase. It must be noted that increasing K results in practically no more computational costs for the supertagger, but will slow down the ensuing parser. From the results, it can be seen that choosing N, K = 5 and N, K = 7 give good results with relatively good speed3 , at 250,000 ms and 372,000 ms respectively to supertag the test set, and low supertags/word output. Increasing N increases performance, but slows down the process accordingly. Increasing K reduces precision, but increases recall, which is considered more important for the task at hand. In the limit, achieving high performance 3 The experiments were conducted on a Pentium-II 200MHz PC running Linux with 256MB of RAM

4

with N, K = 15 at 89.62% recall, or 90.18% with N, K = 20 is possible, but rather slow at 1,100,000 ms. Pruning the supertag search space Because there are bound to be errors in the training corpus due to errors in the original Treebank and the extraction algorithm employed, spurious supertags need to be deleted. For this purpose only supertags that occur 9 times or more in total in the training data are considered valid, and the rest are not used in the search. Given this cut-off, a quantity of tags corresponding to 0.17% of events seen in the training data are considered invalid. This method improves the speed of the system, and gives an improvement to recall of around 0.20% - 0.25%. Returning K > N paths To assess gains possible through the return of a larger number of paths, the search algorithm has been modified to allow the algorithm to return K > N paths at the end. The results of experiments using this method are seen in table 1.24 , where it can be seen that improvements to recall are possible, but at high cost to precision. N 5 5 5 20 20 20 20

K 5 10 20 20 40 60 80

Precision 75.54 69.23 63.44 63.07 56.68 54.19 52.84

Recall 88.54 88.87 88.95 92.61 92.85 92.89 92.90

F1 81.52 77.83 74.06 75.04 70.39 68.45 67.37

SPW 1.172 1.284 1.402 1.468 1.638 1.714 1.758

Table 1.2: Number of supertags per word and performance figures for various values of K ≥ N

4.4

Unknown word models

For the development data, 1,705 of the 45,556 words, that is to say 3.7%, are unknown. If rarely seen words are also classified as unknown, say those that are seen less than four times (see 4.4), the number of words categorised as unknown increases to 3237, 7.1%. The importance of a good unknown word model becomes obvious with this percentage. 4

These results are not based on the baseline performance but on the performance of the system with supertag dictionary, supertag pruning, unknown word model and wordfeature model. See sections 4.4, 4.3, 4.4 and 4.4 respectively

5

Searching through the complete supertag set The baseline model works with an overly simplified unknown word model which assigns the category “N ” (noun) to any unknown word, as it is the most common supertag in the training set. A simple extension is to check through the entire set of supertags for each unknown word. The results of this addition are seen in table 1.3. N 5 20

K 5 20

Precision 74.96 62.67

Recall 87.67 91.74

F1 80.82 74.47

Table 1.3: Performance searching through complete supertag set

The addition of this method results in an appreciable increase in recall, around 1.4%, but slows down the supertagger. Adding a basic unknown word model Srinivas and Joshi [19] use a method combining a probability estimate for unknown words (P r(U N K|Ti )) with a probability estimate based on word features. This results in: N (Wi , Ti ) if N (Wi , Ti ) > 0 (1.3) N (Ti ) = P r(U N K | Ti ) ∗ P r(word f eatures(Wi ) | Ti ) otherwise

P r(Wi | Ti ) =

Thus, a token “UNK” is associated with each supertag, and its count is estimated as seen above, where N1 (Tj ) is the number of words associated with the supertag Tj that appear exactly once in the corpus . The coefficient η is used to ensure that sparse supertags have probabilities less than 1. The authors use the leaving-one-out technique, as described in [15]. Instead, a method using the entire training corpus was used to estimate the probabilities of unknown words, with no use of word features. Experiments show that the effect of η on recall is weak, so results with η set to 5 are seen in table 1.4. It can be seen that this method results in an increase in recall of N 5 20

K 5 20

Precision 75.36 62.57

Recall 88.18 91.97

F1 81.27 74.47

Table 1.4: Results using P r(U N K|Ti )

around 0.2% - 0.5% for different beam sizes. 6

Adding a supertag dictionary In the baseline model, an assumption on the representation of words in the training corpus was made, so that for words that have been encountered in the training data only the tags they had been seen with were searched through. This is a rather restrictive version of the “tag dictionary” described by Ratnaparkhi [16]. Words that are not included in the supertag dictionary are classified as unknown. Effects of changing the cut-off value can be seen in table 1.5, in the percentage of the 43,887 words in the training set that are accepted for the supertag dictionary, the resulting percentage of words classified as unknown in the test data, and the resulting performance. The Cut off 0 2 3 4 5 6

Words Accepted 100.00 38.97 31.59 26.99 23.60 21.04

Unseen words in test 3.74 6.28 7.11 7.74 8.36 8.96

Unknown word recall 81.88 80.04 80.14 79.90 80.08 79.63

Precision 75.36 75.30 75.26 75.19 75.09 74.97

Recall 88.18 88.32 88.38 88.33 88.25 88.14

F1 81.27 81.29 81.29 81.23 81.14 81.02

Table 1.5: Effects of supertag lexicon cut-off values. N, K = 5. results suggest that words that have been seen 4 times or more (cut-off of 3) in the training data should be accepted as representative of their true supertag search space, and those seen less than 4 times require searching through the entire set of tags. Adding word features Word features are important clues for guessing the categories of unknown words; capitalized words are more likely to be nouns, words ending with -ing verbs etc. Word features that were accounted for are: capitalization, hyphenization, numbers, words that end with -s, -ing and -tion. For these six categories independence assumptions were made for simplicity, but these features are known to be dependent. Building on this, as discussed in [16], it was assumed that sparsely represented words are informative of unknown words. Experiments showed that the choice of cutoff, in between values of 5 and 50, had little impact on the recall of the system, and best results were obtained when words seen less than 10 times were included in the word feature model. Using the word feature model, with cut-off 10, the results seen in table 1.6 are obtained.

7

N 5 7 10 15 20

K 5 7 10 15 20

Precision 75.54 73.29 70.45 66.35 63.07

Recall 88.54 89.94 90.91 92.01 92.61

F1 81.52 80.76 79.38 77.10 75.04

Table 1.6: Effects of word feature method on performance.

4.5

Extending the model to quadgram

The trigram model was extended to a quadgram model using the same principles. With the implementation of Good-Turing discounting and Katz’s back-off in place, the model produces the results seen in table 1.7. The N 5 15

K 5 15

Precision 74.39 71.26

Recall 87.73 91.30

F1 80.51 80.04

Table 1.7: Performance of quadgram model using Katz’s back-off degradation in performance compared to the simpler trigram model can be explained as being due to the sparseness of the training data, which makes the correct estimation of back-off coefficients for Katz’s model difficult. In an attempt to overcome this problem, a back-off model which simply uses the quadgram estimate when available, and acts as a trigram model otherwise is used : (

P˜ (T4 | T1 , T2 , T3 ) =

P (T4 | T1 , T2 , T3 ) if P (T4 | T1 , T2 , T3 ) > 0 P (T4 | T2 , T3 ) otherwise

(1.4)

This method produces the results seen in table 1.8. N 5 20

K 5 20

Precision 75.56 62.61

Recall 88.83 92.82

F1 81.66 74.78

Table 1.8: Performance of quadgram model using simplified back-off The improvement provided by this method can be seen to be around 0.2% - 0.3%, which is not remarkable. This level of improvement can be seen to apply to the final test data as well, as can be seen in tables 1.10 and 1.11. For appreciable gains to be made through the use of n-grams longer than trigrams larger training data are needed. This will also allow a more 8

theoretically founded back-off model to be used. But even with the limited data available it is seen that gains, although small, are available.

5

Results

The methods discussed so far are applied to the test data (section 23 of the WSJ) in their optimum versions, and the results are seen in table 1.10. The higher level of accuracy can be partially explained by a lower percentage (5.5%) of unknown words in the data. N 5 5 7 10 10 15 15 20 20 30 40 50 60

K 5 15 7 10 40 15 60 20 80 30 40 50 60

Supertags per word 1.176 1.417 1.233 1.301 1.576 1.402 1.698 1.482 1.793 1.626 1.753 1.849 1.943

Error rate for unknown words 17.32 16.58 15.79 14.13 13.64 12.54 12.05 11.69 11.30 10.89 10.19 9.58 9.08

Per sentence Recall 31.28 21.12 36.00 41.14 42.00 45.36 46.16 47.55 48.57 50.99 53.52 55.04 56.36

Per word Precision Recall 75.67 89.02 63.06 89.33 73.22 90.31 70.18 91.33 58.09 91.58 65.95 92.45 54.56 92.67 62.73 92.99 52.00 93.23 57.73 93.78 53.75 94.23 51.13 94.57 48.82 94.85

F1 81.81 73.93 80.87 79.37 71.09 76.98 68.69 74.92 66.77 71.46 68.46 66.38 64.46

Table 1.10: Performance on test set using finalized versions of methods

N 5 10 15 20

K 5 10 15 20

Precision 75.59 69.94 65.54 62.10

Recall 89.15 91.62 92.67 93.27

F1 81.81 79.33 76.78 74.56

Table 1.11: Performance on test set using quadgram model Through the information gained in the experiments, it was possible to considerably improve the performance of the system. The recall for N, K = 5 was improved from the initial 86.38% to 88.54%; for N, K = 20 from 90.17% to 92.60%. On the test data, for N = 5, K = 1, our method achieves 86.45% recall, compared to Srinivas and Joshi [19], who report a performance of 92.2% on similar data. At the limit the authors achieve 97.1% using 3-best supertags.

9

As our method has a variable amount of supertags per word, exact comparison is not possible, but at 1.17 supertags per word the performance increases to 89.02%. These results are considerably lower than those achieved by Srinivas and Joshi, and competing performance is only possible by increasing the beamwidth, so at N, K = 20 a 92.99% performance is achieved. The differences may be attributed to the different lexical formalisms used (XTAG vs. CCG), or to differences in the implementation of methods. A rather small portion of the difference can be attributed to the differences in training/test sections used. As the details specified in the paper mentioned were implemented as far as possible, it is difficult to tell what modifications are necessary for comparable results. By analysing the ratio between the increases in the supertags per word and the recall values of N, K of 20, 30, 40, 50 and 60, it can be suggested that to achieve recall levels of 99.9%, around 4 supertags per word on average would be needed. Taking into consideration that the ratio of improvement will most likely lessen at higher levels of N, K, the number of supertags needed will probably lie around 5-6 supertags per word. Considering the fact that the original search space consists of more than 1200 supertags, of which several hundred may apply to a word, the reduction is appreciable.

6

Conclusion and future work

A CCG based supertagger based on a trigram statistical model was implemented. The model uses Good-Turing discounting and Katz’s back-off, and achieves 93% recall and 62.7% precision on a typical test5 . The optimization of different components of the language model were discussed, revealing their influence on the overall performance of the system. The model achieved less performance than a comparable system using LTAG. Methods that can be useful to the further development include: • The incorporation of Part of Speech information. As POS taggers reach high levels of accuracy, the model could be augmented such that P (wi |Ti ) = αP (wi |Ti ) + (1 − α)P (wi |P OS i ) and a similar method of interpolation could be used for the contextual probabilities. Successful use of the POS information should result in an appreciable performance increase. • Using category based models [15], under which a small number of supertags would be assigned to each word, it would be possible to use larger histories than trigrams without requiring larger training data. 5

These figures are for an experiment running N, K = 20 paths in memory. The results for a computationally less expensive experiment at N, K = 5 are 89.02% recall and 75.67% precision, and for a computationally more expensive experiment at N, K = 60 recall of 94.85% and precision of 48.82% are obtained.

10

• A head trigram model [10] can be used, in which some words are classified as heads and the others dependents. The determination of the supertag for a word thus depends on the previous head words in combination with the immediately preceding two words. • A method for filtering supertags based on other parameters than frequency, such as span constraints like those discussed by [19] may prove useful. • Manual evaluation of some of the training data for accuracy may prove useful in accounting for the difference in performance of XTAG and CCG based approaches.

Bibliography [1] J. Allen. 1995. Natural Language Understanding. Benjamin/Cummings, Redwood City, MA. [2] Eric Brill. 1993. Automatic grammar induction and parsing free text: A transformation-based approach In Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics, Columbus, Ohio. [3] R. Chandrasekar, C. Doran and B. Srinivas. 1996. Motivations and Methods for Text Simplification. In Proc. of COLING96, 1041–1044. [4] R. Chandrasekar and B. Srinivas. 1997. Gleaning information from the web: Using syntax to filter out irrelevant information. In Proceedings of AAAI 1997 Spring Symposium on NLP on the World Wide Web. [5] R. Chandrasekar and B. Srinivas. 1997. Using supertags in document filtering: The effect of increased context on information retrieval effectiveness In Proceedings of Recent Advances in NLP (RANLP). [6] R. Chandrasekar and B. Srinivas. 1997. Using syntactic information in document filtering: A comparative study of part-of-speech tagging and supertagging. In Proceedings of RIAO ’97, 531-545, Montreal. [7] R. Chandrasekar and B. Srinivas. 1998. Knowing a word by the company it keeps: Using local information in a maximum entropy model for word sense disambiguation. Submitted for publication. [8] E. Charniak, G. Carroll, J. Adcock, A. Cassandra, Y. Gotoh, J. Katz, M. Littman and J. McCann. 1996. Taggers for Parsers. Artificial Intelligence, 85(12). [9] E. Charniak. 1993. Statistical Language Learning. MIT Press, Cambridge, MA. [10] Chen, John and Bangalore, Srinivas and Vijay-Shanker, K. 1999. New models for improving supertag disambiguation. In Proceedings of the 9th Conference ofthe European Chapter of the Association for Computational Linguistics, Norway. [11] I. J. Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika, 3 and 4(40):237-264.

11

[12] Joshua T. Goodman. 2001. A bit of progress in language modeling. Computer Speech and Language, (15):403-434. [13] Frederick Jelinek. 1998. Statistical Methods for Speech Recognition, MIT Press. [14] Slava M. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 3(35):400-401. [15] T. R. Niesler and P. C. Woodland. 1996. A variable-length category-based n-gram language model. In Proceedings, IEEE ICASSP. [16] Adwait Ratnaparkhi. 1998. Maximum entropy models for natural ambiguity resolution. Ph.D. thesis, University of Pennsylvania. [17] Yves Schabes and Aravind K. Joshi. 1991. Parsing with lexicalised tree adjoining grammar. In M. Tomita, editor, Current Issues in Parsing Technologies, Kluwer Academic Publishers. [18] Srinivas, B. and Baldwin, B. 1996. Exploiting supertag representation for fast coreference resolution. In International Conference on NLP+IA/TAL+AI, Moncton, NB, Canada. [19] Bangalore Srinivas and Aravind K. Joshi. 1999. Supertagging: An approach to almost parsing. Computational Linguistics, 25(2):237-266. [20] M. Steedman. 1996. A very short introduction to CCG. Unpublished tutorial paper. Available at http://www.cogsci.ed.ac.uk/steedman/papers.html. [21] M. Steedman. 2000. The Syntactic Process. MIT Press, Cambridge, MA.

12

Supertagging with Combinatory Categorial Grammar

Supertagging with Categorial GrammarB ColumQ IS ...

Supertagging with Factorial Hidden Markov Models - Jason Baldridge

Grammar with laughter.pdf

Improving Dependency Parsers using Combinatory ...

Mark Steedman and Jason Baldridge Categorial ...

basic english grammar basic english grammar - WordPress.com

Intermediate-Chinese-A-Grammar-And-Workbook-Grammar ...