Proceedings of Information and Communication Technologies International Symposium (ICTIS07): Workshop on Arabic Natural Language Processing, Morocco, 2007.

Machine Translation Oriented Syntactic Normalization of Noun Phrases in Arabic Khaled Elghamry

[email protected] Faculty of Al-Alsun (Languages), Ain Shams University, Cairo, EGYPT Abstract--It has been shown that syntactic normalization boosts text summarization and improves both precision and recall in information retrieval. This paper shows that syntactic normalization can also improve the performance of machine translation systems. This paper presents a method for identifying and normalizing the structural variations of the nominal ‘Construct State’, also known as iDafa, in Arabic. This type of noun phrases has the structure Noun1 Noun2, and is highly frequent in Arabic. The paper (i) describes the structural variations of this construction, (ii) describes the suggested method for identifying its structural variations, and (iii) finally tests the effect of normalizing these variants on the performance of Arabic-to-English machine translation systems. The results showed a 5-point improvement in MT performance. To the author’s best knowledge, this is the first attempt for text normalization in Arabic, and its corresponding effect on Arabic MT. Index Terms—Arabic, Normalization, Noun Phrases

Machine

The rest of the paper is organized as follows. Section 2 reviews the structural properties of the Construct State and its syntactic variants. Section 3 describes the suggested method for identifying these variants. Section 4 presents an evaluation of the identification results. Section 5 shows the effect of normalizing these variants on Arabic-to-English machine translation. Finally, Section 6 discusses the conclusions of the study and proposes possible directions for improvement.

II. NORMALIZABLE NPS IN ARABIC Arabic (and other Semitic languages) uses a nominal construct state to express a genitival relation between a head noun and a noun phrase without the mediation of a (dummy) preposition [8]. The following noun phrase is an example of CS in Arabic. All Arabic data and examples are transliterated using Buckwalter transliteration scheme.1

Translation,

I. INTRODUCTION

N

ormalizing the different variations of a given linguistic structure is an important issue in several areas of NLP tasks and applications. It has been shown that text normalization boosts text summarization [1] and paraphrasing [2], and improves both precision and recall in information retrieval [3][4][5][6][7]. To the best of the author’s knowledge, no syntactic normalization methods have been presented for Arabic. This paper is an attempt in this direction, by presenting a method for identifying and normalizing the structural variations of the nominal Construct State (CS) in Arabic. CS is a Semitic nominal construction which expresses a genitival relation between a head noun and a noun phrase without the mediation of a preposition (e.g. mSAdr Al$rTp, literally sources the police– police sources). It is shown that syntactic normalization can also improve the performance level in machine translation. The method suggested in this paper only requires a lightly tokenized un-disambiguated part-of-speech-tagged corpus. Manuscript received January 7, 2007. Khaled Elghamry is with The Faculty of Al-Alsun (Languages), Ain Shams University, Abbassia, Cairo, Egypt. (phone/fax: 202-639-4122; email: [email protected].

N IJKL

Tags: Arabic: Translit: Literal: Gloss:

mEhd institute the planning

DET N MNNNNNNNNNNNOPQRS‫ا‬ AltxTyT the planning institute

The number and gender features of the whole construct come from the head, and the definiteness feature from the modified noun phrase. The head also carries the case of the construction, which depends on the syntactic position of the construction, whereas the modified noun phrase is always in the genitive case. The two elements of the construct cannot be separated from each other. Therefore, if there is an adjective that modifies the head noun, the modifier should be put after the second element, as in the following example. Tags: Arabic: Translit: Literal: Gloss:

N

IJKL mEhd institute the national

DET N MNNNNNNNNNNNOPQRS‫ا‬ AltxTyT the planning planning

DET ADJ

VLWXS‫ا‬ Alqwmy the national institute

An alternative is to put the adjective right after the head noun, and in this case the genitival relation between the two elements is mediated by a preposition.

1

http://www.qamus.org/transliteration.htm

Proceedings of Information and Communication Technologies International Symposium (ICTIS07): Workshop on Arabic Natural Language Processing, Morocco, 2007. Tags: Arabic: Translit: Literal: Gloss:

DET N IJKYS‫ا‬ AlmEhd the institute the national

DET ADJ VNNNNNNLWXS‫ا‬ Alqwmy the national institute

P DET N MOPQRZS lltxTyT for the planning for planning

A preposition is obligatory if we want to limit indefiniteness to the head noun only, as in the following example. Tags: Arabic: Translit: Literal: Gloss:

N IJKL mEhd institute an institute

P DET N MOPQRZS lltxTyT for the planning for planning

The prepositional alternatives of the CS are frequent in Arabic, and almost every CS has a prepositional counterpart. Another alternative structure for the construct state can be obtained by using the relative adjective equivalent of the second element of the CS. Relative adjectives in Arabic are derived by attaching any of the suffixes (y, yp, ywn, yyn) to a noun.2 They usually denote that a person or thing is related to, or connected with this noun, generally in respect to a tribe, location, etc., (e.g. mSr ‘Egypt’ – mSry ‘Egyptian’)[8], as in the following example. Tags: Arabic: Translit: Literal: Gloss:

N N ‫دور‬ ^NN_L dwr mSr role Egypt Egypt’s role

DET N DET ADJ ‫ور‬INNNS‫ا‬ ‫^ي‬NNNNN_YS‫ا‬ Aldwr AlmSry the role the Egyptian the Egyptian role

This means that the following structures are possible syntactic variations of the construct state, where ‘’ means ‘can be re-written as’. N1 N2  N1 ADJ N1 N2  N1 P N2 N1 N2 ADJ  N1 ADJ P N2 Figure 1: Syntactic Variants of CS

In the following section, I will describe a method for identifying instances of noun phrases in the corpus that have any of these syntactic variations.

III. IDENTIFYING EQUIVALENT NPS The identification process uses a straightforward distributional method. Given each of the structures above, we check if the corpus contains one or more of its possible alternative structures. Identifying equivalent NPs was done in two steps. The first was to extract all examples of possible

NP equivalents. The other step was to determine true variants. The identification rules were as follows, where parentheses around an element indicate that it is optional: 1. Extract all word sequences in the corpus that have the POS-tag pattern , and is not immediately followed by a word tagged as ADJ. 2. If the corpus contains the word sequence with the POS-tag pattern , such that ADJ = N2 + [y|yp|ywn|yyn], where y, yp, ywn, and yyn are relative adjective markers, then and W1 (Al) W3 are possible NP structural variations of each other. 3. If the corpus contains the word sequence with the POS-tag pattern , then and are possible structural variations. 4. If the corpus contains the word sequence that has the tag pattern , and the word sequence that has the tag pattern , then both structures are possible structural variations. The second step of identifying true variants was done using human raters. All the examples identified in the first step were given to three native speakers of Arabic. Rating followed a binary decision: two structures are either Equivalent or Not Equivalent. Two of the identified NPs are equivalent if they refer to the same entity: person, place, etc…If two out of the three raters agreed on the equivalence, then equivalence was established.

IV. IDENTIFICATION EVALUATION These identification rules were implemented in Perl. The performance of these rules was evaluated on a corpus extracted from Al-Ahram Arabic Newspaper (in 2001).3 The corpus contained about 3,318,774 tokens and 146289 unique words. The corpus was first analyzed using Buckwalter’s Arabic Morphological Analyzer (BAMA).4 As a result, every word in the corpus was given all its possible part-ofspeech (POS) tags and morphological analyses. The words analyzed as NOT-FOUND were excluded. Only light tokenization was then performed and was limited to segmenting off the definite article (Al) and proclitic prepositions (b, l, k) and conjunctions (w, f). Tokenization was done using the analyses provided by BAMA. No POS disambiguation was necessary. Spelling normalization is already incorporated in the analyzer. Table 1 shows the results of the identification rules for each pair of possible syntactic variants. The second column in the Table shows that number of possible pairs of possible syntactic variants, and the third column shows the number of noun phrases in the corpus for every pair. The manual analysis of a 10,000token sample of the corpus showed that there are almost 7

2

These adjectives usually correspond to English adjectives ending in ‘-an, ian, –ese, or -al’.

3

4

http://www.ahram.org.eg ttp://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02

Proceedings of Information and Communication Technologies International Symposium (ICTIS07): Workshop on Arabic Natural Language Processing, Morocco, 2007. normalizable structures every 1000 tokens. The rules identified a total of 2331 possible normalizable pairs totaling 26684 tokens. Not all pairs were true syntactic variants of each other. The next step was then to determine true variants. This was performed by human raters as explained above. Table 2 shows the results after filtering out ‘spurious’ cases of equivalence.

Arabic. Future research is still required to find fully automated methods for the identification of true variants. The results obtained using manual filtering are used below to show that normalizing these types of nominal structures can significantly improve the performance of an Arabic-toEnglish machine translation system.

V. THE NORMALIZATION EFFECT ON MT Table 1: Results of NP variants identification Normalizable Pairs # of Pairs Tokens N1 N2  N1 ADJ 627 7961 N1 N2  N1 P N2 1636 18127 N1 N2 ADJ  N1 ADJ P N2

69

593

TOTAL

2331

26684

Table 2: Identification Type and Token Precision Normalizable Pairs # Prec.ty Prec.to N1 N2  N1 ADJ 484 77% 70% N1 N2  N1 P N2 1384 85% 93% N1 N2 ADJ  N1 ADJ P N2 69 100% 100% TOTAL 1936 83% 84%

A total of 1936 pairs out of the possible 2331 variant pairs were identified as true variants. The third and fourth columns of Table 2 show the type and token precision of the identification rules for each pair of structures. The rules achieved almost equal average type and token precision rate, 83% and 84%, respectively. The lowest precision rate was in the N1 N2  N1 P N2 pair, and the highest was in the N1 N2 ADJ  N1 ADJ P N2 pair. The low precision of the first pair was largely due the high frequency of construct states that have the word ‘Aldwlp’ (meaning state or country) as its second member, leading to false equivalences such as the following5: Tags: Arabic: Translit Literal: Gloss:

N

DET N jNNNNNS‫و‬IS‫ا‬ jhwd Aldwlp efforts state the state’s efforts

‫د‬WJi

DET N DET ADJ ‫د‬WNNJkS‫ا‬ jNNNNNNNNOS‫و‬IS‫ا‬ Aljhwd Aldwlyp the efforts the international the international efforts

This word only resulted in 55 incorrect possible variants, which represents almost 9% of the mistakes for this pair. On the other hand, the perfect precision of the N1 N2 ADJ  N1 ADJ P N2 pair makes it a perfect normalizable pair. This means that any pairs of word sequences corresponding to these POS-tag sequences can be safely considered perfect variations of each other. The noun phrases ‘mEhd AltxTyT Alqwmy’ meaning The National Planning Institute’ and ‘AlmEhd Alqwmy lltxTyT’ meaning The National Institute for Planning illustrate this type of pairs. The results in Table 2 are the first results to be reported for NP normalization in 5

This is due to the fact that the relative adjective derived from this noun is confusable with the relative adjective derived from its plural form (i.e. Aldwl).

Once true NP variants have been identified, the next step was to check whether normalizing these variants would have a significant effect on the performance of an Arabic-English MT system in translating these NP structures. For this purpose, the Arabic-to-English translation tool in Google’s Language Tools was used (accessed January 2nd and 3rd 2007).6 The normalization effect was tested according to the following procedure: 1. All pairs of Arabic NP variants were translated using the Google tool. Each structure was given to the MT system in a +/-3-word context. Some experiments were done on the context effect, and it was found out that a smaller window affected the quality of the translation, and that a larger window did not have an effect on the translation. .2 The translations of these pairs were then given to a professional human translator for evaluation. Four translation judgments were used in evaluation: (a) BOTH CORRECT, if the translations of the two variants were correct; (b) VAR1-CORRECT, if only the translation of the first variant was correct; (c) VAR2-CORRECT, if only the translation of the second variant in the pair was correct; and (d) BOTH WRONG, if the translations of the two variants in a pair were wrong. Table 3 shows the phrase token correctness results for the three pairs after human evaluation and prior to normalization. Table 3: Translation Correctness Rates for all Pairs

Pair N1 N2  N1 ADJ N1 N2  N1 P N2 N1 N2 ADJ  N1 ADJ P N2

Correctness rate 0.841 0.912 0.803

Below I discuss these results for each pair in detail and show the effect of normalization on the performance for each pair, then on the overall performance. Table 4 shows the results for the N1 N2  N1 ADJ pair. There were 267 out of the 484 pairs where the system yielded the correct translations for the two variants in the pair, totaling 3549 phrase tokens. There were 46 pairs where only the N1 N2 translation was correct, totaling 390 phrase tokens. There were 118 pairs where only the translation of the N1 ADJ variant was correct, totaling 679 phrase tokens. 6

http://www.google.com

Proceedings of Information and Communication Technologies International Symposium (ICTIS07): Workshop on Arabic Natural Language Processing, Morocco, 2007. And finally there were 53 pairs where the system yielded the wrong translation for the two syntactic variants, totaling 360 phrase tokens. This results in wrong translations for a total of 217 pairs and 871 phrase tokens.

Though this improvement-by-normalization method has the disadvantage of requiring human intervention in evaluation and normalization, the significant improvement in performance encourages more future research for more automated methods for evaluation and normalization.

Table 4: Translation Results for N1 N2  N1 ADJ Types Tokens

Both Correct

N1N2

N1ADJ

Both wrong

267 3549

46 390

118 679

53 360

If we take the N1 ADJ variant to be the normal form for the N1 N2  N1 ADJ pairs where only the translation of the N1 ADJ variant was correct, then the 118 occurrences of the N1 N2 variants are translated correctly by the system after normalization. These 118 variants total 377 phrase tokens, which means a 377-reduction in the 871 erroneous translations, and improvement in performance by 6.9 points. If, on the other hand, we take the N1 N2 variant to be the normal form for the 46 pairs where only the N1 N2 translation was correct, this means their corresponding N1 ADJ variants are also correctly translated if normalized as N1 N2. These 46 N1 ADJ pairs total 134 phrase tokens. This means a 134-reduction in errors, and a corresponding improvement in performance by 2.4 points. This is a total of 9.3 points of improvement as a result of normalization. The same procedures were also applied to the second and the third syntactic pairs. Tables 5 and 6 show the results of the system’s performance on the two structural pairs. Table 5: Translation Results for N1 N2  N1 P N2 Types Tokens

Both Correct

N1N2

N1PN2

Both wrong

1069 14330

85 621

83 373

147 1131

Table 6: Translation Results for N1 N2 ADJN1 ADJ P N2 Types Tokens

Both Correct 48 374

N1N2ADJ 6 21

N1ADJPN2 10 81

Both wrong 5 85

Normalization was carried out in the same manner as in the first pair. The result of this normalization was 2 points of improvement in performance, for the N1 N2  N1 P N2 pair, and 3.7 for N1 N2 ADJN1 ADJ P N2. Put together, the three improvements by normalization lead to 5 points of improvement in the overall performance of the system in translating the three pairs of structural variants. Table 7 shows the results of performance before and after normalization. Table 7: Performance Rates before and after normalization

Pair N1 N2  N1 ADJ N1 N2  N1 P N2 N1 N2 ADJ  N1 ADJ P N2 Average

Before 0.841 0.912 0.803 0.852

After 0.934 0.932 0.84 0.902

VI. CONCLUSION AND FUTURE DIRECTIONS This paper presented a method for identifying and normalizing noun phrase variations in Arabic. It showed that syntactic normalization results in a significant improvement in the performance of a machine translation system. This positive effect of normalization on performance motivates future research in two related directions: finding fully or partially automated methods for identifying and normalizing syntactic variants, and identifying more normalizable structures in Arabic. The preliminary results of some experiments on other normalizable structures in Arabic showed that the same approach could be used with more complicated noun phrase structures as well as compound adjectives. Consequently, the author plans to extend the present method in order to incorporate these new structures into the normalization scheme and to study its effect on the MT performance.

REFERENCES [1] Regina Barzilay, Kathleen McKeown, and Michael E1hadad. 1999. Informational fusion in the context of multidocument summarization. In Proceedings of ACL'99, pages 550- 557, University of Maryland. [2] Caroline Brun and Caroline Hagege. 2003. Normalization and Paraphrasing Using Symbolic Methods. In Proceedings of IWP2003, pages 41-48. [3] David A. Evans and Chengxiang Zhai. Noun-Phrase Analysis in Unrestricted Text for Information Retrieval.34th Annual Meeting of ACL (ACL-96). [4] Christiml Jacquemin and Evelyne Tzoukermmm. 1999. NLP for term variant extraction: A synergy of morphology, lexicon, and syntax. In Tomek Strzalkowski, editor, Natural Language Information Retrieval, pages 2574. Kluwer, Boston. [5] Cecile Fabre and Christian Jacquemin. Boosting Variant Recognition with Light Semantics. COLING 2000: 264270. [6] A. T. Arampatzis, T. Tsoris, C. H. A. Koster, and Tit. P. van der Weide. 1998. Phrase-based information retrieval. Information Processing & Management, 34(6): 693-707. [7] C.H.A. Koster, C. Derksen, D. van de Ende and J. Potjer. 1999. Normalization and Matching in the DORO System. [8] G.W. Thatcher. 1982. Arabic Grammar of the Written Language. Frederick Ungar Publishing Co. New York.

Machine Translation Oriented Syntactic Normalization ...

syntactic normalization can also improve the performance of machine ... improvement in MT performance. .... These identification rules were implemented in Perl.

140KB Sizes 1 Downloads 460 Views

Recommend Documents

paper - Statistical Machine Translation
Jul 30, 2011 - used to generate the reordering reference data are generated in an ... group to analyze reordering errors for English to Japanese machine ...

Batch Normalization - Proceedings of Machine Learning Research
2010) ReLU(x) = max(x, 0), careful initialization (Ben- gio & Glorot, 2010; Saxe et al., 2013), and small learning rates. If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network trains, then the op

Machine Translation vs. Dictionary Term Translation - a ...
DTL method described above. 4.3 Example query translation. Figure 2 shows an example ... alone balloon round one rouad one revolution world earth universe world-wide internal ional base found ground de- ... one revolution go travel drive sail walk ru

Exploiting Similarities among Languages for Machine Translation
Sep 17, 2013 - ... world (such as. 1The code for training these models is available at .... CBOW is usually faster and for that reason, we used it in the following ...

The RWTH Machine Translation System
Jun 19, 2006 - We present the statistical machine translation system used by RWTH in the second TC-STAR evaluation. We give a short overview of the system as .... tactically and semantically meaningful sentence-like units, which pass all ...

Model Combination for Machine Translation - Semantic Scholar
ing component models, enabling us to com- bine systems with heterogenous structure. Un- like most system combination techniques, we reuse the search space ...

Exploiting Similarities among Languages for Machine Translation
Sep 17, 2013 - translations given GT as the training data for learn- ing the Translation Matrix. The subsequent 1K words in the source language and their ...

machine translation using probabilistic synchronous ...
merged into one node. This specifies that an unlexicalized node cannot be unified with a non-head node, which ..... all its immediate children. The collected ETs are put into square boxes and the partitioning ...... As a unified approach, we augment

Model Combination for Machine Translation - John DeNero
System combination procedures, on the other hand, generate ..... call sentence-level combination, chooses among the .... In Proceedings of the Conference on.

Automatic Acquisition of Machine Translation ...
translation researches, from MT system mechanism to translation knowledge acquisition ...... The verb-object translation answer sets are built manually by English experts from Dept. of Foreign ... talk business ..... Iwasaki (1996) demonstrate how to

Improving Statistical Machine Translation Using ...
5http://www.fjoch.com/GIZA++.html. We select and annotate 33000 phrase pairs ran- ..... In AI '01: Proceedings of the 14th Biennial Conference of the Canadian ...

The V1 Population Gains Normalization
Dec 24, 2009 - defined by the neuron's selectivity to the stimulus ... population response was defined as the average .... logical, and social networks, including.

Statistical Machine Translation of Spontaneous Speech ...
In statistical machine translation, we are given a source lan- guage sentence fJ .... We use a dynamic programming beam search algorithm to generate the ...

Automated Evaluation of Machine Translation Using ...
language itself, as it simply uses numeric features that are extracted from the differences between the candidate and ... It uses a modified n-gram precision metric, matching both shorter and longer segments of words between the candi- .... Making la

Machine Translation of English Noun Phrases into Arabic
Machine Translation of English Noun Phrases into Arabic. KHALED SHAALAN. Computer Science Department,. Faculty of Computers and Information, Cairo Univ.,. 5 Tharwat St., Orman, Giza, Egypt [email protected]. AHMED RAFEA. Computer Science Depa

Statistic Machine Translation Boosted with Spurious ...
deletes the source spurious word "bi" and implicit- ly inserts ... ence of spurious words in training data leads to .... There is, however, a big problem in comparing.

Machine Translation Model using Inductive Logic ...
Rule based machine translation systems face different challenges in building the translation model in a form of transfer rules. Some of these problems require enormous human effort to state rules and their consistency. This is where different human l

Statistic Machine Translation Boosted with Spurious ...
source skeleton is translated into the target skele- .... Regarding span mapping, when spurious words ... us to map the source sentence span (4,9) "bu xiang.

Machine Translation System Combination with MANY for ... - GLiCom
This paper describes the development of a baseline machine translation system combi- nation framework with the MANY tool for the 2011 ML4HMT shared task. Hypotheses from French–English rule-based, example- based and statistical Machine Translation.

Automated Evaluation of Machine Translation Using ...
Automated Evaluation of Machine Translation Using SVMs. Clint Sbisa. EECS Undergraduate Student. Northwestern University [email protected].

"Poetic" Statistical Machine Translation: Rhyme ... - Research at Google
Oct 9, 2010 - putational linguistics, MT, or even AI in general. It ..... papers/review_in_verse.html. 165 ... Cultural Aspects and Applications of AI and Cognitive.

Statistical Machine Translation of Spontaneous Speech ...
an alignment A is a mapping from source sentence posi- tions to target sentence positions ... alized Iterative Scaling (GIS) algorithm. Alternatively, one can train ...