Improving SMT by learning translation direction Cyril Goutte, David Kurokawa, Pierre Isabelle Interactive Language Technologies group Institute for Information Technology National Research Council

April 2008

SMART workshop, Barcelona 2009

Cyril Goutte

SMART workshop, Barcelona 2009 / 1

Motivation

We address two questions: 1. Is there a difference between original and (human-) translated text and can we detect it reliably? 2. If so, can we use that to improve Machine Translation quality?

Cyril Goutte

SMART workshop, Barcelona 2009 / 2

Motivation

We address two questions: 1. Is there a difference between original and (human-) translated text and can we detect it reliably? 2. If so, can we use that to improve Machine Translation quality? Our answers: 1. Yes: on the Canadian Hansard, we get 90+% accuracy. 2. Yes: on French-English, we obtain up to 0.6 BLEU point increase.

Cyril Goutte

SMART workshop, Barcelona 2009 / 3

Problem setting

Translations often have a “feel” of the original language: Translationese. If translationese is real, it may be possible to detect it! Earlier studies: I

Baroni&Bernardini (2006): detect original vs. translation is a monolingual Italian corpus, with accuracy up to 87%.

I

van Halteren (2008) : detect source language in multi-parallel corpus and identify source language markers.

Both show that various aspects of translationese are detectable. We experiment on a large bilingual corpus (Hansard) and investigate how detecting translation direction may impact Machine Translation quality.

Cyril Goutte

SMART workshop, Barcelona 2009 / 4

Index 1 Motivation and setting . 1 ◦ 2 Data . 4 3 Detecting Translation Direction . 8 4 Exploiting Translation Direction in SMT . 14 5 Discussion . 20

Cyril Goutte

SMART workshop, Barcelona 2009 / 5

Data: The Hansard corpus Bilingual (En-Fr) transcripts of the sessions of the Canadian parliament. Most of 35th to 39th parliaments, covering 1996-2007. 1. Tagged with information on original language (French or English). 2. High quality translation: Reference material in Canada. 3. Large amount of data: 4.5M sentences, 165M words. words (fr) words (en) sentences blocks

fo 14,648K 13,002K 902,349 40,538

eo 72,054K 64,899K 3,668,389 42,750

mx 86,702K 77,901K 4,570,738 83,288

Cyril Goutte

SMART workshop, Barcelona 2009 / 6

Data: The Hansard corpus (II)

Corpus issues: I

Slightly inconsistent tagging, eg both sides claim to be original: puts overall tagging reliability into question.

I

Missing text/alignment, eg valid English but no translation: seems to be a retrieval issue.

I

Imbalance at the word/sentence level: 80% originally English.

I

There may be lexical/contextual hints: Quebec MPs tend to speak French, western Canada MPs almost only anglophones.

Cyril Goutte

SMART workshop, Barcelona 2009 / 7

Corpus (pre)processing I

Tokenized (NRC in-house tokenizer)

I

Lowercased

I

Sentence-aligned (NRC implementation of Gale&Church, 1991)

We consider two levels of granularity: I

Sentence-level: individual sentences;

I

Block-level: maximal consecutive sequence with same original language.

Block-level is balanced, sentence-level is imbalanced 4:1 (eo:fo). Tagged using freely available “Tree Tagger” (Schmid, 1994). =⇒ 4 representations: 1) word, 2) lemma, 3) POS and 4) mixed n-grams. “Mixed”: POS for content words, surface form for grammatical words. Cyril Goutte

SMART workshop, Barcelona 2009 / 8

Index 1 Motivation and setting . 1 2 Data . 4 ◦ 3 Detecting Translation Direction . 8 4 Exploiting Translation Direction in SMT . 14 5 Discussion . 20

Cyril Goutte

SMART workshop, Barcelona 2009 / 9

Detecting translation direction

Support Vector Machines trained with T. Joachims’ SVM-Perf. Test various conditions: 1. Block-level (83K examples) or sentence-level (1.8M examples, balanced). 2. Features: word, lemma, POS, mixed. . . n-gram frequencies. 3. N-gram length: 1. . . 3 for word/lemma, 1. . . 5 for POS/mixed. 4. Monolingual (English or French) or bilingual text. Sentence-level: test fewer feature/n-gram combinations (because of computational cost). All results obtained from 10-fold cross-validation. Results reported in F -score (≈ accuracy in this case). Cyril Goutte

SMART workshop, Barcelona 2009 / 10

Block-level Performance Detection performance (en)

75

80

tf-idf: small but consistent improvement.

70

word lemma mixed POS tf−idf

65

F−score (%)

85

90

Similar perf. on French, +1-2% for bilingual, same general shape.

1

2

3

4

5

Optimal: word/lemma bigram, POS/mixed trigram. Word bigram: F = 90% Mixed trigram: F = 86%.

n−gram size

Cyril Goutte

SMART workshop, Barcelona 2009 / 11

Influence of block length

100

Perf vs. length ( en )

Up to 99% accuracy for large blocks.

80

85

Large range in block length (3-73887 words!).

70

75

Much better than random for short blocks. word>lemma>mixed

65

Accuracy

90

95

word lemma POS mixed 1−gram 2−gram 3−gram

3

37

68

103

147

213

335

541

1084 2638

Length in words (equal bins)

Cyril Goutte

SMART workshop, Barcelona 2009 / 12

Sentence-level Performance

78

Sentence−level detection

F = 77%

70 68

Some missing conditions (computational cost)

66

72

1.8M examples (balanced)

64

F−score

74

76

word lemma mixed POS French English

1

2

3

4

5

n−gram size

Cyril Goutte

SMART workshop, Barcelona 2009 / 13

Analysis of

Most important bigrams in English (eo= original, fo=translation). Most important=relatively more frequent. “A couple of”: no equivalent in French Canadian alliance, CPC, NDP: mostly western, mostly anglophone parties BQ (Bloc Quebecois): French-speaking French translation overuses articles, prepositions (because French does), and “Mr. Speaker”!

eo couple of alliance ) a couple do that , canadian the record forward to , cpc cpc ) of us this country this particular many of canadian alliance across the out there the things for that

fo of the mr . , the in the to the , i . the ) : speaker , . i : mr , and . speaker bq ) , bq hon . that the on the

Cyril Goutte

SMART workshop, Barcelona 2009 / 14

Index 1 Motivation and setting . 1 2 Data . 4 3 Detecting Translation Direction . 8 ◦ 4 Exploiting Translation Direction in SMT . 14 5 Discussion . 20

Cyril Goutte

SMART workshop, Barcelona 2009 / 15

Impact on Statistical Machine Translation Typical SMT system training: I

Gather as much English-French aligned sentences as possible.

I

Preprocess + split data

I

Estimate parameters in either direction (en→fr and fr→en)

I

Original translation direction is not considered at all!

⇒ We use French originals and English translations to train an en→fr system (”reverse” translation??) We know SMT is very sensitive to genre/topic. . . Does difference between original and translation matter? If so, by how much?

Cyril Goutte

SMART workshop, Barcelona 2009 / 16

Impact on Statistical Machine Translation

We analyze the impact of translation direction on MT by investigating: 1. Do we get better performance by sending original text to MT system trained only on original text?

Cyril Goutte

SMART workshop, Barcelona 2009 / 17

Impact on Statistical Machine Translation

We analyze the impact of translation direction on MT by investigating: 1. Do we get better performance by sending original text to MT system trained only on original text? 2. Detecting translation direction and sending text to the “right” MT system.

(eo) en−>fr English

French

orig.

Classifier trans. (fo) en−>fr

French

Cyril Goutte

SMART workshop, Barcelona 2009 / 18

Impact of Original Language System trained on eo, fo, or mx, tested on eo/fo part of test set, or all (mx). Train mx fo eo

mx test set fr.en en.fr 36.2 37.1 31.2 30.8 36.6 37.8

fo test set fr.en en.fr 36.1 37.3 36.2 36.5 33.7 36.0

eo test set fr.en en.fr 36.1 36.9 30.5 30.1 36.8 38.0

eo system does (much) better on eo test, with 80% of training data. eo system also does better on mx data (test is 88% eo data vs. 80% in train). fo system does much worse on mx and eo data, but about the same as mx on the fo data, with only 20% of the training data! ⇒ Idea: detect source language using classifier, then use the right MT system (“Mixture of Experts”)

Cyril Goutte

SMART workshop, Barcelona 2009 / 19

Impact of Automatic Detection

Top part is more or less identical to previous table. ref: using reference source language information, gain a consistent ∼ 0.6 BLEU points. SVM: using SVM prediction, gain is similar.

mx fo eo SVM ref

Full test set fr→en en→fr 36.86 37.78 32.00 31.85 37.20 38.23 37.44 38.35 37.46 38.35

Smaller gain over the eo system (due to having 88% eo data in test set). ⇒ Detecting original vs. translation provides a small-ish but consistent improvement in translation performance. ⇒ not worth looking for better classifier (for that task). Other uses of translation direction detection?

Cyril Goutte

SMART workshop, Barcelona 2009 / 20

Index 1 Motivation and setting . 1 2 Data . 4 3 Detecting Translation Direction . 8 4 Exploiting Translation Direction in SMT . 14 ◦ 5 Discussion . 20

Cyril Goutte

SMART workshop, Barcelona 2009 / 21

Discussion

How general are these results? Will it generalize to: 1. Detection on other English-French data? 2. Training a classifier on another corpus? 3. Another language pair? 4. Other settings: source vs. translations from different languages. Mixture of experts: could use additional input-specific information. I

Mother tongue?

I

Gender?

Cyril Goutte

SMART workshop, Barcelona 2009 / 22

To Conclude...

Can we tell the difference between an original and translated document? → Yes. To what level of accuracy? → Up to 90+% accuracy on blocks, 77% on single sentences. Is translation direction useful for machine translation? → Yes! Is the classification performance sufficient? → Indistinguishable from reference labels. . . Cyril Goutte

SMART workshop, Barcelona 2009 / 23

Index 1 Motivation and setting . 1 2 Data . 4 3 Detecting Translation Direction . 8 4 Exploiting Translation Direction in SMT . 14 5 Discussion . 20

Cyril Goutte

Improving SMT by learning translation direction

Data: The Hansard corpus ... Large amount of data: 4.5M sentences, 165M words. fo .... alliance ) mr . a couple. , the do that in the. , canadian to the the record.

129KB Sizes 1 Downloads 231 Views

Recommend Documents

Improving Statistical Machine Translation Using ...
5http://www.fjoch.com/GIZA++.html. We select and annotate 33000 phrase pairs ran- ..... In AI '01: Proceedings of the 14th Biennial Conference of the Canadian ...

SMT
Aug 21, 2017 - processes and a world class facility. The company is ... medical devices, Internet of things, optical communication, automotive electronics and ...

Improving the natural language translation of formal ...
Dec 9, 2004 - suggestions and continuous assistance throughout my thesis work. Due to his ...... introduction to the areas of study that are the basis of this thesis. 1.1 Language ..... A significant part of this thesis deals with how these OCL funct

Improving Shape Retrieval by Learning Graph ...
tained a retrieval rate of 91% on the MPEG-7 data set, which is the highest ever ... Shape matching/retrieval is a very critical problem in computer vision. There.

Improving Shape Retrieval by Learning Graph ...
Given a database of shapes, a query shape, and a shape distance function, ... We propose a learning method to modify the original shape distance d(A, C).

Better Learning and Decoding for Syntax Based SMT ...
Data made available by the courtesy of Microsoft .... Part-of-Speech mapping template: whether the ..... clude that PSDIG and Pharaoh each excel on dif-.

Improving Compiler Heuristics with Machine Learning
uses machine-learning techniques to automatically search the space ..... We employ depth-fair crossover, which equally weighs each level of the tree [12]. ...... Code to Irregular DSPs with the Retargetable,. Optimizing Compiler COGEN(T). In Internat

Small-sample Reinforcement Learning - Improving Policies Using ...
Small-sample Reinforcement Learning - Improving Policies Using Synthetic Data - preprint.pdf. Small-sample Reinforcement Learning - Improving Policies ...

Dr (Smt) -
Jul 29, 2013 - I am to further inform that, Awards are proposed to be given to the deserving teachers & Teacher educators working under the categories at a State .... for recommendation the teachers for state Awards. 2. Criteria to be followed for se

NILAI SMT Genap FL (smt 6_2012) 2015 mtbs.pdf
Page 2 of 10. PROGRAMACIÓ TRIMESTRAL Escola del Mar, curs 2017-18. 5è. 2. SEGON TRIMESTRE. Numeració i càlcul. - Nombres decimals: part sencera i ...

SMT-4032A_Datasheet.pdf
HDMI, DVI, VGA, and Component (CVBS Common) video input ... Stand (WxHxD) ... information / specification can be found at www.samsungsecurity.com.

Late Smt. Leelabai.pdf
Facts in brief are that the assessee was head of the family after the death of ... her unawareness about source of investment made in the said property and.

Learning Translation Consensus with Structured Label ...
The candidate with minimal bayes risk is the one most similar to other candidates. .... the probability of a translation of a source sentence is updated.

New series of article by Smt Rama Devi -
The new series is being authored by Smt Ramadevi, whom we are fortunate to associate with. I recently happened to read her article on. Avvaiyar which was ...

Direction
All districts. Sir. Sub: Model MoU between WCDC and PIA- reg. A draft format for MoU to be executed between WCDCs and PIAs is enclosed for your perusal. All Project Managers & PD, PAUs should ensure that you have entered MoU with PIAs. If you have no

Improving Dependability by Revisiting Operating System ... - Choices
Figure 1. Microkernel OS structure also exists in other microkernels like L4 [17], Chorus [18], .... filesystem service and a network service that use SSRs.