Confusion Network Based System Combination for Chinese Translation Output: Word-Level or CharacterLevel?

Introduction  Recently, confusion network based system combination has applied successfully to various machine translation tasks.  Confusion network based system combination picks one hypothesis as the skeleton and aligns the other hypotheses against the skeleton to form a confusion network.  The path with the highest score represents the consensus translation.  Previous work on system combination most focus on combining translation outputs in Latin alphabet-based languages, in which sentences are already segmented into words sequences with white space before constructing the confusion network. 2

Introduction  When combining Chinese translation outputs  The first step is to segment the translation output into a sequence of words,  An alternative is to split the translation output into characters,  Both approach is possible.  In this woks, we compare the translation performance of confusion network based system combination when the Chinese translation output is segmented into words versus characters.

3

Related work  It is a long debating issue that which one, word or character, is the appropriate unit for Chinese NLP.  J. Xu, et al. investigated CWS for Chinese-English phrasebased SMT,  R. Zhang, et al. reported that the most accurate word segmentation is not the best word segmentation for SMT,  P-C Chang, et al. optimized CWS granularity with respect to the SMT task,  M. Li, et al. compared word-level metrics with characterlevel metrics,  J. Du utilized a character-level strategy to improve translation quality for spoken language translation. 4

Confusion network based system combination for Chinese translation output  IHMM monolingual hypothesis alignment approach is utilized to align the hypothesis to the skeleton.  IHMM approach uses a similarity model and a distortion model to calculate the conditional probability that the hypothesis is generated by the skeleton. p(e'j | ei )  a  psem (e'j | ei )  (1 a)  psur (e'j | ei )  Given a source sentence:  Pakistan cleric says would rather die than surrender  And three translation hypotheses:  巴基斯坦称死不投诚  巴基斯坦说死不投诚  巴基斯坦说死于投诚 5

Confusion network based system combination for Chinese translation output  We can construction a word-level and a characterlevel confusion network given the example.

6

Experimental Data  We conducted experiments on two datasets  The NIST'08 English-to-Chinese translation task.  Contains 127 documents with 1,830 segments;  4 human reference translations;  The best 7 submitted system outputs are chose to participate in system combination;  3-fold cross-validation.  The IWSLT'08 English-to-Chinese CRR challenge task.  The development set contained 757 segments and the test set contained 300 segments;  4 human reference translations; 7

Experimental Setting  It has been reported that character-level automatic metrics correlate with human judgment better than word-level automatic metrics for Chinese translation evaluation.  The system performance of Chinese translation output are measured with character-level metrics.  Character-level BLEU,  Character-level NIST,  Character-level METEOR,  Character-level GTM,  Character-level TER 8

Experimental Setting  Because better automatic evaluation metrics leading to better translation performance for parameters optimization.  The feature weights of confusion network based combination system are tuned based on character-level BLEU score.  We experimented with three different CWS tools  ICTCLAS,  Stanford Chinese word segmenter (STANFORD),  Urheen.

9

Results on NIST’08 EC Tasks  The submitted outputs of 7 systems are combined:  System 01, system 03, system 17, system 18, system 24, system 28, and system 31.  Words are not demarcated in the system outputs, we divide the output into words by different CWS tools or characters to facilitate hypothesis alignment before combining the outputs.

10

Results on NIST’08 EC Tasks  The "Character" row shows the translation performance after the system outputs are split into characters.  The "ICTCLAS", "STANFORD", and "Urheen" rows show the scores when the system outputs are segmented into words by the respective CWS tools.  Experimental results given in Table 1 show that the characterlevel combination system significantly improves the translation performance (p < 0.01).

11

12

Results on IWSLT’08 EC CRR challenge Tasks  We segment the Chinese sentences in bilingual training data into word sequences, and train several English-to-Chinese SMT systems to decode the development set and test set.  JoshuaICTCLAS represent the Joshua system that Chinese sentences in the training data have been segmented into words by ICTCLAS tools, thus the outputs to be combined can be seemed to have been segmented into words by ICTCLAS tools.  JoshuaSTANFORD represent the Joshua system that Chinese sentences in the training data have been segmented into words by STANFORD tool.

13

Results on IWSLT’08 EC CRR challenge Tasks  Because the outputs to be combined have been segmented into words with different granularity, we must consistently resegment the outputs into words or characters before system combination.  The "ICTCLAS", and "STANFORD" rows show the scores when the system outputs are re-segmented into words by the respective Chinese word segmenters.  The experimental results in Table 2 show when translation outputs to be combined are with different word granularity:  The character-level combination system significantly improves the translation performance. 14

15

Results on IWSLT’08 EC CRR challenge Tasks  When the outputs to be combined have been segmented into words by the same CWS tool ICTCLAS, we combined the output generated by two SMT systems:  MosesICTCLAS,  JoshuaICTCLAS.  Table 3 shows the character-level combination system still consistently outperforms the word-level combination system, “ICTCLAS”, even though the translation outputs to be combined are with the same word granularity.

16

17

Conclusion and discussion  We conducted a study of character-level versus word-level confusion network based system combination for Chinese translation output.  The experimental results show that character-level combination system significantly outperforms word-level combination systems.

18

Conclusion and discussion  Reasons:  Chinese sentences can be split into characters with perfect accuracy; however, there is not a CWS tool to perform 100% yet. Therefore, outputs can be segmented into characters more consistently. which lead to generate high quality monolingual hypothesis alignment to help construct confusion network.  Chinese character is a smaller unit than Chinese word (containing at least one character) for constructing confusion network. Thus, character-level approach has more choice to produce better consensus translation. 19

Thanks! 20

Confusion Network Based System Combination for ... - GitHub

segmentation is not the best word segmentation for SMT,. ➢P-C Chang, et al. optimized ... 巴基斯坦说死不投诚. ➢ 巴基斯坦说死于投诚. 5. ' ' ' ( | ). ( | ) (1 ). ( | ) j i sem j i sur ... the output into words by different CWS tools or characters to facilitate ...

1MB Sizes 2 Downloads 355 Views

Recommend Documents

DCU Confusion Network-based System Combination for ... - GitHub
is to predict a function g : X → Y where we call g a true function. Now, let t be a .... proceedings of the Joint Conference of the 47th An- nual Meeting of the ACL ...

DFKI System Combination with Sentence Ranking at ... - GitHub
data provided as part of the shared task. The proposed ... the meta-data from the two systems themselves. .... tence selection themselves, big sets increase the.

Component-based game object system - GitHub
3.7.2 Can we reuse game object types, or their behaviors, in new games? . 7. 3.7.3 Is it easy to ...... gameprogrammingpatterns.com/component.html. [16] Pie21.

restauraurant recommendation system based on collborative ... - GitHub
representations of content describing an item to representations of content that interest the user pairs (Melville, 2010). Music Recommendation systems in use web content-based filtering. The increase in multimedia data creates difficulty in searchin

COSΦ: Vision-based Artificial Pheromone System for ... - GitHub
freely available software package capable of fast and precise tracking of a large ... pheromone's intensity at location (x, y) and ci defines the ... system using the Colias-Φ mobile robot platform in single-robot and swarm scenarios. To allow the .

Multitask Learning and System Combination for ... - Research at Google
Index Terms— system combination, multitask learning, ... In MTL learning, multiple related tasks are ... liver reasonable performance on adult speech as well.

combination of statistical and rule-based approaches for spoken ...
... Frey* and Leon Wong. Microsoft Research, One Microsoft Way, Redmond, Washington 98052, USA ... Applications of task classification include call routing and information ..... International Conference on Speech and Language. Processing ...

Machine Translation System Combination with MANY for ... - GLiCom
This paper describes the development of a baseline machine translation system combi- nation framework with the MANY tool for the 2011 ML4HMT shared task. Hypotheses from French–English rule-based, example- based and statistical Machine Translation.

food recommendation system based on content filtering ... - GitHub
the degree of B.Sc. in Computer Science and Information Technology be processed for the evaluation. .... 2.1.2 Limitations of content based filtering algorithm .

speech translation by confusion network decoding
This paper describes advances in the use of confusion net- works as interface ..... [10] A. Stolcke, “SRILM - an extensible language modeling toolkit,” in Proc. of ...

A precise teach and repeat visual navigation system based ... - GitHub
proposed navigation system corrects position errors of the robot as it moves ... University [email protected]. The work has been supported by the Czech Science Foundation projects. 17-27006Y and ... correct its heading from the visual information

A precise teach and repeat visual navigation system based ... - GitHub
All the aforementioned modules are available as C++ open source code at [18]. III. EXPERIMENTAL EVALUATION. To evaluate the system's ability to repeat the taught path and to correct position errors that might arise during the navigation, we have taug

Confusion-based Statistical Language Modeling (or ...
Explore features; build tools; learning methods. 5 .... Large scale systems. – Produce real system outputs on training data (lattices, n-best lists). – Produce .... Can be run using Hadoop for map/reduce approach. – Include distributed perceptr

An Ambient Robot System Based on Sensor Network ... - IEEE Xplore
In this paper, we demonstrate the mobile robot application associated with ubiquitous sensor network. The sensor network systems embedded in environment.

External Localization System for Mobile Robotics - GitHub
... the most known external localization reference is GPS; however, it ... robots [8], [9], [10], [11]. .... segments, their area ratio, and a more complex circularity .... The user just places ..... localization,” in IEEE Workshop on Advanced Robo

Fingerprint Based Cryptography Technique for Improved Network ...
With the advancement in networking technology ... the network so that the sender could generate the ... fingerprint and the sender also generates private key.

web based - GitHub
I am nota developer! Hello, I'm Emil Öberg,. I am not a developer. ... Page 6 ... iOS old. iOS 8. Android old. Android 5. PLZ give tab bar ...

Method for controlling home network system
Jan 24, 2011 - Thus, a standard for a high-speed communication with a large amount of data is ... appliances or the Internet can be performed using a network.

routine management system - GitHub
10. Figure 4 - Sample Data Set of Routine Management System . .... platform apps, conventional software architectural design patterns may be adopted and ...

System Requirements Specification - GitHub
This section describes the scope of Project Odin, as well as an overview of the contents of the SRS doc- ument. ... .1 Purpose. The purpose of this document is to provide a thorough description of the requirements for Project Odin. .... Variables. â€

System Requirements Specification - GitHub
System Requirements Specification. Project Odin. Kyle Erwin. Joshua Cilliers. Jason van Hattum. Dimpho Mahoko. Keegan Ferrett. Note: This document is constantly under revision due to our chosen methodology, ... This section describes the scope of Pro

FreeBSD ports system - GitHub
Search - make search (cont'd). Port: rsync-3.0.9_3. Path: /usr/ports/net/rsync. Info: Network file distribution/synchronization utility. Maint: [email protected].

2009_IJWA_Draft_An Interactive Web-based System for Urban ...
2009_IJWA_Draft_An Interactive Web-based System for Urban Traffic Data Analysis.pdf. 2009_IJWA_Draft_An Interactive Web-based System for Urban Traffic ...

CodaLab Worker System - GitHub
The worker system consists of 3 components: • REST server: ... a ”check out” call which is used to tell the server that a worker is shutting down and prevent it from.