Learning Translation Consensus with Structured Label Propagation †Shujie

Liu, ‡Chi-Ho Li, ‡Mu Li and ‡Ming Zhou †

Harbin Institute of Technology ‡Microsoft Research Asia

Outline     

Translation Consensus and Related Work Structured Label Propagation Graph-based Translation Consensus Model Graph Construction for Re-ranking and Decoding Experiment

Translation Consensus 

Translation Consensus Principle: 

A translation candidate is deemed more plausible if it is supported by other translation candidates.

Translation Consensus 

Translation Consensus Principle: 



A translation candidate is deemed more plausible if it is supported by other translation candidates.

Different formulations:  



whether the candidate is a complete sentence or just a span of it whether the candidate is the same as or similar to the supporting candidates whether the supporting candidates come from the same or different MT systems.

Related Work 

MBR(Minimum Bayes Risk) approaches 



The candidate with minimal bayes risk is the one most similar to other candidates. MBR re-ranking and decoding: Kumar and Byrne (2004), Tromble et al. (2008, 2009)

Related Work 

MBR(Minimum Bayes Risk) approaches 





The candidate with minimal bayes risk is the one most similar to other candidates. MBR re-ranking and decoding: Kumar and Byrne (2004), Tromble et al. (2008, 2009)

Consensus decoding with different systems  

Collaborate decoding (Li, et al., 2009) Hypothesis mixture decoding (Duan et al., 2011)

Related Work 

MBR(Minimum Bayes Risk) approaches 





The candidate with minimal bayes risk is the one most similar to other candidates. MBR re-ranking and decoding: Kumar and Byrne (2004), Tromble et al. (2008, 2009)

Consensus decoding with different systems  

Collaborate decoding (Li, et al., 2009) Hypothesis mixture decoding (Duan et al., 2011)

All theses work collect consensus information from translation candidates of the same source sentence.

Related Work

Related Work The correct translation for the first sentence is in the N-best list, but not ranked as the best one.

The translation of second sentence can help the first one to select a good translation candidate.

Related Work The correct translation for the first sentence is in the N-best list, but not ranked as the best one.

The translation of second sentence can help the first one to select a good translation candidate. 

Collect consensus from similar sentences 

If two source sentences are similar, their translation results should be similar.

Related Work 

Collect consensus from similar sentences 



Re-ranking N-best list using a classifier with features of consensus from similar sentences (Ma et al., 2011). Graph-based semi-supervised method for SMT re-ranking (Alexandrescu and Kirchhoff, 2009). 





A node represents a pair of source sentence and its candidate translation There are only two possible labels for each node: 1 for good pair and 0 for bad one. We extend this method with structured label propagation and collecting consensus information for spans.

Outline     

Translation Consensus and Related Work Structured Label Propagation Graph-based Translation Consensus Model Graph Construction for Re-ranking and Decoding Experiment

Graph-based Model 

Graph-based Model 



Graph-based model assigns labels to instances by considering the labels of similar instances. Principle:if two instances are very similar, their labels tend to be the same.

Graph-based Model 

Graph-based Model 





Graph-based model assigns labels to instances by considering the labels of similar instances. Principle:if two instances are very similar, their labels tend to be the same.

Label Propagation 

The probability of label l for the node 𝑖 𝑝𝑖,𝑙 is updated with respect to the corresponding probabilities for 𝑖’s neighboring nodes 𝑁(𝑖). 𝑡 𝑇(𝑖, 𝑗)𝑝𝑗,𝑙 𝑗∈𝑁(𝑖)

𝑇 𝑖, 𝑗 =

𝑝0,𝑙 𝑤𝑖,𝑗

𝑗 ′ ∈𝑁(𝑖) 𝑤𝑖,𝑗 ′

𝑝4,𝑙

𝑇(𝑖, 1)

𝑡+1 𝑝𝑖,𝑙 =

𝑝1,𝑙

𝑝𝑖,𝑙

𝑝2,𝑙

𝑝3,𝑙

Label Propagation 

Label Propagation 𝑡+1 𝑝𝑖,𝑙 =

𝑡 𝑇(𝑖, 𝑗)𝑝𝑗,𝑙 𝑗∈𝑁(𝑖)



𝑇 𝑖, 𝑗 =

𝑤𝑖,𝑗 𝑗 ′ ∈𝑁(𝑖) 𝑤𝑖,𝑗 ′

With a suitable measure of instance similarity, it is expected that an unlabeled instance will find the most suitable label from similar labeled nodes.

Label Propagation 

Label Propagation 𝑡+1 𝑝𝑖,𝑙 =

𝑡 𝑇(𝑖, 𝑗)𝑝𝑗,𝑙 𝑗∈𝑁(𝑖)





𝑇 𝑖, 𝑗 =

𝑤𝑖,𝑗 𝑗 ′ ∈𝑁(𝑖) 𝑤𝑖,𝑗 ′

With a suitable measure of instance similarity, it is expected that an unlabeled instance will find the most suitable label from similar labeled nodes. Problem when applying to SMT:different instances (source sentences) would not have the same correct label (translation result), and so the original updating rule is no longer valid, as the value of 𝑝𝑖,𝑙 should not be calculated based on 𝑝𝑗,𝑙 .。

Label Propagation 

Label Propagation 𝑡+1 𝑝𝑖,𝑙 =

𝑡 𝑇(𝑖, 𝑗)𝑝𝑗,𝑙 𝑗∈𝑁(𝑖)







𝑇 𝑖, 𝑗 =

𝑤𝑖,𝑗 𝑗 ′ ∈𝑁(𝑖) 𝑤𝑖,𝑗 ′

With a suitable measure of instance similarity, it is expected that an unlabeled instance will find the most suitable label from similar labeled nodes. Problem when applying to SMT:different instances (source sentences) would not have the same correct label (translation result), and so the original updating rule is no longer valid, as the value of 𝑝𝑖,𝑙 should not be calculated based on 𝑝𝑗,𝑙 .。 We need a new updating rule so that 𝑝𝑖,𝑙 can be updated with respect to 𝑝𝑗,𝑙′ , where in general 𝑙 ≠ 𝑙 ′ .

Structured Label Propagation 

Our structured Label Propagation 𝑡+1 𝑝𝑓,𝑒 =



𝑇𝑙 𝑒, 𝑒′ 𝑝𝑓𝑡 ′ ,𝑒 ′

𝑇𝑠 𝑓, 𝑓′ 𝑓′ ∈𝑁 𝑓

propagating probability of label

𝑒 ′ ∈𝐻 𝑓′

the probability of a translation 𝑒 of a source sentence 𝑓 is updated with probabilities of similar translations 𝑒 ′ s of some similar source sentences 𝑓 ′ s

Structured Label Propagation 

propagating probability of label

Our structured Label Propagation 𝑡+1 𝑝𝑓,𝑒 = 𝑓′ ∈𝑁 𝑓 

𝑇𝑙 𝑒, 𝑒′ 𝑝𝑓𝑡 ′ ,𝑒 ′

𝑇𝑠 𝑓, 𝑓′ 𝑒 ′ ∈𝐻 𝑓′

the probability of a translation 𝑒 of a source sentence 𝑓 is updated with probabilities of similar translations 𝑒 ′ s of some similar source sentences 𝑓 ′ s 𝑇𝑙 𝑒, 𝑒′ =

𝑠𝑖𝑚 𝑒, 𝑒′ 𝑒 ′′ ∈𝐻 𝑓′ 𝑠𝑖𝑚 𝑒, 𝑒′′

label similarity

Structured Label Propagation 

propagating probability of label

Our structured Label Propagation 𝑡+1 𝑝𝑓,𝑒 = 𝑓′ ∈𝑁 𝑓 

𝑒 ′ ∈𝐻 𝑓′

the probability of a translation 𝑒 of a source sentence 𝑓 is updated with probabilities of similar translations 𝑒 ′ s of some similar source sentences 𝑓 ′ s 𝑇𝑙 𝑒, 𝑒′ =



𝑇𝑙 𝑒, 𝑒′ 𝑝𝑓𝑡 ′ ,𝑒 ′

𝑇𝑠 𝑓, 𝑓′

label similarity

𝑠𝑖𝑚 𝑒, 𝑒′ 𝑒 ′′ ∈𝐻 𝑓′ 𝑠𝑖𝑚 𝑒, 𝑒′′

The original rule is a special case of our new rule, when 𝑠𝑖𝑚 𝑒, 𝑒′ is defined as 𝑠𝑖𝑚 𝑒, 𝑒′ =

1 0

original rule

𝑖𝑓 𝑒 = 𝑒′ ; 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒;

𝑡+1 𝑝𝑓,𝑒 =

𝑡 𝑇(𝑓, 𝑓 ′ )𝑝𝑓′,𝑒 𝑓′ ∈𝑁 𝑓

Outline     

Translation Consensus and Related Work Structured Label Propagation Graph-based Translation Consensus Model Graph Construction for Re-ranking and Decoding Experiment

Graph-based Translation Consensus Model 

Two consensus features are added to the conventional loglinear model. 𝑝 𝑒|𝑓 =





exp( 𝑖(𝜆𝑖 𝜓𝑖 (𝑒, 𝑓))) 𝑒 ′ ∈𝐻(𝑓)(exp

𝑖

𝜆𝑖 𝜓 𝑖 𝑒 ′ , 𝑓

)

Graph-based consensus features: consensus among the translations of similar sentences. Local consensus features: consensus among the translations of the identical sentence

Graph-based Consensus Feature 

Graph-based Consensus Feature: the log of graph-based consensus confidence calculated by structured label propagation. 𝑇𝑠 𝑓, 𝑓 ′

𝐺𝐶 𝑒, 𝑓 = log 𝑓′ ∈𝑁 𝑓

𝑇𝑙 𝑒, 𝑒 ′ 𝑝𝑓′ ,𝑒 ′ 𝑒 ′ ∈𝐻 𝑓′

Graph-based Consensus Feature 

Graph-based Consensus Feature: the log of graph-based consensus confidence calculated by structured label propagation. 𝑇𝑠 𝑓, 𝑓 ′

𝐺𝐶 𝑒, 𝑓 = log 𝑓′ ∈𝑁 𝑓 

𝑇𝑙 𝑒, 𝑒 ′ 𝑝𝑓′ ,𝑒 ′ 𝑒 ′ ∈𝐻 𝑓′

Label (translation) similarity: 𝑠𝑖𝑚 𝑒, 𝑒′ = 𝐷𝑖𝑐𝑒 𝑁𝐺𝑟𝑛 (𝑒), 𝑁𝐺𝑟𝑛 (𝑒′)

Graph-based Consensus Feature 

Graph-based Consensus Feature: the log of graph-based consensus confidence calculated by structured label propagation. 𝑇𝑠 𝑓, 𝑓 ′

𝐺𝐶 𝑒, 𝑓 = log 𝑓′ ∈𝑁 𝑓 

𝑇𝑙 𝑒, 𝑒 ′ 𝑝𝑓′ ,𝑒 ′ 𝑒 ′ ∈𝐻 𝑓′

Label (translation) similarity: 𝑠𝑖𝑚 𝑒, 𝑒′ = 𝐷𝑖𝑐𝑒 𝑁𝐺𝑟𝑛 (𝑒), 𝑁𝐺𝑟𝑛 (𝑒′)



Instance (source) similarity: symmetrical sentence level BLEU 1 𝑤𝑓,𝑓′ = 𝐵𝐿𝐸𝑈𝑠𝑒𝑛𝑡 𝑓, 𝑓 ′ + 𝐵𝐿𝐸𝑈𝑠𝑒𝑛𝑡 𝑓 ′ , 𝑓 2

Local Consensus Feature 

Local consensus features is defined over the n-best translation candidates as:

𝑝(𝑒 ′ |𝑓)𝑇𝑙 (𝑒, 𝑒′))

𝐿𝐶 𝑒, 𝑓 = log(

MBR Scores

𝑒 ′ ∈𝐻 𝑓 

Local consensus features collect consensus information from the translation candidates of the same source sentence.

Local Consensus Feature 

Local consensus features is defined over the n-best translation candidates as:

𝑝(𝑒 ′ |𝑓)𝑇𝑙 (𝑒, 𝑒′))

𝐿𝐶 𝑒, 𝑓 = log(

MBR Scores

𝑒 ′ ∈𝐻 𝑓 



Local consensus features collect consensus information from the translation candidates of the same source sentence. Other fundamental features : such as translation probabilities, lexical weights, distortion probability, word penalty, and language model probability.

Outline     

Translation Consensus and Related Work Structured Label Propagation Graph-based Translation Consensus Model Graph Construction for Re-ranking and Decoding Experiment

Graph Construction for Re-Ranking 



 



A separate node is created for each source sentence in training data, development data, and test data. For any node from training data, it is labeled with the correct translation, and we think it is pointless to re-estimate the confidence of those sentence pairs. There is no edge between training nodes. Each node from development/test data is given an n-best list of translation candidates as possible labels from a MT decoder. A dev/test node can be connected to training nodes and other dev/test nodes

Graph Construction for Re-Ranking 

An example of graph constructed for re-ranking Nodes for dev/test sentence

Nodes for training sentence

Source sentence similarity

Graph Construction for Decoding 

Graph-based consensus can also be used in the decoding algorithm, by re-ranking the translation candidates of not only the entire source sentence but also every source span.

Graph Construction for Decoding 







Graph-based consensus can also be used in the decoding algorithm, by re-ranking the translation candidates of not only the entire source sentence but also every source span. Forced alignment are used to extract candidate labels and spans for training sentence. The cells in the search space of a decoder can be directly mapped as dev/test nodes in the graph for development and test sentences. Two nodes are always connected if they are about a span and its subspan.

Graph Construction for Decoding 

An example of graph constructed for decoding

Created by forced alignment

Edge for subspans

Semi-supervised Training 

There is mutual dependence between the consensus graph and the decoder. 



MT decoder depends on the graph for the graph-based consensus features. The graph needs the decoder to provide the translation candidates as possible labels, and their posterior probabilities as initial labeling probability.

Semi-supervised Training

Train 𝜆0

Train 𝐺𝐶 0

Train 𝜆1

Train 𝐺𝐶 1

Outline     

Translation Consensus and Related Work Structured Label Propagation Graph-based Translation Consensus Model Graph Construction for Re-ranking and Decoding Experiment

Experiment 



We test our method with two data settings: one is IWSLT data set, the other is NIST data set. Our baseline decoder is an in-house implementation of BTG decoder with a lexical reordering model trained with maximum entropy.

Experiment Setting-1 

Data Setting: 

 

Training data: 81K sentence pairs, 655K Chinese words and 806K English words. Development data: devset8+dialog Test data: devset9

Experiment Result-1 Baseline Struct-LP Rerank-GC&LC Rerank-GConly Rerank-LConly Decode-GC&LC Decode-GConly Decode-LConly

devset8+dialog

devset9

48.79 49.86 50.66 50.23 49.87 51.20 50.46 50.11

44.73 45.54 46.52 45.96 45.84 47.31 46.21 46.17

Experiment Setting-2 

Data Setting: 

 

Training data: 354K sentence pairs, 8M Chinese words and 10M English words. Development data: NIST 2003 data set Test data: NIST 2005 and 2008 data set

Experiment Result-2 Baseline Struct-LP Rerank-GC&LC Rerank-GConly Rerank-LConly Decode-GC&LC Decode-GConly Decode-LConly

NIST'03 38.57 38.79 39.21 38.92 38.90 39.62 39.42 39.17

NIST'05 38.21 38.52 38.93 38.76 38.65 39.17 39.02 38.70

NIST'08 27.52 28.06 28.18 28.21 27.88 28.76 28.51 28.20

Summary    

Focus on consensus among similar source sentences Developed a structured label propagation method Integrated into the conventional log-linear model Proved useful empirically

Thanks

Learning Translation Consensus with Structured Label ...

The candidate with minimal bayes risk is the one most similar to other candidates. .... the probability of a translation of a source sentence is updated.

1MB Sizes 0 Downloads 207 Views

Recommend Documents

Robust Tracking with Weighted Online Structured Learning
Using our weighted online learning framework, we propose a robust tracker with a time-weighted appearance ... The degree of bounding box overlap to the ..... not effective in accounting for appearance change due to large pose change. In the.

TV-Based Multi-Label Image Segmentation with Label ...
Without the label cost term, thresholding the solution of the convex relaxation ui gives the global minimum .... Clearly, when the maximum of the labeling function uk(x) ∈ {0,1}, 1 ≤ k ≤ n, on the whole image ..... image and video segmentation

Structured Learning with Approximate Inference - Research at Google
little theoretical analysis of the relationship between approximate inference and reliable ..... “soft” algorithmic separability) gives rise to a bound on the true risk.

Learning to Localize Objects with Structured Output Regression
Oct 13, 2008 - center point. Page 12. Object (Category) Localization. Where in the picture is the cow? center point ..... svmlight.joachims.org/svm_struct.html.

Distributed Average Consensus With Dithered ... - IEEE Xplore
computation of averages of the node data over networks with band- width/power constraints or large volumes of data. Distributed averaging algorithms fail to ...

DISTRIBUTED AVERAGE CONSENSUS WITH ...
“best constant” [1], is to set the neighboring edge weights to a constant ... The suboptimality of the best constant ... The degree of the node i is denoted di. |Ni|.

Discriminative Unsupervised Learning of Structured Predictors
School of Computer Science, University of Waterloo, Waterloo ON, Canada. Alberta Ingenuity .... the input model p(x), will recover a good decoder. In fact, given ...

Solution: maximum margin structured learning
Structured Learning for Cell Tracking. Xinghua Lou ... Machine learning for tracking: • Local learning: fail .... Comparison: a simple model with only distance and.

Effective Multi-Label Active Learning for Text ...
Jul 1, 2009 - We call this approach LR − based label prediction. By incorporating the ..... international ACM SIGIR conference on Research and development in information ... video data using multi-class active learning. In Proceedings of.

Efficient Learning of Label Ranking by Soft Projections onto Polyhedra
In this set of experiments, the data was generated as follows. First, we chose ...... Conference on Knowledge Discovery and Data Mining (KDD), 2002. C.-J. Lin.

Jointly Learning Data-Dependent Label and Locality ...
Jointly Learning Data-Dependent Label and Locality-Preserving Projections. Chang Wang. IBM T. J. ... Sridhar Mahadevan. Computer Science Department .... (l ≤ m), we want to compute a function f that maps xi to a new space, where fT xi ...

Semi-supervised Multi-label Learning by Solving a ...
Multi-label learning, Graph-based semi-supervised learning, Sylvester equation, Collaborative filtering. 1 Introduction. Many learning problems require each ...

Efficient Learning of Label Ranking by Soft Projections onto Polyhedra
soft projection onto the polyhedron defined by a reduced set of constraints. We also describe and ... the definition of the hinge-loss used in SVM to the label ranking setting. ..... are seemingly different, they can be solved using the same algorith

Jointly Learning Data-Dependent Label and ... - Semantic Scholar
Sridhar Mahadevan. Computer Science Department ... 1 Introduction. In many ... which classes are similar to each other and how similar they are. Secondly, the ...

Reaching consensus in wireless networks with ...
In this paper, the effect of the wireless medium on simple consensus protocol is explored. In a wireless environment, a node's transmission is a broadcast to all ...

Scene Understanding with Discriminative Structured ...
Department of Computer Science and Technology, Tsinghua University ... Particularly, we adopt online Exponentiated Gradi- ent (EG) algorithm to solve ... M3N with online EG algorithm. Section 6 ...... Accelerated training of conditional ran-.

Disciplined Structured Communications with ...
Mar 1, 2014 - CITI and Departamento de Informática. FCT Universidade Nova de .... cesses which execute in arbitrary, possibly nested locations, we obtain a property which we call consistency: update .... shall consider annotated located processes lh