INTERSPEECH 2013: Show & Tell Contribution

ReFr: An Open-Source Reranker Framework Daniel M. Bikel, Keith B. Hall Google Research, New York, NY {dbikel,kbhall}@google.com

Abstract

2. Data Format for I/O

ReFr (http://refr.googlecode.com) is a software architecture for specifying, training and using reranking models, which take the n-best output of some existing system and produce new scores for each of the n hypotheses that potentially induce a different ranking, ideally yielding better results than the original system. The Reranker Framework has some special support for building discriminative language models, but can be applied to any reranking problem. The framework is designed with parallelism and scalability in mind, being able to run on any Hadoop cluster out of the box. While extremely efficient, ReFr is also quite flexible, allowing researchers to explore a wide variety of features and learning methods. ReFr has been used for building state-of-the-art discriminative LM’s for both speech recognition and machine translation systems. Index Terms: language modeling, discriminative language modeling, reranking, structured prediction

There are two main choices when building discriminative reranking models for speech or machine translation: (a) rescore a lattice or hypergraph or (b) simply use a strict reranking approach applied to n-best lists. For ReFr, early on we decided to use (b) reranking n-best lists. The primary reasons were the flexibility this would allow us in designing features and tools. N-best lists readily allow for sentence-level features in a way that, say, lattices do not. Additionally, it is far easier to define generic schemes of passing around n-best lists than it is for designing schemes to take speech lattices as well as machine translation hypergraphs or other, problem-specific data types. ReFr is meant to be flexible enough to allow for a variety of data sources. In order to avoid the need for overly complex data formats, we have chosen to adopt a formalism which allows one to augment the input format, allowing for flexible feature extraction and data manipulation/analysis. We opted to use a data format which mirrors the data-structures that are used internally for training. The Google protocol buffers[1] provide a programming-language independent specification framework to define data formats. The protocol buffers specification language is used by the protocol buffer tools to generate source-code for serializing and deserializing the data stored in the format. Code is generated to allow for native programming-language encapsulation of the data. For example, in C++ each item of data is stored in an object based on a object oriented data specification (a C++ class) allowing for access to the data.1

1. Introduction Creating effective software tools for research is a tricky business. The classic tension between flexibility and efficiency arises with greater urgency. We want researchers to be able to try out many different ideas easily, but we also want them to be able to have a quick code-test-evaluate cycle. ReFr grew out of the 2011 Johns Hopkins Summer Workshop, from the team using automatically generated confusions to synthesize training data for discriminative language models for speech and machine translation, led by Prof. Brian Roark of OHSU. That approach required tools that would scale up to training data sizes orders of magnitude larger than had previously been used to build discriminative language models, so we not only needed our training and inference to be inherently fast, but we needed to design tools with distributed computing in mind from the outset. This paper describes the tools we have developed to solve not only the immediate research problem of exploring confusions for discriminative language modeling, but also the more general problem of reranking approaches to speech and language processing, including structured prediction. We designed ReFr to have the following properites:

3. Core learning framework Consider Algorithm 1, which describes the training procedure for a generic online-learning algorithm. Each training example ei comprises a set of candidate hypotheses, each of which is projected via some function Φ into a feature space, RF . We typically think of Φ as being a suite of feature functions, one per dimension. The model itself is defined as a weight vector in this space, w. Decoding, or inference, is carried out simply by taking the dot product of the model and a test instance. More generally, any kernel function K may be used. The training procedure iterates over the training data T —each iteration is called an epoch—until the N EED T O K EEP T RAINING() predicate returns false. Often, such a predicate is based on the average loss of the current model on some held-out development data D, which is the purpose of the E VALUATE(D) line in the T RAIN(T ) procedure.

• • • •

“library quality” code industrial strength academic flexibility easy exploration of different types of features, different update methods (e.g., MIRA-style, direct loss minimization, loss-sensitive) and different learning methods (e.g., perceptron-style, log-linear, kernel methods) • modern, object-oriented design, complete with dynamic factories and dynamic composition for flexibility • parallelizable, especially for distributed-computing environments

Copyright © 2013 ISCA

1 For the 2011 Johns Hopkins Workshop, we were targeting multiple tasks (ASR and MT), and so our toolkit provides a means to convert from two types of text-based n-best formats, one the output of an ASR system, the other the output of an MT system. These conversion tools are not only useful in their own right, but serve as example implementations for any developer converting from their own, proprietary format to the Google Protocol Buffer format used by ReFr.

756

25 - 29 August 2013, Lyon, France

model file = "my model file"; // model output file model = PerceptronModel( name("my model"), score comparator(DirectLossScoreComparator())); exec feature extractor = ExecutiveFeatureExtractorImpl( feature extractors({NgramFeatureExtractor(n(2)), RankFeatureExtractor()}); training efe = exec feature extractor; dev efe = exec feature extractor; training files = {"training1.gz", "training2.gz"}; devtest files = {"dev1.gz", "dev2.gz"};

Algorithm 1 Training algorithm for online-learning reranking models. Let ei = {c1 , . . . , ck } be a training example, where each cj is a candidate hypothesis. Similarly, let di = {c1 , . . . , ck } be a held-out development data example, also consisting of k candidate hypotheses. Finally, let K be a kernel function. procedure T RAIN(T = {e1 , . . . , en }, D = {d1 , . . . , dm }) while N EED T O K EEP T RAINING() do T RAIN O NE E POCH(T ) E VALUATE(D) end while end procedure

Figure 2: An example ReFr configuration file, read by its Interpreter class.

procedure T RAIN O NE E POCH(T ) foreach training example ei do S CORE C ANDIDATES(ei ) if N EED T O U PDATE() then U PDATE() end if end for end procedure

to explore a new combination of learning method functions. To do this, ReFr includes a very lightweight and yet powerful interpreter for a language that allows for assignment statements for primitives, vectors of primitives, Factory-constructible objects and vectors of Factory-constructible objects. Figure 2 shows an example ReFr configuration file. The syntax is intentionally very similar to that of C++. This lightweight language provides a flexible mechanism by which to specify how feature extraction, training and inference shall occur.

procedure S CORE C ANDIDATES(ei ) foreach candidate hypothesis cj ∈ ei do cj .score ← K(wt , cj ) end for end procedure

Model

4. Cluster-based distributed training

Candidate Scorer

As Algorithm 1 shows, the basic perceptron algorithm involves “online” updating, and thus it is possible to read in each training example from file each time it is needed, only keeping the model’s parameters persistently in memory. The Reranker Framework allows both the memory-intensive way of training as well as this “streaming mode” version of training, essential for distributed learning. The structured perceptron [2] and it’s variants have proven to be effective in supervised, discriminative language modeling work [3]. We have centered the development of our opensource discriminative learning toolkit around perceptron-style algorithms, which are, by definition, online learning algorithms. Identifying the optimal solution for a distributed online optimization algorithm is still an open research question. We borrow from our previous work on distributed perceptron training in [4, 5] and use the Iterative Parameter Mixtures algorithm for distributed computation. The Reranker Framework makes it easy to switch between single processor and distributed training, which uses the Hadoop implementation of MapReduce [6].

Update Predicate



Updater

Figure 1: A pictorial view of how a Model wraps instances of other interfaces that specify the predicates and functions needed to carry out model training.

For the basic perceptron, the model starts out at time step 0 as the zero vector; that is, wo = ~0. The update is wt+1 = wt + Rt [Φ (yoracle (ei )) − Φ (ˆ y (ei ))] ,

(1)

where yoracle is a function that picks out the hypothesis towards which we want to bias our model, yˆ is a function that picks out the candidate hypothesis we want to bias our model against and Rt is a learning rate or step size. Most often, yoracle is defined to pick the hypothesis with the lowest loss relative to some goldstandard truth, and yˆ is defined to pick the candidate hypothesis that scores highest under the current model wt . Most of the variations of this basic learning method involve finding different ways of defining Rt , Φ, yoracle and yˆ, along with the various procedures and predicates shown in Algorithm 1. Therefore, we would like our Reranker Framework to make it easy for the researcher to define these various functions, as well as to specify which ones to use at run-time. ReFr defines a Model interface with virtual methods for all of the functions shown in Algorithm 1. To avoid the exponential blow-up of overriding different combinations of these methods, ReFr also employs dynamic composition. That is, we keep the idea of a Model interface, but additionally have each Model instance wrap a set of predicate/manipulator objects, each of which itself conforms to an interface. Figure 1 shows a pictorial representation of this scheme. As we discussed above, we employ dynamic composition to avoid defining a new subclass of Model every time we wish

5. Demo Plan Our demo will consist of a walk-through of all ReFr’s features, followed by a hands-on demonstration of how easy it is to implement a new class of features for the reranker based on the rank of each candidate hypothesis. We will also show how easy it is to integrate that new class of features into training and inference. We will then demonstrate the ease with which one can use the API and the interpreted configuration language to alter the training algorithm. Finally, we will demonstrate the simple way that a user can switch from single processor training to large-scale distributed training.

6. Acknowledgements The authors would like to thank Prof. Brian Roark of Oregon Health and Science University for leading a fantastic team at the 2011 Johns Hopkins Workshop, and we would also like to thank all of our teammates, especially Prof. Izhak Shafran of OHSU and Ph.D. candidate Maider Lehr, who are actively working with and helping us improve ReFr.

757

7. References [1] Google, “Protocol buffers,” http://code.google.com/apis/protocolbuffers/. [2] M. Collins, “Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms,” in Proc. EMNLP, 2002, pp. 1–8. [3] B. Roark, M. Sarac¸lar, and M. Collins, “Discriminative n-gram language modeling,” Computer Speech and Language, vol. 21, no. 2, pp. 373 – 392, 2007. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0885230806000271 [4] R. McDonald, K. Hall, and G. Mann, “Distributed training strategies for the structured perceptron,” in HLT-NAACL, 2010. [5] K. Hall, S. Gilpin, and G. Mann, “Mapreduce/bigtable for distributed optimization,” in NIPS Workshop on Leaning on Cores, Clusters, and Clouds, 2010. [6] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” CACM, vol. 51:1, 2008.

758

ReFr: An Open-Source Reranker Framework - Research at Google

a lattice or hypergraph or (b) simply use a strict reranking ap- proach applied to n-best ... tations for any developer converting from their own, proprietary format.

247KB Sizes 2 Downloads 515 Views

Recommend Documents

An interactive tutorial framework for blind users ... - Research at Google
technology, and 2) frequent reliance on videos/images to identify parts of web ..... the HTML tutorial, a participant was provided with two windows, one pointing to.

The Snap Framework - Research at Google
Here, we look at Snap, a Web-development framework ... ment — an app running in development mode ... This is a complete work- .... its evolutionary life cycle.

A Framework for Benchmarking Entity ... - Research at Google
Copyright is held by the International World Wide Web Conference. Committee (IW3C2). .... of B in such a way that, a solution SB for IB can be used to determine a ..... a dataset (call it D), we apply iteratively Sim over all doc- uments and then ...

Test Selection Safety Evaluation Framework - Research at Google
Confidential + Proprietary. Pipeline performance. ○ Safety data builder ran in 35 mins. ○ Algorithm evaluator. ○ Optimistic ran in 2h 40m. ○ Pessimistic ran in 3h 5m. ○ Random ran in 4h 40m ...

Deep Shot: A Framework for Migrating Tasks ... - Research at Google
contact's information with a native Android application. We make ... needed to return a page [10]. ... mobile operating systems such as Apple iOS and Android.

Strato: A Retargetable Framework for Low-Level ... - Research at Google
optimizers that remove or hoist security checks; we call .... [2, 3]. One way to enforce CFI is to insert IDs ... CFI in a way similar to the original implementation [3],.

A Scalable MapReduce Framework for All-Pair ... - Research at Google
stage computes the similarity exactly for all candidate pairs. The V-SMART-Join ... 1. INTRODUCTION. The recent proliferation of social networks, mobile appli- ...... [12] eHarmony Dating Site. http://www.eharmony.com. [13] T. Elsayed, J. Lin, ...

framework - National Research Foundation
6. Evaluation of the Funding Instrument. At the end of a five-year period in 2017 or as agreed between the DST and the NRF, the .... 141 Russia. 85.5 EURO ...

An Information Avalanche - Research at Google
Web-page editors, blogging soft- ware, image- and video-sharing ser- vices, Internet-enabled mobile devices with multimedia recording capability, and a host of ...

Mathematics at - Research at Google
Index. 1. How Google started. 2. PageRank. 3. Gallery of Mathematics. 4. Questions ... http://www.google.es/intl/es/about/corporate/company/history.html. ○.

Measuring Nonprofit Results - OpenSource Leadership Strategies
services and systems, prevented higher-cost behaviors and activities, and/or delivered ... a “portfolio” of groups across a geographic region or field of work.

An Argument for Increasing TCP's Initial ... - Research at Google
3rd Quarter 2009. http://www.akamai.com/stateoftheinternet, 2009. [5] M. Allman, S. Floyd, and C. Partridge. Increasing TCP's. Initial Window. RFC 3390, 2002.

An Optimal Online Algorithm For Retrieving ... - Research at Google
Oct 23, 2015 - Perturbed Statistical Databases In The Low-Dimensional. Querying Model. Krzysztof .... The goal of this paper is to present and analyze a database .... applications an adversary can use data in order to reveal information ...

AUDIO SET: AN ONTOLOGY AND HUMAN ... - Research at Google
a hierarchy to contain these terms in a way that best agreed with our intuitive .... gory, “Bird vocalization, bird call, bird song”. 3. AUDIO SET DATASET. The Audio ...

Drac: An Architecture for Anonymous Low ... - Research at Google
(e.g., extracted from a social network web site [3],) but that relationships with ..... world network of 500 users, with 10 friends and 10 contacts each, and circuit.

Faucet - Research at Google
infrastructure, allowing new network services and bug fixes to be rapidly and safely .... as shown in figure 1, realizing the benefits of SDN in that network without ...

BeyondCorp - Research at Google
41, NO. 1 www.usenix.org. BeyondCorp. Design to Deployment at Google ... internal networks and external networks to be completely untrusted, and ... the Trust Inferer, Device Inventory Service, Access Control Engine, Access Policy, Gate-.

VP8 - Research at Google
coding and parallel processing friendly data partitioning; section 8 .... 4. REFERENCE FRAMES. VP8 uses three types of reference frames for inter prediction: ...

JSWhiz - Research at Google
Feb 27, 2013 - and delete memory allocation API requiring matching calls. This situation is further ... process to find memory leaks in Section 3. In this section we ... bile devices, such as Chromebooks or mobile tablets, which typically have less .

Yiddish - Research at Google
translation system for these language pairs, although online dictionaries exist. ..... http://www.unesco.org/culture/ich/index.php?pg=00206. Haifeng Wang, Hua ...

AES-VCM, AN AES-GCM CONSTRUCTION ... - Research at Google
We give a framework for construction and composition of univer- sal hash functions. Using this framework, we ... Informally, we give a result regarding the use of a universal hash function to construct a secure MAC. 1 ... The VMAC paper [3] gives a r

Katholieke Universiteit Leuven An efficient ... - Research at Google
where Γ denotes the SNR-gap to capacity, which is a function of the desired BER, the coding gain and noise margin. The data rate for user n is. Rn = fs ∑k bn k .

Paxos Made Live - An Engineering Perspective - Research at Google
Jun 26, 2007 - As a result, the consensus problem has been studied extensively ...... At Google, the day-to-day monitoring and management of our systems is ...