IMPACT OF WEB BASED LANGUAGE MODELING ON SPEECH UNDERSTANDING Ruhi Sarikaya, Hong-Kwang Jeff Kuo and Yuqing Gao IBM T.J. Watson Research Center Yorktown Heights, NY 10598 {sarikaya,hkuo,yuqing}@us.ibm.com ABSTRACT Data sparseness in building statistical language models for spoken dialog systems is a critical problem. In a previous paper we addressed this issue by exploiting the World Wide Web (WWW) and other external data sources in a financial transaction domain. In this paper, we evaluate the impact of improved speech recognition due to Web–based Language model (WebLM) on the speech understanding performance in a new domain. As speech understanding system we use a natural language call–routing system. Experimental results show that the WebLM improves the speech recognition performance by 1.7% to 2.7% across varying amounts of in–domain data. The improvements in action classification (AC) performance were modest yet consistent ranging from 0.3% to 0.8%. 1. INTRODUCTION In recent years spoken dialog systems (SDS) matured in such a way that it has been widely used in many applications. Nevertheless, rapid deployment of SDS for new domains remains as a challenge. One of the main problems facing us today is the data sparseness for building robust statistical language models. Acoustic modeling is fairly insensitive to the domain of application as long as there is not significant environmental and channel mismatch between training and test conditions. Language modeling on the other hand, has significant impact on the speech recognition performance. Using LMs built for fairly generic applications such as dictation or Switchboard [12] does not help to improve system performance to a usable level. In such circumstances it is essential to improve the LM performance. The bulk of the LM research has concentrated on improving the language model probability estimation, while obtaining additional training material from external resources received little attention [2, 1]. In [8], we took advantage of the WWW and available external text resources previously collected for other applications to improve LM performance. In fact, as the largest data source, WWW has been previously utilized for numerous natural language processing (NLP) applications [3]. The Web has been used for language modeling as well [1, 4, 5]. The contribution of our study was to propose a framework on how to use the Web for limited domain spoken dialog systems. To this end, we formulated a query generation for data retrieval from the Web and developed a data filtering and selection mechanism to extract the useful utterances from the retrieved pages [8]. The proposed mechanism makes good use of limited in–domain data to sift through the large external text inventory to identify “similar” sentences. In many, if not all, of the previous work,

0-7803-9479-8/05/$20.00  2005 IEEE

268

documents were used as the unit in accepting or rejecting training material. We believe that going one step further and sifting for relevant information within a document is essential. Therefore, we do not take the returned documents as a whole, but rather find relevant utterances in the page. By doing so, we filter out irrelevant text and keep only relevant data for language modeling. In reality when we start to build a SDS for a new domain, the amount of in–domain data for the target domain is usually small. In cases where there is no in–domain data, we generate artificial data. Using available data, we build a pilot system that is mainly used to collect real in–domain conversational data. It is essential to have a pilot system that operates with a reasonable accuracy both at the speech recognition and speech understanding level so that the users are not frustrated in communicating with the SDS. In this paper, we apply the previously proposed mechanism for Web–based language modeling to a new domain which involves a large company’s call center customer hotline for technical assistance. Furthermore, we evaluate the impact of the speech recognition performance improvement on the speech understanding performance. Here, as speech understanding application we use a natural language call–routing system [9]. The rest of the paper is organized as follows. In Section 2 we present the framework to collect relevant data from the Web as well as other available resources for language modeling. In Section 3, we explain how search queries are generated and sentences are selected from the retrieved pages. A brief description of the natural language call–routing application is given in Section 4. Section 5 presents the experimental results. Section 6 summarizes the findings. 2. THE MECHANISM FOR WEB DATA COLLECTION The proposed framework utilizes not only the Web, which we refer as a dynamic resource, but also other static resources. Static resources include any corpus that is previously collected for limited–domain or domain–independent applications. We gave a list of the static corpora used in our experiments in [8]. The flow–chart for the proposed approach is depicted in Fig. 1. We assume that we are given some limited data belonging to the target domain. This data can also be generated manually. We generate queries from these sentences and search the Web. The retrieved documents are filtered and the documents are further processed to extract relevant utterances using the limited in–domain data. We employ a similarity metric to identify

ASRU 2005

WWW

DOCUMENT

SEARCH

FILTERING

(S)

QUERY GENERATION

(LIMITED) IN-DOMAIN TEXT DATA

BUILD IN-DOMAIN LM

(Q1) SIMILARITY BASED SENTENCE SELECTION

DOMAIN-SPECIFIC LM

(Q2) (QN)

what is the balance of my stock fund portfolio ⇓ STOP-WORDS what is the balance of my stock fund portfolio ⇓ N–GRAMS ISLANDS [what] [balance] [stock fund portfolio] ⇓ ADD CONTEXT [what is the] [the balance of] [my stock fund portfolio] ⇓ RELAX N–GRAMS [what] [balance] [stock fund] [fund portfolio] [what is the] [the balance of] [my stock fund] [fund portfolio] [what] [balance] [stock] [fund] [portfolio]

Table 1: Query generation. LARGE STATIC TEXT INVENTORY

INTERPOLATION

LM

Figure 1: Flow diagram for the collecting relevant data. sentences that are likely to belong to the target domain. The same process is applied to the static data sources. However, for the static corpora, in–domain data is directly used with the similarity metric to identify relevant utterances without query generation and retrieval. Finally, we build a domain– specific language model using the relevant sentences obtained from static and dynamic sources. An effective method to combine a small amount of in–domain and a large amount of out– of–domain data is through building separate language models and interpolating them [7]. 3. SEARCH QUERY GENERATION AND SENTENCE SELECTION A set of initial experiments with several search engines made us believe that Google was the most useful for our application. Google indexes web pages (it also includes URLs that it has not fully indexed) and many additional file types in the web database. We did not want to increase computation by the conversion of non–text documents into plain text. We only downloaded those files where text can be retrieved efficiently. Search query generation from a sentence is a key issue. The queries should be sufficiently specific since the more specific the query is the more relevant the retrieved pages are. On the other hand, if the query is too specific there may not be enough, if any, retrieved pages. In reality, we do not have infinite resources as such, one needs to avoid sending too many failed requests to the server just to get documents for a sentence. Therefore, the approach we take for query generation takes these concerns into account by generating queries that start from the most relevant case and gracefully degrade to the least relevant case. We define the most relevant query as the one that has ANDed maximal n–grams with context. The least relevant query is defined as the ORed unigrams obtained from an utterance. An example of query generation is given in Table 1. The first step in forming queries is to define a set of frequently occurring words as stop words (i.e. the, a, is,. etc.,). The remaining text is chunked into n–gram islands consisting of only content words. Then, we add context to these islands by including their left and right neighbors. The purpose of adding context around the content words is to incorporate conversational style into queries

269

to some degree. In the example, we start with a sentence, “what is the balance of my stock fund portfolio”. Next, we identify the stop words: is, the, of, my. The remaining word or phrase islands form the basis of the queries. Then, we add context to these islands. The amount of context can be increased by adding more neighboring words from the right and left of the content word at the expense of increased likelihood for failed requests. We form queries starting with the most optimistic one (Q1), which combines n–gram chunks using AND. The next best query (Q2) is formed by splitting the trigram content word island, [stock fund portfolio] into two bigram islands, [stock fund] and [fund portfolio] and then adding context again. This is repeated until unigram islands are obtained. The queries [Q1, Q2,...,QN] are repeated by substituting AND with OR and added to the end of the query list. Note that in Google AND is implicit, therefore we did not insert AND between chunks when we form a query. During retrieval, queries from this list are submitted to server in order until a pre–specified number of documents are retrieved. In this paper the stopping point was 100 pages per in–domain utterance. The retrieved documents are filtered by stripping off the HTML tags, punctuation marks and HTML specific information that is not part of the content of the page. Based on some initial experimental results we decided that using sentence as a unit rather than the document retrieved from the Web is a better choice. We adopted BLEU (BiLingual Evaluation Understudy) [6] as the similarity measure for utterance selection. Is simple terms BLEU is an n–gram similarity measure between two sentences. The BLEU metric is defined as follows: BLEU = BP · exp(

N 

wn log pn )

(1)

n=1

where N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, and BP is the brevity penalty:



BP =

1 exp(1 − r/c)

if c > r if c ≤ r

(2)

where r and c are the lengths of the reference and candidate sentences, respectively. The ranking behavior becomes more apparent in the log domain [6],

 r wn log pn log(BLEU) = min(1 − , 0) + c N

(3)

n=1

Here, we used N = 4 and wn = 1/N . We tailored the way BLEU is applied to our needs. For each sentence in the in– domain data we select all the sentences in the retrieved Web

WEB, STATIC and INDOMAIN LMs (TEST DATA)

data as well as static corpora where the similarity score is above an empirically determined threshold, which is determined based on word error rate using a held–out data and is set to 0.08.

36 BASE−LM WEB−LM BASE−LM + WEB−LM WEB−LM + STAT−LM BASE−LM + WEB−LM + STAT−LM

34

4. NATURAL LANGUAGE CALL–ROUTING

In this work, we use the MaxEnt method to build a statistical classifier [9]. The MaxEnt method is a flexible modeling framework that allows the combination of multiple overlapping information sources [10, 11]. The MaxEnt modeling matches the feature expectations exactly while making as few assumptions as possible in the model. The multiple information sources are combined as follows: P (C|W ) = 

e



i

λi fi (C,W )



e C

j

λj fj (C  ,W )

,

(4)

which describes the probability of a particular class C (e.g. action class) given the word sequence W spoken by the caller. Notice that the denominator includes a sum over all classes C  , which is essentially a normalization factor for probabilities to sum to 1. The fi are indicator functions, or features, which are “activated” based on computable features on the word sequence, for example if a particular word or word pair appears, or if the parse tree contain a particular tag, etc. For simplicity, we only use unigram (single word) features in this paper, also commonly known as a “bag of words” model. The MaxEnt models are trained using the improved iterative scaling algorithm [10] with Gaussian prior smoothing [11] using a single universal variance parameter of 2.0. 5. EXPERIMENTAL RESULTS AND DISCUSSION The domain of call–routing application in this paper is a technical support hotline for a Fortune–500 company’s call center [9]. There are 35 predetermined call–types. The full training data has 27K utterances amounting to 177K words. This data is split into ten chunks by uniformly sampling from the full set. The first chunk is further split into two 1.3K chunks. The vocabulary has 3667 words. The acoustic models are first trained using +1000 hours of generic telephony acoustic data, later MAP adapted using in–domain training data to this application. A decision tree clustered context dependent modeling with continuous density HMMs is used to model acoustic space. There are 2198 context dependent states with 222K Gaussians in the acoustic model. A separate test data consisting of 5644 utterances is used for evaluation. The trigram language models are built with deleted interpolation. In all cases the data is split into a 90% and 10% chunks. The former chunk is used for training and the latter chunk is used as heldout set for smoothing. Using a language model built

270

WER(%)

32

The aim of call–routing is to understand the speaker’s request and take the appropriate action. Typically, natural language call–routing requires two statistical models. The first performs speech recognition that transcribe the spoken utterance. The second is the Action Classifier (AC) model that takes the spoken utterance obtained by the speech recognizer and predicts the correct action to fulfill speaker’s request. Actions are the categories into which each request a caller makes can be mapped.

30

28

26

24

0

5

10

15 20 # OF SENTENCES (X1000)

25

30

Figure 2: Data size versus WER. for domain independent dictation or large vocabulary speech recognition tasks resulted in fairly high error rates (> 45%). In Fig. 2, we plotted the word error rates (WER) for five language models with respect to the amount of in–domain data. In the figure, “BASE–LM” stands for the baseline LM built using only the in–domain data, “WEB–LM” stands for the LM built using only the WEB data and “STAT–LM” is for the LM built using only the data obtained from static resources. We also presented interpolations of these LMs. The very first points on the curves correspond to using only 1.3K in–domain sentences. The WER for the BASE–LM is 30.9% and the corresponding figure for the WEB–LM is 35.1%. Interpolating WEB–LM for the STAT– LM reduces the WER to 32.5%. Interpolation of WEB–LM with the BASE–LM results in 28.9%. Three way interpolation of BASE–LM, WEB–LM and STAT–LM reduces the figure to 28.2%. This is 2.7% reduction compared to BASE–LM WER. Similar improvements are observed up to 8K in–domain data. As we include more in–domain data and all the WEB data collected for each chunk are treated as the static data for the next chunk, the amount of data to be processed easily reached to more than 10GB. Static resources account for much of the computation (about 80% for this application). We observe that at 8.1K in–domain data using static resources in addition to Web data contributes marginally to improved performance. An alternative way to look at Fig. 2 is to see the needed in– domain data reduction to achieve a given performance level. We observe about 3–to–4 fold reduction to match the in–domain LM performance. For example, we can match the performance of in–domain LM that uses full 27K set by using 8.1K in–domain LM, the Web and Static LMs. We did not plot the graph for WEB–LM, STAT–LM and their interpolations with the BASE–LM after 8.1K in–domain utterances. Nevertheless, we wanted to see the impact of interpolating WEB–LM and STAT–LM with the full 27K BASE–LM. In Fig. 3, we plotted the WER against the amount of in–domain data used to retrieve data for the WEB–LM and STAT–LM and their interpolation with the full 27K in–domain LM (27K– BASE–LM). Note that the first points (for 1.3K) for WEB–LM and “WEB–LM+STAT–LM” are same as those in Fig. 2. When these LMs are interpolated with the 27K–BASE–LM, the WER

WEB, STATIC and FULL−INDOMAIN LMs (TEST DATA)

ASR BASED ON VARIOUS LMs (TEST DATA)

36

82 27K−BASE−LM WEB−LM WEB−LM + STAT−LM 27K−BASE−LM + WEB−LM + STAT−LM

34

80

78 AC ACCURACY (%)

WER(%)

32

30

76

BASELM

28

74

WEBLM WEBLM+STATICLM BASE−LM + WEB−LM

26

24

72

1

2

3

4 5 6 # OF SENTENCES (X1000)

7

8

70

9

Figure 3: Interpolation of Web and Static LMs with the full 27K in–domain LM. LM 27K–BASE–LM 27K–BASE–LM + 1.3K–[WEB–LM+STAT–LM] 27K–BASE–LM + 2.7K–[WEB–LM+STAT–LM] 27K–BASE–LM + 5.1K–[WEB–LM+STAT–LM] 27K–BASE–LM + 8.1K–[WEB–LM+STAT–LM]

BASE−LM + WEB−LM + STATIC−LM

1

2

3

4 5 6 # OF UTTERANCES (X 1000)

7

8

9

Figure 4: Data size vs AC Accuracy. models. We proposed a framework to retrieve, filter, and select utterances from these sources by diligently exploiting the limited amount of available in–domain data. The amount of in–domain data required to achieve a given performance level is reduced by a factor of 3 to 4. The improvements in speech recognition accuracy range from 1.7% to 2.7%. The resulting improvements in call–routing range from 0.3% to 0.8%.

AC Accuracy 83.5 83.7 83.6 84.0 83.8

Table 2: AC Performance (%) for Various Language Models. reduces from 25.7% to 25.0%. At 8.1K sentences the improvement over the 27K–BASE–LM is 1%. The slope of the lowest curve suggests further reductions in WER are possible. Next, we evaluated the impact of improved speech recognition results on the call–routing performance. Note that the call– routing AC model is trained using only the in–domain data. In fact we experimented with using the subset of the retrieved external LM data as part of the AC model training data without success. The external data was just too noisy to improve the AC accuracy. Moreover, we believe that this application is very specific in its domain, unlike financial domains it was very difficult to acquire relevant external data. In Fig. 4, we plotted AC performance against varying amounts of in–domain data. The difference between these curves is only in the LM used to generate the speech recognition hypothesis. Even though we use only in–domain data to build the AC model, using only WEB– LM or “WEB–LM+STAT–LM” for speech recognition results in fairly high AC accuracy. However, they do not match the performance of using in–domain data for speech recognition. Interpolating the in–domain LM with the WEB–LM and STAT–LM improves AC accuracy modestly especially after 1.3K. The improvements are 0.3%, 0.6%, 0.8% and 0.7% for 1.3K, 2.7K, 5.1K and 8.1K utterances, respectively. Lastly, in Table 2, we presented the AC accuracy figures corresponding to the lower two curves at the bottom of Fig. 3. The improvements in speech recognition do not translate to improvements of same magnitude in AC accuracy. However, there is again some small improvement as given in the table. 6. CONCLUSIONS We considered the WWW and the available external static corpora as possible sources for building robust statistical language

271

References [1] X. Zhu and R. Rosenfeld, “Improving trigram language modeling with the world wide web”, ICASSP-2001 pp. I:533–536, Salt Lake City, UT 2001. [2] R. Rosenfeld, “Two decades of statistical language modeling: Where we go from here?”, Proceedings of IEEE, vol. 88, no:8, 2001. [3] M. Lapata and F. Keller, “The Web as a baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP tasks”, HLT/NAACL, pp. 121–128, Boston MA 2004. [4] A. Berger and R. Miller, “Just–in–time language modeling”, ICASSP–98 pp. II:705–708, Seattle, WA 1998. [5] I. Bulyko, M. Ostendorf and A. Stolcke, “Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures”, HLT-2003, 2003. [6] K. Papineni, S. Roukos, T. Ward and W. Zhu, “Bleu: A Method for Automatic Evaluation of Machine Translation”, Proc. ACL, 2002, Philadelphia, PA. [7] A. Rudnicky, “Language modeling with limited domain data”, Proc. ARPA Spoken Language Technology Workshop, pp. 66–69, 1995. [8] R. Sarikaya, A. Gravano and Y. Gao, “Rapid language model development using external resources for new spoken dialog domains”, ICASSP-2005, Philadelphia, PA 2005. [9] V. Goel, H-K.J. Kuo, S. Deligne and C. Wu, “Language model estimation for optimizing end–to–end performance of a natural language call routing system”, ICASSP-2005, Philadelphia, PA 2005. [10] S. D. Pietra, V. D. Pietra and J. Lafferty, “Inducing features of random fields”, IEEE Trans. Pattern. Analysis Mach. Int., 19(4):380– 93, 1997. [11] S. Chen and R. Rosenfeld, “A survey smoothing techniques for ME models”, IEEE Trans. SAP, 8(1):37–50, 2001. [12] B. Kingsbury, et al., “Toward domain-independent conversational speech recognition”, EUROSPEECH-2003, Geneva, Switzerland 2003.

Impact of Web Based Language Modeling on Speech ...

IBM T.J. Watson Research Center. Yorktown Heights, NY ... used for language modeling as well [1, 4, 5]. ... volves a large company's call center customer hotline for tech- ... cation we use a natural language call–routing system [9]. The rest of ...

145KB Sizes 0 Downloads 267 Views

Recommend Documents

Impact of Web Based Language Modeling on Speech ...
volves a large company's call center customer hotline for tech- nical assistance. Furthermore, we evaluate the impact of the speech recognition performance ...

structured language modeling for speech ... - Semantic Scholar
20Mwds (a subset of the training data used for the baseline 3-gram model), ... it assigns probability to word sequences in the CSR tokenization and thus the ...

STRUCTURED LANGUAGE MODELING FOR SPEECH ...
A new language model for speech recognition is presented. The model ... 1 Structured Language Model. An extensive ..... 2] F. JELINEK and R. MERCER.

MORPHEME-BASED LANGUAGE MODELING FOR ...
2, we describe the morpheme-based language modeling used in our experiments. In Section 3, we describe the Arabic data sets used for training, testing, and ...

THE IMPACT OF ASR ON SPEECH–TO–SPEECH ...
IBM T.J. Watson Research Center. Yorktown ... ASR performance should have a great impact on the .... -what we call- the Intrinsic Language Perplexity (ILP).

INVESTIGATIONS ON EXEMPLAR-BASED FEATURES FOR SPEECH ...
from mobile and video sharing speech applications. How- ever, most ... average distance between the hypothesis X and the k-nearest .... the hypothesis in the lattice with the lowest edit cost. ... scribed in Section 2 for development and tuning.

2014_J_e_Finite element modeling of ballistic impact on multi-layer ...
2014_J_e_Finite element modeling of ballistic impact on multi-layer Kevlar 49.pdf. 2014_J_e_Finite element modeling of ballistic impact on multi-layer Kevlar ...

The Causal Impact of Common Language on ...
Data. RDD set-up. Results. Robustness. Conclusion. Language and culture. Language in the humanities ... driver of cross-border activity. [migration (Chiswick 9 ); ...

THE EFFECTIVENESS OF WEB SPEECH GRAPHICS ON ENGLISH ...
THE EFFECTIVENESS OF WEB SPEECH GRAPHICS ... LEARNING OF BUSINESS ENGLISH STUDENTS.pdf. THE EFFECTIVENESS OF WEB SPEECH ...

The Impact of the Recession on Employment-Based Health Coverge
Data from the Survey of Income and Program Participation ..... Figure 14, Percentage of Firms Offering Health Benefits, by Firm Size, 2007–2008 . ...... forestry, fishing, hunting, mining, and construction industries had employment-based ...

Impact of Delays on a Consensus-based Primary ...
Frequency Control Scheme for AC Systems Connected by a. Multi-Terminal ... Multi-terminal HVDC system. 3 .... α and β: integral and proportional control gain.

Impact of Delays on a Consensus-based Primary Frequency Control ...
for AC Systems Connected by a Multi-Terminal HVDC Grid. Jing Dai, Yannick ...... with the network topology,. i.e., each edge in the figure also represents a bi-.

The Impact of the Recession on Employment-Based Health Coverge
ent-B as ed Cov erage: Dependent. Unins ured. Co verag e an d. Percen tag e T h at Are Un in su ...... 2 See www.nber.org/cycles/april2010.html .... Chairman Dallas Salisbury at the above address, (202) 659-0670; e-mail: [email protected].

The Impact of the Recession on Employment-Based Health Coverge
are used to examine health coverage prior to the recession, and as recently ... common source of health insurance among the population under age 65. ... Paul Fronstin is director of the Health Research and Education Program at EBRI. ..... Figure 14,

The Impact of the Recession on Employment-Based Health Coverge
were covered by Medicaid or the State Children's Health Insurance Program (SCHIP), 6.3 ...... Whites experienced a small drop in the percentage with employment-based ...... EBRI's membership includes a cross-section of pension funds; businesses; ...

Impact of Delays on a Consensus-based Primary ...
resents the frequency information availability at differ- ... We now search for equilibrium points of the system following the step change in .... domain as: ui(s)=( α.

The Impact of Financial Crises on Foreign Direct Investment Web ...
Web appendix for paper with same title published in Review of Development Economics ..... Dummy variable indicating Host country and USA share a common ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute of ... over all competing classes, and have been demonstrated to be effective in isolated word ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute ... NIST (National Institute of Standards and Technology) has ..... the best procedure to follow.

Confusion-based Statistical Language Modeling (or ...
Explore features; build tools; learning methods. 5 .... Large scale systems. – Produce real system outputs on training data (lattices, n-best lists). – Produce .... Can be run using Hadoop for map/reduce approach. – Include distributed perceptr

Location-Based-Service Roaming based on Web ...
1. Introduction. In various Add-On services, Location Based. Services (LBS) are services based on the ... network-based approach and handset-based approach.