Designing Language Models for Voice Portal ... - Springer Link

Viewer
Transcript

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 7, 93–99, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Designing Language Models for Voice Portal Applications PHIL SHINN, MATTHEW SHOMPHE, MOLLY LEWIS, KATHY CAREY AND DAVID KIM HeyAnita Inc., 303 N. Glenoaks Blvd., 5th Floor, Burbank, CA 91502, USA [email protected]

Abstract. At HeyAnita we use statistical language models to improve speech recognition performance in a number of our portal applications, including driving directions, traffic, weather, stocks, sports, movies and restaurants. In this paper, language modeling implementations in different recognition environments and some real world data are reviewed. Keywords: language modeling, speech recognition, voice portal

Introduction

LM Formats

There are a number of components common to any automatic speech recognition system, including acoustic models, grammars and dictionaries. In addition to grammatical representations of what words can follow what words, some systems also support statistical language modeling (henceforth LM) which allows users to probabilistically weight the likelihood of different words or phrases. There are common elements to these representations, and in the section below we present a representative set of LM formats. This paper does not discuss actual methods of creating LMs, nor their theoretical underpinnings. For a review of these, see the workshop put on by the Institute for Mathematics and Applications entitled “Mathematical Foundations of Natural Language Processing” [http://www.ima.umn.edu/talks/workshops/10-30-113.2000], in particular Rosenfield (2000). Another resource is the Workshop on Language Modeling and Information Retrieval, at http://la.lti.cs.cmu.edu/callan/ Workshops/lmir01/WorkshopProcs/OtherPages/Table OfContents.html. See also Manning and Schutze (2000) and Jelenek (1999). What we do show is some data from the real world, which we use to create practical language models on our commercial voice portals for customers like Sprint PCS and Verizon Wireless.

In this section we discuss language model weights in the grammar formats we at HeyAnita are familiar with and support in our Free SpeechTM Voice Browser. There may be other grammar formats that employ LM weighting schemes, but these are not discussed here. The grammar formats discussed below include: Nuance’s Grammar Specification Language (GSL); SpeechWork’s1 Backus Naur Format (BNF) and Open Speech Recognition (OSR); Microsoft’s Speech Application Programming Interface (SAPI); the JAVA Speech Grammar Format (JSGF); and VoiceXML 2.0. GSL The following section from the Nuance manual describes their methodology for assigning probabilities to words or phrases: “You can assign a probability to any GSL expressions by using the tilde (∼). The probabilities are used during the recognition search to weight or favor certain phrases. Typically, you assign a high probability to phrases that are expected to be spoken more frequently. So, in a city grammar, the larger cities might get a higher probability. In a stock quote grammar, the most commonly traded companies

94

Shinn et al.

would get the highest probabilities, and so on. . . . For example, the following city grammar assigns different probabilities to various city names: City [boston∼.2 (new york)∼.4 dallas∼.3 topeka∼.1] Not specifying any probabilities is the same as specifying a probability of 1.0 for each item, meaning that no phrase is favored over any other. Note: Adding probabilities can potentially increase both recognition accuracy and speed of your grammar. However, assigning bad probability values can actually hurt recognition performance, so be careful when using this feature of GSL. Use real data to determine the probabilities whenever possible. Nuance recommends that, if you choose to use grammar probabilities, you test your application’s performance both with and without probabilities.” (Nuance, 1998). The weights are compiled into the grammars upon invocation of the grammar compiler. BNF and OSR SpeechWorks support for language modeling in versions 6.5 and 6.5 SE uses a stand-alone program named dln set within class probs which applies a file of terminal node strings matched with log probabilities (called a unigram file) to modify one of the outputs of the grammar compiler (the .dln file). The range of log probabilities go from –700 (least likely) to 0 (most likely). In addition to unigram modeling, bigram modeling is also available (SpeechWorks, 2001). In their newer release, OSR, the methodology is somewhat different, although the two methods of applying language models are still available. As for bigram modeling, SpeechWorks uses the meta element as follows: “The format of the n-gram is specified in W3C January 3, 2001 Working Draft (work in progress) “Stochastic Language Models (N-gram) Specification for the W3C Speech Interface Framework”: http://www.w3.org/TR/ngram-spec.” The other method, SWI scoreDelta “. . . lets you maximize the accuracy of the recognizer by allowing you to attach weights based on

application knowledge. Its value is applied to the recognition raw score before confidence scores are computed. So, for example, if you have a grammar consisting of many cities, you could put higher values of SWI scoreDelta on cities you expect to be uttered more often:

new york Newark

. . . Note that bigrams are an alternative to SWI scoreDelta . . . for boosting scores for frequently uttered words and phrases. Especially for large vocabularies, bigrams are a better choice since they get applied earlier in the recognition process (before parsing) and so improve both efficiency and, to a lesser extent, accuracy. However, they are more complex to apply and are not useful for boosting scores based on dynamic criteria (say today’s date or the caller’s area code).” (SpeechWorks, 2002). SAPI Microsoft’s SAPI 5.1 also includes LM capabilities:

recognize speech

wreck a nice beach

“. . . Recognition proceeds by means of transitions from one rule state to another. Each rule state has a collection of transitions, which represent the possible recognition paths to subsequent rule states. In the absence of weighting, each transition is considered equally probable. For example, in a rule state with five transitions, each has a 20 percent probability of being followed. The Weight property enables grammar designers to specify a transition as more probable, or less probable, than its sibling transitions. The weight property for a transition is a fractional number with a range of zero to one, and

Designing Language Models for Voice Portal Applications

the sum of the weights of a rule state’s transition should be one.” (Microsoft, 2001). JSGF The Java Speech API grammar format (JSGF) supports weights for grammar elements as follows: “Not all ways of speaking a grammar are equally likely. Weights may be attached to the elements of a set of alternatives to indicate the likelihood of each alternative being spoken. A weight is a floating point number surrounded by forward slashes, e.g. /3.14/. The higher the weight, the more likely it is that an entry will be spoken. The weight is placed before each item in a set of alternatives. For example: = /10/ small | /2/ medium | /1/ large; = /0.5/ red | /0.1/ navy blue | /0.2/ sea green; = please (/20/save files |/1/delete all files); = /20/ | /5/ ; The weights should reflect the occurrence patterns of the elements of a set of alternatives. In the first example, the grammar writer is indicating that ‘small’ is 10 times more likely to be spoken than ‘large’ and 5 times more likely than ‘medium.’” (Sun Microsystems, 2002). VoiceXML 2.0 VoiceXML version 2.0 also includes weights: “A weight is surrounded by forward slashes and placed before each item in the alternatives list. // Weight above 1.0 is a positive bias // Weight below 1.0 is a negative bias // Default is 1.0 which does not affect recognition /10/ small | /2/ medium | large /3.1415/ pie | /1.414/ root beer | /.25/ cola ... Grammar authors and speech recognizer developers should be aware of the following limitations upon the definition and application of weights. . . • The application of weights to a speech recognition search is under the internal control of the recognizer. • There is no Normative or Informative algorithm for applying weights. Furthermore, speech recognition is a statistical process so consistent behavior cannot be guaranteed.

95

• Appropriate weights are difficult to determine for any specific grammar and recognizer. • Guessing weights does not always improve speech recognition performance. • Effective weights are best obtained by study of real speech input to a grammar. For example, a reasonable technique for developing portable weights is to use weights that are correlated with the occurrence counts of a set of alternatives. • Tuning weights for a particular recognizer does not guarantee improved recognition performance on other speech recognizers.” (W3C, 2001).

Summary of LM Techniques and Findings While the scaling factors and direction of scoring varies from format to format, they each support some form of statistical modeling, and can be transformed into one another. However, as pointed out by the contributors to the VoiceXML spec, there is no ‘cookbook’ for how to create weights. And, to further confound our goals of designing interoperable systems, a given transformation may or not result in the same sort of performance across different recognition engines. So, not only is it unclear how one should go about building a language model, grammar designers are equally in the dark as to how one should transform one language model into another. At HeyAnita we have found that adding probabilities definitely helps performance. As most vendors recommend, it is better to use real data to determine actual usage probabilities whenever possible. The problem, and the next topic of this paper, is what do you do before you get real usage data? And how do you know that the data you have reflects actual or future usage?

A Priori Models and Zipf’s Law Wentian Li of Rockefeller University maintains an interesting website that has numerous references to Zipf’s law (Zipf, 1932, 1935, 1949) which is named after the Harvard linguistics professor George Kingsley Zipf (1902–1950). The ‘law’ is the observation that frequency of occurrence of some event, as a function of the rank, is a power-law function. The most famous example of Zipf’s law is the frequency of English words. The graph below shows word frequency data from the Brown Corpus (Kucera and Francis, 1967).

96

Shinn et al.

In addition to frequencies of words in text, letter frequencies also show a power law distribution, as shown below:

Populations of States and Cities Zipf noticed that population data appeared to follow his law. The second example Zipf gave was the population of cities. The population of the city as plotted as a function of the rank (the most popular city ranked number one) was a power-law function with exponent close to 1. In the intervening years, it appears that this characteristic has not changed much. The following two plots come from the year 2000 U.S. Census. The first shows that the population by rank plot of the 50 United States is ‘Zipfian’:

Mandelbrot (1953) proposed an extension of Zipf’s Law: P(r ) = P(r + V )−B where the constants B and V can be computed if the other probabilities are known. If V is very small and B is nearly equal to 1, this is almost the same as Zip’s original law. A Priori Data Used by HeyAnita In the following sections, we show plots of data from various sources, all of which appear to display powerlaw relations.

The next shows that within a state, the city populations are Zipfian as well.

Designing Language Models for Voice Portal Applications

We use this data to construct language models for a number of applications, including weather, movies, traffic reports and driving directions.

97

recognition performance on this application would be unacceptable.

Airport Passengers Our Flight Tracker application allows a user to get commercial flight arrival and departure times. One part of the application asks the user to say the name of an arrival or departure airport. The Federal Aviation Administration keeps track of data on the number of passengers that travel through each airport, as shown below:

Movie Screenings HeyAnita’s movie application gives callers information about what movies are playing where and when, in addition to reviews and other information. The grammars contain all movies playing around the US, every week. Since there are many movies playing at only one or two ‘revival’ houses, it doesn’t make sense to weight these as much as the ‘block-buster’ new releases that are playing on thousands of screens.

Since the most traveled-through airport, Chicago’s O’Hare, is far more likely to be requested than a smaller regional airport, it makes sense to weight it strongly. Restaurant Names As the data below show, the frequency of occurrence of a restaurant name in our database of restaurants appears to have a power law distribution. Utility of a language model in this application is not as clear as some of the others, since it is not clear that one actually goes looking for the most commonly occurring restaurants.

Equity Trading Volumes Our stock quote application recognizes upwards of 37,000 equity names and ticker symbols. Our research shows that without a language model,

Sports Teams Our sports application provides scores and audio feeds for both college and professional teams, among other information. While the number of professional teams is not that large, there are a considerable number of colleges and universities, and, there are a large number of ways a school or team can be named, for example “Berkeley,” “Berkeley Bears,” “The CAL Bears,” “The University of California at Berkeley,” etc. Not only are there a large number of schools, and a large number of ways to ask for them, it turns out there is a great discrepancy in the popularity, or ‘request-ability’ of a team. Clearly, a large university’s team in the final four of the national playoffs will be more likely asked for than a small school’s team stuck in the cellar of a minor league. One of the inputs we use to construct school weights is the number of seats available in the school’s outdoor stadium and the indoor arena. In summary, our data reflect power-law distributions across a wide variety of application domains.

98

Shinn et al.

Estimating the Effect of Language Modeling There are two ways to measure the effect of LM on performance: calculate actual empirical results based on collected data, or calculate theoretical expected results.

Empirical Error Reduction In terms of empirical results, Microsoft notes: “In general, the better the language model, the lower the error rate of the speech recognizer. By putting together the best results available on language modeling, we have created a language model that outperforms a standard baseline by 45%, leading to a 10% reduction in error rate for our speech recognizer.” [http://research. microsoft.com/srg/language-modeling.asp] A series of tests was run on stock grammars using different ASR engines. About a quarter of the errors were eliminated by using trading volume alone. A simple weighting scheme was used, based on the assumption that some of the requests for stocks were driven by trading volumes and some were completely random. Two endpoint models were created, one where the probability was based only on the trading volume of the stock, and the other where the probability was just one over the number of stocks (equi-probable). A series of models were then created which interpolated between the two endpoints, and the series of models were tested on the same data. In another round of testing, we created a series of models based on the combination of trading volume and price, based on the intuition that very few shares of Berkshire Hathaway, trading at $73,000 a share, is somehow as important as quite a lot of shares of a penny stock. Similar, though slightly superior, error

reductions were obtained as with the trading volume alone models. In addition to these sorts of global models, we also apply probabilities in specific places in many grammars in order to either amplify or down-weight items in a grammar based on either observations of specific performance problems or expected issues. For example, we have a common list-oriented selector prompt phrase, “When you hear the item you want, say ‘That one.”’ We increase the likelihood of ‘that one’ in all of these contexts, since it is explicitly prompted for. This was particularly useful when a movie came out entitled “The One” and the grammar at that state accepted either the name of a movie or the phrase ‘that one.’ A similar situation occurred in our Horoscope application where the global command ‘cancel’ was becoming confused too often with the sign ‘cancer.’ It turned out that no one was ever observed asking to cancel, but we didn’t want to remove it from the grammar because it is in the global command set and someone may someday say ‘cancel’. The solution is to down-weight it. Theoretical Reduction Error reduction accuracy measures include calculating false accept and false reject rates, out of vocabulary rates, etc. These entail choosing a set confidence thresholds, and an operating point. A choice of one operating point versus another can have a large impact on system accuracy numbers. Therefore quantifying the effect of language modeling by using the measure of words correct or incorrect will change from operating threshold to operating threshold. A different approach is to consider information theory (Rosenfield, 2000). Shannon’s equation (Shannon, 1948) is: H =−

n

Pi log Pi

i=1

This measures the amount of information in a message in bits, and one can view the operation of a speech recognition grammar on an utterance as a messagedecoding scheme. The probability of occurrence (Pi ) represents the likelihood that the ith word is spoken, and it is multiplied by the log (base 2) of Pi . Suppose there are two different messages (words) in a grammar, and the words are equally likely to be spoken. The equation above calculates this to be a one-bit per

Designing Language Models for Voice Portal Applications

utterance grammar. With four words possible, it is a two-bit grammar; eight words results in a three-bit grammar, and so on. If words are not equally likely, the same equation is used. The table below shows results of running the equation in the equally probable cases versus using the probabilities shown in graphs above:

Film screenings

Equally probable

Actual

Difference

7.99

4.66

3.33

Domestic enplanements

8.92

6.29

2.63

Restaurant names

6.85

4.29

2.55

Equity volumes

11.71

9.35

2.36

CA state population

8.89

7.18

1.72

US population

5.67

5.02

0.65

Sports arenas

7.88

7.46

0.42

Sports stadiums

7.83

7.50

Total bits

0.33 13.98

The third column shows the difference, in bits, between equally probable and actual cases. We labeled the column ‘actual’, although in fact we do not know that our models are actual at all. It may be that our users are not distributed like the US Census data, or that the films showing in the towns they live in are not like those we get from our theatrical data provider, or that the stocks they own are not traded like our aggregate data shows.

Conclusion We have found significant improvement in accuracy by employing language models in many of our grammars. Since we do not have a tremendous amount of usage data, compared with the a-priori data we can obtain from other sources, we created initial language models based on these other sources. Testing against the data we did have indicates significant error reduction.

99

Going forward, a challenge will be to determine the best way to integrate new usage information into existing models. Note 1. Speechworks is now part of Scansoft.

References Jelenek, F. (1999). Statistical Methods for Speech Recognition. Cambridge: MIT Press. Kucera, H. and Francis, N. (1967). Computational Analysis of Present-Day American English. Providence: Brown University Press. Mandelbrot, B. (1953). An informational theory of the statistical structure of languages. In W. Jackson (Ed.), Communication Theory, Betterworth, pp. 486–502. Manning, C.D. and Schutze, H. (2000). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press. Microsoft (2001). SAPI 5.1 Documentation, Grammar Format Tags. Redmond: Microsoft Corporation. Nuance (1998). NGB—Nuance Grammar Builder Help, Product Version 1.0, File Version 132. Menlo Park: Nuance Communications. Pierce, J. (1961). An Introduction to Information Theory: Symbols, Signals and Noise. New York: Dover Publications Inc. Rosenfield, R. (2000). Two Decades of Statistical Language Modeling: Where do we go from here? http://www.ima.umn. edu/talks/workshops/10-30-11-3.2000/rosenfeld/rosenfeld.pdf. Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656, July and October. SpeechWorks (2001). dln set within class probs Usage Notes. Cambridge: SpeechWorks International. SpeechWorks (2002). SpeechWorks Developer’s Guide, OpenSpeech Recognizer 1.0. Cambridge: SpeechWorks International. Sun Microsystems (2002). Java Speech Grammar Format Specification, Version 1.0, Oct. 1998. Santa Clara: Sun Microsystems. (http://java.sun.com/products/java-media/speech/forDevelopers/ JSGF/index.html). W3C (2001). Speech Recognition Grammar Specification for the W3C Speech Interface Framework, W3C Working Draft 20 August 2001, A. Hunt and S. McGlashan (Eds.). (http://www.w3. org/TR/2001/WD-speech-grammar-20010820). Zipf, G. (1932). Selective Studies and the Principle of Relative Frequency in Language. Cambridge: MIT Press. Zipf, G. (1935). Psycho-Biology of Languages. New York: Houghton-Mifflin. Zipf, G. (1949). Human Behavior and the Principle of Least Effort. New York: Addison-Wesley.

eContractual choreography-language properties ... - Springer Link