HMM Based Event Detection in Audio Conversation

Viewer
Transcript

HMM BASED EVENT DETECTION IN AUDIO CONVERSATION Shajith Ikbal

Tanveer Faruquie

IBM India Research Lab, New Delhi, India. {shajmoha, ftanveer}@in.ibm.com ABSTRACT In this paper, we address the problem of detecting sensitive events in speech signal such as exchange of credit card information. Although close in nature to the word spotting problem, variability in the linguistic content constituting an event and their composition makes event detection a harder task, especially in the context where it is applied such as callcenter interaction. In this work we extend the hidden Markov model (HMM) based framework as used in word spotting to event detection, by constructing a network composed of HMM based acoustic models for event and garbage (nonevent). Vocabularies specific to the event and non-event are used respectively to build the event and garbage models along with length constraints based on prior knowledge. Effectiveness of this approach is demonstrated by applying it to the problem of detecting credit card transaction event in real life conversations between agents and customers in call center. Our approach yield a false alarm rate of 17.0% and false miss rate of 12.5%. Index Terms— Event detection, word spotting, HMM, transcripts, garbage. 1. INTRODUCTION Many organizations are increasingly providing customer support over voice channels to perform critical tasks such as bank transactions, purchase using credit cards, and medical consultations. Conversations to complete these tasks often require sharing of private, sensitive, and confidential information by the customer. These conversations are usually recorded for purposes like quality analysis, auditing, mining, training, and learning. In this paper we refer to the presence of such sensitive information in audio recordings as an event and aim to automatically detect them to further protect them from unauthorised access. Event detection has close resemblance to word spotting [4, 5] where the objective is to detect occurrence of a word or a set of words. In event detection the aim is to detect long continuous segments of conversation consisting of several possible relevant phrases and words often interspersed with several possible irrelevant phrases and words. Taking an example of credit card transaction event, credit card information can be conveyed in several possible ways and very of-

ten conversation can get into topics other than the credit card. These relatively larger set of relevant phrases and words along with their syntactic variations and almost infinite set of irrelevant phrases and words makes event detection a more complex problem than word spotting. In a closely related work Jeanrenaud et. al. [2] proposed a posterior probability scoring method for detecting basic data types such as digits, date and time. The distinctive factor of our work is the fact that event detection can be seen as detection of sequence of different relevant data types mixed with irrelevant data types. Topic segmentation [3] is another related work, however event detection is more focused since several events can occur within a single topic segment. A standard approach to event detection is to post-process the text from transcripts of audio conversations [6], as like the approaches to named entity detection in audio recordings [8]. However, in practice transcripts are very noisy. For unconstrained telephone speech, such as conversational recordings in call centers, the best reported word error rates are around 40% [1, 6]. In this paper, we extend the hidden Markov Model (HMM) based framework for word spotting to event detection, by composing HMM based models for event and garbage (non-event). We consider a specific case of event detection namely credit card transaction event detection in call center conversations to demonstrate our approach. This paper is organized as follows. Section 2 describes the notion of event. Section 3 explains the HMM based approach to event detection. Section 4 describes experimental setup to validate the approach. Section 5 discusses the results. 2. EVENT DESCRIPTION An event typically consists of a sequence of sub-events related to its topic, mixed with sequence of non-event phrases and words. Considering the case of credit card transaction event in a call center conversation, examples of sub-events are credit card name, credit card number, expiration data, and verification (such as CVN number) and re-verification of card number. These sub-events can be represented by grammatical units made of representative words and phrases. Some of the sub-events are required to define the event while others are optional. For example, credit card number should occur while re-verification of credit card number is optional. Different possible sequences of sub-events can make the event. For

example, the agent may ask for card name before the number or vice versa. Non-event words and phrases that may occur between sub-events are as a result of confusions and skepticism by the customer or small talks on non-event topics. These large amount of variabilities introduced by the occurrence, length, composition, and content of the event make event detection a harder task than word spotting [4] or data type spotting [2]. 3. EVENT DETECTION In this section, we first explain hidden Markov model (HMM) based approach, and a method involving post-processing of transcripts obtained from large vocabulary speech recognizer, to establish a baseline. 3.1. HMM-based Event Detection The basic idea behind HMM based event detection (an extension of HMM based word spotting) is illustrated in Figure 1, which shows a network composed of an event model and a garbage (non-event) model. Presence or absence of the event can be found by Viterbi alignment of speech signal against this network. The alignment effectively chooses either the Event model

Start

Garbage model

Garbage model

End

Garbage model

taken into consideration are: card name, card number, card expiration and card verification. The two parallel paths in the network restrict the sequences of sub-events to one of the following two: { card name, card number, card expiration, card verification} and { card name, card number, card verification, card expiration }. Incorporating more parallel paths with different ordering of the sub-event models would enable more sub-event sequences. Now having composed a network, result of credit card event detection can simply be obtained by performing Viterbi alignment of speech signal against this network. An alignment path from the start node to the end node which passes only through the sub-garbage nodes is equivalent to passing through the garbage model in Figure 1. On the other hand, a path passing through sub-event nodes is possibly equivalent to passing through the event model in Figure 1. In our experiments we assume that credit card event has occurred if at least two of its prominent sub-events are detected during the alignment, as this gives best results. Figure 2 shows 3 different sub-garbage models namely, general garbage, inter-word garbage, and bypass garbage to respectively represent 3 different scenarios of non-events namely 1) non-event regions at the start and end of speech, 2) non-event regions in between the sub-events, and 3) nonevent model against which any sub-event model should score better to classify a particular region as a sub-event. General structure of models used for sub-events and subgarbages is shown in Figure 3. This is a word network with

Fig. 1. Illustration of HMM based event detection. event or the garbage model based on zero threshold applied to alignment scores as given by the following equation: S=

T X t=1

e P (xt |qte )p(qte |qt−1 )−

T X

g P (xt |qtg )p(qtg |qt−1 )

t=1

where xt corresponds to tth feature vector, T is the total number of feature vectors extracted from speech signal, qte and qtg are respectively the states of event and garbage models aligned to the feature vector at time t. The event and garbage models are composed from acoustic models of basic context-dependent phonemes that were trained for use in large vocabulary continuous speech recognition system. Unlike the construction of word model for word spotting, which is done simply from the phonetic transcriptions, the construction of event model is more involved because of the large variance in composition and content of an event as explained in section 2. Instead of constructing an event model that will incorporate all possible variabilities of the event to full perfection we construct an approximate model to detect sub-events at regions where evidences of them are more prominent. For a credit card transaction event, Figure 2 shows a network illustrating composition of event and garbage models. The event and garbage models are built using sub-event and sub-garbage models. In this case the prominent sub-events

00 Word Word 11

1 0 00 11 00 11 0 1 00 11 00 11

00 Word Word 11

11 00 00 11 00 11 00 11 00 11 00 11

11 00 00 11 00 11 00 11

11 00 00 11 00 11 00 11

11 Word Word 00

Word

Word

11 00 00 11 00 11 00 11 11 00 00 11 00 11 00 11 00 11 00 11

Word

Fig. 3. Word network for sub-event and sub-garbage models. length constrained paths. For the word network of sub-events card name, card expiration, and card verification, the lengths of the paths are either 1 or 2. An example of word network for card name sub-event is shown in Figure 4. For word netAMEX

Credit

American

Card

Express

11 00 11 00 00 11 00 11 Visa

Fig. 4. Word network for card name sub-event model. work of sub-event card number, lengths are between 4 and 6. Vocabulary for sub-event models are restricted to list of words specific to them, as given by the experts from call centers. For general and inter-event garbage models, lengths of

CNa

CNu IeG

BpG

Start

CE

CV

IeG BpG

IeG BpG

BpG

GG

End GG

BpG

BpG IeG

CNa

BpG

BpG

IeG CNu

IeG CV

CE

Fig. 2. Network used for credit card event detection. In the figure, GG = general garbage model, BpG = bypass garbage model, IeG = inter-event garbage model, CNa = credit card name model, CNu = credit card number model, CE = credit card expiration model, CV = credit card verification model. parallel paths are 0 to any length. For bypass garbage model, lengths of all paths are 3. Vocabulary used for garbage model is a list of restricted domain specific words obtained after discarding words used in sub-event vocabularies. The length constraint applied to sub-event and garbage models is expected to ensure that prominent portions of subevent regions are aligned to their corresponding models. For example, if a credit card number is present in the speech, as the length of the card number sub-event model is restricted between 4 and 6, the model is expected to get aligned to that portion of credit card number where its evidences are prominent. A minimum length constraint of 3 to all the paths of bypass garbage model is imposed to make sure that the subevent models get preference over garbage in the regions where sub-events actually occur. 3.2. Post-processing Recognizer output In this approach we use a large vocabulary recognizer to first generate a complete transcript along with alignments. The word error rate of the recognizer is 43%. Then the postprocessing step is equivalent to aligning the word net (not taking into account the acoustic models) as shown in figure 2 against the transcripts. However, some alterations are done in the word net to achieve best performance, such as relaxing length constraint to 1 for sub-event models and garbage model not contributing to the final score. Simple string matching is used to perform the alignment, hence final score is proportional to the effective length of matching regions. An optimal threshold is then applied to the final score to decide about the presence of event. 4. EXPERIMENTAL SETUP 4.1. Corpora We chose to use recordings of real-life telephonic conversation between agents and customers in call center to evaluate our approach, because: 1) system developed is for final use in call centers and 2) call center speech data is harder to deal with because it is collected in a real-life scenario, also it has lot of noise in the form of key strokes, cross talk, microphone breathing sound, long silences, hold music, automated announcement, and laughter. Customer’s location and mood such as agitation, frustration, pleasure, and satisfaction bring additional variability in the recording. The recordings

are at 8KHz. The conversations are level 2 support calls for troubleshooting IT systems like laptops, desktops, modems, applications, and anti-virus software. In some of these conversations cross-sell happens and customers have to provide credit card information. 4.2. Training The acoustic models of context-dependent phonemes, used to compose event and garbage models, are bootstrapped from US English telephony speech recognition models with an independent speech database consisting of 25 hours of call-center conversation. The final system consists of 54 phonemes, approximately 1k context-dependent phonemes, context length 3, tri-state, and a total of approximately 38k Gaussians. 60 dimensional linear discriminant analysis (LDA) features are derived from 9 consecutive frames of 24 dimensional mel-frequency cepstral coefficients (MFCC). 5. EXPERIMENTAL RESULTS Database used to evaluate the two approaches described in Section 3 consists of a total of 140 call center conversations, out of which 40 had credit card transaction event and 100 did not have. These conversations are about 8-12mins long, altogether resulting in approximately 22 hours of speech data. Performance of the approach is measured by the false alarm rate and false miss rate. False alarm occurs when a detected event is not present in the recording. False miss occurs when event present in the recording is not detected. Table 1 compares the performance of both the approaches. For HMM based approach network structure as shown in the Figure 2 with two parallel paths, corresponding to two subevent sequences as explained in Section 3, is used. The credit card event is declared to be present if atleast two of its subevents are detected, which has yielded best results. As can be seen, HMM-based approach performs better. The main difference between the two approaches comes from the fact that, HMM-based approach is biased toward the words and phrases representative of the event. The length and vocabulary size constraints for the garbage model is responsible for this bias. For post-processing approach, although such bias is imposed at a later stage, it is difficult to correct the mistakes of the recognizer at a later stage. The HMM based approach instead of recognizing the event literally (as the recognizer

try to do) tries to detect it from regions where evidences of sub-events are prominent. Table 1. Comparison of HMM based and post-processing approaches for credit card event detection. False alarm, % 17.0 19.0

HMM-based approach post-processing

False miss % 12.5 23.1

The network, in Figure 2, used for obtaining above results has two parallel paths corresponding to sequence of events: {card name, card number, card expiration, card verification} and {card name, card number, card verification, card expiration}). This network yielded the best performance among different options of network structures considered. Table 2 compares the performance of this network with a network containing 5 parallel paths. In this 5 parallel path network, apart from the two sub-event sequences of 2 path network, 3 remaining paths correspond to sub-event sequences: {card name, card expiration, card number, card verification}, {card number, card name, card expiration, card verification}, and {card number, card name, card verification, card expiration}. Table 2. Comparison of different network structures in HMM based approach for credit card event detection. Number of parallel paths 2 5

False alarm, % 17.0 39.0

False miss % 12.5 5.0

Instead of deciding the presence of events just based on Viterbi alignments, applying a further threshold to confidences of event detection offer an opportunity to find a better trade-off between false alarm and false miss rates. We compute event confidences as simple average of time normalized confidences of those words recognized from speech representing credit card event. Word confidences are in turn computed from frame level rank likelihoods [7]. Figure 5 show Detection Error Trade-off (DET) curves for two parallel path

when atleast one of its sub-events is detected, and ’Curve 2’ corresponds to the case where credit card event is declared to be present when atleast two of its sub-events are detected. As can be seen from the figure, best performance is false alarm rate = 14.0% and false miss rate = 12.5% (indicated as best operating point in the figure). 6. CONCLUSION In this paper we have addressed the problem of detecting complex events having a large amount of variability in content and composition. We have extended the HMM based word spotting method to detect events, by composing an event model from sub-event models and garbage model from sub-garbage models. We have compared the performance of this method with event detection based on noisy ASR transcripts and have found that detecting such event using HMM-based approach performs better. This work can be further extended by automatically learning the HMM based event grammar using the human generated transcripts annotated with sub-event boundaries. This can further strengthen the sub-event models and the garbage models. 7. REFERENCES [1] T.K. Chia, H. Li, and H.T. Ng, “A Statistical Language Modeling Approach to Lattice-Based Spoken Document Retrieval,” in Proc. of EMNLP and CoNLL, Prague, pp. 810-818, June 2007. [2] P. Jeanrenaud, M. Siu, J.R. Rohlicek, M. Meteer and H. Gish, “Spotting events in continuous speech,” in Proc. of ICASSP, pp. 381-384, Vol. 1, April 1994. [3] I. Malioutov, A. Park, R. Barzilay, and J. Glass, “Making sense of sound: Unsupervised topic segmentation over acoustic input,” in Proc. of ACL, Prague, pp. 504511, June 2007. [4] Rose R. C., and Paul D. B., “A Hidden Markov Model based Keyword Recognition System,” in Proc. of ICASSP, 1990.

100

[5] Weintraub M., “LVCSR Log-likelihood Ratio Scoring for Keyword Spotting,” in Proc. of ICASSP, 1995.

90

False miss rate, in %

80 70 60

[6] Mishne G., Carmel D., Hoory R., Roytman A., and Soffer A., “Automatic Analysis of Call-Center Conversations,” in Proc. of ACM CIKM, NY, USA, 2005.

Curve 2

50

Curve 1

40

Best operating point

30 20 10 0

0

10

20

30

40

50

60

70

80

90

False alarm rate, in %

Fig. 5. DET curves for HMM based credit card event detection using network with two parallel paths. network of Figure 2. In the figure, ’Curve 1’ corresponds to the case where credit card event is declared to be present

[7] Bahl L. R., et. al. “Robust Methods for Using Context-Dependent Features and Models in A Continuous Speech Recognizer,” in Proc. of ICASSP, Adelaide, Sydney, April, 1994. [8] Bikel D. M., Schwartz R., and Weischedel R. M., “An Algorithm that Learns Whats in a Name,” in Machine Learning Journal, Special Issue on Natural Language Learning, 1999.

TED: Efficient Type-based Composite Event Detection ...