J.H.L. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, and B. Pellom, "Audio Stream Phrase Recognition for a National Gallery of the Spoken Word: “One Small Step”, " ICSLP-2000: International Conference on Spoken Language Processing, vol. 3, pp. 1089-1092, Beijing, China, Oct. 2000.

Audio Stream Phrase Recognition for a National Gallery of the Spoken Word: “One Small Step”

R

John H.L. Hansen, Bowen Zhou, Murat Akbacak, Ruhi Sarikaya, and Bryan Pellom

C S L R RSPL

Robust Speech Processing Laboratory Center for Spoken Language Research University of Colorado Boulder, Campus Box 594 (Express Mail: 3215 Marine Street, Room E-265) Boulder, Colorado 80309-0594 303 – 735 –5148 (Phone) 303 – 735 – 5072 (Fax) [email protected] (email)

ICSLP-2000: Inter. Conference on Spoken Language Processing, Beijing, China, Oct. 16-20, 2000. http://www.icslp2000.org/

AUDIO STREAM PHRASE RECOGNITION FOR A NATIONAL GALLERY OF THE SPOKEN WORD: “One Small Step” John H.L. Hansen, Bowen Zhou, Murat Akbacak, Ruhi Sarikaya, and Bryan Pellom The Center for Spoken Language Research; Robust Speech Processing Laboratory University of Colorado at Boulder Boulder, Colorado 80309-0594, USA , http://cslr.colorado.edu

ABSTRACT In this paper, we introduce the problem of audio stream phrase recognition for information retrieval for a new National Gallery of the Spoken Word (NGSW). This will be the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings that are of historical content from the 20th Century. We propose a system diagram and discuss critical processing tasks such as: an environment classifier, recognizer model adaptation for acoustic background noise, restricted channels, and speaker variability, natural language processor, and speech enhancement/feature processing. A probe NGSW data set is used to perform experiments using SPHINX-III LVCSR and a previously formulated RSPL-keyword spotting system. Results are reported for WSJ, BN, and NGSW corpora. Results from sub-system evaluations are reported for (i) model adaptation based on mixture weight adjustment with MLLR (reduces WER by 2.6% over a baseline BN trained model), speaker and environmental turn taking using a Bayesian Information Criterion (BIC), and statistical analysis of phrase recognition performance for confidence measure scoring. Finally, we discuss a number of research challenges needed to address the overall task of robust phrase searching in unrestricted corpora.

1. INTRODUCTION The problem of reliable speech recognition for information retrieval is a challenging problem when data is recorded across different media and equipment. In this paper, we address the problem of audio stream phrase recognition for a new National Gallery of the Spoken Word (NGSW)[1]. This will be the first largescale repository of its kind, consisting of speeches, news broadcasts, and recordings that are of significant historical content. An NSF initiative was recently established to provide better transition of library services to digital format. As part of this Phase-II Digital Libraries Initiative, researchers from Michigan State Univ. (MSU) and Univ. of Colorado Boulder (RSPL-CSLR) have teamed to establish a fully searchable, online WWW database of spoken word collections that span the 20th Century. The database draws primarily from holdings of MSU's Vincent Voice Library which include +60,000 hours of recordings (from T.Edison's first cylinder recordings, to famous speeches such as man's first steps on the moon “One Small Step for Man”, to American presidents over the past 100 years). In this partnership, MSU will house the NGSW collection, as well as digitize (with assistance from LDC), catalog, organize, and provide meta-tagging information. MSU is also responsible for a number of engineering challenges such as digital watermarking and effective compression strategies. The team at CSLR is responsible for developing the robust audio-stream search engine and user selectable speech enhancement plug-ins. In the field of robust speech recognition, there are a variety challenging persistant problems such as: reliable speech recognition across wireless communications channels, recognition of speech across changing speaker conditions (emotion, stress, accent)[6], or recognition of speech from unknown or changing acoustic environments. The ability to achieve effective performance in changing speaker conditions for large vocabulary continuous speech recognition (LVCSR) remains a challenge as demonstrated in recent

DARPA evaluations focused on Broadcast News (BN) versus previous results from the Wall Street Journal (WSJ) corpus. The range and extent of acoustic distortion, speaker variability, and audio quality from the NGSW corpus pales in comparison with any database available from the DARPA community (including BN). This paper is organized as follows. First, we discuss the research challenges faced in developing an audio stream search engine, then present the proposed system and discuss system components. Sec. 4 presents a series of experiments with model adaptation, followed by results from sub-system task evaluations for phrase recognition, speaker/environmental turn detection, and speech enhancement approaches. Sec. 5 concludes with a discussion of the research directions for system development.

2. RESEARCH ISSUES Although the problem of audio stream search is relatively new, it is related to a number of previous research problems. Previous systems developed for streaming video search based on audio[2] or closed-captioning can be effective, but often assume either an associated text stream or a clean audio stream. Information retrieval via audio and audio mining have recently produced several commercial approaches[12,13], however these methods generally focus on relatively clean single speaker recording conditions. Alternative methods have considered ways to time-compress or modify speech in order to allow human listeners the ability to more quickly skim through recorded audio data[14]. While, keyword spotting systems can generally be used for topic or gisting applications, for phrase search, the system must be able to recover from errors in both the user requested text-sequence and rankordered detected phrase sites within the stream. Phrase search focuses more on locating a single requested occurrence, whereas keyword/topic spotting systems assume a number of possible searched outcomes. Great strides have also been made in LVCSR for BN[3], which reflect a wider range of acoustic environments. However, the recognition of speech in BN reflects a homogeneous data corpus (i.e., recordings from TV and radio news broadcasts, organized into 7 classes from F0: clean, to FX: low fidelity with cross-talk). One natural solution to audio stream search is to perform forced transcription for the entire dataset, and simply search the synchronized text stream. While this may be a manageable task for BN (consisting of about 100 hours), the initial offering for NGSW will be 5000 hours (with a potential of +60,000 total hours), and it will simply not be possible to achieve accurate forced transcription, even if the text data were available. To illustrate the range of NGSW recording conditions, three example spectrograms are shown in Fig. 1. The recordings are: (a)Thomas Edison, “my work as an electrician” [talking about contributions of 19th century scientists; original Edison cylinder disk recording, 1908], (b) Thomas Watson, “as Bell was about to speak into the new instrument,” [talking about the first telephone message from A.G. Bell on March 10, 1876; recorded in 1926], (c) President Bill Clinton, “tonight I stand before you,” [State of the Union Address on economic expansion, Jan. 19 1999]. This example indicates the wide range of distortions present in the speech corpus. Some of these include: severe bandwidth restrictions (e.g., Edison style cylinder disks), poor audio from scratchy, used, or aging recording media, differences in microphone type and placement,

reverberation for speeches from many public figures, recordings made from telephone, radio, or TV broadcasts, background noise including audience and multiple speakers or interviewers, wide • range of speaking styles and accents, etc. Clearly, the ability to achieve reliable phrase recognition search for such data is an unparalleled challenge in the speech recognition community. (A) EDISON



(B) WATSON



(C) CLINTON

• •

(D) EDISON

Fig. 1. Example Audio Stream (8kHz) spectrograms from NGSW. (A) Thomas Edison, 1908, (B) Thomas Watson, 1926, (C) Pres. Billl Clinton, 1999, (D) enhanced Edison recording.

User Makes Search R equest: "I have a dream" N-BEST AUDIO S TREAM RANK ORDER TEST

NATU RAL LANGUAGE PARSER ENVIRON MENT CLASSIFIER

Input Audio Stream ACOUSTIC B ACKGROUND NOISE ADAPTATI ON SPEAKER ADAPTATION

SPEECH E NHANCEMENT : FEATURE ENHAN CEMENT , PERCEPTUAL ENHANCEMENT

RESTRICTED C HANNEL ADAPTATION HMM RECOGNITION SEARCH Input Audio Stream

Spotted Prompt Phrase Sets

Fig. 2. Flow Diagram of Audio Stream Recognition Search

3. SYSTEM COMPONENTS Having identified the range of distortion in the corpus, we now propose a system for performing audio stream phrase search for NGSW. Fig. 2 presents the proposed phrase recognition system. The system will consist of the following processing tasks: • N-Best Processor & Natural Language Parser: Since the user submits a requested text search sequence, an N-best parser will be used to rank order potential audio streams for search. An NLP processor will also be used to correct for ill-formed requests, so search is in a standard format. • Environmental Characterization: An environment processor will be used in conjunction with meta-tag information for input stream under test. This processor will identify distortion type:

acoustic background or recording media noise, restricted channel, reverberation, multiple speakers, etc.). Acoustic Background Noise Adaptation: One of three adaptation processing tasks to be employed. Application will be based on Environmental Classifier output. We will focus on rapid methods for parallel model combination[4]. Speaker Adaptation: For repeated searching of the same speaker, HMM adaptation is performed. Here, the challenge is for limited adaptation dataset size. Several methods are currently being considered including MLLR based schemes. Selective training [7] and decision bias correction[8]. Restricted Channel Adaptation: Clearly for audio streams from Edison cylinder disks, frequency bandwidth is small (2kHz). As a result, we are investigating methods to neutralize feature sets for recognition models trained with 8 and 16kHz speech. This processing stage will also provide confidence assessment for the requested phoneme phrase set, given a truncated frequency structure for the audio stream under test. Speech Enhancement: A set of speech enhancement algorithms will be available for user for quality improvement, and feature enhancement for the audio prior to recognition stream search. HMM Recognition Search: With model adaptation tasks completed, an HMM phrase search is performed. If the Environment Classifier determines that front-end speech feature enhancement could be effective, the input parameter set will be enhanced before performing the search. Detected phrase sets will be rank order using confidence measures and an NLP processor.

4. PHRASE RECOGNITION EVALUATION 4.1 Probe NGSW Data Corpus The primary NGSW challenge is to formulate a robust search engine without the use of audio stream transcripts. Here, we consider a number of limited LVCSR and keyword recognition experiments. The follow data sets were used for system evaluations: (i) a set of 100 short (each ~6 sec.) audio streams reflecting a cross-section of the entire audio corpus, and (ii) 10 audio streams (each 5-8 min.) which focus on audio data from the 1930-40’s (mostly speech from President F.D. Roosevelt).

4.2 LVCSR Evaluations: WSJ, BN, NGSW In order to investigate speech recognition performance using NGSW data, a series of experiments were performed using the CMU Sphinx-III LVCSR. The recognizer was trained using 16kHz sampled speech with a 39-dimensional feature vector (MFCCs, energy, plus deltas’s). The acoustic model units are within-word and cross-word triphones, plus context independent monophones and noise models (e.g., click sounds, etc.). A word-based language model is used that include unigrams, bigrams and trigrams, along with a back-off mechanism (N-best rescoring using an A* search was disabled for these tests). A 64,000 word vocabulary set is used. Two acoustic model sets were used, one trained from WSJ SI284 data, the second trained from BN corpus. The PE section of 1996 BN development test data was used for testing. Table 1 shows results using this LVCSR setup for BN and WSJ. Clearly, BN trained models produce lower WERs (word-error-rate) than WSJ trained models. A range of performance occurs across the focused conditions. Other than clean (F0), non-native speakers (F5) have slight WER increase, while narrow bandwidth (F2) and other conditions (FX) had significant increases. In a manner similar to BN, we clustered NGSW data into 4 focus conditions (C0-C3), with C0 reflecting seriously degraded recording environments. WERs for these conditions ranged from 42-85%, with an average of 67.9% (much higher than for BN). Baseline HMM mean MLLR model adaptation for NGSW data was considered, as well as a new transformation based on mixture weights[15]. The results show a slight reduction in WER for the C0 focus condition.

0

0:20

0:40

1:00

1:20

1:40

2:00

2:20

2:40

3:00

3:20

3:40

Tim e (s e c ) 4 3 F re q . 2 (k Hz ) 1 Hand Mar ked BIC: Speaker/Env. Change Markings

H.S. Truman: J.F. Kennedy: 0:34.9 - 0:54.4 1:48.7 - 2:06.7 "...human dignity..." "...ask not what your country can do for you..." F.D.R oose ve lt: D.D. Eisenhower: R.M. Nixon: L.B. Johnson: 0:08.4 - 0:28.13 1:01.7 - 1:26.8 3:02.9 - 3:55.0 2:15.8 2:53.9 "...only thing we have "...freedom is pited "...simple things..." "...great society..." to fear is fear itself..." against slavery..."

Fig. 3. Phrase Search Example: Historian discussing U.S. history with audio examples from six U.S. Presidents included. Hand marked speaker change locations are shown, along with BIC speaker/environmental detected changes.

4.4 Speaker/Environmental Turn Detection HMM TRAINING CORPUS

WER (%) for each BN Focus Condition F0 F1 F2 F3 F4 F5 FX

WSJ (16 mixtures) 19.3 41.9 61.3 44.4 37.7 32.1 69.2 BN (32 mixtures) 14.1 31.3 38.3 31.5 22.9 25.5 55.4

Average WER (%) 43.1 31.1

WER (%) for NGSW C0 Focus Condition

M ODEL INFORMATION

ORIGINAL

MLLR µ

MLLR µ+T-I

BN (32 mixtures)

85.6

83.3

83.0

Table 1. SPHINX-III LVCSR (i) baseline performance using BN and WSJ corpora, and (ii) NGSW performance using mean MLLR and new T-I mixture weight model adaptation scheme[15].

4.3 Baseline Phrase Recognition Search: NGSW Next, a baseline keyword spotting system was developed by RSPL for audio stream search in television broadcasts[11], the structure of which is similar to previously reported methods[2]. The RSPL spotter uses context-dependent HMMs with filler models based on context-independent HMMs and a silence model. A front-end channel characterization is performed with binary decisions based on male/female, music/noise, high quality/telephone channel quality models. Using NGSW short-streams, two sets of phrase searches were established for testing. Since phrase search is a new area in speech recognition, there exists no standard performance criterion. Therefore, while there may only be a single occurrence of a requested search phrase per audio stream, we borrow the performance criteria used for keyword spotting systems. The three performance quantities used are (i) percent requested phrases recognized in the audio stream correctly (% correct), (ii) average number of false alarms per keyword per hour (FA/K/H), and (iii) from NIST[5] a single figure of merit (F.O.M.) that reflects the average of detection scores up to 10 false alarms per keyword per hour:

FOM = ( p1 + p 2 + p3 + L + p N + ap N +1 ) / 10T

( 2)

where pi is the percentage of true keyword kits found before the ith false alarm, N is the first integer ≥ 10T − (1 / 2), T is the fraction of an hour of test talkers, and “a” is a scale factor that interpolates to 10 false alarms per hour. The FOM yields a more stable overall performance score. Fig. 3 shows phrase spotting results with the number of false alarms ranging from 0-to-9. Since the number of potential “keywords” for the requested phrase search is small for each audio stream, the FOM is flat after 2-4 hits. The phrase complexity can also influence performance (explains differences between List#1 & #2). Performance here is reasonable, given the wide range of distortions within the corpus.

As part of HMM recognition adaptation, a method is needed to detect speaker turn and/or environmental change. A method was developed based on BIC (Bayesian Information Criterion), which is described in greater detail in [16]. This method employs a Hotelling’s T2-Statistic to achieve a 100 fold increase in computing performance with better speaker turn segmentation and clustering. Sample audio segmentation and clustering was performed using BIC for long duration audio streams. An extracted sample is shown in Fig. 4, where a radio announcer gives an historical overview of U.S. history, with injected audio clips from six U.S. presidents. On the average, speaker changes were identified within 90.7msec, with some additional environmental changes detected in presidential speeches due to audience applause.

4.5 Isolated Phrase Searching While keyword search results reflect performance of our baseline search engine, for audio stream information retrieval, the user will likely request isolated occurrences of individual phrases. As such, further investigation using the baseline phrase recognition system was performed. The purpose was to obtain statistical analysis of keyword recognition performance for future confidence measure scoring. There are a number of probability calculations used to determine the occurrence of a keyword that include: overall word probability, left and right half probability scores, and the balance between left/right scores. Thresholds for these scores were set to an extremely low minimum in order to obtain a high number of false hits. Also, a threshold is used in the spotting system for the detection of speech vs. music/noise condition. We disabled music/noise model selection so that all input data would be evaluated. The goal was to determine the average number of false hits before the correct requested key-phrase was found. We selected 5 audio streams of duration 5-8min. that focused primarily on President F.D. Roosevelt during the 1930-40’s. Each had significantly different noise/recording/speaker conditions: Files1&2: FDR with LF noise, speech bandwidth <2kHz, recorded at different times; Files3&4: 2 speakers talking, File4 had one accented speaker; File5: historian speaking with numerous presidential speeches inserted (a portion is shown in Fig.3). Here, only the male telephone HMM model set recognizer was used. Fig. 5 summarizes results that show the probability of finding the correct phrase in the so-called “hit-list” varies depending on recording condition (probability of success ranged from 27-74%). When the correct phrase was found (we searched for 75 phrases), the average number of hits was 22.4, with 5.6 false hits before the correct phrase was identified. When speaker/noise variability is high (File5), phrase detection rates are low. Reasonable performance is shown for low BW speech (Files1&2) and multiple speaker conditions (File3), which suggests that the system is capable of locating the requested phrase. This

performance represents a worst case scenario for audio stream search, since no model adaptation or confidence measurement processing was applied, and that more appropriate model sets should be included (e.g., HMMs retrained with BN data). Since insufficient NGSW data has been transcribed, it was not possible to compare performance with the system retrained with NGSW data yet. F.O.M. P HRASE R ECOGNITION P ERFORMANCE 20

CORRECT DETECTION RATE

18

Phrase List #1

16 10

Phrase List #2

12 10 8 6 4 2 0 0

Avg. Correction Detection Rate (with 0-9 False Alarms): Phrase List #1: 16.18% Phrase List #2: 12.64% 1

2

3

4

False Alarms Keyword Hour 10.51 8.69 5

6

% Correct 21 .65% 15 .38% 7

8

9

C ORRE SPO NDI NG N UM BER OF F ALSE ALARM

Acknowledgements

Fig. 4. RSPL Keyword Spotting Results. Two NGSW phrase lists were tested, with NIST FOM (figure of merit) shown. Male Telephone Channel KEY: Prob. of Finding Success in List Avg. # Hits in Search Avg. # Hits Before Success 67% 36 15

74% 21 3

67% 20 1

HMM Models

47% 31 7

approaches under consideration to improve model characterization based on selective training[7], bias-normalization[8], MLLR, and PCA-PMC[4]. Probe NGSW data was used for several experiments. Results were presented for LVCSR (SPHINX-III) tests using WSJ, BN, and NGSW corpora. Results were also presented from sub-task evaluations based on keyword spotting and single phrase search. Finally, speech enhancement methods were discussed for front-end feature processing and user selected sub-systems for listening. The ability to achieve fast and reliable NGSW phrase recognition search performance is an enormous challenge. The statistical analysis of single phrase retrieval performance will provide useful data for confidence measure scoring. Clearly, improved recognition models based on BN and NGSW corpora, as well as model adaptation based on PCA-PMC will be necessary for effective search. The range and source of environmental distortion and speaker variability in NGSW will require that the best contributions in robust speech recognition be summoned together in an effective, integrated manner. The performance of the LVCSR and RSPL-keyword spotting systems here have served to identify areas where NGSW data will be most challenging, and therefore represents “one-small-step” in addressing the overall task of robust phrase searching in unrestricted corpora.

27% 4 2

This work was supported by NSF Cooperative Agreement No. IIS-9817485. Any opinions, findings, and conclusions expressed are those of the authors and do not necessarily reflect the views of NSF. We would like to acknowledge a number of NGSW participants: Michigan State Univ.: M. Seadle (Library Services), M. Kornbluh (History), J. Grant (Education), J.R. Deller (Electrical Engineering). We also extend thanks to Rita Singh from CMU for assistance in answering questions regarding SPHINX-III recognizer.

REFERENCES [1] http://www.ngsw.org [2] [3] [4] [5]

File 1 File 2 File 3 File 4 File 5 Fig. 5. Single Phrase Recognition Results for 5 different noisy 1930’s audio streams using Male telephone HMM model set.

4.6 Speech Enhancement Processing Finally, a number of speech enhancement methods[10] will be considered for the proposed system. One recent method which has shown promise is based on perceptual enhancement using an Auditory Masking Threshold (AMT)[9]. This new method is shown to result in more accurate AMT estimates, and therefore contribute more effectively for speech enhancement. To illustrate performance, we processed a sequence of NGSW audio streams. Fig. 1(D) shows a resulting speech spectrogram for the 1908 Edison recording, which sounds extremely clean. This and other methods will be employed for NGSW data, with alternatives addressing broadband, tone and impulsive noise sources.

5. DISCUSSION In this study, we have introduced the problem of audio stream phrase recognition for a new National Gallery of the Spoken Word. This will be the first large-scale repository of its kind. We discussed the enormous series of challenges needed to achieve effective audio stream search on data recorded from 1900’s Edison cylinder disks, to audio examples from presidential speeches today. A proposed system diagram was discussed, which identified critical processing tasks such as an environment classifier, recognizer model adaptation for acoustic background noise, restricted channels, and speaker variability. We identified

[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

J. Foote, et. al, “Talker-Independent Keyword Spotting for Information Retrieval,” Eurospeech, 1995, Vol 3, pp. 2145-2149. P.C. Woodland, et al., “Experiments in Broadcast News transcription,” IEEE ICASSP-98, pp. 909-912, 1998. R. Sarikaya, J.H.L. Hansen, “PCA-PMC: A novel use of a priori knowledge for fast parallel model combination,” IEEE ICASSP-2000. Road Rally Word-Spotting Corpora, RDRALLY1 CD-Rom, LDC, NIST Speech Disc 6-1.1, Sept. 1991. J.H.L. Hansen, “Analysis and Compensation of Speech under Stress and Noise for Environmental Robustness in Speech Recognition,” Speech Communication, 20(2):151-170, Nov. 1996. L.M. Arslan, J.H.L. Hansen, “Selective Training in Hidden Markov Model Recognition,” IEEE Trans. Speech & Audio Proc., 7(1): 46-54, Jan. 1999. L. M. Arslan, J.H.L. Hansen, “Likelihood Decision Boundary Estimation between HMM Pairs in Speech Recognition,” IEEE Trans. Speech & Audio Proc., 6(4):410-414. July 1998. R. Sarikaya, J.H.L. Hansen, “Auditory Masking Threshold Estimation with Application to Speech Enhancement," EUROSPEECH-99, (6):2571-4, Budapest, Hungary, Sept. 1999. J.H.L. Hansen, “Speech Enhancement,” Encyclopedia of Electrical and Electronics Engineering, John Wiley & Sons, 20:159-175, 1999. J.H.L. Hansen, B.L.Pellom, D.Chappell, S.Bou-Ghazale, “Analysis & Normalization of Speech and Speaker Characteristics,” RSPL Final Tech. Report RSPL-97-3, SPAWAR Systems, Feb. 1997. http://speechbot.research.compaq.com/ http://www.dragonsys.com/news/pr/audiomine.html B. Arons, “A system for interactively skimming recorded speech,” ACM Trans. Computer-Human Interaction, 4(1): 3-38, 1997. G. Zhou, “Nonlinear Speech Analysis & Acoustic Model Adaptation with Applications to Stress Classification and Speech Recognition,” Ph.D. Thesis, Robust Speech Processing Lab, Nov. 1999. B. Zhou, J.H.L. Hansen, “Unsupervised Audio Stream Segmentation & Clustering via the Bayesian Information Criterion,” ICSLP-2000.

Audio Stream Phrase Recognition for a National ... - Semantic Scholar

Oct 16, 2000 - In this partnership, MSU will house the NGSW collection, as well as digitize ... results from sub-system task evaluations for phrase recognition, .... correct), (ii) average number of false alarms per keyword per hour. (FA/K/H) ...

239KB Sizes 4 Downloads 234 Views

Recommend Documents

A Privacy-compliant Fingerprint Recognition ... - Semantic Scholar
Information Technologies, Universit`a degli Studi di Siena, Siena, SI, 53100,. Italy. [email protected], (pierluigi.failla, riccardo.lazzeretti)@gmail.com. 2T. Bianchi ...

Automatic Speech and Speaker Recognition ... - Semantic Scholar
7 Large Margin Training of Continuous Density Hidden Markov Models ..... Dept. of Computer and Information Science, ... University of California at San Diego.

Pattern Recognition Supervised dimensionality ... - Semantic Scholar
bAustralian National University, Canberra, ACT 0200, Australia ...... About the Author—HONGDONG LI obtained his Ph.D. degree from Zhejiang University, ...

Customized Cognitive State Recognition Using ... - Semantic Scholar
training examples that allow the algorithms to be tailored for each user. We propose a ..... to the user-specific training example database. The TL module is used ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute of ... over all competing classes, and have been demonstrated to be effective in isolated word ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute ... NIST (National Institute of Standards and Technology) has ..... the best procedure to follow.

Markovian Mixture Face Recognition with ... - Semantic Scholar
cided probabilistically according to the probability distri- bution coming from the ...... Ranking prior like- lihood distributions for bayesian shape localization frame-.

The RWTH Phrase-based Statistical Machine ... - Semantic Scholar
The RWTH Phrase-based Statistical Machine Translation System ... machine translation system that was used in the evaluation ..... OOVs (Running Words). 133.

SPAM and full covariance for speech recognition. - Semantic Scholar
tied covariances [1], in which a number of full-rank matrices ... cal optimization package as originally used [3]. We also re- ... If we change Pj by a small amount ∆j , the ..... context-dependent states with ±2 phones of context and 150000.

A Change in Orientation: Recognition of Rotated ... - Semantic Scholar
... [email protected]. Fax: (613) 562- ..... be an intractable task to rule on this issue because reaction times for free foraging bees would ... of the First International Conference on Computer Vision, IEEE Computer Society Press,. London. pp.

A Change in Orientation: Recognition of Rotated ... - Semantic Scholar
Workers were labeled with numbered plastic disks glued to the thorax or .... raises the question of whether the failure to find a significant discrimination in the S + ...

A Appendix - Semantic Scholar
buyer during the learning and exploit phase of the LEAP algorithm, respectively. We have. S2. T. X t=T↵+1 γt1 = γT↵. T T↵. 1. X t=0 γt = γT↵. 1 γ. (1. γT T↵ ) . (7). Indeed, this an upper bound on the total surplus any buyer can hope

A Appendix - Semantic Scholar
The kernelized LEAP algorithm is given below. Algorithm 2 Kernelized LEAP algorithm. • Let K(·, ·) be a PDS function s.t. 8x : |K(x, x)| 1, 0 ↵ 1, T↵ = d↵Te,.

A demographic model for Palaeolithic ... - Semantic Scholar
Dec 25, 2008 - A tradition may be defined as a particular behaviour (e.g., tool ...... Stamer, C., Prugnolle, F., van der Merwe, S.W., Yamaoka, Y., Graham, D.Y., ...

Biotechnology—a sustainable alternative for ... - Semantic Scholar
Available online 24 May 2005. Abstract. This review outlines the current and emerging applications of biotechnology, particularly in the production and processing of chemicals, for sustainable development. Biotechnology is bthe application of scienti

Biotechnology—a sustainable alternative for ... - Semantic Scholar
May 24, 2005 - needsQ, as defined by World Commission on Environment and Development (Brundt- ... security, habitat loss and global health, all in the context of social justice and ...... Hsu J. European Union's action plan for boosting the competiti

Syllabic length effects in visual word recognition ... - Semantic Scholar
Previous studies on the syllable-length effect in visual word recognition and naming ..... as subjects for course credit, 16 in the Experiment 1 (naming) and 40 in ... presented in isolation on the center of the display screen of a Pentium computer.

Speaker Recognition in Two-Wire Test Sessions - Semantic Scholar
system described in [7]. ... virtual 2w training method described in Section 5. ... 7.0. (0.0328). 3.2. Feature warping. Feature warping is the process of normalizing ...

Color invariant object recognition using entropic ... - Semantic Scholar
1. INTRODUCTION. Humans are capable of distinguishing the same object from millions .... pictures of an object yields two different representations of the.

Graphic Symbol Recognition using Graph Based ... - Semantic Scholar
http://mathieu.delalandre.free.fr/projects/sesyd/queries.html (accessed: May 16 2009). Table 1. ... In AI Magazine, volume 12, pages 50–63,. 1991. [3] A. K. ...

Shape-based Object Recognition in Videos Using ... - Semantic Scholar
Recognition in videos by matching the shapes of object silhouettes obtained ... It is the closest ap- proach to ours regarding the nature of the data while we use.

Speaker Recognition using Kernel-PCA and ... - Semantic Scholar
[11] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models,". Digital Signal Processing, Vol. 10, No.1-3, pp.