This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

1

Audio-based video genre identification Mickael Rouvier, Stanislas Oger, Georges Linar`es, Driss Matrouf, Bernard Merialdo, Yingbo Li

Abstract—This paper presents investigations about the automatic identification of video genre by audio channel analysis. Genre refers to editorial styles such commercials, movies, sports... We propose and evaluate some methods based on both low and high level descriptors, in cepstral or time domains, but also by analyzing the global structure of the document and the linguistic contents. Then, the proposed features are combined and their complementarity is evaluated. On a database composed of single-stories web-videos, the best audio-only based system performs 9% of Classification Error Rate (CER). Finally, we evaluate the complementarity of the proposed audio features and video features that are classically used for Video Genre Identification (VGI). Results demonstrate the complementarity of the modalities for genre recognition, the final audio-video system reaching 6% CER. Keywords—video genre classification, automatic classification, linguistic feature extraction.

I.

I NTRODUCTION

T

HE amount of videos available on the internet and digital TV grew rapidly in the last decade. Usually, these databases are not organized into a clear and uniform structure, especially when they are composed of user-generated records. This lack of structure limits the accessibility of contents, since most of the search tools use uncertain and imprecise metadata that are provided by the users. Automatic structuring of such collections requires highlevel video categorization by descriptors that are related not only to the semantic contents, but also to the document form. The genre is one of these metadata that could help organize videos in large categories. Genre refers to the editorial style of a video. Here, we focus on seven of the main categories that can be found in TV video streams and on video sharing platforms: commercials (and advertissements), news, music, sports, cartoons, documentaries and movies, which correspond to a subset of the full taxonomy of genres as defined in [1]. The automatic identification of video genre is a challenging task that motivated many recent studies and some contests such as the Google Challenge 1 and TrecVid evaluation campaigns [2]. Most of the proposed methods rely on features that are obtained through image or cinematic analysis. This seems to be a natural way for video classification. Various features Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. M. Rouvier is with Laboratoire Fondamental d’Informatique (LIF), Marseille, France S. Oger, G. Linares and D. Matrouf is with Laboratoire d’Informatique d’Avignon (LIA), Avignon, France B. Merialdo and Y. Li is with EURECOM, Sophia Antipolis, France. 1 Google Challenge : http://comminfo.rutgers.edu/conferences/mmchallenge/ 2010/02/10/google-challenge/

were tested, corresponding to various levels of representation and different classification strategies. The reported results clearly depend on the evaluation framework but error rates are typically lower than 10%: in a 5-class Video Genre Identification (VGI) task, [3] obtains 8% by using color histograms and a Support Vector Machine (SVM) classifier. Text-based classification methods are mostly used for classifying videos according to the semantic or linguistic contents, such as genre of movies (horror, comedy, etc.), rather than the editorial style [4]. Some authors proposed to apply text-based VGI methods on closed captions or on viewable texts extracted from the image. The results confirm that textual information provides some complementary information to classical video features [5], but such methods depend on the availability of textual information attached to the document. Audio-based identification has been developed in two different ways. Some high-level approaches consist in tracking audio events, such as jingles [6]. Others apply text-categorization techniques to the outputs of a speech recognizer [5]. The major drawbacks of these methods are due to the use of a priori knowledge about audio contents (e.g. about the kind of audio markers), and the CPU resources required by Automatic Speech Recognition (ASR). Moreover, in the context of TV or Web data, high word error rates can be expected and any linguistic analysis would be negatively affected by recognition errors. Low-level approaches use classifiers that operate in the cepstral domain [1], [7], [8] or in the temporal domain, such as [9] which focuses on acoustic features representing the document structure along the temporal axis by using ZeroCrossing Rates (ZCR) and energy variances. In [10], the authors use a neuromimetic classifier that operates on short term cepstral analysis (MFCC: Mel Filter Cepstral Coefficients). On a 5-genre classification task, this method obtains 49% Classification Error Rate (CER), corresponding to the results that were reported in other papers using similar classification schemes. A general conclusion about these various approaches of VGI could be that audio seems significantly less accurate than video for genre identification. Nevertheless, speech processing is continuously progressing and recent techniques, such as variability reduction by factor analysis or system combination, significantly improved speaker and language identification systems. Moreover, genres differ in many aspects: cepstral distribution and linguistics, but also speech style and quality, speakers interactivity, relative distribution of speech, music, etc. This paper focuses on audio-based VGI. Various systems are pesented, based on low-level and high level features extracted from the audio channel. Relevance and complementarity of audio-features are evaluated and we propose a combination

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

scheme that benefits from the feature diversity. Finally, we combine an audio-only approach with a simple image-based method in an audio-video VGI system, and we discuss the complementray of the two modalities. This paper is organized as follows. The next section presents precisely the task and the data on which the proposed methods are evaluated. Then, we present our contributions to audiobased video genre identification. We first focus on classification in the cepstral domain by presenting a method based on factor analysis that aims at improving the system’s robustness. Section 4 introduces high level audio features with the purpose of capturing information relative to the audio context and the temporal structure of videos. Section 5 presents our investigation on the linguistics of genre by text categorization methods applied to automatic transcriptions. In section 6, we evaluate classical video features for VGI. Section 7 focuses on feature combination: we evaluate the complementarity of all the audio descriptors and we combine them to obtain our final audio-only VGI system. Then, video features are added to the system, and we estimate the complementarity of audio and video features. The last section concludes on the interest of audio-based approaches for genre identification and proposes some perspectives. II. TASK AND C ORPUS Experiments are conducted on a corpus composed of videos belonging to one of the seven categories on classes commonly used for evaluation of VGI methods: commercials, sports, news, cartoons, documentaries, music, movies (trailers). The database contains 1.610 videos collected from video sharing web-platforms and manually annotated by humans. The documents are relatively short: from 1 to 5 minutes long, with a mean duration of 2min 15s. Spoken contents are mainly in French, however the music class contains both French and English songs. Some previous works deal with the relevance of metadata for VGI [11]. In [12], [13], it is shown that people tends to upload the same kind of videos - meaning that a system could use uploader profiles to contribute to uploaded-video genre identification. Since we focus on content extraction and analysis, the metadata attached to the videos, such as titles or tags added by the uploaders, will not be used. This choice is motivated by the uncertain availability or poor relevance of this kind of user-produced meta-information. Each video is supposed to belong to a unique class and to be a single storie, i.e. their narative structure remains relatively simple and homogeneous. The corpus is divided into 2 parts : 1.050 videos are used for training and tuning, and 560 for the test set. Since the video were selected for an even distribution among all the classes, we finally obtained about 150 videos of each class for training, and 80 for testing. III. S YSTEM OVERVIEW The system proposed is a 2-level architecture (Figure 1), where the first level consists in extracting features that are

2

Fig. 1. Principle scheme of genre classifier: features related to acoustic space characterization, speaker interactivity, speech quality, linguistics and video are extracted and combined by a classifier.

combined at the second level. We identify the following 5 feature groups, which are described in-depth in the next section: • acoustic space: this is the most frequently used descriptor for categorization by audio only. The general idea is to distinguish genres by statistical modeling of their cepstral patterns. • speaker interactivity: video genres may display significantly different interactivity levels. For example, speaker turns and speaking time are probably different between cartoons and news. • speech quality: most of the speech-quality related features rely on speech recognition methods; we estimate the quality of speech contents by acoustic analysis and by an a posteriori evaluation of a speech recognition process that is applied to the speech segments. • linguistics: video genres may be classified by relevant words extracted from the available textual data. We propose to extract linguistic information by using ASR to produce a transcription of the audio channel of the videos. • video: in order to classify video genres, we propose to extract 4 visual features which are commonly used for video indexing (color moments, wavelet features, edge histograms, and local binary patterns). Each feature group offers a particular view of the targeted document that may be useful to the upper level genre identification system. IV.

G ENRE IDENTIFICATION IN THE CEPSTRAL DOMAIN

A. Introduction The short-term cepstral coefficients are the most used features for speech and acoustic processing. The general idea is that the continuous acoustic stream may be represented

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

as a sequence of acoustic features that are each estimated on a temporal window short enough that the signal within may be considered stationary. Statistical classifiers estimate the probability of classes knowing each acoustic observation. The resulting frame-level scores (probabilities or log-probabilities) are then cumulated to evaluate the classification hypothesis on the whole observation sequence. Classifying documents by analyzing the acoustic vector sequence presents two major difficulties. First, variability within each genre (intervideo variability) may be high. For example, commercials may be composed of music or speech exclusively; short movies may be filmed in a highly moving and noisy environment, or in a very quiet room. The second difficulty is that genres are potentially poorly separable because a document may belong to different classes that are very close. For exemple, a commercial may be viewed as a movie designed for marketing. Also music, and cartoons may be hard to distinguish by analyzing audio channel only, etc. We propose to improve the classical cepstral classification scheme by using a variability reduction method based on Factor Analysis, which was successfully used for speaker or language identification.

3

m(h,GE) = m + DyGE + Ux(h,GE) ,

where m(h,GE) is the random intervideo-variability-genredependent supervector mean (a M D vector), D is M D × M D diagonal matrix, yGE is the random genre vector (a M D vector), U is the intervideo-variability matrix of low rank R (a M D × R matrix) and x(h,GE) are the channel factors (random vector), a R vector (theoretically x(h,GE) is not dependent on GE). Both yGE and x(h,GE) are normally distributed among N (0, I). D satisfies the following equation I = τ Dt Σ−1 D where τ is the relevance factor required in the standard MAP adaptation, and DDt represents the a priori covariance matrix of yGE , Σ is the covariance matrix of the GMM. 2) Classification task: The following section details the strategy employed to perform compensation for useless variability. The classification task is defined as follows. A genre GEtar is enrolled in the system with its training data YGEtar . The model for the genre GEtar is:

m(htar ,GEtar ) = B. Factor analysis for video genre identification 1) Overview: Gaussian Mixture Model, adaptation from a Universal Background Model (GMM-UBM) is the predominant approach used in the speaker verification field [14]. In this work, we use it for the by-genre classification of videos: each genre (news, movies, cartoons, music, sports, documentaries or commercials) is modeled by using a specific GMM. A world model (UBM) represents the whole acoustic space while genre-specific GMMs are obtained by adapting the generic UBM. The adaptation technique used is the standard Maximum A Posteriori (MAP) [15], [16]. Similarly to the usage for speaker verification, only the means vectors are adapted, and the weights and variances remain the same as in the UBM. Factor analysis aims to decompose a genre-specific model into three different components: a genre-intervideo-variabilityindependent component, a genre-dependent component, and an intervideo-variability-dependent component. A GMM mean supervector is defined as the concatenation of the GMM component means. Let D be the dimension of the feature space (39 in our case); the dimension of a supervector mean is M D, where M is the number of Gaussians in the UBM. A genre- and intervideo-variability-independent model is usually estimated to represent the inverse hypothesis: the UBM model. Let this model be parameterized by a mean vector (m), a covariance matrix (Σ) and mixture weight (α). In the following, (h, GE) indicates that the genre of the recording is GE and the intervideo-variability is h. Two different intervideo-variability values corresponding to the same genre constitute different observations, due to the following reasons: different speakers, different recording materials, different acoustic environments, different kinds of music, etc. As previously explained, such variability must be located and modeled. The factor analysis model can be written as :

(1)

m + DyGEtar .

(2)

The genre classification task consists in determining if the test frames Y belong to GEtar or not. Using the factor analysis decomposition in the testing data, one can write:

m(htest ,GEtest ) = m + DyGEtest + Uxhtest .

(3)

The genres GEtar in the training data and GEtest in the testing data have been distinguished. In this work, a hybrid domain normalization strategy is used, aiming to withdraw the useless component in the test data at frame level. A frame t is modified as follows: tˆ = t −

M X

γg (t) · {U · xhtest }[g] .

(4)

g=1

where M is the number of Gaussian components in the UBM, γg (t) is the a posteriori probability of Gaussian g given the frame t. These probabilities are estimated by using the UBM. U · xhtest is the supervector with M × D components. {U · xhtest }[g] is the g th D component bloc vector of U · xhtest . 3) Scoring: The function score is given by: LLK(Y|m + DyGEtar ) − LLK(Y|m)

(5)

where LLK(·|·) indicates the average of the log-likelihood function over all frames. Here, GMMs share their covariance matrix as well as the same mixture weights (both dropped from the equation for clarity). The intravideo variability information subtraction in the testing data is performed at the frame level (feature domain).

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

4) Kernel-based scoring and SVM modeling: By using equation 5, the factor analysis model estimates supervectors containing only genre information, normalized with respect to the useless variability. In [17], the authors proposed a probabilistic distance kernel that computes a distance between GMMs, well suited for a SVM classifier. Let Xs and Xs0 be two sequences of audio data corresponding to genres GE and GE 0 ; the kernel formulation is given below.

K(XGE , XGE 0 ) =

M  X √

−1

αg Σg 2 mgGE

t √

 −1 αg Σg 2 mgGE 0 .

g=1

(6) This kernel is valid when only the means of GMM models are varying (weights and covariance are taken from the world model). Here mGE is taken from the model in eq. 2, i.e. mGE = m + DyGE . 5) Protocol and Results: All experiments were performed using the ALIZE and LIA SpkDet toolkit2 and LaRank SVM. In our experiments, we used MFCC features, extracted using a 25ms Hamming window. Each frame is composed of 39 coefficients (MFCC 13, ∆ MFCC 13, and ∆ ∆ MFCC 13) every 10 ms. A cepstral mean normalization process is applied to each audio recording. The next subsections describe the various systems that we tested in our experiments. 6) GMM-UBM-FA: The Universal Background Model (UBM) is trained with the Expectation Maximization (EM) algorithm. Given a genre utterances, GMM-UBM training is performed by MAP adaptation of the means with a relevance factor of 14. Given the UBM and a genre utterances, Factor Analysis decomposition is performed (equation 1). The model for the genre GEtar is given by mGEtar = m + DyGEtar . The classification scores are estimated as explained in section IV-B2. 7) SVM-UBM and FA: A Support Vector Machine (SVM) is a two-class classifier constructed from the sums of a kernel function. In order to use a SVM on a multiclass problem, we propose to use the SVM LaRank. The LaRank algorithm is a dual coordinate ascent algorithm relying on a randomized exploration inspired by the perceptron algorithm [Bordes 2005 and 2007]. This approach is competitive with gradient based optimizers on simple binary and multiclass problems. 8) Results: In the following experiments, we study the impact of the Factor Analysis. We use a 256-component GMM, and the U matrix rank (the nuisance variability subspace) is set at 40. Previous works have shown that a GMM with 256 Gaussians may capture all the variability in the ”video genre” space [18]. The first row of Table I shows the results obtained by the GMM-UBM approach without intervideo-variability compensation. The second row shows the results obtained with the GMM-UBM-FA system (using 256 Gaussians). Performance is strongly improved by FA in comparison to the baseline GMMUBM system, with a relative reduction of error rate of about 2 an open-source software available at http://www.lia.univ-avignon.fr/ heberges/ALIZE/

4

TABLE I.

C EPSTRAL FEATURES FOR GENRE CLASSIFICATION : (%) BY GENRE PERFORMED BY A SVM CLASSIFIER .

CONFUSION RATES System GMM-UBM GMM-UBM-FA SVM-UBM-FA

Doc. 66 4 4

News 64 14 18

Movie 66 16 14

Music 71 18 2

Cartoon 58 24 14

Com. 76 55 9

Sport 28 13 17

Total 61 18 11

TABLE II. C ONFUSION MATRIX (%) FOR CEPSTRAL COEFFICIENTS WITH SVM-UBM AND FACTOR ANALYSIS METHOD (SVM-UBM-FA) System Doc. News Movie Cartoon Music Com. Sport

Doc. 96 13 2 0 0 2 2

News 2 82 0 0 0 0 5

Movie 0 0 86 12 0 5 0

Cartoon 0 1 0 86 2 0 2

Music 1 0 1 0 98 2 4

Com. 1 4 11 2 0 91 4

Sport 0 0 0 0 0 0 83

70%. For SVM-UBM-FA system, we observe a CER decrease of about 80% compared to the GMM-UBM system. Table II reports the confusion matrix of the SVM-UBM-FA system. We observe that the system has correctly classified the classes documentaries, music and commercials with 4%, 2% and 9% CER respectively. However, news, movies, cartoons and sports obtain worse results. We observe that the most frequent confusions concern the classes that are naturally similar, like documentaries and news or cartoons and movies. Nevertheless, the gap between all the classes is significantly smaller than the one observed with the baseline GMM-UBM approach: all the scores being in the range of 82% to 98%. 9) Conclusion: Factor analysis for speaker identification recently became a standard in state-of-the art systems. Nevertheless, the task of VGI is very different: contrary to the task of classical speaker identification, the number of classes is very small – typically for 5 to 10 – since intra-class variability is very high. Our experiments demonstrated that the FAdecomposition model matched the VGI problem: the CER is reduced by about 80% with respect to the standard approach based on GMM-UBM and MFCC features. These results confirm and extend the previous results about FA for categorization : FA makes not only a reduction of the supervector-space dimensionality, but also an extraction of the relevant subspace. This extraction is supervised and it contributes to the categorisation. In comparison to standard data analysis techniques for dimensionality reduction (such PCA or SVD-based approaches), it yields to improve the data representation by taking benefits from prior labels, related to known variabilities. V.

H IGH - LEVEL ACOUSTIC FEATURES

The first part of this paper demonstrated that cepstral descriptors carry relevant information about the video genre. Nevertheless, it remains very low-level information and higher level descriptors could provide various views about documents. These features could rely on the document structure or on its contents. The next two sub-sections investigate features representing the structure of the documents – structures that are related to the interactivity level and the quality of the spoken contents. For these two categories, a feature group is extracted and evaluated on our test set.

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

TABLE III.

5

Interactivity features for genre classification: Classification Error Rates (CER) by genre with a SVM classifier.

System CER -Int.

TABLE IV. System Doc. News Movie Music Cartoon Com. Sport

Doc. 28

News 28

Movie 23

Music 31

Cartoon 72

Com. 14

Sport 87

Total 38

C ONFUSION MATRIX (%) FOR interactivity features AND A SVM CLASSIFIER Doc. 72 3 0 22 0 0 0

News 5 72 3 0 3 4 31

Movie 1 0 77 0 46 9 20

Music 1 3 0 69 0 0 0

Cartoon 1 11 9 9 28 0 7

Com. 0 5 11 0 23 86 29

Sport 0 6 0 0 0 1 13

A. Interactivity Features The number of speakers and how the speakers communicate may differ according to the genre. For example, there is usually only one main speaker in news, where as cartoons and movies generally contain many speakers with highly variable speaking times and speaker turns. The interactivity features aim to represent these speaker-related profiles. This feature vector is composed of the following 3 parameters: the density of speaker turns, the number of speakers, and the speaking time of the main speaker. These data are extracted by using a speaker diarization system based on a three stage segmentation and clustering process. The first stage performs a Viterbi segmentation based on the 3 following classes : speech, speech over music, and music. Each of them is modeled by a GMM of 64 mixtures. Acoustic vectors are composed of 12 MFCC coefficients, their first and second order derivatives, and the delta and delta2 energy. This system is fully described and evaluated in [19]. The last 2 stages perform speaker turn detection and clustering. We used the system described in [20] based on Bayesian Information Criterion. This technique allows the estimation of the number of speaker and speaker turns for each document. The three interactivity features compose a vector that is submitted to a SVM classifier for genre identification. The results reported in Table III show that interactivity is clearly less accurate than cepstral-level information. The cepstral-features system obtain 11% CER whereas interactivity features system obtain 38% CER. We observe, in Table IV, that the error distribution is quite different than the one obtained by cepstral classification: the most frequent confusions concern cartoons (28% Accuracy) and sport (13% Accuracy), whereas news is the most confusable class in the cepstral domain (82% Accuracy). These qualitative differences match our intuitive expectations: structural information is related to the global organization of the document. This is clearly specific for news but probably irrelevant for editorial styles that are loosely constrained. B. Speech quality The basic idea is that the speech quality could provide some relevant information about genres. For example, speech is usually clean in news, of which the linguistic domain is well covered by speech recognition systems. However the linguistic domain of commercials may be unexpected due to the different

TABLE V.

Speech quality for genre classification: Classification Error Rates (CER) by genre in a SVM classifier.

System Q

TABLE VI. System Doc. News Movie Music Cartoon Com. Sport

Doc. 22

News 21

Movie 52

Music 30

Cart. 50

Com. 71

Sport 24

Total 39

C ONFUSION MATRIX (%) FOR speech quality FEATURES . Doc. 78 12 4 0 17 7 0

News 8 79 0 0 3 5 0

Movie 1 1 48 7 14 5 2

Music 1 0 21 70 16 41 11

Cartoon 9 1 21 0 50 8 0

Com. 2 7 6 21 0 29 11

Sport 1 0 0 2 0 5 76

speaking styles which reflect the wide variety of products being advertised. We use 3 features in this group, all based on the LIA broadcast news transcription system SPEERAL [21]. This system is based on an A* decoder using state-dependent hidden Markov models for acoustic modeling. The baseline Language Model (LM) is a 65k word broadcast news 3-gram, estimated on 200M words from the French newspaper ”Le Monde” and from the ESTER broadcast news training corpus of about 1M words. On the sample set composed by 100 randomly chosen videos, the ASR system produces a transcription with a WER of 63.7%. Such high WER is expected on web videos and it may impact significantly approaches based on any in-depth analysis of semantic contents. Consequently, we choose to use descriptors that could be relativelly robust to the ASR errors. The first descriptor is the posterior probability of the best hypothesis. We use posteriors as a confidence measure that integrates not only the acoustic and the linguistic scores, but also some information related to the decoding graph that was effectively developed by the search algorithm. The second descriptor is the linguistic probability of the best hypothesis, with language models trained on materials extracted from goldtranscriptions of French broadcast news and newspapers. The last feature is based on phonetic entropy. This descriptor was introduced by [22] for speech/music separation. The entropy H(n) is computed as the entropy of acoustic probabilities :

H(n) = −

N K 1 XX P (qk |xm ) log2 P (qk |xm ) N m=1

(7)

k=1

where the frame values xm are averaged over a temporal window of size N , K represents a phonetic model, and qk the phonetic sequence resulting from the ASR run. This measure is supposed to be high on low-quality speech, and to decrease on clean speech. Error rates of speech quality features are close to that of interactivity : for speech quality features, we observe about 39% CER, whereas interactivity features yield an error rate of about 38%. Distribution of errors are slightly different, as shown in Table VI: the best classes are news and documentaries, both of which usually contain speech that corresponds to the training condition of the ASR system.

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

VI.

6

A. Overview In the field of automatic text categorization, the proposed approaches usually rely on extracting meaningful words from the documents to classify. For video genre classification, most of the linguistic-level methods rely on this principle: relevant words are extracted from the available textual data (video metadata such as close captions, tags, etc.), either by removing stopwords (usually the N most frequent words) or by using a word scoring metric like term frequency-inverse document frequency (TF-IDF) ranking. The approaches proposed in [23], [24], [25] follow this usual method and will be our baseline. The TF-IDF-based approaches capture the topics present in each video genre by implicitly modeling its semantic field. Classifying genres like news or documentary with this approach is not robust, since the topics are frequently unexpected. What we call video genre refers more to the editorial style of the videos than to the topics they cover, or more generally, to their semantic contents. We think that the editorial style can be characterized by analyzing the stopwords used. The use of the stopword statistics for extracting relevant information is the exact opposite of the classical approaches where stopwords are removed from the text features [26]. The videos that we propose to classify do not have any metadata associated (closed captions, etc.) with them. To extract linguistic information, we use an ASR system to produce a transcription from the audio channel of the videos. This ASR system uses a closed lexicon and a trigram LM that is classically estimated on a huge amount of textual data. We should ideally build such a model for each video genre in order to ensure the best recognition performance. However, we do not have enough relevant data for each video genre. Thus we use a standard language model to transcribe all the videos. The lexical coverage is therefore weak and the transcription contains errors, especially on infrequent words, which are also often meaningful and high-rated by the TF-IDF-like metrics. Document modeling relies on the classical bag-of-words model [27]. In this approach, each dimension of the feature space represents a word; documents are then represented as word frequency vectors. We use the classifier architecture proposed in [28], which is composed of two levels of classifiers: low-level genredependent classifiers, which pre-process specific groups of features, and a top-level classifier that makes the final decision. This meta-classifier operates on the outputs of the low-level classifiers. B. The Baseline TF-IDF Features Our baseline consists in extracting keywords from the videos by using the TF-IDF metric: TF-IDF(d, w) = TF(d, w) × IDF(w)

nd,w k∈d nd,k

TF(d, w) = P

L INGUISTICS OF GENRE

(8)

where TF(d, w) is the frequency of the word w in the document d and IDF(w) represents the discriminative strength of the word w. TF(d, w) is defined as follows:

(9)

where nd,w is the frequency of the word w in the document d. IDF(w) is defined as follows:   N IDF(w) = log (10) DF(w) where N is the total number of documents in the database and DF(w) is the number of documents that contain the word w. The higher the TF-IDF value, the more the word is representative of the document. The words with a high TF-IDF value are generally meaningful, topic-bearing words. The feature space is made of the n best TF-IDF-ranked words of each genre, found in the transcriptions of the training documents. A document is thus represented in this space by a word frequency vector Vg . This vector is used in the low-level classifier as the only representation of the documents. C. The Stopword Features Unlike in the classical TF-IDF approach, we propose to use the stopwords as features. These words are characterized by a high frequency and by the fact that they are not meaningful. They generally exhibit a low TF-IDF value, because they are present in all the documents. These words include function words, articles, pronouns, etc. In our experiments, we defined the stopwords as the N most frequent words found in the automatic transcriptions of the training corpus. The nine most frequent words found in the subsequent experiments are shown in Table VII. We think that the frequency of these stopwords is characteristic of the video genre. Unlike the classical TF-IDF approach, the proposed method is topic-independent, since the stopwords are not topic-related. The feature space is made of the n most frequent words in the transcriptions of the training corpus. A document is thus represented in this space by a word frequency vector Vs . This vector is used in the low-level classifier as the only representation of the documents. D. The Linguistic Features Evaluation Because the transcripts of the videos are obtained by using an automatic speech recognition engine, they contain an additional ”word” that represents silences in the speech signal. By convention, it is represented by the token . This special word gives us information related to the dynamic of the speech. We included it as a regular word in the features presented in sections VI-B and VI-C. It’s important to note that, from the ASR system point of view, is a standard word representing a ”long” silence, that is modeled similarly to other words and may be freely hypotized by the system, without any specific constraints. As shown in Table VII, this special word is the most frequent word of the training corpus. Concerning the TF-IDF feature extraction, the number of the best TF-IDF-ranked words to be used as input features

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

60 Classification error rate in [%]

for the low-level classifier was set by trying almost all the possible values: from 1 to the total number of different words in the whole corpus. We found that the optimal value in our case is about 6000. We tested two types of low-level classifiers: Boosting [29] and Artificial Neural Networks (ANN) [30]. The word frequencies in the feature vectors of each document are normalized with respect to the size of the document. The total number of words in the document is added as a feature. The Boosting classifier obtains the best results: it achieves a 27.9% CER. This result constitutes our baseline and is represented by the horizontal line in Figure 2. For the stopword features, the classification error rates of the two best classifier types are presented in Figure 2. These error rates depend on the number of the most frequent words used as input features, from 1 to 1000. Beyond 1000 words, performance of the classifiers slightly decrease, because of the noise. Note that the ANN is a multi-layer perceptron with one hidden layer; the size of the hidden layer is optimized on the training data. The presented results of the ANN stopword classifier are obtained by using raw frequencies in the feature vectors, which proved to work better than using normalized frequencies. It is worth noting that classifier performance grows quickly with the number of stopword features. Results, comparable to those of the baseline of 6.000 feature words, are obtained with only 23 feature words. The best CER is 19.6%, obtained with the ANN classifier using the 100 most frequent words. With only the most frequent token, (which represents a silence), as input, the best classifier achieves a 51.4% CER. Adding the second word gives a 46.1% CER. The performance gain of adding words follows an inverse logarithmic law. The less frequent the added words are, the lower the gain is. We can conclude that the more frequent a word is, the more important the word is for classifying genres. Table VII contains the nine most frequent words in the training corpus, associated with their frequencies. We tried to model n-grams instead of words in the proposed approach. Supposedly, it would lead to better results by improving the modeling precision of the linguistic structures. However, we actually observed a systematic decrease of performance when augmenting the n-gram order. The cause of this is probably due to the fact that high-order n-grams are more discriminant than low-order ones and they capture more meaningful words, which makes them more context and content-dependent, whereas the strength of the proposed approach precisely is its independence of the context. These results validate our initial hypothesis: the stopword frequencies contain information that is characteristic of the video genre. Moreover, the proposed approach yielded a CER gain of about 8 points with respect to the baseline TF-IDF, while the feature space is reduced by 98%. In the next section we present our experiments on combining these features with several other audio features. The results reported in Table VIII are obtained with the stopword features described in section VI-C, by using the 100 most frequent words and an ANN classifier. The linguistic features are relatively accurate for genre identification, and some classes are especially well recognized: the system scores

7

Stopword ANN Stopword Boosting Baseline: best TF-IDF Boosting (6k words)

55 50 45 40 35 30 25 20 15 1

10 100 1000 Number of most frequent words used (log scale)

Fig. 2. Classification error rates (CER) (%) of the ANN and Boosting classifiers using the stopword features, according to the number of words used. TABLE VII. F REQUENCY OF THE NINE MOST FREQUENT TOKENS FOUND IN THE AUTOMATIC TRANSCRIPTIONS OF THE TRAINING CORPUS . Word de les

Frequency 146100 20093 12526

Word et le la

Frequency 12236 10961 10819

Word est des il

Frequency 9385 8682 7628

93% on documentaries, 85% on news and 93% on cartoon (3%, 15% and 7% CER respectively). These features outperform high level audio descriptors presented in section V but remain less accurate than the cepstral model presented in section IV. VII.

V IDEO F EATURES

In order to evaluate audio and video complementarity, we have implemented a standard video-based VGI system, which will be evaluated alone in this section. The next section focuses on audio-video combination. Video analysis is an important component in the realization of multimedia content classification systems [31]. The various approaches and techniques proposed can be classified into four generic categories. 1) Shot segmentation [32]: a shot is the sequence of contiguous frames that originate from the same camera take. The boundary between two shots may be a ”hard cut” (which is generally easy to detect), or a ”gradual transition” (which requires more sophisticated processing). 2) Visual feature computation [32]: color, texture and shape are visual characteristics which can describe properties of a visual content. The color histogram is a simple, yet popular, representation which is sometimes combined with a grid decomposition of the image. Texture can be analyzed through Gabor filters, wavelets, or edges. 3) Background/foreground separation [33]: optical flow is a popular method to describe the movement, and to separate the foreground objects from the background.

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

TABLE VIII.

C LASSIFICATION ERROR RATES (CER) (%) WITH STOPWORD LINGUISTIC FEATURES AND THE ANN CLASSIFIER . System Doc. News Movie Music Cartoon Com. Sport Total L 7 15 33 30 7 25 30 20

TABLE IX. System Doc. News Movie Music Cartoon Com. Sport

8

TABLE X. V IDEO FEATURES FOR GENRE CLASSIFICATION : C LASSIFICATION E RROR R ATES (CER) BY GENRE WITH A SVM CLASSIFIER . System Doc. News Movie Music Cartoon Com. Sport Total Video 25 55 27 35 42 48 15 36

C ONFUSION MATRIX (%) FOR LINGUISTIC FEATURES . Doc. 93 8 0 0 1 0 0

News 1 85 1 0 0 0 4

Movie 1 4 67 5 1 14 4

Music 1 1 14 70 5 5 4

Cartoon 1 0 1 2 93 0 2

Com. 3 2 15 17 0 75 16

Sport 0 0 2 6 0 6 70

While this process is easy when the camera is fixed, it is more difficult when the camera is moving, and its movements have to be estimated first. 4) Specific detectors [34]: some approaches focus on the detection and recognition of specific objects, such as faces, people, signs, etc. Specific techniques such as eigenfaces can be used efficiently, sometimes involving large reference databases. In addition, various types of classifiers and models can be used to combine this information for classification tasks. For example, in [35] the authors exploit the background and foreground motion to classify the videos into 3 genres: sports, cartoons, and news. And in [33] they use static feature (color histogram) and dynamic feature (motion information) to classify the videos into cartoons, sports, commercials, news by Hidden Markov Model (HMM). This paper focuses on audio features and our objective, in this section, is to estimate the complementarity of the audio and video modalities. Therefore we extracted standard (and relatively simple) video features that are described in the next section. A. Our visual features Visual features are normally classified into global features and local features. Global features consider the characteristics of a whole video frame by computing statistics of image pixels or patches, such as color histogram, edge histogram, etc. . . Several such features have been normalized in the MPEG-7 standard [?]. The GIST descriptor [?] has been shown to be efficient for large scale image retrieval. More recently, local features have been proposed and showed good performance. They rely on specific analysis of image patches, such as SIFT [?], SURF [?], HOG, to compute local descriptors which are pooled over the image to build a feature vector. A common pooling technique is the bag-of-visual-words model [?], which computes an histogram of quantized descriptors, while more advanced techniques such as Fisher vectors [?] use higher order statistics of the descriptor distribution. The latest research results suggest that Deep Networks [?] are capable of even better performance for recognition tasks. In this paper, since video is not the main focus, we rely on simple global features and use the 4 following ones: color moments, wavelet features, edge histograms and local binary patterns. Color moments provide a measurement of color similarity among images, assuming the color in the image follows a

TABLE XI. System Doc. News Movie Music Cartoon Com. Sport

C ONFUSION MATRIX (%) OF VIDEO FEATURES . Doc. 75 23 2 20 25 18 0

News 4 45 0 2 1 3 6

Movie 0 3 73 9 0 7 2

Music 11 4 20 65 7 18 6

Cartoon 9 8 1 0 58 1 0

Com. 1 12 4 4 9 52 2

Sport 0 5 0 0 0 1 85

probability distribution. In [36] the authors use three central moments for each channel of the color image: mean, standard deviation and skewness. We split each image into a 5x5 grid, and for each region, we compute these moments for the three Lab channels, to produce a 225 dimensional vector. Wavelet features [37] provide a representation of the image texture from the energies of wavelet subbands. We split each image into a 3x3 grid, and we use the Haar wavelet to get 9 wavelet coefficients for each region, producing an 81 dimensional vector. The MPEG-7 Edge Descriptor uses 4 directional (vertical, horizontal, 45 degree, and 135 degree directional edges) and one non-directional filters [38]. We compute the histograms of these edges on a 4x4 grid decomposition over the image, leading to an 80 dimensional vector. Local Binary Pattern (LBP) [39], [40] is recognized as a powerful feature in the domain of texture classification. LBP describes the surroundings of a pixel by creating the bitcode from the binary derivatives of a pixel, which is a gray-scale invariant text measure. The LBP can describe the formation of a texture in macroscopic and microcosmic domains. We concatenate the LBP computed at three different levels to obtain a 54 dimensional vector. The above features are computed for each video frame, and provide a diverse representation of the visual content, to be combined with audio features. B. Results Following the experimental scheme we used for audio feature evaluation, video descriptors are grouped into feature vectors that are submitted to a SVM classifier. Table X reports results that are globally close to the ones obtained with high level audio features, even if, as expected, the confusion matrix are quite different: the best classes are sports, documentaries, and movies; all of which have clear visual specificities. On the other hand, some classes seem difficult to distinguish: news are frequently substituted with documentaries and commercials; this confirms our intuition about their visual similarities. VIII.

C OMBINING AUDIO AND VIDEO FEATURES

This section first presents the full audio-only system that integrates all audio features. In the second part, video features are integrated into the overall system, and particularities and complementarity are discussed.

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

TABLE XII. C LASSIFICATION ERROR RATES (CER) (%) FOR THE COMBINED AUDIO FEATURES : CEPSTRAL FEATURES (AS), LINGUISTIC (L), INTERACTIVITY (Int), SPEECH QUALITY (Q). System Doc. News Movie Music Cartoon Com. Sport Total AS 4 18 14 14 2 9 17 11 AS+L 3 13 11 7 2 8 19 9 AS+L+Int 3 12 11 7 3 8 19 9 AS+L+Int+Q 4 14 7 2 5 6 18 8

In order to combine all the previously described audio features, we group the corresponding scores into a large vector of 20 coefficients. The first seven coefficients are the outputs of the 7 SVM binary classifiers on cepstral features. The next 7 coefficients are the outputs of the linguistic classifier (the seven cells of the neural network that obtained the best results for linguistic-based categorization). The last 6 coefficients are the 3 interactivity features, and the 3 speech-quality features. Then, we train a linear-kernel SVM model on these large vectors. Due to the lack of training data, SVM models are trained by a leave-one out strategy: we use a 6-fold testing on the training corpus where 5 parts are used to train the genredependent models and the remaining part is used to train the meta-classifier operating on large vectors. This strategy allows us to train the last classifier with a significant amount of data, at the cost of a 17% reduction of the training set size for single feature-group models. In order to estimate the feature complementarities, the combination is performed step-by-step: starting from the best individual feature group (cepstral descriptors), we add successively the best remaining groups: linguistic, interactivity, and speech quality. We finally added video features to perform a final evaluation, comparison and combination of audio-only and video descriptors. A. Results The result of the global combination are reported in Table XII. We observe an accuracy gain of 3 points, compared to the best single-feature system, which is based on cepstral descriptors. These results show that all the proposed features are globally complementary and relevant for genre classification, except for interactivity which does not provide any significant additional gain, in comparison to cepstral and linguistic features. We can observe that the system has correctly classified the documentaries, movies, cartoons, music, and commercials classes; but the classes news and sports obtain worse results. Table XIII shows that the news class is frequently substituted with documentaries. Results for the sports class are probably affected by a large intra-class variability, grouping various sources (car racing, football, etc.). Testing on a larger training set could improve recognition rates for this class. In spite of the limited performance of our video features, their integration to the overall system significantly improves performance: the gain is about 25% (from 8 to 6% CER), confirming the assumption of audio-video complementarity. Sports and news remain the most difficult classes: news is frequently recognized as documentaries, and sports videos seem confusable with all classes except documentaries, probably due to a weakness of the sport model.

9

TABLE XIII.

C ONFUSION MATRIX (%) FOR COMBINED AUDIO FEATURES

System Doc. News Movies Music Cartoons Com. Sports

Doc. 96 10 0 0 0 0 0

News 0 86 0 0 0 0 5

Movie 0 0 93 2 2 2 0

Music 2 0 4 98 3 4 3

Cartoon 1 0 0 0 95 0 0

Com. 0 4 3 0 0 94 10

Sport 0 0 0 0 0 0 82

TABLE XIV.

C LASSIFICATION E RROR R ATES (CER) (%) OF THE AUDIO , VIDEO AND COMBINED (AUDIO +V IDEO ).

System Doc. News Movie Music Cartoon Com. Sport Total Audio (AS+L+Int+Q) 4 14 7 2 5 6 18 8 Video 25 55 27 35 42 48 15 36 Audio+Video 3 15 4 2 2 6 17 6

IX.

C ONCLUSION

We presented our recent experiments on audio-based genre identification of videos. The first contribution concerns categorization in the cepstral domain, which is the most popular approach for audio-only VGI. We demonstrated that variability reduction by factor analysis dramatically improves the classifier accuracy. By integrating various cepstral features in a factor analyzed classifier, the CER decreases from 48% (baseline MFCC/GMM performance) to 11%. Then, we evaluated higher-level descriptors based on speech analysis. We proposed to integrate them into a multiview classifier. Although these features alone are significantly less accurate than cepstral descriptors, they carry some complementary information which enhances the fully featured audio-only system, reducing the CER to 8%. Automatic extraction of linguistic features is usually strongly dependent on ASR performance, especially on the lexical coverage that may be critical in such an open-domain task. We proposed to characterize the linguistics of genre by using statistics on the most frequent words of the targeted language. These words are supposed to be more specific to the editorial style than to the topics or the semantic contents. Experiments confirm this assumption. We finally obtained a system that combines cepstral, structural, and linguistic features. On our 7-genre task, it reaches 8% CER, significantly outperforming the previous proposals of audio-only VGI systems that were evaluated on a similar task. These experiments validate the idea that, although most of the VGI methods rely on picture and motion analysis, the audio channel brings information which is highly specific to the video genre. One of the main difficulties in exploiting this source for VGI is due to the distribution of useful information over various levels of representation, from short term acoustic patterns to linguistics. Each level requires a specific strategy to be efficiently integrated in an overall classifier. Perspectives on improvements are related to experimental developments, training strategies, and to the combination of state-of-the-art video-based VGI systems to our audio based one. We plan now to apply the proposed method to a large variety of genres, in order to handle highly heterogeneous video streams. Concerning the training strategies, this paper proposed to extract complementary views of documents. This comple-

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

TABLE XV.

C ONFUSION MATRIX (%) FOR COMBINED AUDIO - VIDEO FEATURES

System Doc. News Movies Music Cartoons Com. Sports

Doc. 97 10 0 1 0 1 0

News 0 85 0 0 0 0 4

Movie 0 0 96 0 2 4 2

Music 3 1 3 98 0 0 0

Cartoon 0 0 0 1 98 1 4

Com. 0 4 1 0 0 94 7

Sport 0 0 0 0 0 0 83

mentarity could be exploited during the training process to improve each feature-group classifier. Co-training methods could be highly profitable in such a multi-view context. Such collaborative training methods could rely on the proposed audio-based techniques and on improved video features, the ones we used in this work being only simple global features. Finally, the World Wide Web offers opportunities for collecting data of various genres, with metadata and textual information attached to the videos, such as comments, which could be helpful in identifing the video genre. R EFERENCES [1] [2]

[3]

[4]

[5]

[6]

[7] [8] [9]

[10]

[11]

[12]

[13]

L.-Q. Xu, M. Roach, and J. Mason, “Classification of non-edited broadcast video using holistic low-level features,” in IWDC’2002, 2002. A. F. Smeaton, P. Wilkins, M. Worring, O. de Rooij, T.-S. Chua, and H. Luan, “Content-based video retrieval: Three example systems from trecvid,” International Journal of Imaging Systems and Technology, vol. 18, no. 2-3, pp. 195–201, 2008. V. Suresh, C. K. Mohan, R. K. Swamy, and B. Yegnanarayana, “Content-based video classification using support vector machines,” in Neural Information Processing. Springer, 2004, pp. 726–731. Z. Rasheed and M. Shah, “Movie genre classification by exploiting audio-visual features of previews,” vol. 2. Los Alamitos, CA, USA: IEEE Computer Society, 2002, p. 21086. C. T. Weiyu Zhu and S.-P. Liou, “Automatic news video segmentation and categorization based on closed-captioned text,” in Multimedia and Expo, ICME, 2001. C. Zieger, “An hmm based system for acoustic event detection,” in Multimodal Technologies for Perception of Humans, 2008, vol. 4625, pp. 338–344. R. Jasinchi and J. Louie, “Automatic tv program genre classification based on audio patterns,” in Euromicro Conference, 2001. L.-Q. Xu and Y. Li, “Video classification using spatial-temporal features and pca,” in Multimedia and Expo, 2003. ICME, 2003. S. Moncrieff, S. Venkatesh, and C. Dorai, “Horror film genre typing and scene labeling via audio analysis,” in Multimedia and Expo, ICME, 2003. P. Roach, S. Arnfield, W. Barry, J. Baltova, M. Boldea, Marasek, A. Marchal, E. Meister, and K. Vicsi, “BABEL: An eastern european multi-language database,” in International Conference on Spoken Language Processing, Interspeech, vol. 3, Philadelphia, PA, oct 1996, pp. 1892–1893. M. Larson, M. Eskevich, R. Ordelman, C. Kofler, S. Schmiedeke, and G. J. Jones, “Overview of mediaeval 2011 rich speech retrieval task and genre tagging task.” in MediaEval 2011, 1-2 Sept 2011, Pisa, Italy, 2011. M. Rouvier and G. Linar`es, “Lia @ mediaeval 2011: Compact representation of heterogeneous descriptors for video genre classification,” in MediaEval 2011, 1-2 Sept 2011, Pisa, Italy, 2011. B. Merialdo and U. Niaz, “Uploader models for video concept detection,” in Content-Based Multimedia Indexing (CBMI), June 2014, pp. 1–4.

10

[14]

F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. MagrinChagnolleau, S. Meignier, T. Merlin, J. Ortega-Garcia, D. Petrovska, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, Special issue on biometric signal processing, 2004.

[15]

J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 291– 298, 1994.

[16]

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, 2000.

[17]

W. Campbell, D. Sturim, and D. Reynolds, “Support Vector Machines Using GMM Supervectors for Speaker Verification,” Signal Processing Letters, IEEE, vol. 13, no. 5, pp. 308–311, 2006.

[18]

M. Rouvier, D. Matrouf, and G. Linar`es, “Factor analysis for audiobased video genre classification,” in International Conference on Speech Communication and Technology, Interspeech, 2009.

[19]

N. Scheffer, C. Fredouille, J.-F. Bonastre, and D. Istrate, “Broadcast news speaker tracking for ester 2005 campaign.” in International Conference on Speech Communication and Technology, Interspeech, 2005.

[20]

X. Zhu, C. Barras, S. Meignier, and J.-L. Gauvain, “Combining speaker identification and bic for speaker diarization,” in International Conference on Speech Communication and Technology, Interspeech, 2005.

[21]

G. Linar`es, P. Noc´era, D. Massonie, and D. Matrouf, “The lia speech recognition system: from 10xrt to 1xrt,” in Lecture Notes in Computer Science, 2007.

[22]

G. Williams and D. P. W. Ellis, “Speech/music discrimination based on posterior probability features,” in European Conference on Speech Communication and Technology, Interspeech, 1999.

[23]

D. Brezeale and D. Cook, “Using closed captions and visual features to classify movies by genre,” in Proc. MDM/KDD, 2006.

[24]

W. Lin and A. Hauptmann, “News video classification using svm-based multimodal classifiers and combination strategies,” in Proc. ICM, 2002, pp. 323–326.

[25]

T. Tokunaga and I. Makoto, “Text categorization based on weighted inverse document frequency,” SIG-IPSJ, pp. 33–39, 1994.

[26]

C. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Information Retrieval. Cambridge University Press, 2008.

[27]

G. Forman, “An extensive empirical study of feature selection metrics for text classification,” The Journal of Machine Learning Research, vol. 3, pp. 1289–1305, 2003.

[28]

M. Rouvier, G. Linar`es, and D. Matrouf, “Robust audio-based classification of video genre,” in International Conference on Speech Communication and Technology, Interspeech, 2009.

[29]

B. Favre, D. Hakkani-T¨ur, and S. Cuendet, “Icsiboost,” http://code. google.come/p/icsiboost, 2007.

[30]

S. Nissen, “Implementation of a Fast Artificial Neural Network Library,” http://leenissen.dk/fann/wp/, 2003.

[31]

D. Brezeale and D. J. Cook, “Automatic video classification : A survey of the literature,” in Systems, Man, and Cybernetics, 2008.

[32]

H. K. Ekenel, T. Semela, and R. Stiefelhagen, “Content-based video genre classification using multiple cues,” in Proceedings of the 3rd international workshop on Automated information extraction in media production, ser. AIEMPro ’10. New York, NY, USA: ACM, 2010, pp. 21–26. [Online]. Available: http: //doi.acm.org/10.1145/1877850.1877858

[33]

M. K. Geetha and S. Palanivel, “Hmm based automatic video classification using static and dynamic features,” in Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Volume 03. Washington, DC, USA: IEEE Computer Society, 2007, pp. 277–281. [Online]. Available: http://portal.acm.org/citation.cfm?id=1335117.1335469

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

[34]

[35]

[36]

[37]

[38]

[39] [40]

N. Dimitrova, L. Agnihotri, and G. Wei, “Video classification based on hmm using text and faces,” in In European Signal Processing Conference, 2000. M. J. Roach, J. D. Mason, and M. Pawlewski, “Video genre classification using dynamics,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2001, pp. 1557–1560. M. A. Stricker and M. Orengo, “Similarity of color images,” in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, ser. Presented at the Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, W. Niblack & R. C. Jain, Ed., vol. 2420, Mar. 1995, pp. 381–392. K. Huang and S. Aviyente, “Wavelet Feature Selection for Image Classification,” IEEE Transactions on Image Processing, vol. 17, pp. 1709–1720, Sep. 2008. D. K. Park, Y. S. Jeon, and C. S. Won, “Efficient use of local edge histogram descriptor,” in Proceedings of the 2000 ACM workshops on Multimedia, ser. MULTIMEDIA ’00, 2000, pp. 51–54. T. Menp, “The local binary pattern approach to texture analysis extensions and applications.” 2003. T. Ojala, M. Pietikinen, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern Recognition, vol. 29, no. 1, pp. 51 – 59, 1996.

11

Georges Linar`es is Mathematics graduate since 1995. He defended a Ph.D. in the field of Neural networks for acoustic processing in 1998 and joined the Speech Processing group of the LIA, first as PLACE associate professor and as full professor since 2011. PHOTO His main research interests are related to speech HERE recognition and audio indexing: acoustic modeling, search strategies, language models, multimedia categorization, etc. He was involved in many industrial and collaborative projects, at both european and national levels. He published about 120 articles in the major conferences and journals of this field and supervised 11 Ph.D. students. He participated to the organization of conferences such as Acoustics08 (ASA, Paris), EACL 2012 (ACL, Avignon), Interspeech 2013 (ISCA, Lyon). Georges Linars leads the Neurocomputation and Language Processing department of the Labex (Laboratoire dexcellence) Brain and langage Research Institute. He heads the computer science laboratory of Avignons University (LIA) since 2010.

PLACE PHOTO HERE

PLACE PHOTO HERE

PLACE PHOTO HERE

Mickael Rouvier received the M.S. and Ph.D. degrees in computer science from the University of Avignon, Avignon, France, respectively in 2008 and 2012. His currently a post-doc associate with the Traitement Automatique du Langage Ecrit et Parle (group) in LIF University of Marseille, France. His research interests are in Machine Learning approaches applied to speech and speaker modeling.

Stanislas Oger has a Ph.D. in computer science from the University of Avignon, that he defended in 2011. He is mainly interested in spoken language processing, natural language processing in general and language modeling in particular. During his Ph.D. he designed a new kind of language model, that rely on the Possibility Theory, which is particularly suited to take advantage of the huge amount of textual data that is available on the Web for improving language modeling.

Driss Matrouf received the Ph.D. degree in noisy speech recognition from the LIMSI Laboratory, Paris IX University, Paris, France, in 1997. He then joined the University of Avignon (LIA), Avignon, France as an Associate Professor. His research interests include speech recognition, language recognition, and speaker recognition. His current research interests concentrate on session and channel compensation for speech and speaker recognition. In parallel with this research activities, he teaches at LIA in the fields covering computer science, speech coding, and

information theory.

Bernard Merialdo is professor in the Multimedia Department of EURECOM, France and current head of the department. A former student of the Ecole Normale Suprieure, Paris, he received a PhD from PLACE Paris 6 University and an Habilitation Diriger des PHOTO Recherches from Paris 7 University. For more than HERE 10 years, he was a research staff, then project manager at the IBM France Scientific Center, working on probabilistic techniques for Large Vocabulary Speech Recognition. He later joined EURECOM to set up the Multimedia Department. His research interests are the analysis, processing, indexing and filtering of Multimedia information to solve user-related tasks. His research covers a whole range of problems, from content extraction based on recognition techniques, content understanding based on parsing, multimedia content description languages (MPEG7), similarity computation for applications such as information retrieval, user personalization and user interaction for the design of innovative applications. He participates in numerous conference program committees. He is part of the organizing committee for the CBMI workshop series. He was editor for the IEEE Transactions on Multimedia and general chair of the ACM Multimedia conference in 2002. He often acts as an expert and reviewer for French and European research programs. He is a Senior Member of IEEE and member of ACM.

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2014.2387411, IEEE/ACM Transactions on Audio, Speech, and Language Processing JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

Yingbo Li received his B.Eng. degree of from Xi’an Jiaotong University, China in 2005. Then he obtained his M.S. degree in Image Processing from Pohang University of Science & Technology, South Korea in PLACE 2008. In the same year he began his Ph.D. study at PHOTO Eurecom, France. He has received his Ph.D. degree HERE from Telecom ParisTech, France in February 2012. During his Ph.D. time, his research interests include multimedia retrieval, multimedia indexing and content-based video analysis, especially the video summarization. Now he is a Post-doc researcher at Clarity Centre, Dublin City University with the research topic on the eye and gaze tracking.

2329-9290 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

12

Audio-based video genre identification

The genre is one of these metadata that could help organize videos in large categories. ...... degrees in computer science from the University of Avignon, Avignon ... 10 years, he was a research staff, then project man- ager at the IBM France ...

304KB Sizes 5 Downloads 346 Views

Recommend Documents

genre punch card.pdf
Punch Card. Genre Choice. Punch Card. Genre Choice. Punch Card. Page 1 of 1. genre punch card.pdf. genre punch card.pdf. Open. Extract. Open with. Sign In.

genre labels.pdf
... and Clouds. Informational Text. nonfiction. Traditional Literature. fiction. Fables. fiction. Tall Tales. fiction fiction. Graphic Novels. Page 2 of 2. genre labels.pdf.

genre scavenger hunt.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... genre scavenger hunt.pdf. genre scavenger hunt.pdf. Open. Extract.

simulation genre discrimination.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. simulation genre discrimination.pdf. simulation genre discrimination.pdf. Open. Extract. Open with. Sign In.

genre et nombre fiches.pdf
genre et nombre fiches.pdf. genre et nombre fiches.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying genre et nombre fiches.pdf.

Genre et Foucault .pdf
... de Max Weber à Michel Foucault, Editions La Découverte, septembre 2010. 2 Michel Foucault, « Deux essais sur le sujet et le pouvoir », Hubert Dreyfus et ...

Literary Genre Fiction Sort.pdf
other planets. SF. “A Sound of. Thunder” is an. example of this. SF. Stories passed. down from. generation to. generation. FLore. These stories and. movies often involve. space, time travel,. and the future. SF. The person in. these stories may.

genre et nombre jeu.pdf
Page 3 of 3. fourchette. tige. statue. pincette. briques. étincelles. guitares. courgettes. mouchoir. cheveu. radiateur. tablier. coussins. coffres. ustensiles. troncs.

The Super Hero Genre Pack.pdf
Isolation. Isolationism. Jealousy. Journey. Journey and Back. Judgment. Justice. Knowledge. Knowledge vs. Ignorance. Leadership. Learning. Life After Loss. Life's Traumas. Loneliness. Losing Hope. Loss. Loss of Individualism. 401-403. 404-406. 407-41

Literary Genre Fiction Sort.pdf
Whoops! There was a problem loading more pages. Retrying... Literary Genre Fiction Sort.pdf. Literary Genre Fiction Sort.pdf. Open. Extract. Open with. Sign In.

The Mystery Genre Pack.pdf
Brotherhood. Building Character. Bullies. Capitalism. Caring. Censorship. Challenges. Change. Change of Power. Change vs. Tradition. Chaos and Order.

RAPPORT formation genre et jeune.pdf
RAPPORT formation genre et jeune.pdf. RAPPORT formation genre et jeune.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying RAPPORT formation ...

Finding Genre Signals in Academic Writing - Journal of Writing Research
example, one common move in scientific discourse is a claim, usually a statement of fact that is novel. Claims .... institutional affiliation, and document object index), the full text of the article without images, and works cited ..... the 505 rese

MULTI-VIDEO SUMMARIZATION BASED ON VIDEO-MMR
we propose a criterion to select the best combination of parameters for Video-MMR. ... Marginal Relevance can be used to construct multi-document summaries ... is meaningful to compare Video-MMR to human choice. In a video set, 6 videos ...

Knowledge Graph Identification
The web is a vast repository of knowledge, but automatically extracting that ... Early work on the problem of jointly identifying a best latent KB from a collec- ... limitations, and we build on and improve the model of Jiang et al. by including ....