Content-Based Video Summarization using Spectral ...

Viewer
Transcript

Content-Based Video Summarization using Spectral Clustering Kadir A. Peker1, Faisal I. Bashir2 1

Mitsubishi Electric Research Laboratories, Cambridge, MA. [email protected] 2 University of Illinois at Chicago, Chicago, IL. [email protected] ABSTRACT

This paper presents a novel video summarization and browsing system based on face detection results in consumer video. We address the problem of generating content-adaptive meaningful summaries from various video genres in an unsupervised way. We perform face detection on video stream and use simple features like number of faces, their locations and sizes for clustering. Since each frame can have sparsely-placed multiple faces, our distance computation takes that factor into account. A novel spectral clustering approach where optimal number of clusters is adaptively determined from content is then employed to generate summaries. The results are reported in terms of generating semantically meaningful segmentation of various types of consumer video programs that are rich in face content. A summary player demo application is developed to qualitatively test the results of video summarization for user studies. 1. INTRODUCTION Content-based summarization and browsing of videos has gained significant attention from scientific circles lately. Huge amount of new video data is being produced and recorded every day in terms of commercial news and TV broadcast programs. Personal video recorders (PVR) enable digital recording of several day’s worth of broadcast video on a hard disk device. Several user and market studies confirm that this technology has the potential to profoundly change the TV viewing habits. Effective content-based video summarization and browsing technologies are deemed crucial to realize the full potential of these systems. Quite recently, domainspecific content-segmentation like news video story segmentation has been studied and has produced impressive results [1]. But the field of content-based unsupervised generation of video summaries is still in its infancy. In this paper, we address the issue of unsupervised video content summarization, across the video genres, to aid effective browsing. We base our approach on the face detection results noting that humans are the primary subjects of most consumer video programs. For example, in broadcast news videos, the more interesting segments appear to be anchor introducing the story, meteorologist presenting weather forecast, etc. We underline the fact that a desired high-level task of generating semantic summaries

requires significant amount of robust face recognition and supervised learning. We avoid this approach based on two reasons. First, the consumer video platforms like PVRs have a resource-constrained development environment. Due to the limited processing capabilities of these devices, it is not possible to build systems that work in highdimensional feature spaces or use complex non real-time algorithms. Second, any supervised approach based on face recognition will ultimately require training data which results in a domain-specific solution. Our goal on the other hand, is to build generic end-to-end system that works on various genres from multiple content providers. Also we note that face recognition generally doesn’t work in the normal news or TV program setting due to large variation of pose and illumination settings. Thus we rely mainly on the face detection results (“Is this a face?” as compared to “Whose face is this?”) and simple derivatives of this process towards an unsupervised video clustering. We use a distance measure that takes into account the number, locations and sizes of faces in each frame while computing the inter-frame distances. In a truly unsupervised sense, we establish the optimal number of clusters based on data using spectral clustering. We have developed a demo application with an intuitive user interface to play and browse through the generated summaries for user study. The rest of the paper is organized as follows. Section 2 describes the face detection and sampling of frames from video sequences. Section 3 presents our spectral clustering approach that finds optimal number of clusters and performs the clustering. This section also briefly outlines the faces arrangement-based distance computation method. Section 4 presents the user interface developed for user studies. Section 5 details simulation results where we compare the results of spectral clustering with a modified version of K-Means clustering. Finally, conclusions are laid out in section 6. 2. FACE DETECTION AND PRE-PROCESSING We use the Viola-Jones face detector which is based on boosted rectangular image features [2]. We first subsample the video stream to 360x240, and perform the detection on 1 pixel shifts. The speed is about 15 fps at these settings on a Pentium IV 3.0 GHz PC, including decoding and display overheads. About 1 false alarm per 30-60 frames occurs with frontal face detector. Using DC images increases the speed dramatically, both through the detector (speed proportional to the number of pixels), and

through savings in decoding. The minimum detected face size increases in this case, but the target faces in news video is mostly within range. The detector can be run on only the I-frames, or at a temporally sub-sampled rate appropriate for the processing power. In the preprocessing stage, we first temporally subsample the video sequence at 90 frames (~3 sec.). Next we look for a representative frame for this subsampled window. Towards this end, we sort the frames based on number of faces and locate the 70th percentile frame in the list sorted by face counts. This frame is taken as the representative frame of the sampled window. In case there are more than one frames with same number of faces as 70th percentile point, we choose the one with larger size of the biggest face. Third level ties are broken by comparing the confidence score of the highest confident faces in the two frames. The reason behind picking the 70th percentile point is that the face detector has a higher miss-rate compared to the very low false alarm rate due to pose variations. If the frames with no faces are more than 80% the number of face frames, we mark the segment as “NoFace” and exclude it from clustering process ahead. The subsampling window is then shifted by 30 frames (~1 sec.) and the same process is repeated. Once the representative frame for each subsampled window is picked, we store the face count, face x- and y-locations as well as width of the face window. These derived features are then used in the clustering process as explained in the next section. 3. SPECTRAL CLUSTERING BASED ON FACES ARRANGEMENT Once the face detection and subsampling has been done, we compute the pair-wise faces arrangement distances between the representative frames of all subsampled windows in the video sequence. We then use our spectral clustering algorithm that computes optimal number of clusters and performs the clustering on data based on faces arrangement distance. 3.1. Faces Arrangement for Distance Computation We modify the distance measure based on Faces Arrangement proposed in [3]. This distance measure computes distances between frames based on number of faces, their spatial locations and their sizes. Since the number of faces can be different across the frames to be matched, we first establish correspondence between the faces present in the two frames. This is done by minimizing the relative spatial location distance between each face of one frame and all faces of the other one. This distance TD is given by: M

M j

j

1

2

L − L T

D

=

1 M

j =1

j

1

+ W

M

W j =1

−W

j

j

1

+ W

M

T

2

−T

j

H

2

j =1

+ H

j 1

− H

j 2

(1)

j =1

H

Once the correspondence between faces has been established based on spatial locations, the total distance between the frames is computed as follows: Dist ( F1 , F2 ) = α TD + β TOV + γ TA + (1 − α − β − γ ) TN (2) Here, the component distance measures are given by:

1 TA = 1 − M TOV = 1 − TN =

1 M

M j =1

( ) max ( A , A ) min A1j , A2j j 1

M j =1

j 2

(

OverlappedSize A1j , A2j

)

(3)

NF1 − NF2

M where NF1 and NF2 are the numbers of faces in the two frames F1 and F2. M is the minimum of the number of

(

)

faces between two frames, L1j , T1 j are the coordinates of the top-left corner of the rectangle for face j in the first frame. W1 j and H1j are the width and height of the rectangle for jth face in the first frame. A1j is the area of the rectangle for jth face in first frame. W and H are the width and height of the video sequence. We use equation (2) to compute the pair-wise distances between representative frames of all the subsampled windows. The resulting symmetric distance matrix is then used in the spectral clustering as explained ahead. 3.2. Spectral Clustering Spectral clustering is a relatively new technique which employs eigenspace decomposition of the symmetric similarity matrix between items to be clustered. Ding et al [4] prove that when optimizing the K-means objective function for a specific value of k, the continuous solutions for the discrete cluster indicator vectors are given by the first k-1 principal components of the similarity matrix. In this approach, a proximity or affinity matrix is computed from original items of the data set using some suitable distance measure. The eigenspace decomposition of the affinity matrix is then used to group the dataset items into clusters. This approach has been proved to outperform KMeans clustering especially in case of non-convex clusters resulting from non-linear cluster boundaries [5]. Given the nxn symmetric affinity matrix generated from facesarrangement distance of frames, we look for the optimal number of clusters k and put the n subsampled windows into k clusters. The proposed algorithm uses k eigenvectors simultaneously as in [5] to perform a k-way partitioning of the data space into k clusters. Our decision for the number of clusters k computes the cluster validity score similar in spirit to the one proposed in [6]:

αk =

k c =1

1 Nc

i , j∈Z c

Wij (4)

where Zc denotes the cluster c, Nc the number of items in cluster c, and W is the matrix formed out of Y, the normalized eigenvector matrix to be explained ahead. We use the following algorithm to find the number of clusters k and to perform the clustering: 1. Form the affinity matrix A ∈ R nxn defined by Aij = exp(− Dist ( Fi , Fj ) / 2σ 2 ) if i ≠ j , and Aii = 0 . 2. Define D to be the diagonal matrix whose (i,i)element is the sum of A’s ith row, and construct the matrix L = D −1/ 2 AD1/ 2 . 3. Find the n principal components x1 , x2 , , xn of D. 4. Using the matrix formed by stacking k largest PCs X = [ x1 , x2 , , xk ] ∈ R n× k , form the normalized eigenvector matrix Y by renormalizing each of X’s rows to have unit length, Yij = X ij /( X 2ij )1/ 2 . Also j

compute the nxn W matrix: W = Y .Y ′ 5. Use K-Means clustering on the rows of Y to form k clusters. 6. Calculate α k . 7. Iterate the steps 4 through 6 for k = 1, 2, …, K , and find the maxima. Here we note that although the algorithm internally uses K-Means clustering, the functionality as well as the results of the algorithm differs from using K-Means on the data directly. This is due to the fact that often times the clusters in original data space corresponds to non-convex regions in which case K-Means run directly on the data finds unsatisfactory clusters. Our approach not only finds the clusters in this situation, but also computes the optimal number of clusters from the given data. Figure 1 shows a sample run of our spectral clustering algorithm on the Court TV program which results in two clusters.

application to qualitatively analyze the results as explained in the next section. 4. BROWWING VIDEOS USING SUMMARIES We have developed a user interface both to qualitatively assess the performance of video summary generation process and to port the system to PVR platforms. A screen shot of the interface is provided in Figure 2. The user can skip to alternative summaries using Up, Down controls or skip to segments within the current summary using Left, Right controls. Also, current summary and segment information is overlaid on the present frame being rendered for visual feedback. In the skip between segments mode, the system starts playing a segment of the summary (yellow box) and then skips to the start of next segment not playing the uninteresting parts between segments (light blue box). A red bar shows the current time mark. In the fast forward between segments mode, the uninteresting segments are played at fast playback rate instead of being skipped over. The user can at anytime hit the play button to get back to normal playback mode or skip over the rest of the uninteresting segment.

Figure 2: Screenshot of the Summary Player application developed for qualitative assessment of summary generation and user studies. 5. SIMULATION RESULTS Figure 1: Distance Matrix before (a) and after (b) spectral clustering on Court TV video. Optimal k was found to be 2. Once the subsampled windows are clustered, we generate the summaries from clustering results. First step towards that is temporal smoothing of clustering results to remedy the fragmented and sparse nature of generated summaries. We use morphological filtering to clean up the generated noisy summaries and to fill in the gaps. Once the summaries are generated, we use the summary player

We have conducted our experiments on two consumer videos of around one hour combined duration from two different mainstream broadcast TV channels in the United States. One of the videos is from a news program (News) with anchor shots, outside reports, weather reports, commercials, etc. The other program is a reality TV public hearing program (Court), where people contest their cases in front of a Judge. Both the videos are rich in human faces. The resulting summaries are displayed in Figure 3 for News and in Figure 4 for Court programs. The figures also show a few representative frames for each summary. As shown in Figure 3, the various summaries correspond

to semantically meaningful classes based on the production syntax of the program. The first summary corresponds to two-face shots which normally result from anchor involved with a reporter in outside news, for example. The second summary results from a small face (small face area) moving across the screen (large variation in x-coordinate of face), which corresponds to the meteorologist presenting weather forecast. The last summary corresponds to large face in the middle or at right or left side of the screen. This usually corresponds to the anchor person shots.

clustering algorithm takes the symmetric distance metric as input and generates clusters by looking up distances between frames instead of computing a new distance value at each iteration. In standard K-means, the centroid is computed as a new data point by taking the mean value of data points assigned to a cluster. In our modified version, we use the notion of median instead of mean and use the frame with median distance value as centroid. The number of k in this case was set to be the same as found by spectral clustering algorithm for the sake of even comparison. A limited user study revealed a higher level of user satisfaction with the summaries generated by our spectral clustering-based algorithm. The results of these experiments show that our unsupervised method of generating summaries performs well across the video program genres. It also results in meaningful clusters corresponding to semantic concepts. 5. CONCLUSIONS

Figure 3: Summaries generated for News program. The bottom figure show 3 summaries (rows) for 727 subsampled windows (columns). One representative frame per summary is also shown. The second example in Figure 4 shows summaries and representative frames from the Court TV program. In this case, as optimal number of clusters was found to be two, we have two summaries. The first summary has frames with several small faces. These correspond to the shots from defendants with audience in the background. The second summary corresponds to large face in the middle of the screen which corresponds to shots of the Judge.

Figure 4: Summaries generated for Court program. The bottom figure shows 2 summaries (rows) for 877 subsampled windows (columns). Three representative frames are also shown. For the purpose of comparison, we have implemented a modified K-Means clustering algorithm for generating summaries in the above settings. Our modified K-Means

In this paper, we have presented our novel contentbased video summarization system based on near real-time face detection and robust spectral clustering. The framework handles the presence of multiple faces in each frame robustly and computes distances based on number of faces, their spatial locations and face sizes. A novel spectral clustering algorithm is then used to cluster the frames while adaptively computing the optimal number of clusters from data. Temporally smoothed summaries are then generated to aid effective browsing of video content. We report results on two commercial TV programs from different content providers. Our results are very encouraging and result in summaries corresponding to semantic concepts. REFERENCES [1].Chua T.S., Chang S.F., Chaisorn L., Hsu W., “Story Boundary Detection in Large Broadcast Video Archives – Techniques, Experience and Trends”, ACM Multimedia Conference, 2004. [2].Viola P., Jones M., “Robust real-time object detection”, IEEE Workshop on Statistical and Computational Theories of Vision, 2001. [3].Abdel-Mottableb et al, “Content-based Album Management using Faces Arrangement”, ICME 2004. [4].Ding C, He X, “K-means Clustering via Principal Component Analysis”, Proceedings of the 21st International Conference on Machine Learning, ICML 2004. [5].Ng A.Y., Jordan M.I., Weiss Y., “On Spectral Clustering: Analysis and an Algorith”, Advances in Neural Information Processing Systems, Vol. 14, 2001. [6].Porikli F., Haga T., “Event Detection by Eigenvector Decomposition using Object and Frame Features”, International Conference on Computer Vision and Pattern Recognition, CVPR 2004.

Scalable Video Summarization Using Skeleton ... - Semantic Scholar