A novel video summarization method for multi-intensity illuminated ...

Viewer
Transcript

A NOVEL VIDEO SUMMARIZATION METHOD FOR MULTI-INTENSITY ILLUMINATED INFRARED VIDEOS Jen-Hui Chuang, Wen-Jing Tsai, Chia-Hsin Chan, Wen-Chih Teng, and I-Chun Lu Dept. of Computer Science, National Chiao Tung Univ., HsinChu, Taiwan {jchuang, wjtsai}@cs.nctu.edu.tw, {terry0201, tenvunchi, latotai.dreaming}@gmail.com ABSTRACT In nighttime video surveillance, proper illumination plays a key role for the image quality. For ordinary IR-illuminators with fixed intensity, faraway objects are often hard to identify due to insufficient illumination while nearby objects may suffer from over-exposure, resulting in image foreground/background of poor quality. In this paper we proposed a novel video summarization method which utilizes a novel multi-intensity IR-illuminator to generate images of human activities with different illumination levels. By adopting GMM-based foreground extraction procedure for images acquired for each illumination level, foreground objects with most plausible quality can be selected and merged with a preselected representation for still background. The result brings out a reasonable video summary for moving foreground, which is generally unachievable for nighttime surveillance videos. Index Terms— Nighttime video surveillance, multiintensity infrared illuminator, video summary 1. INTRODUCTION In recent years, video surveillance systems can be found in many office buildings, communities, and public spaces for safety issues. Since criminal activities in such places often happen during nighttime, infrared camera equipped with an infrared illuminator (IR-illuminator) is widely used to achieve the nighttime video surveillance capability. In order to obtain a clear view for effective surveillance, proper illumination condition plays a key role. However, an ordinary IR-illuminator is set to have a fixed power level. If the power is not enough, under-exposure will occur for faraway objects, which will be unclear and hard to identify. On the other hand, objects near the camera will suffer from over-exposure, and significant loss of details, if the power level is too high. To help solving both the under-exposure and overexposure problems simultaneously, a multi-intensity IRilluminator which can periodically emit infrared light of different levels of power of infrared light was proposed by Teng [1]. Videos recorded using the multi-intensity IR-

illuminator can capture rich details of both close and distant objects under different illumination intensities. This technique was adopted in [2] and [3] for foreground object detection and license plate detection, respectively. The goal of video summarization [4] is to extract and arrange the important information of a video into a compact visual representation. With video summarization, a long video can be turned into a short summary that can be interpreted by a human viewer in a short period of time. Thus, automatic video summarization method is quite useful for surveillance applications such as action recognition, human detection, or creation of a visual diary. In this paper a novel video summarization method is proposed for the multi-intensity illuminated infrared videos. The proposed method summarizes the image frames of a video by selecting the foreground of best quality among images captured with different intensity level for each period of intensity variation. The result brings out a high quality video summary with the characteristic of multiintensity illumination sufficiently utilized. The remainder of this paper is organized as follows: section 2 introduces the acquisition and properties of the multi-intensity illuminated infrared videos; section 3 describes the proposed method; section 4 demonstrates the experiment results; and section 5 concludes this research work. 2. MULTI-INTENSITY ILLUMINATED INFRARED VIDEOS Unlike an ordinary IR-illuminator which uses a fixed power to generate a constant-level of illumination intensity, the power of multi-intensity IR-illuminator varies periodically in order to establish various illumination conditions. In this

Fig. 1. Periodic changes of the illumination intensity of the multiintensity IR-illuminator.

Fig. 2. Example of images of a sidewalk scene, which are captured with six different illumination levels.

(a) (b) (c) Fig. 3. Example of a series of snapshots captured for channel 4.

(d)

(e)

(f)

(a) (b) (c) (d) (e) (f) Fig. 4. Manual selection of the clearest channels for Fig. 3. The channel IDs from (a) to (f) are 6, 6, 5, 4, 3 and 2, respectively.

paper, the videos are assumed to be captured by a common infrared camera equipped with such a multi-intensity IRilluminator. We adopt the IR-illumination pattern proposed in [1] wherein the illumination intensity decays exponentially from the brightest level to the darkest one in each period, as shown in Fig. 1. For multi-intensity illuminated infrared videos considered in this paper, six frames (channels) are captured for each period of intensity variation with six different illumination levels. The channels are labeled from 1 (the darkest) to 6 (the brightest), and the image frames of the same channel will have a constant illumination condition. Fig. 2 shows an example of images of a sidewalk scene thus captured. To demonstrate the advantage of having the multiintensity illuminated infrared video, a series of snapshots from channel 4 are provided in Fig. 3 wherein a person is walking from faraway towards the camera. One can see that the illumination power is not strong enough when the person is far from the camera and is hardly noticeable. On the other hand, the illumination power is too high when he comes closer, resulting in over-exposure and loss of details. However, a clear view of the person can actually be obtained from the 4-th snapshot (Fig. 3(d)). In particular, Fig. 4 shows most suitable channels manually selected for the corresponding snapshots shown in Fig. 3 to maximally maintain the clearness of the person. Although channels with different intensities do have the potential of providing relatively clear view for objects at different distances, manual selection is infeasible for large amount of video data. There are many image processing techniques which can help improving image quality such as image enhancement, histogram equalization, and high dynamic range (HDR) synthesis. In consideration of the direct usage of original,

unprocessed image as well as a major reduction of amount of data for surveillance purpose, a novel video summarization technique is developed in this paper to extract information across the channels. 3. PROPOSED METHOD The flowchart of the proposed video summarization method is provided in Fig. 5. Existing image processing techniques for daytime scenarios are applied and modified to suit the above multi-channel videos for nighttime surveillance. First, separately trained Gaussian mixture models (GMMs) for

Fig. 5. The flowchart of the proposed method.

each channel will take the original video as input and perform the background/foreground segmentations. Then, foregrounds are further refined and the images are fed back to update these GMM models, separately for each channel. Finally, the selected foregrounds from all channels are merged with the background of a preselected channel to generate the video summary output. Details of the algorithm are described in the following subsections.

(a) (b) (c) Fig. 6. An snapshot from (a) the original image, (b) the foreground image, and (c) the background image after the proposed GMMbased segmentation.

3.1. Foreground/background segmentation In order to model the background scene for foreground/background segmentation, Gaussian Mixture Model (GMM) is a popular choice for its capability of adaptation to background variation. For estimating the foreground probability ( ⃑ ) at time for a pixel with intensity value ⃑ and location , a mixture model consisting of N Gaussian distributions can be denoted by (

⃑)

where

∑

⃑

⃑

⃑

(1)

⃑

symbolizes a Gaussian probability density function ̇

(2)

√

, represent the mean, and and ⃑ ⃑ ⃑ variance, and mixture weight for the n-th Gaussian model, respectively. In this paper the GMM is implemented according to the algorithm proposed in [5]. Two different learning rates and are used to update the model parameters through the following equation: ⃑

(

⃑

)

⃑

(

⃑

)

⃑ ⃑

(

⃑

)

⃑

⃑ ⃑

⃑

⃑

⃑ ⃑

⃑

(3) Since the level of intensity of each channel is different from others’, GMMs need to be applied separately for each channel. Therefore the obtained density estimation will be denoted as ( ⃑ ) for each pixel with channel i. Moreover, the parameter , which represents the GMM sensitivity of the pixel intensity change, also needs to be modified. For a higher sensitivity, even a small intensity change will be regarded as the appearance a foreground pixel, and vice versa. Because the intensity change tends to be larger for a stronger illumination, smaller are set for brighter channels to prevent foreground noise while larger are set for darker channels to increase the detection rate

(a) (b) Fig. 7. Example of foreground refinement after applying (a) thresholding plus erosion and dilation and (b) channel intersection.

of foreground objects. Fig. 6. shows segmentation results for the image shown in Fig. 3(d). At the beginning, video clips of 450 frames (each channel, with no foreground) are provided for the training phase of GMMs. After that, normal frames with foregrounds are analyzed by the GMMs. Pixels with foreground probability ( ⃑ ) larger (smaller) than a pre-defined threshold will be regarded as foreground (background) pixels. Note that the GMMs will update continuously via the feedback of foreground and background images; therefore foreground objects with no movement for some time will be recognized as background thereafter. 3.2. Foreground refinement As described above, the threshold of foreground probability separates the pixels into foreground and background pixels. However, if we examine the foreground through a binary image generated by marking the foreground/background pixels to be white/black colors as shown in Fig. 7(a), one can see that foreground regions are fragmented and contains lots of noise including reflections and backgrounds. In order to improve the quality of image foreground, the morphological erosion and dilation [6] is first applied to the original binary image to repair the fragmentation problem and eliminate small foregrounds which are usually due to backgrounds noises. Then, we consider the rule of “channel intersection,” i.e., if the same pixel location is marked as foreground for less than two channels, such a location will be marked as background for all channels. The channel intersection utilizes the relationship between channels and can effectively eliminate the noise cause by reflection due to bright illumination, as shown in Fig. 7(b). 3.3. Selecting the channel with the best foreground After the foregrounds are refined, quality of them is evaluated for every channel. Smoothness constraints for

channel selection are then applied to achieve a comfortable viewing experience. 3.3.1. Evaluating the foreground quality Regarding the example snapshots from the multi-intensity illuminated video shown in Fig. 4, it is observed that the most recognizable foregrounds may come from different channels at different time. In order to automatically summarize a video into a new data form which will have a consistently plausible quality, visual appearance of the foreground objects need to be evaluated and selected. Because it is desirable to show as much detail as possible for foreground objects of interest for nighttime surveillance, their quality is measured by the number of edge pixels they have. Here we first calculate the pixel gradient | | | | for each foreground pixel obtained for channel i at , where and denote the approximations of horizontal and vertical derivative using the well-known Sobel operator. Pixels with gradient larger than a pre-defined threshold will be regarded as edge pixels or {

(4)

Then, the initially selected channel at time period t is regarded as the one which has the largest number of edges, or ∑

(5)

(a)

(b)

(c) Fig. 8. The effect of smoothness constraints. (a) Initial channel selection with no constraint. (b) Channel selection with limited range. (c) Channel selection with limited range plus median filtering.

The reason of taking the median of the unfiltered results is that although the initial selection may not be used, it does indicate the possibility of a channel change. And the channel should change eventually when more and more recent unfiltered selections indicate such a trend. Finally the filtered selection is limited to the range again to make sure the smoothness constraint. Fig. 8(c) shows the final result for the initial selection shown in Fig. 8(a) for n = 5. 3.4. Synthesis and summarization

3.3.2. Preserving the smoothness of channel selections Although the initial channel selection contains the largest number of edge pixels, the selection is only based on information from current frames (of six channels). Therefore, selection of channel may changes abruptly because previous channel selection is not considered, as shown in Fig. 8(a). As a result, flickering may occur for foregrounds due to the quick variation of their intensity. It is thus very important that the smoothness of channel selection for consecutive output frames is ensured. In this paper, several constraints are applied to cope with the above problem. First, the range of current channel selection is limited to where denotes the channel selection of the previous frame. Fig. 8(b) shows the resulting curve of selection obtained for Fig. 8(a), which is much smoother and has less trembles. To further reduce the trembles, an one-dimensional median filter is applied to smooth the channel selection. The filtered selection is the median of current initial selection and some previous unfiltered selections , or (6)

As described in section 3.1, background images for each channel are fed back to the GMM model. Since the GMM models are maintained separately for each channel, the result generates self-updated background videos for each channel. Here we simply select a fixed channel, which generally give the clearest background view, to provide the background video. The reason for selecting the background from the fixed channel is that we want to keep the level of intensity nearly constant for a comfortable viewing. Finally, the video summarization is accomplished by replacing background pixels with those of the selected foregrounds described in section 3.2. The proposed video summarization is based on channel selection for better foreground quality and the video is skimmed by the number of channels. For example, there are six channels in this paper, and the summarized video will be six times shorter than the original one. Existing video summarization algorithms such as key frame extraction [7][8] can be applied later to further skim the video. 4. EXPERIMENT AND DISCUSSION

Table 1. Tested video clips Name

Format

Channels

Outdoor_Pathway1 Outdoor_Pathway2 Outdoor_Pathway3

320x240, 30fps 320x240, 30fps 320x240, 30fps

6 6 6

# of frames 200 400 150

The proposed video summarization is tested for three video clips from two multi-intensity illuminated infrared videos with six channels, as shown in Table 1. The original videos can be found in [9]. The threshold values for foreground probability and edge detection are set to be 0.5 and 20, respectively, and 5-tap median filter is used to smooth the channel selection. The snapshots of the resulting video summary are shown in Fig. 9. The scene shown in Fig. 3 is firstly tested in Fig. 9(a). One can see that the image quality is not only comparable to the manually channel selection shown in Fig. 4 for good-quality foreground objects, but even better for its clearer backgrounds which have a nearly constant illumination condition for a pleasant viewing experience. Fig. 9(b) starts with a lady walking away from the camera, and then back to near the camera location. Fig. 9(c) illustrates the action of a walking person who finds and picks up a bag on the ground. The bag is initially treated as part of the background, later the GMM successfully recognizes it as the foreground after it is picked up. In Fig. 10, the channel selection curves are plotted for the corresponding video summaries in Fig. 9. In Fig. 10(a) the channel successfully adapts to a darker channel when the person comes near to the camera. In Fig. 10(b) the channel selection achieves a smooth transition for the lady vanishing in the background and then emerging to the foreground again. However in Fig. 10(c) after the person picking up the bag, he quickly moves toward the camera and is out of the screen soon. The channel adaptation for darker levels is not fast enough due to the 5-tap median filter used in our experiments. Moreover, it should also be noted that while the 1st channel is not selected thorough the experiments, it is still useful for the foreground object which is very close to the camera, for example when a criminal is about to tamper with the camera. Overall, for a wide range of distances between a person and the camera, the proposed method will adapt to the illumination conditions well and therefore generates video summarization results with consistently clear foregrounds. However, it should be noted that when the channel of the selected foreground is different from that of the background, some noise may appear along the foreground boundary making the image looks inharmonious. The problem might be resolved by adopting more sophisticated foreground/background segmentation algorithm or by applying some other image smoothing techniques. For more complex situations wherein the foreground may contain several objects at different distances, it is possible to extend the proposed method by treating each object separately. For example, connected component analysis can

be applied after the foreground refinement (section 3.2) to identify the foreground objects. Then, the proposed channel selection (section 3.3), together with some object tracking algorithm, can be applied separately for each object. As a result, each object can be selected with its best channel and merge with the background to generate the summarization. 5. CONCLUSION A novel video summarization method is proposed in this paper for multi-intensity illuminated infrared videos. The characteristics of different channels of intensities are nicely utilized to bring out a reasonable video summary wherein foreground objects with most plausible quality, possibly illuminated differently, are selected and merged with a preselected representation of still background for comfortable viewing. Such summarization results are generally unachievable for ordinary nighttime surveillance videos and reveal a plausible application for multi-intensity illuminated infrared video. To the best of our knowledge, it is the first trial to summarize a multi-intensity illuminated infrared video for comfortable viewing, and more research directions around this topic are waiting to be explored. 6. REFERENCES [1] W.-C. Teng, “A New Design of IR Illuminator for Nighttime Surveillance,” MS Thesis, National Chiao Tung Univ., 2010. [2] P. J. Lu, J.-H. Chuang, and H.-H. Lin, “Intelligent Nighttime Video Surveillance Using Multi-Intensity Infrared Illuminator,” Proceedings of the World Congress on Engineering and Computer Science, vol. 1, San Francisco, USA, Oct. 2011. [3] Y.-T. Chen, J.-H. Chuang, W.-C. Teng, H.-H. Lin, and H.-T. Chen, “Robust License Plate Detection in Nighttime Scenes Using Multiple Intensity IRIlluminator,” IEEE International Symposium on Industrial Electronics, May 2012. [4] A. G. Money and H. Agius, “Video Summarisation: A Conceptual Framework and Survey of the State of the Art,” Journal of Visual Communication and Image Representation, vol. 19, no. 2, pp. 121-143, Feb. 2008. [5] H.-H. Lin, J.-H. Chuang, and T.-L. Liu, “Regularized Background Adaptation: A Novel Learning Rate Control Scheme for Gaussian Mixture Modeling”, IEEE Transactions on Image Processing, vol. 20, no. 3, pp. 822-836, Mar. 2011. [6] R. C. Gonzalez and R. C. Woods, “Digital Image Processing,” Pearson/Prentice Hall, 3rd edition, 2008. [7] W. Wolf, “Keyframe Selection by Motion Analysis,” International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1996. [8] H. J. Zhang, J. Wu, D. Zhong, S. W. Smoliar, “An Integrated System for Content-Based Video Retrieval

and Browsing,” Pattern Recognition, vol. 30, no. 4, pp. 643-658, April 1997. [9] MI3: Multi-Intensity Infrared Illumination Dataset. Available at: ://140.113.241.203/dataset/

ACKNOWLEDGEMENT This research is supported in part by NSC-101-2220-E-009054 and ”Aim for the Top University Plan” of the Ministry of Education, Taiwan, R.O.C.

(a)

(b)

(c) Fig. 9. Snapshots from the video summarization result for video clips of (a) Outdoor_Pathway1, (b) Outdoor_Pathway2, and (c) Outdoor_Pathway3.

(a)

(b)

(c)

Fig. 10. Foreground channel selection curves for video clips of (a) Outdoor_Pathway1, (b) Outdoor_Pathway2, and (c) Outdoor_Pathway3.

MULTI-VIDEO SUMMARIZATION BASED ON VIDEO-MMR