Episode detection in videos captured using a head ...

Viewer
Transcript

Pattern Anal Applic (2004) 7: 176–189 DOI 10.1007/s10044-004-0215-4

T H E O R E T I C A L A D V A N C ES

Aneesh Chauhan Æ Sameer Singh Æ Dave Grosvenor

Episode detection in videos captured using a head-mounted camera

Received: 1 October 2003 / Accepted: 15 April 2004 / Published online: 19 June 2004 Springer-Verlag London Limited 2004

Abstract With the advent of wearable computing, personal imaging, photojournalism and personal video diaries, the need for automated archiving of the videos captured by them has become quite pressing. The principal device used to capture the human-environment interaction with these devices is a wearable camera (usually a head-mounted camera). The videos obtained from such a camera are raw and unedited versions of the visual interaction of the wearer (the user of the camera) with the surroundings. The focus of our research is to develop post-processing techniques that can automatically abstract videos based on episode detection. An episode is deﬁned as a part of the video that was captured when the user was interested in an external event and paid attention to record it. Our research is based on the assumption that head movements have distinguishable patterns during an episode occurrence and these patterns can be exploited to diﬀerentiate between an episode and a non-episode. Here we present a novel algorithm exploiting the head and body behaviour for detecting the episodes. The algorithm’s performance is measured by comparing the ground truth (user-declared episodes) with the detected episodes. The experiments show the high degree of success we achieved with our proposed method on several hours of head-mounted video captured in varying locations. Keywords Dominant motion Æ Episode detection Æ Head-mounted video

A. Chauhan Æ S. Singh (&) Autonomous Technologies Research, Department of Computer Science, University of Exeter, Exeter, EX4 4QF, UK E-mail: [email protected] D. Grosvenor Digital Media Department, Hewlett Packard Research Labs, Frenchay, Bristol, UK

Video abstraction Video abstracts are helpful in a number of contexts [1] including the development of multimedia archives, movie marketing, and home entertainment. Traditionally, video archives are indexed and searched by text that leads to loss of information. An audio-visual abstract is semantically much richer than text and is deﬁned as a sequence of moving images, extracted from a longer video, much shorter than the original, and preserving the essential message of the original. Manual abstraction techniques are very time consuming and the search is on for developing eﬃcient semi-automatic abstraction methods (a computer produces a draft summary that is further edited by a human expert) and fully automatic abstraction methods (the summary is produced solely by the computer). In all cases, abstracts can be a collection of only key frames (still images) [2–8], a variable number of frames extracted depending on the content as in video skimming [9] or true video content [1, 10–12]. In this paper we are interested in the latter and in fully automatic abstraction methods. It is important that video abstraction provides good quality abstracts. Lienhart [13] recommends the following basis of judging the quality of abstracts: (a) balanced coverage of material, (b) optimally shortened shots, (c) a more detailed coverage of selected clusters rather than a totally balanced but too short coverage, and (d) a proper choice of editing pattern. It is quite obvious that the quality of the abstract can only be judged in view of a speciﬁc target audience [10]. For example, the aim of viewers of documentaries is to received information, whereas the aim of feature ﬁlm viewers is entertainment. For this reason it is important that the abstracts diﬀer. A documentary abstract should give an overview of the contents of the entire video, whereas a feature ﬁlm abstract should be entertaining and not reveal the end of the story. Also, in movies, actors are important but not so much in documentaries. However, audio content is extremely important for

177

documentaries. In several applications, people are more important than other objects in the video, and often skin detection methods are used on video frames to ﬁnd face and hands.

Properties of video captured using head-mounted cameras Increasingly, people use their camcorders to record hours of footage but never get the time to edit it. This is for several reasons [14]. Unedited video is long and lacks visually appealing eﬀects so it is hard to edit it in a short amount of time. Also, editing video is a fairly boring task. Since there are no special eﬀects, the user ﬁnds it hard to focus on the editing task for a long period of time. Home entertainment videos are not as well structured as movie videos and therefore video abstraction techniques are not so well developed for such material. In this study, we have replaced the camcorder with a head-mounted camera. The video captured is for home entertainment; however, it is considerably diﬀerent from camcorderbased video as described below and as such necessitates far more complex video abstraction algorithms. Most of the contemporary devices used for recording videos (video cameras, camcorders etc.) have the basic drawback that the person recording the video is left out of the activity. In other words, the involvement of the user is restricted to recording an event rather than being a part of it. Wearcam (a wearable camera) is a solution to the stated problem as the user can be personally involved in and taping the activity at the same time. The major diﬀerence amongst various wearcams is the location on the human body where it is mounted. Of special interest is the headcam (headmounted camera) since it is believed that our head and iris movements usually track the objects of interest to record and index the visual environment of the user from the ‘‘ﬁrst person perspective’’. This single property of a headcam makes it suitable not only for recording videos but also in various other emerging technologies that are meant for processing the immediate and intimate user environments (e.g. wearable computers, personal imaging systems, augmented reality etc.). Mann [15] describes the operational principles, inner workings, optical arrangements and hardware speciﬁcations of a headcam. Mann [15] suggests that the most ideal wearcam will be the ones that can trace the iris movements of our eye; however, most video material available is taken from a static head-mounted camera which can just follow the head movements and not the iris movements. There are several important diﬀerences in video quality between head-mounted video and hand-held or mounted-device-captured video. Head-mounted video is extremely jittery and noisy. Since the head movements are large in number, often small in size and highly erratic, a traditional cut detection algorithm will immediately fail on such data. In addition, the se-

quences are fairly long and gradually changing so there are hardly any, strictly speaking, scene changes. Also, most of the video contains irrelevant scenes when the user is simply walking and not paying attention. Finally, there are many more illumination changes and haphazard camera motions in such video that can completely confuse traditional image processing algorithms.

Defining episodes The concept behind existing systems of video abstraction, e.g. Vabstract [10] or MoCA [1], that are used on commercially captured video is to determine scene boundaries, extract scenes that are of interest based on contrast and motion information, and then edit these to shorten them before assembling them back into a coherent structure. Deﬁning scenes of interest is a nontrivial task. Lienhart et al. [11] suggest the use of colour histograms and colour coherence vectors for capturing colour information, use of edge information for detecting shot boundaries, and use of a frontal face detector [16] for ﬁnding people in videos. Unfortunately, such a scheme is not useful for head-mounted video captured for personal entertainment. Archiving such video now requires us to automatically ﬁnd episodes within the video that are of interest to the user and must be separated from several hours of video where the user did not ﬁnd anything interesting even though the events were still recorded. Psychological evidence shows that important events are stored in human episodic memory. It is believed that our memory of past events (which engrossed a suﬃcient amount of a user’s attention) is naturally organised into sequences of episodes. An episode thus can be deﬁned as an incidental event that gripped one’s attention long enough to be considered as an important activity for that person. The ability to faithfully capture the immediate environment and the user’s involvement in it makes the visual information available from headcam videos a strong contender in detecting episodic events for that user. Episodes can last from a few frames to several frames in length and are distributed irregularly in the video, as shown in Fig. 1. We are motivated by the approach of Nakamura et al. [17, 18] that describes the concept of active and passive attention that can be used to understand episodes in head-mounted video. These concepts are discussed below. Active and passive attention We can broadly classify attention as type active or passive. When the user is paying attention to the target, it is termed as ‘‘active’’, and if the user is not paying attention but just looking around, then it is termed ‘‘passive’’. We are interested in ﬁnding video clips that correspond

178 Fig. 1 General episode occurrence structure in a general video sequence: E ‘i’ (i is 1/2/3) is the i-th independent episode and E’ij’(ij is 21/22/31) are the episodes present within a container episode

to ‘‘active’’ attention. We can further deﬁne these two types of attention in the context of video analysis as follows:

be still as shown in Fig. 2a, often with small movements such as nodding. The duration of those scenes is usually long, for example, 10 min.

– Active attention (episodes)

For our head-mounted video data, we ﬁnd that a number of cues help us to detect episodes. These are shown in Fig. 3. Images a1/a2/a3 come from HP_video7 (school_sports1a_day.avi) and show an example of tracking an object of interest in a continuous frame sequence; images b1/b2 come from video HP_video6 (car_boot_down2.avi) and image b3 from HP_video1 (black_boy_hill_down1.avi), which are examples of hand emergence as an important cue to ﬁnd an episode; images c1/c2/c3 come from HP_video2 (black_boy_hill_down2.avi) and are an example of an episode showing user focus on a single object in a continuous frame sequence; images d1/d2/d3 come from video HP_video3 (bluepeter_down1.avi), where the frame sequence shows head movement to the right because of a shift in the user’s attention. The frames shown in images a, b, c and d of Fig. 3 were taken arbitrarily from their respective videos from a single sequence, where each of them was a real episode. We can summarise the important cues for episode detection as observed through Fig. 3, below:

– When a person is following an object of interest (independent of the person or the object of interest being still or stationary). – Passive attention (redundant data) – When a person is looking vaguely and continuously at something (usually thinking). – When a person is vaguely looking around in a random manner without being interested in any particular object. – When a person is non-stationary (moving), with no objects of interest in the vicinity. Figure 2 gives us details on how the head moves during ‘‘active’’ and ‘‘passive’’ attention. We quote here the deﬁnition provided by Nakamura et al. [17, 18]: Active attention We often gaze at something and visually track it when it attracts our interest. If the target stays still, head motion will be as in Fig. 2a or b. If the person is moving, they will be as in Fig. 2c or d. This type of behaviour lasts a relatively short time, e.g. a few seconds. Passive attention We often look vaguely and continuously at something around ourselves while working at a desk, engaged in conversation or resting. This type of behaviour does not always express a person’s attention. However, this kind of scene can be a very good cue to remember where the person was. Head motions tend to Fig. 2 Head motion in paying attention (Nakamura et al. 2000a, b)

– Tracking an object of interest (i.e. moving the head in parallel with the object motion, Fig. 3—a1/ a2/a3). – Emergence of user’s hands (indicating something or someone important is present, Fig. 3—b1/b2/b3). – Staying in focus(no movements shown by the head, Fig. 3—c1/c2/c3). – A sudden body movement towards an interesting event (Fig. 3—d1/d2/d3).

179

Fig. 3 Diﬀerent behavioural patterns during episodic events

180

– While on the move, continuously and regularly looking (moving head) in the same direction (indicating the presence of an object of interest in that direction, Fig. 3—d1/d2/d3). – Decision to move or to stop is another strong indicator of change in attention (either end of one episode or start of a new one). On the basis of the above observations, we deﬁne three key characteristics that help us identify an episode. These are: 1. Head motion – No head movement (detecting user’s focus). – Movement of head to left and right (detecting object tracking, presence of object of interest in either direction or sudden body movement). 2. User moving or stationary – (detecting start or end of diﬀerent episodes). 3. Hand emergence detection – (detecting an important event, requiring the user to extend his hands, which in turn brings them within the camera’s focus). For the ﬁnal cue, we have applied a face detector developed at HP Research Laboratory but found that it generates an unacceptable number of false alarms. Hence, for our ﬁnal system, we use the ﬁrst two cues alone.

Methodology for episode detection Our methodology for episode detection is shown in Fig. 4. There are two major components of the proposed Fig. 4 Our methodology for episode detection in head-mounted video

system. A classiﬁcation scheme that can separate frames when the user is stationary and when she is moving (non-stationary). Secondly, we need a classiﬁcation scheme to identify the head movement, i.e. whether the user is looking left, right or straight. These components are described in greater detail below. We can fuse the information from the previous two classiﬁcation schemes and use it for ﬁnal episode detection as follows: Small head or body movement in one direction (without returning back) This will be an episodic event in the stationary class, as it indicates either a change in the user’s attention to a new event of interest or following the target (object of interest). On the other hand, this is a non-episodic event in the non-stationary class, since such a movement at most indicates a change in the direction of the user’s motion and is not an indicator of the presence of an object of interest (because a moving user, after tracking the target, returns to her original position). Head showing patterns of left-right-left (LRL) or rightleft-right (RLR) movements This will be an episodein the non-stationary class, indicating the presence of the target in demanding the user’s attention (while moving the user has to keep coming back to the straight position, so as to keep focusing on the path as well, thus resulting in RLR or LRL head patterns). However, this will just be noise in the case of the stationary class because in such situations the stationary class itself will be an episode and momentary head movements will only be distractions. Considerable head movement This will be an episodein both the stationary and non-stationary classes, where considerable movement can be deﬁned as a sizeable shift

181

in head or body in comparison to the small head movements in one direction. The episode is detected as a small sequence of frames in the video. Our ﬁnal output appends these frames with 30 frames (past and future) so that we do not leave out any frames of the real episode. We now discuss the two classiﬁcation schemes that can distinguish frames in which the user is stationary from those in which the user is non-stationary as well as estimate the direction of head movement. Stationary vs. non-stationary classiﬁcation The classiﬁcation of stationary vs. non-stationary user is performed using features extracted from motion analysis [19]. Figure 5 illustrates the three main stages of the motion estimation approach, which is in its structure a fairly standard feature-based method. We ﬁrst detect features in one image, then search for correspondence in the second image, and ﬁnally estimate the transformation from these correspondences. We chose to use a corner feature detector with non-maxima suppression to create very eﬃciently a number of high-texture features uniformly spread over the image. Then we used a multiresolution sum-of-squared diﬀerences method as feature matcher, which has the desirable property of being robust to large image displacements while being comparatively computationally eﬃcient. Finally, we used the robust ﬁtting method [20], which has proven resilient to the inevitable outliers produced by the previous stage. A transition matrix is calculated using aﬃne-based dominant motion for each pair of frames of these videos. For N videos, the matrix contains N columns associated with each pair of consecutive frames for each video. The transition matrix essentially estimates the transition (projection) of a pixel in the previous frame onto the next frame. Thus, for a pixel at position (x,y) the estimated position (x, y) in the next frame is given by 0 x ax ay tx x 0 y ¼ bx by ty y w 0 0 1 1 and therefore, x0 ¼ xðaxÞ þ y ðay Þ þ tx Fig. 5 Overview of featurebased frame-to-frame motion estimation method

y 0 ¼ xðbxÞ þ y ðby Þ þ ty where ax, ay,bx and by are the dominant motion transition factors dependent on the pixel position in the previous frame, and tx and ty are independent of the transition values of the pixels in the previous frame and respectively give the overall independent estimated rotation in x and y coordinates of the projected frame relative to the previous frame. The data so generated give dominant motion ‘per consecutive frame pair’. We theorise that the movement of the head will cause signiﬁcant dominant motion in the video captured through a headcam. However, there will be other events as well where the movement of a large object in region of view of the headcam which might be mistaken for the head movements. However, the duration of such object movement will be extremely short when compared with the head movement. Thus, the duration for which we observe dominant motion will help us in detecting whether what we observe is a head movement or noise caused by motion of a large object in the camera’s focal area. We use two important features from the motion information as described below. Inverse region of interest (IROI) Since the dominant motion data are relative (movement of pixels into the next frame with respect to the previous frame), we expect this relativity to contain important information. We use dominant-motion-based translation matrices to project the current frame into the next frame. Every frame in this particular form of forward projection is obtained from all the translation matrices present before it. In other words, we project each frame from the very ﬁrst frame of that video. Figure 6 shows an example of the discussed forward projection. The frames chosen have been taken from HP_video3 (bluepeter_down1.avi), and the ﬁgure shows frame f(i), the next frame f(i+1) and the projected frame f’(i+1), such that the projected frame is a mosaic constructed from the pixels carried forward to the actual frame f(i+1). The black region in the projected image shows the pixels that have moved out, and it is this feature of the forward-projected mosaic that we are going to use for stationary and non-stationary classiﬁcation. If we deﬁne the ROI (region of interest) as the portion of a frame that is carried forward to the next frame, IROI

182 Fig. 6 Example of how frames are projected using motion information

can then be deﬁned as the region in one frame that moved out during translation onto the next frame. It is computed as the ratio of the number of pixels moved out of one forward-projected frame with respect to all of the previous frames to the total number of pixels in that frame. This can be deﬁned as IROI Number of pixels moved out during forward projection ¼ Total number of pixels in the frame IROI eﬀectively combines the information contained in the translation matrices such that it can be taken as an overall representative of motion information irrespective of the directional transition. This loss of information (direction-based knowledge), however, beneﬁts us. The feature IROI now helps us to retrieve only the information about the motion of the head, and direction does not matter. Figure 7 plots the IROI feature taken for sequences labelled stationary and non-stationary by a human expert (based on ground truth for HP_video1) on 300 consecutive frames in each sequence. Clearly, high IROI values result when the subject is non-stationary, and classifying the two classes is relatively easy using a simple threshold on the IROI feature. Head movements when a person is moving and when a person is not moving give diﬀerent patterns. While moving, a user’s head shows continuous head oscillations (these oscillations depend on the terrain the person is moving on),

whereas when stationary head movements are not regular (there are periods where the head is in a static position, which is not possible when one is non-stationary). Exploiting this characteristic diﬀerence we divide the set of frames into their corresponding classes. We ﬁrst need to determine the threshold of the IROI feature for classifying stationary and non-stationary users. We have found an optimal threshold value of 0.009 to provide an initial classiﬁcation such that the frames with IROI greater than 0.009 will represent a non-stationary user, and otherwise stationary: 0

F ðnÞ ¼

0 1

IROIðnÞ > 0:009 IROIðnÞ60:009

Non-stationary ; Stationary

where F¢(n) is the new value corresponding to the threshold sensitivity for the frame (n). This will give us another set of data, which is used for further processing, as discussed later. Figure 8 shows that the threshold value used is a good indicator of a person moving or not moving; however, the amount of noise is an important consideration for good classiﬁcation. The noise (circled in the ﬁgure) can be accounted for by the following reasons: – The surge in IROI values crossing the threshold, while in stationary state: – The reason is that, even though the amount of head movements is mostly below the threshold, values above the threshold result from unpredictable head

183

Fig. 7 A plot of the inverse region of interest for the ﬁrst 300 frames of HP_video1

movements (which could be due to an episodic event or simply during passive attention). – The lowering of IROI values below the threshold while in non-stationary state: – This is the result of head oscillations, which depend either on the type of movement (walking, running, biking, driving etc.) or the terrain a person is moving on. We use a modiﬁed scheme to counter the eﬀect of noise on our classiﬁcation. In order to class a frame as truly stationary, we look ahead roughly 50 frames (roughly 2 s of video). If the sum of F¢(n) function values for the next 50 frames is less than a threshold (in our case 20, which indicates that at least 40% of the frames were stationary), then the current frame is classed as stationary; otherwise, it is classed as non-stationary. The results of this modiﬁed scheme are shown in Fig. 9 for 400 frames of video sequence HP_video5 (carboot_down1.avi).

tracking an object, it is important to estimate the direction of head movement. We can divide the video frames into three classes, depending on the direction of movement of the user’s head in the current frame when compared with the previous frame. These classes are left, straight and right. Our algorithm uses the translation values tx in x direction (horizontal head movement with respect to the person’s torso) to detect the direction of head movement. A simple threshold-based ﬁlter working on the tx values faithfully detects the three classes of head direction motion (left, straight or right) for a given frame n: 8 if tx > 10 < 1 0 0 Left GðnÞ ¼ 0 0 1 Right if tx\ 10 : 0 1 0 Straight otherwise The information from the function G(n) can be used further to understand head behaviour as described below.

Determining the direction of head movement

Small head or body movement in one direction (without going back) This is calculated by counting continuous head movements in one direction for up to 8 frames and checking the next 30 frames for an indication of whether or not the head moves back in the opposite direction.

The classiﬁcation of stationary vs. non-stationary user is only the ﬁrst step towards recognising episodes. In order to evaluate whether the user is gazing at something or

Considerable head movement Calculated by detecting continuous head movement in a single direction up to 15 frames.

184

Fig. 8 The above ﬁgure has been taken from randomly chosen frames from video HP_video5 (car_boot_down1.avi), for 400 frames starting at frame 3600 to frame 4000. The straight line is the threshold value of 0.009 and the noise from this ﬁrst phase of ﬁltering is circled in the diagram. We can notice the high frequency of noise occurrence in the stationary state as compared with the non-stationary state

Fig. 9 Stationary vs. nonstationary classiﬁcation results with modiﬁed scheme. The graph shows the results comparing ground-truth data per frame with predicted output

LRL or RLR motion Detecting the trend in a plot of the function region in G(n) where the head direction motion indicates one direction (left/right continuously), then the opposite direction (right/left) and later back to the ﬁrst direction (left/right continuously). This is again done by simply counting and identifying such trends in the graph. Figure 10 shows the results of this thresholdbased head direction detection on 3000 frames from

185

Fig. 10 a Output of our head motion direction indicator on HP_video8 (school_sports_day1b_down.avi). b Sample results on the frames

HP_video7 (school_sports_day1a_down.avi). The white lines demarcate the regions into the three directional classes.

Data description The raw video data used were provided by Hewlett Packard Labs UK. The videos were captured using a

headcam. In total we have 11 videos (HP_video1 to HP_video11), but only 8 videos were analysed due to the high time consumed in generating ground truth information on a per-frame basis. The content of the videos is varied as shown in Table 1. The variety of diﬀerent environments and the adequate length of videos covered make it a general data set, ideal for evaluating the performance of the algorithm in varied real-life applications. For training and performance evaluation of the algorithms developed for diﬀerent stages, ground truth was manually generated. Episodes were marked by the

186 Table 1 A list of head-mounted videos used for analysis

Video name

File name

Number of frames

Description of content

HP_video1 (V1)

black_boy_hill_down1.avi

28165

HP_video2 (V2)

black_boy_hill_down2.avi

28561

HP_video3 (V3) HP_video5 (V5)

bluepeter_down1.avi car_boot_down1.avi

32910 41386

HP_video6 (V6)

car_boot_down2.avi

43655

HP_video8 (V8)

sports_day1b_down.avi

18294

HP_video9 (V9) HP_video11 (V11)

sports_day2_down.avi wearable3.avi

18648 35625

User walks down and shop for kitchen groceries Shopping inside a kitchen grocery shop User taking part in a bike rally User biking up to the car boot sale point Shopping and browsing in the car boot sale User is attending and watching a school sports day Continuation of V8 User walking around the HP Labs Bristol campus

same person who had recorded the videos. The total number of episodes identiﬁed in the ground-truth labelling process are as follows: – 13 Episodes for HP_video1 (black_boy_hill_down1. avi) – 10 Episodes for HP_video2 (black_boy_hill_ down2.avi) – 22 Episodes for HP_video3 (bluepeter_down1.avi) – 25 Episodes for HP_video5 (car_boot_down1.avi) – 16 Episodes for HP_video6 (car_boot_down2.avi) – 5 Episodes for HP_video7 (school_sports_day1a_ down.avi) – 8 Episodes for HP_video9 (school_sports_day2_ down.avi) – 11 Episodes fro HP_video11 (wearable3.avi)

Experimental details and results Here we give our experimental results for the three stages of analysis: (a) determining if the user is stationary or not, (b) estimating head movement direction, (c) predicting which frames indicate an episode. Results on stationary vs. non-stationary classiﬁcation Our results show an average classiﬁcation accuracy of 88.3%. The confusion matrices are shown in Appendix 1 for all videos analysed. Results on head direction classiﬁcation The full results are shown in Appendix 2. The average classiﬁcation accuracy is 95.33% across all eight videos. Results on episode detection The full results for this analysis are shown in Appendix 3. In this detection accuracy evaluation we have ignored

the false positives because we cannot have the value for non- episodes that were detected as non-episodes. The average accuracy of detecting episodes is 89.5%. We found that in some cases our episode detection algorithm does not work as well as we would like it to. We suggest that further work be conducted into resolving the following problems. Multiple detection of episodes Our results suggest that, in most cases, we end up with more than one link to a single episodic memory (i.e. an episode is falsely split by our method into several episodes). Without further contextual knowledge in addition to motion analysis, it is hard to bridge the spurious gaps. Noise caused by frantic motion Frantic motion can be explained as the continuous movement of the head in random directions without any regular pattern. This leads us to falsely classify a stationary event as a nonstationary event and impacts the accuracy of episode detection. False positive detection We detect a number of false positives as a result of passive attention where the user unknowingly repeats the behavioural patterns, which they will generally perform only during an episodic event. We propose a few solutions to these problems, which we expect can lead to the removal of most of the false positives: – Colour histogram analysis can be used alongside other image features to check for spurious gaps that break episodes to see whether the gaps have the same property as frames before and after them. – Frantic motion of head/body can be considered as a totally diﬀerent class of motion (separate from stationary/non-stationary) so that we can exploit the random behaviour to identify whether motion is frantic or if it falls under one of the other two classes. – Other contextual cues, such as detecting people if the face/skin detection algorithm gives robust results, can be exploited.

187

Conclusions We have proposed a novel approach to detecting episodes by understanding and exploiting the distinctive patterns of head and body movements during an episodic event. We have been able to recognise the relationship between diﬀerent types of head motion and human behavioural patterns while paying attention to an object of interest. One of the most important ﬁndings has been that the head movements during an episodic event are diﬀerent in stationary and non-stationary states. Using aﬃne-based dominant motion to acquire the features for detecting a user’s head direction motion and stationary/non-stationary states, we have been able to translate the behavioural information into two basic features—IROI and tx. The cumulative information provided by these features has proven in our research to be suﬃcient and accurate enough for detecting episodes. Although the ﬁnal algorithm gives us high reliability (89.5% accuracy) on episode detection, several problems remain open to investigation. Thus in the short term, work in this area should focus on using various imageand video-based methods to resolve these problems. Realising the feasibility of this method, further research should also concentrate on content- and semantic-based analyses of the episodes, so as to ultimately achieve a total understanding of the information contained within them. We strongly believe that our approach, if combined with other relevant techniques (e.g. audio based processing), will give us a better understanding, and thus better detection, of episodes. In addition to the application area tackled in this paper, the techniques proposed in this paper for head direction detection and stationary/non-stationary classiﬁcation can also be used in other ﬁelds (e.g. wearable computing), where users have to interact with their immediate environment using devices mounted on their heads.

Originality and contributions This study presents a methodology for the reliable detection of episodes in head mounted video. Episode detection is extremely important for video summarisation and we are not aware of any other study that addresses the issue of reliable detection of episodes. This work was performed in collaboration with Hewlett Packard Research Lab in the UK on a large collection of video data that was manually indexed for episodes and motion analysis. The paper shows that the concept of active and passive attention can be used using motion features for episode detection. This work should inspire further research on head-mounted video analysis and episode detection based video summarisation.

Appendix 1. Confusion matrices for stationary and non-stationary classification

188

Appendix 2. Confusion matrices for head movement direction classification

Appendix 3. Confusion matrices for head movement direction classification (Note: ‘‘not applicable’’ has been inserted because we cannot have the value for non-episodes that were detected as non-episodes.)

References 1. Lienhart R, Pfeiﬀer S, Eﬀelsberg W (1997a) Video abstracting. Commun ACM 40(12):55–62 2. Arman F, Depommier R, Hsu A, Chiu MY (1994) Content based browsing of video sequences. In: Proceedings of the ACM international conference on multimedia, pp 97–103 3. Rorvig ME (1993) A method for automatically abstracting visual documents. J R Am Soc Inf Sci 44(1):40–56 4. Taniguchi Y, Akutsu A, Tonomura Y, Hamada H (1995) An intuitive and eﬃcient access interface to real-time incoming video based on automatic indexing. In: Proceedings of the ACM international conference on multimedia, San Francisco, pp 25–33 5. Tonomura Y, Akutsu A, Taniguchi Y, Suzuki G (1994) Structured video computing. IEEE Multimedia Mag 1(3):34–43 6. Yeung MM, Yeo BL, Wolf W, Liu B (1995) Video browsing using clustering and scene transitions on compressed sequences. In: Rodriguez AA, Maitan J (eds) Proceedings of SPIE, Multimedia Computing and Networking, San Jose, 2417:399–414 7. Zhang H, Low CY, Smoliar SW, Wu JH (1995) Video parsing, retrieval and browsing: an integrated and content-based solution. In: Proceedings of the ACM international conference on multimedia, San Francisco, pp 15–24

8. Zhang H, Smoliar SW, Wu JH (1995) Content based video browsing tools. In: Rodriguez AA, Maitan J (eds) Proceedings of SPIE, Multimedia Computing and Networking, San Jose, 2417:389–398 9. Smith M, Kanade T (1995) Video skimming for quick browsing based on audio and image characterisation. Computer Science Technical Report, Carnegie Mellon University, Pittsburgh 10. Pfeiﬀer S, Lienhart R, Fischer S, Eﬀelsberg W (1996) Abstracting digital movies automatically. J Vis Commun Image Represent 7(4):345–353 11. Lienhart R, Pfeiﬀer S, Fischer S (1997b) Automatic movie abstracting and its presentation on an HTML page. Technical Report TR-97–003, University of Mannheim, Germany 12. Saarela J, Merialdo B (1999) Using content models to build audio-video summaries. In: Proceedings of SPIE 3656, Storage and Retrieval for Image and Video Databases VII, pp 338–347 13. Lienhart R (1990) Abstracting home video automatically. In: Proceedings of ACM Multimedia 99 (Part 2):37–40, Orlando, FL 14. Lienhart R (2000) Dynamic video summarization of home video. In: Proceedings of SPIE 3972, Storage and Retrieval for Media Databases 2000, January 2000, pp 378–389. Technical Report MRL-VIG99020, April 1999b

189 15. Mann S (1998) WearCam (The Wearable Camera): Personal imaging for long-term use in wearable tetherless computermediated reality and personal photo/videographic memory prosthesis. In: Proceedings of the 2nd international symposium on wearable computers, pp 124–131 16. Rowley HA, Baluja S, Kanade T (1995) Human face recognition in visual scenes. Technical Report, Carnegie Mellon University, CMU-CS-95–158R, School of Computer Science, Pittsburgh 17. Nakamura Y, Ohde J, Ohta Y (200a) Structuring personal experiences- analyzing views from a head-mounted camera. In: Proceedings of the IEEE international conference on multimedia and expo, New York, pp 1137–1140

18. Nakamura Y, Ohde J, Ohta Y (200b) Structuring personal activity records based on attention-analyzing videos from head mounted camera. In: Proceedings of the international conference on pattern recognition Barcelona, pp 222–225 19. Pilu M (2003) A method for real-time, robust frame-to-frame global motion estimation. HP Labs Technical Report HPL2003–65 April 2003 20. Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model ﬁtting applications to image analysis and automated cartography. Commun ACM 24:381–395

Event Detection in Baseball Videos Using Genetic Algorithm ... - APSIPA

Detection of Motorcyclists without Helmet in Videos ...

Crowdsourcing Event Detection in YouTube Videos - CEUR Workshop ...

Identifying News Videos' Ideological Perspectives Using Emphatic ...

Shape-based Object Recognition in Videos Using ... - Semantic Scholar

Identifying News Videos' Ideological Perspectives Using Emphatic ...

real time anomaly detection in h.264 compressed videos

detection of urban zones in satellite images using ...

Abnormal Signal Detection in Gas Pipes Using Neural Networks

Keylogger Detection Using a Decoy Keyboard

A Generalized Data Detection Scheme Using ... - Semantic Scholar

Human eye sclera detection and tracking using a ...

DETECTION OF ROADS IN SAR IMAGES USING ...

A study of OFDM signal detection using ... - Semantic Scholar

Fast Pedestrian Detection Using a Cascade of Boosted ...

A Survey on Brain Tumour Detection Using Data Mining Algorithm

A Review on Segmented Blur Image using Edge Detection

A Saliency Detection Model Using Low-Level Features ...

A Framework for Malware Detection Using Ensemble Clustering and ...

Network Anomaly Detection Using a Commute ...

Detection of HIFU lesions in Excised Tissue Using ...

Lead Detection Using a Pineapple Bioelectrode

A Framework for Malware Detection Using Ensemble Clustering and ...