Head Motion Tracking and Pose Estimation in the Six Degrees of ...

Viewer
Transcript

UNIVERSITY OF MALTA Faculty of Engineering Department of Systems and Control Engineering

FINAL YEAR PROJECT B. ENG. (Hons)

Head Motion Tracking and Pose Estimation in the Six Degrees of Freedom

by Michael Sapienza

A dissertation submitted in partial fulfilment Of the requirements of the award of Bachelor of Engineering (Hons.) of the University of Malta

Copyright COPYRIGHT NOTICE

Copyright in text of this dissertation rests with the Author. Copies (by any process) either in full, or of extracts may be made only in accordance with regulations held by the Library of the University of Malta. Details may be obtained from the Librarian. This page must form part of any such copies made. Further copies (by any process) made in accordance with such instructions may only be made with the permission (in writing) of the Author. Ownership of the right over any original intellectual property which may be contained in or derived from this dissertation is vested in the University of Malta and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any agreement.

ii

Abstract Estimating where someone is looking may be of little difficulty to human beings. In fact, a glance at a person‟s head is enough to have an immediate indication of where the attention is being directed. Solving this complex problem for computers is an important step for human-computer interaction (HCI), which opens up new ways to control machines and to examine human behaviour. This work aims to track a human head and estimate its orientation in the six degrees of freedom, which is a fundamental step towards estimating a person‟s gaze direction. The only sensor will be an un-calibrated, monocular web camera that will keep the user completely free of any devices or wires. The system is designed around a feature based geometrical technique which utilises correspondences between the eyes, nose and mouth to estimate the head pose. The principal processing steps include face and facial feature detection, in order to start automatically, tracking of the eye, nose and mouth regions using template matching, and estimating the 3D vector normal to the facial plane from the position of the features in the face. This non-intrusive system runs in real-time, starts automatically, and recovers from failure automatically, without any previous knowledge of the user appearance or location. Global mean absolute errors of 3.03, 5.27 and 3.91 degrees were achieved for roll, yaw and pitch angles respectively. The experiments and tests carried out on the system proved to satisfy the set out objectives and provided insight into areas for future development.

iii

Acknowledgements First and foremost, I would like to express gratitude towards my supervisor, Dr. Ing. Kenneth P. Camilleri, for his invaluable advice, enthusiasm, and for giving me the opportunity to work on such an inspiring research project. I would also like to thank my parents, my brother, my sister, and Sophie, for their unlimited patience, support and encouragement.

iv

Contents COPYRIGHT .......................................................................................................... II ABSTRACT ............................................................................................................ III ACKNOWLEDGEMENTS .................................................................................... IV CONTENTS .............................................................................................................. V LIST OF FIGURES ............................................................................................... VII LIST OF TABLES ................................................................................................. XV CHAPTER 1: INTRODUCTION .............................................................................1 CHAPTER 2: HEAD POSE ESTIMATION BACKGROUND ...............................4 2.1 FEATURE BASED MODELLING METHODS .............................................................5 2.2 APPEARANCE BASED MODELLING METHODS .......................................................7 2.2.1 Template Matching......................................................................................7 2.2.2 Linear Subspace Analysis ............................................................................7 2.2.3 Non-Linear Subspace Analysis ....................................................................8 2.2.4 Neural Networks .........................................................................................9 2.3 3D HEAD MODEL BASED METHODS .................................................................. 10 2.4 DISCUSSION ......................................................................................................12 2.5 CONCLUSION .................................................................................................... 13 CHAPTER 3: HEAD POSE TRACKING ALGORITHM .................................... 14 3.1 FACE & FEATURE DETECTION ...........................................................................14 3.1.1 The Viola-Jones face detector.................................................................... 16 3.1.2 Implementation of the Viola-Jones Algorithm ............................................ 19 3.2 FEATURE TRACKING ......................................................................................... 21 3.2.1 Algorithm for feature tracking with NSSD template matching: .................. 22 3.3 HEAD POSE ESTIMATION ................................................................................... 24 3.3.1 Preliminary Processing steps .................................................................... 25 3.3.2 The Facial Model ...................................................................................... 27 3.3.3 Estimating the Facial Normal.................................................................... 28 3.4 RECOVERY FROM FAILURE................................................................................. 30 v

3.5 SOFTWARE IMPLEMENTATION ...........................................................................31 3.5.1 Structured Programming ...........................................................................31 3.5.2 System Architecture .................................................................................. 32 CHAPTER 4: RESULTS AND DISCUSSION ....................................................... 33 4.1 PERFORMANCE MEASURES................................................................................. 33 4.1.1 Mean absolute error .................................................................................. 34 4.1.2 Root mean squared error ...........................................................................34 4.1.3 Standard Deviation ................................................................................... 34 4.2 GROUND TRUTH DATASET ................................................................................ 35 4.3 QUANTITATIVE PERFORMANCE EVALUATION..................................................... 36 4.3.1 Presentation of Results ..............................................................................37 4.3.2 Global Results ........................................................................................... 40 4.3.3 Roll ...........................................................................................................41 4.3.4 Yaw ...........................................................................................................44 4.3.5 Pitch ......................................................................................................... 49 4.3.6 X, Y, and Z translation...............................................................................52 4.3.7 Discussion ................................................................................................. 54 4.4 QUALITATIVE EMPIRICAL EVALUATION ............................................................. 55 4.4.1 Video ‘my1’ ............................................................................................... 55 4.4.2 Video ‘my2’ ............................................................................................... 57 4.4.3 Video ‘my3’ ............................................................................................... 59 4.4.4 Simple Application .................................................................................... 61 4.4.5 Discussion ................................................................................................. 62 4.5 SUMMARY ........................................................................................................ 66 CHAPTER 5: CONCLUSION ................................................................................ 67 5.1 PROJECT OBJECTIVES AND ACHIEVEMENTS........................................................ 67 5.2 FUTURE WORK ................................................................................................. 68 REFERENCES ........................................................................................................ 70 APPENDIX A: QUANTITATIVE RESULTS ....................................................... 74 APPENDIX B: INITIALISATION PARAMETERS .............................................93

vi

List of Figures Figure 1.1: The six degrees of freedom is made up of the „x‟,‟y‟ and „z‟ translations as well as the „roll‟, „yaw‟ and „pitch‟ rotations. These parameters are to be extracted from a single camera that views the head. ................................. 2 Figure 2.1: Head Pose estimation using a feature based geometric model which finds correspondences between the eyes, nose and mouth features. ....................... 5 Figure 2.2: The pose eigenspace of a head rotating from profile to profile [9]. ............... 8 Figure 2.3: Illustration of the training stage of a neural network. After training it would be able to classify an unknown head pose in an image. .................... 10 Figure 2.4: Tracking pose change measurements between video frames to estimate the rotation angles of the head using a 3D model based approach. .............. 11 Figure 3.1: The main processing steps include image acquisition, face detection, facial feature detection, feature tracking and subsequently pose estimation. ................................................................................................. 14 Figure 3.2: Haar-like features. Their value is determined by subtracting the sum of pixels enclosed in the white region from the sum of pixels in the gray region. ........................................................................................................ 16 Figure 3.3: The value of the integral image at point (x,y) is the sum of all the pixels above and to the left of that point. .............................................................. 17 Figure 3.4: The value of the integral image at 1 is the sum of the pixels in rectangle A. The value at location 2 is A+B, at location 3 is A+C, and at location 4 is A+B+C+D. The sum within D can be computed as 4+1-(2+3). .............. 18 Figure 3.5: The detection cascade. A series of classifiers are applied to each subwindow. If they are classified as „face‟ they move on to a more complex classifier and if they are classified as „non-face‟ they are immediately rejected. ..................................................................................................... 19 Figure 3.6: The detected head region is depicted using a red box. The feature ROIs are extracted directly from the face ROI, and the detection of a feature location is only performed in its respective ROI. Finally the results are combined and displayed. The blue, green and yellow circles represent the detected eyes, nose and mouth respectively. ............................................... 20 Figure 3.7: (a) Template image of left eye obtained by taking a snapshot around the detected location found from the Viola-Jones detector in the previous processing stage. (b) Region of interest image around last known vii

location of left eye. (c) NSSD image result from matching the template image of the left eye to the region of interest around the last known location of the left eye. ............................................................................... 22 Figure 3.8: Viola-Jones face and feature detection algorithm is checking whether the person is in a frontal pose before moving on to the tracking algorithm. (a) Frontal head pose, the nose tip is within a region defined from the centroid of the eye and mouth features and therefore the system will go on to initialise the tracking stage. (b) Head pose is not frontal and the nose tip is out of the targeted region, therefore the system will not move on to initialise the tracking. ........................................................................ 24 Figure 3.9: Points „Pt1‟ and „Pt2‟ are two points in the image which can represent the eye locations. The distance „d‟ between the points can be found by using Pythagoras‟ theorem. The angle between the two points can be found from the arctangent of the slope between the two points. ............................ 25 Figure 3.10: Angle between two lines L1 and L2 is

.................................. 26

Figure 3.11: The face model is composed of the distance ratios between the eyes, nose, and mouth locations of a typical face. The eyes and mouth form three points which define the facial plane. The symmetry axis is found by joining the midpoint between the eyes to the mouth. .............................. 27 Figure 3.12: Finding the facial normal from the image by joining the nose base to the nose tip. The angles Ө and τ can easily be extracted from the 2D image. .... 28 Figure 3.13: The face centred coordinate system used to calculate the slant. The facial slant σ is the angle between the vector normal to the image plane, and the facial normal located along the „z‟ axis of the facial plane. ................... 28 Figure 3.14: The left eye can be considered as lost if the distance between the left eye and the right eye, and the distance between the left eye and the nose is greater than a certain threshold. In that event, the feature can be recovered by exploiting the geometrical properties of the face in frontal pose. In this case, the face symmetry is used to recover the position of the left eye. ...................................................................................................... 30 Figure 3.15: The Main processing algorithms are separated into distinct sub-programs which are brought together in the main program. ........................................ 31 Figure 3.16: The computational flow chart for the integrated “closed loop” system. Only the main processing steps of the integrated process are shown. .......... 32 Figure 4.1: Illustration of the “Flock of birds” 3D magnetic tracker setup. ................... 35

viii

Figure 4.2: Image of „jam‟ subject................................................................................ 36 Figure 4.3: Image of „ssm‟ subject ............................................................................... 36 Figure 4.4: Video frames from the sequence „jam8‟. The automatic initialisation stage can be seen in „Frame 1‟ where the detected face region is represented by a red box, whereas the eyes, nose, and mouth are represented by blue, green, and yellow circles respectively. In the remaining frames, the feature locations are made known by drawing red boxes around them, and the pose is depicted using a drawing pin representation at the top left hand corner. ................................................... 37 Figure 4.5: Results for video „jam8‟. Graphs (a), (b), and (c), show the variation of the rotation angles with respect to the video frame number. The red solid line, which represents the estimated pose, should ideally follow the black solid line which represents the ground truth. In the second row of graphs labelled (d), (e), and (f), the ground truth was plotted against the estimated pose angles in order to better visualise the difference between the two data sets. In the ideal case, when the estimated data is exactly equal to the ground truth, the data points should all lie on the line . ... 38 Figure 4.6: Illustration of the roll rotation. .................................................................... 41 Figure 4.7: Graphical results from video „jam1‟. In graph (a), the roll value starts „jumping‟ above and below the ground truth value as it exceeds 22 degrees and decreases below -30 degrees. This occurs since the fixed template image of the eye is not a very good representation of the eye under large roll rotations. ........................................................................... 42 Figure 4.8: Graphical results from video „jam1‟. In graph (a) the roll rotations were successfully tracked, even though large rotations were present. The maximum and minimum values of roll rotation are of +28 degrees and 24 degrees. At the positions where large rotations were present, the yaw and pitch angles were not tracked properly. Looking at the video frames carefully one will notice that at large rotations, the eye positions were incorrectly located at the eye corners, thus causing errors in yaw estimation, as seen in graph (b). ................................................................. 43 Figure 4.9: Graphical results from video „jam9‟. In graph (a) the roll rotations were successfully tracked under clockwise rotations (negative values), and slightly underestimated for anticlockwise rotations (positive values). The maximum and minimum values of roll rotation are of +21 degrees and 31.5 degrees respectively. In the last frames of the sequence, the eye positions were incorrectly located at the eye corners, resulting in inaccurate yaw and pitch values. ................................................................ 43 ix

Figure 4.10: Frame 200 of the „jam1‟ video. The eye centres are incorrectly located at the eye corners. This does not cause significant errors for the roll values but it greatly affects the yaw and pitch estimation, especially when one of the eyes is incorrectly located. .................................................................... 44 Figure 4.11: Illustration of the yaw rotation.................................................................... 44 Figure 4.12: Feature location confidence drops as head rotates away from the image plane. Since the template image is fixed and extracted from a frontal face orientation, the minimum value of the NSSD value increases as the head rotates away from frontal pose.................................................................... 45 Figure 4.13: Graphical results of video „jam5‟. In graph (b) the yaw estimation can be seen to saturate at approximately ±20 degrees. This is also seen in graph (e), where the fitted line clearly shows an underestimation for the negative and positive yaw angles, especially at higher angles. .................... 46 Figure 4.14: Graphical results from video „ssm8‟. It can be seen that in graph (b), the yaw angle is greatly underestimated between frames 50 and 100. The yaw is only underestimated for the positive angles since these were larger than the negative angles and harder to track. Graph (e) shows that the negative rotations were tracked, whilst the higher positive angles were underestimated. From frames 165 onwards the tracking is lost, and caused outliers in the results which distorted the results as seen in graph (d). Despite this setback, the track was recovered in the last frame. ........... 47 Figure 4.15: Frame 125 from the „jam5‟ sequence. Head model assumes that the nose tip can be accurately located in the face image for accurate pose estimation. The yaw angle is underestimated since under large yaw rotations, the nose centre tends to be a better match than the nose tip for the NSSD tracker. ...................................................................................... 47 Figure 4.16: Graphical results from video „ssm9‟. The maximum and minimum values of successfully tracked yaw angles were registered in this video sequence. The maximum value is of +24 degrees, and the minimum value is of -29 degrees. ............................................................................... 48 Figure 4.17: Illustration of the pitch orientation.............................................................. 49 Figure 4.18: Graphical results for video „jam6‟. In graph (c) the pitch is correctly tracked but suffers from underestimation at positive pitch rotations between frames 75-115 and temporary overestimation at negative pitch rotations between frames 148-160. ............................................................. 50 Figure 4.19: Graphical results for video „ssm5‟. In graphs (c) and (f) it is very clear that the pitch has been tracked at a constant offset value. It can also be x

seen that in frames 45-65 the system temporarily lost track and recovered in subsequent frames. ................................................................................. 51 Figure 4.20: Graphical results for video „ssm6‟. Similarly to the figure above, the pitch values were offset by a constant value, which can be seen clearly in graphs (c) and (f). ....................................................................................... 51 Figure 4.21: (a) Frame 3 from the video „ssm5‟. (b) Frame 3 from the video „ssm9‟. The difference in the initial location of the mouth in video „ssm5‟ (a) caused a significant offset in the pitch rotation angles. The location of the mouth was more accurate in video „ssm9‟(b), and this can be seen from the drawing pin representation depicted in the top left corner of the frames. In the ideal case there should be a perfectly spherical circle and a single point in the centre. ........................................................................... 52 Figure 4.22: Frames from video „ssm2‟ where the subject displays translation in the „z‟ direction. Even though the templates are of constant size, they are still matched correctly to the intended feature locations. This is important for the system to be of practical use. ........................................................... 53 Figure 4.23: Graphical results for video „ssm2‟ in which the subject undergoes scale variation. The system does not lose track of the features and still manages to estimate the pose within acceptable accuracy. .......................... 53 Figure 4.24: Graphical results for „x‟, „y‟, and „z‟ translations for video „ssm2‟. The „x‟ and „y‟ coordinates of the head are measured in image pixels, whilst the „z‟ translation is measured as a scale factor relative to the initial distance from the camera. ........................................................................... 53 Figure 4.25: Video frames taken from the first sequence aimed at tracking head rotation about the „x‟, „y‟, and „z‟ axis. ....................................................... 56 Figure 4.26: Graphical results from video „my1‟. The red solid line represents the estimated angle in time. The black square boxes were drawn over the results in order to emphasize key regions of the graph where particular rotations are taking place. The blue square boxes represent regions where the tracker has failed thus causing jump discontinuities in the estimated angles. In this case the failures are mainly due to excessive rotations. ....... 56 Figure 4.27: Video frames from „my2‟ sequence. This sequence was aimed at tracking head-shake, single feature occlusions and full re-initialisation capability when the tracker fails dramatically, for example, due to a full face occlusion shown in Frame 500. ........................................................... 57 Figure 4.28: Graphical results for video „my2‟. The black boxes in graphs (b) and (c) were drawn around the head shaking sequences. The blue boxes xi

represent regions where the system temporarily lost track of the pose due to occlusions. The green line towards the end marks the time after which the face has become completely occluded and system automatically reinitialises. ................................................................................................ 58 Figure 4.29: Video frames from sequence „my3‟. This video is 900 frames or 30 seconds long, and the subject takes the rotation in yaw and pitch to its limits to identify the angles at which the tracker fails to estimate the pose further. ....................................................................................................... 59 Figure 4.30: Graphical results from video „my3‟. The black boxes mark regions where the system has lost track, and the blue circles mark the points at which the maximum rotation angles were read. The green line just before frame 900 marks the time when the system was reinitialised, and fresh template images were taken. ....................................................................... 60 Figure 4.31: Typical screenshots whilst trying to shoot the blue balls with the black crosshair. .................................................................................................... 61 Figure 4.32: Frame 800 from video „my3‟. This image illustrates the reason why downward pitch rotation is difficult to track for the NSSD template matching algorithm. ................................................................................... 63 Figure 4.33: Correcting pose measurements to cope with eyeball rotation. ..................... 64 Figure 4.34: Graphical results from video „my3‟ tracking the pose using the eye corners instead of the eye pupils. The black boxes mark regions where the system has lost track, and the blue circles mark the points at which the maximum rotation angles were read. .......................................................... 65 Figure A.1: video frames from „jam1‟ sequence. ........................................................... 75 Figure A.2: Graphical results from „jam1‟ sequence. ..................................................... 75 Figure A.3: video frames from „jam2‟ sequence. ........................................................... 76 Figure A.4: Graphical results from „jam2‟ sequence. ..................................................... 76 Figure A.5: video frames from „jam3‟ sequence. ........................................................... 77 Figure A.6: Graphical results from „jam3‟ sequence. ..................................................... 77 Figure A.7: video frames from „jam4‟ sequence. ........................................................... 78 Figure A.8: Graphical results from „jam4‟ sequence. ..................................................... 78 Figure A.9: video frames from „jam5‟ sequence. ........................................................... 79 xii

Figure A.10: Graphical results from „jam5‟ sequence. .................................................... 79 Figure A.11: video frames from „jam6‟ sequence. ........................................................... 80 Figure A.12: Graphical results from „jam6‟ sequence. .................................................... 80 Figure A.13: video frames from „jam7‟ sequence. ........................................................... 81 Figure A.14: Graphical results from „jam7‟ sequence. .................................................... 81 Figure A.15: video frames from „jam8‟ sequence. ........................................................... 82 Figure A.16: Graphical results from „jam8‟ sequence. .................................................... 82 Figure A.17: video frames from „jam9‟ sequence. ........................................................... 83 Figure A.18: Graphical results from „jam9‟ sequence. .................................................... 83 Figure A.19: video frames from „ssm1‟ sequence. .......................................................... 84 Figure A.20: Graphical results from „ssm1‟ sequence. .................................................... 84 Figure A.21: video frames from „ssm2‟ sequence. .......................................................... 85 Figure A.22: Graphical results from „ssm2‟ sequence. .................................................... 85 Figure A.23: video frames from „ssm3‟ sequence. .......................................................... 86 Figure A.24: Graphical results from „ssm3‟ sequence. .................................................... 86 Figure A.25: video frames from „ssm4‟ sequence. .......................................................... 87 Figure A.26: Graphical results from „ssm4‟ sequence. .................................................... 87 Figure A.27: video frames from „ssm5‟ sequence. .......................................................... 88 Figure A.28: Graphical results from „ssm5‟ sequence. .................................................... 88 Figure A.29: video frames from „ssm6‟ sequence. .......................................................... 89 Figure A.30: Graphical results from „ssm6‟ sequence. .................................................... 89 Figure A.31: video frames from „ssm7‟ sequence. .......................................................... 90 Figure A.32: Graphical results from „ssm7‟ sequence. .................................................... 90 Figure A.33: video frames from „ssm8‟ sequence. .......................................................... 91 Figure A.34: Graphical results from „ssm8‟ sequence. .................................................... 91 xiii

Figure A.35: Video frames from „ssm9‟ sequence. .......................................................... 92 Figure A.36: Graphical results from „ssm9‟ sequence. .................................................... 92

xiv

List of Tables Table 4.1:

Mean absolute error, root mean square error, ground truth standard deviation and estimated pose standard deviation for video „jam8‟. .............. 39

Table 4.2:

Average errors and standard deviations over all „jam‟ videos ..................... 40

Table 4.3:

Average errors and standard deviations over all „ssm‟ videos ..................... 40

Table 4.4:

Largest rotation angles registered in Boston University videos. .................. 40

Table 4.5:

Roll results from „jam‟ and „ssm‟ videos. ................................................... 41

Table 4.6:

Yaw results from „jam‟ and „ssm‟ videos. .................................................. 45

Table 4.7:

Roll results from „jam‟ and „ssm‟ videos. ................................................... 49

Table 4.8:

Largest rotation angles registered in video „my1‟. ...................................... 57

Table 4.9:

Maximum and minimum rotation angles for yaw and pitch. ....................... 60

Table 4.10: Maximum and minimum rotation angles for yaw and pitch when tracking the eye-corners instead of the eye pupils. The reduction in rotation angles was only marginal which suggests that it could be a more accurate method for tracking the head pose. ............................................................. 64 Table A.1:

Numerical results from „jam1‟ sequence..................................................... 75

Table A.2:

Numerical results from „jam2‟ sequence..................................................... 76

Table A.3:

Numerical results from „jam3‟ sequence..................................................... 77

Table A.4:

Numerical results from „jam4‟ sequence..................................................... 78

Table A.5:

Numerical results from „jam5‟ sequence..................................................... 79

Table A.6:

Numerical results from „jam6‟ sequence..................................................... 80

Table A.7:

Numerical results from „jam7‟ sequence..................................................... 81

Table A.8:

Numerical results from „jam8‟ sequence..................................................... 82

Table A.9:

Numerical results from „jam9‟ sequence..................................................... 83

Table A.10: Numerical results from „ssm1‟ sequence. ................................................... 84 Table A.11: Numerical results from „ssm2‟ sequence. ................................................... 85 xv

Table A.12: Numerical results from „ssm3‟ sequence. ................................................... 86 Table A.13: Numerical results from „ssm4‟ sequence. ................................................... 87 Table A.14: Numerical results from „ssm5‟ sequence. ................................................... 88 Table A.15: Numerical results from „ssm6‟ sequence. ................................................... 89 Table A.16: Numerical results from „ssm7‟ sequence. ................................................... 90 Table A.17: Numerical results from „ssm8‟ sequence. ................................................... 91 Table A.18: Numerical results from „ssm9‟ sequence. ................................................... 92 Table B.1:

Model Rations for „jam‟ and „ssm‟ videos .................................................. 93

Table B.2:

Template initialisation parameters for „jam‟ and „ssm‟ videos .................... 93

xvi

Chapter 1: Introduction Computer vision is a field of study which aims to transform data acquired from images into new informative representations. It is easy to be deluded into thinking that this is a trivial task since our intricate vision system effortlessly filters and classifies incoming signals into a perception of the immediate environment. An artificial vision system however just receives a stream of numbers from the camera or image file, and nothing else. The task of machine vision is therefore to construct an artificial system to analyse the noisy data from an image sensor, and to look for patterns, features, and qualities, with the intent of capturing useful information. Estimating where somebody is looking may be of little difficulty to humans. In fact, a glance at a person‟s head is enough to have an immediate indication of where the attention is being directed.

Solving this complex problem for computers is an

important step for human-computer interaction (HCI), which opens up new ways to control machines and to examine human behaviour. The two major components of gaze direction are found from the orientation of the subject's head, and the orientation of the subject's eyes within their sockets [1]. Previous work [2] in the Department of Systems and Control Engineering focused on single eye gaze tracking under stationary conditions. In order to enable free head movement and enhance the accuracy of the user‟s gaze direction, a head pose tracking algorithm was required. This work aims to track a human head and estimate its orientation in the six degrees of freedom which is fundamental for estimating a person‟s visual focus of attention. This research also aims to investigate whether the head pose information alone is a sufficient indicator of a person‟s gaze direction. The six pose parameters hold all the information about the position of the head and its orientation. The freedom of movement includes the three dimensions of space („x‟, „y‟, & „z‟ translation), as well as the rotation about each axis (roll, yaw & pitch). The main objective of this project is to estimate these six pose parameters from a single camera that views the head, as shown in Figure 1.1. 1

Figure 1.1: The six degrees of freedom are made up of the „x‟,‟y‟ and „z‟ translations as well as the „roll‟, „yaw‟ and „pitch‟ rotations. These parameters are to be extracted from a single camera that views the head.

Since the only sensor to be used will be a single webcam, the user will be completely free of any devices or wires. Furthermore, there should not be any constraints on the movement of the user, and no need to use cosmetics to provide enhanced facial features.

This non-intrusive system would preferably run in real-time, start

automatically, and recover from failure automatically, without any previous knowledge of the user appearance or location. In order for the software to be of widespread use, it should be cheap to implement and capable of running on old hardware. Moreover it should not require any camera calibration and need no expert or skilled operators. In the case of stereo vision, precise alignment and calibration of the cameras is necessary, and this would need to be reinspected regularly by competent persons. Another important consideration is the accuracy of the tracker and its robustness towards different lighting conditions. Finally, the system should be able to track a person‟s pose at a reasonable distance, without any camera calibration, in much the same manner as humans do naturally.

2

The range of prospective applications greatly justifies the pursuit of this challenging research.

Consider the scenario of a pilot in a cockpit.

The dials and critical

information such as altitude and air pressure are projected on a head up display (HUD). One would always want that information in his visual view, even when looking in another direction. If the pilot‟s head pose is known, it would always be possible to keep the projected information in sight. Consider the scenario of a businessman launching a new product. How is this product performing on the market? Picture these products in a shop window whilst people are passing by in the street. Being able to know which products people spend most time looking at, or studying patterns of how people look at a shop window would enable a business man to get feedback on which products are attracting most attention, and to strategically place the items in the limited space of the window. This system can also be applied in face recognition, security and operator alertness monitoring. In addition, it is a starting point for head gesture recognition, facial expression analysis, and the study of human behavioural patterns. Other areas of applicability are found in virtual reality and in creating new ways of interacting with games. Ultimately, if a robot is to keep eye contact with its human friend, it must first find the user‟s head pose. The outline of the dissertation is as follows. In the next chapter, the research related to head pose estimation will be discussed.

Apart from containing the necessary

background knowledge associated with this work, a clear research direction fulfilling the desired objectives will emerge. Chapter 3 will contain the algorithms necessary to construct the head pose estimation system, together with a practical framework for its implementation. The fourth chapter details the performance measures and experiments carried out in order to determine the performance of the developed system.

The

quantitative and qualitative results are discussed in this chapter and used to evaluate the head tracking and head orientation estimation. The final chapter will conclude the dissertation and discuss ideas and pathways for future work.

3

Chapter 2: Head Pose Estimation Background As a consequence of the limited processing capability of the brain, not all sensory inputs reach the level of consciousness required by awareness. The filtered set of sensory inputs usually stems from a source of interest, and in most cases, this can be determined by the gaze direction of a person. Although the eyes are the primary source for a person‟s direction of attention, they cannot be easily tracked in low resolution images, or without using invasive hardware.

In the absence of this

information, the head orientation is typically a very reliable indicator of attention direction. There are various active areas of research in head tracking and head pose estimation. Some systems rely on wearable hardware devices which usually give very accurate results. The downside is that they are often cumbersome to wear and very expensive. In this project, we are only interested in non-contact vision based systems made up of one or more cameras, which will provide a practical and economical solution for estimating head pose information. Head pose estimation is a difficult task because, in real life, head appearance as seen from an image capture device is subject to various changes. For example, when a person moves closer or away from the camera, the head size changes. Additionally when a person changes orientation with respect to the camera, some parts of the head become visible, whilst others become occluded. The head appearance also changes when the level of illumination varies since different parts of the head do not reflect light in the same way. One must also consider the head appearance differences from person to person. Features such as hairline, moustache, eyebrows and facial hair all vary considerably in different people. Even less variant features such as the eyes, nose and mouth display differences. Moreover, some features may become occluded by hats, glasses, or cosmetics. Vision based head orientation estimation is usually classified into three general categories. These are Feature based, Appearance based, and 3D head model based approaches. Some methods also combine multiple techniques in order to achieve a more robust solution [3].

4

2.1 Feature Based Modelling Methods Feature based methods estimate the head pose by modelling relationships between a number of uniquely identifiable features on the face image such as the eyes, nose and mouth as seen in Figure 2.1. By finding geometrical correspondences between the facial landmark points in the image and their respective locations in a head model, 3D rotation and translation of the head can be estimated. Locating the corners of the eyes, the corners of the mouth, and the tip of the nose, the facial symmetry axis can be found by drawing a line from the midpoint between the eyes, to the midpoint between the mouth corners. Assuming the face to be a plane in 3D space, the position of the nose tip in the image could be used to estimate the vector which is normal to the facial plane, and from which the pose parameters can be estimated [1]. This technique also assumes fixed ratios between facial features, and a weak perspective camera model. The facial normal can also be estimated from planar skew symmetry, and a coarse estimate of the nose position [1]. Although simple in nature, this system achieves credible head orientation estimates at a very low computational cost.

Figure 2.1: Head Pose estimation using a feature based geometric model which finds correspondences between the eyes, nose and mouth features.

5

Head pose can also be estimated by using statistical analysis of the face structure, together with the relative distances between the inner and outer corners of the eyes, the nose tip, and the focal length of the camera [4]. The yaw is determined from the difference in size between the left and right eye, the roll can be found by calculating the angle of the line joining the four eye corners, and the pitch is determined by comparing the length of the nasal bridge to anthropometric data. This technique however suffers from large errors in estimated pose for near-frontal views, called degenerative angles, where a very high precision of feature location is needed to accurately estimate the head pose in this model. The head pose estimation problem can also be solved by identifying the direction of the face as the average of the plane created by three features on the face. The position and orientation of the head can be computed from the 3D coordinates of three feature points seen in a 2D image, provided that the inter-point distances are known [5]. This correspondence problem between an image and a 3D target is known as the Perspective-3-Points problem, and can be used to provide a measurement of the head rotation with respect to the camera. In other systems, a coarse-to-fine approach is used, where an initial rough estimate of the pose is used to initialise a more accurate model. The initial estimate could be taken from analyzing how feature appearances change as the head rotates [6], or by analysing the feature appearances and separating them into different pose classes [7]. A more precise estimate is then found by using geometric calculations on the extracted features. All the above mentioned techniques fulfil the six degrees of freedom prerequisite within various degrees of accuracy depending on their respective underlying techniques, assumptions and approximations. A challenging task in salient feature based methods is the actual detection and tracking of the facial features. Since the pose estimation relies on the location of the features in the image, accuracy of the feature detection and tracking steps are critical. This criterion is usually attained by increasing the image resolution, even though this adds to the computational load. The main disadvantage of this method is that the tracking process fails when facial features move out of view because of occlusion or large rotations of the head. Moreover, these

6

approaches are sensitive to background interference, and care needs to be taken since the features change appearance as the face changes pose. On the other hand feature based models can potentially yield accurate pose measurements by tracking only a few facial features.

Furthermore, real-time

performance is easily reached in the majority of implementations, even on modest hardware.

2.2 Appearance Based Modelling Methods Appearance based modelling techniques represent an object as a collection of images from multiple view-points. Contrary to feature based approaches where facial features need to be located in the image, appearance based models need no facial landmark detection. Instead, head pose is estimated using the entire detected facial region. Classification of head pose is usually carried out by training the model on a set of labelled images of known pose. The current unknown head pose is then compared to the training images using some similarity criteria to find the most faithful match. 2.2.1 Template Matching In this technique, templates representing the head from various orientations are used to determine the head pose. The queried image is compared to various labelled templates to determine which one is its closest match. Niyogi et al. [8] developed a real time head pose tracking system that made use of head templates of different people from various orientations, organised in a tree structure. By finding the image in the tree which most closely matches the current image, the head pose can be determined. Its main advantage is its simplicity; however it is memory intensive and matching has an ever increasing computational cost as more templates are added to the exemplar set. Moreover, without the use of some interpolation method, they are only capable of estimating discrete pose locations. 2.2.2 Linear Subspace Analysis Subspace analysis aims to find a low-dimensional representation of high-dimensional observations such as images.

An example of a linear dimensionality reduction

technique is Principal Component Analysis (PCA), which represents the face as a linear combination of weighted eigenvectors also known as eigenfaces. Pose changes from a continuous face rotation in depth forms a smooth curve in the pose eigenspace, 7

Figure 2.2: The pose eigenspace of a head rotating from profile to profile [9].

as can be seen in Figure 2.2. The pose of a novel face can therefore be estimated by projecting it into the pre-calculated subspace and finding the most similar eigenvector [9], [10]. In this way, computation can be made simple by avoiding the need to build 3D models or to perform explicit 3D reconstruction.

Furthermore, view-based

methods do not directly encode prior knowledge of 3D shape, and can be learned from a labelled set of images. This approach is used by Yucheng et al. [11], but the images are pre-processed using a Gabor Filter before being transferred into the Gabor-Eigenspace, which leads to more compact pose clustering. This attempts to enhance pose information and to eliminate distractive information like variable face appearance or changing environmental illumination. 2.2.3 Non-Linear Subspace Analysis Recent developments in non-linear manifold analysis provide more flexibility and modelling power to analyze face pose manifolds. Face appearance under varying pose and illumination causes a highly non-linear and complex manifold in the image space. Since linear subspace analysis is an approximation of this non-linear manifold, it does

8

not have sufficient modelling capacity to preserve the variations of the face manifold and thus achieves less discriminating power to distinguish between poses. A kernel machine learning based approach (KPCA) can be used to extract non-linear features of face images for multi-view face detection and pose estimation [12]. This provides a lower dimensional space in which the distribution is better suited to modelling faces. Another non-linear alternative to the linear subspace method for manifold representation is proposed by Raytchev et al. [13].

This Isomap-based

approach aims to provide a lower dimensional and more faithful subspace embedding for view representation. Pose estimation based on LLE (locally linear embedding) was proposed by Yun et al. [14], where an appearance based strategy for head pose estimation was done using supervised Graph Embedding (GE) analysis. Experimental results showed that even using small training sets, GE achieved higher head pose estimation accuracy with more efficient dimensionality reduction than the existing methods. Criticism of the subspace technique includes the large amount of machine training and learning procedures required to produce good reconstructions, and since an input head image has to be projected on all eigenfaces, the computational cost of the projection becomes important.

Furthermore it is highly sensitive to small changes in face

position and image scaling and its accuracy depends strongly on the amount of test data. On the other hand, these techniques are inherently immune to accuracy drift, but also ignore highly useful temporal information between image frames that could improve estimation accuracy [3]. Most of the subspace techniques presented neglect all the 6DOF, and concentrate mainly on cropped face images which are already centred, thus overlooking important issues for practical implementations. 2.2.4 Neural Networks A neural network approach for real time head orientation estimation was proposed by Liang et al. [15]. Two neural networks were trained to approximate the functions that map an image of a head to the orientation of the head. An illustration of the training stage can be seen in Figure 2.3 where Xr, Yr and Zr are the rotations about the X, Y and Z axis respectively. Head pose can be estimated from cropped images of a head using a Multi Layer Perceptron (MLP) in different configurations, where output nodes can correspond to discrete pose classes. 9

Figure 2.3: Illustration of the training stage of a neural network. After training it would be able to classify an unknown head pose in an image.

Although this approach is very fast, and can work on low resolution imagery, it only provides a coarse estimate of pose at discrete orientations. Another disadvantage is that since it relies on cropped labelled faces for training, it is prone to error from poor head localization.

2.3 3D Head Model Based Methods This modelling approach uses pre-computed 3D parameters of the head, and is based on the tracking of salient points, features, or 2D image patches. Prior knowledge of the human head is used in the model construction. To estimate the head pose, one would need to track the relative movement of the head between consecutive frames of a video sequence. Head translation and rotation are usually found by calculating the small pose shifts between each frame [16], as shown in Figure 2.4.

10

Figure 2.4: Tracking pose change measurements between video frames to estimate the rotation angles of the head using a 3D model based approach.

The head can be modelled as a texture mapped rigid 3D cylinder model, and is tracked by using image registration in the texture map image [17]. This is done to increase the tracking robustness of out-of-plane rotations, which simpler planar models find difficulty tracking. A more elaborate model was used by Zivkovic et al. [18], where a triangular mesh was generated to approximate the 3D geometry of the human head. An individualised 3D model was also used by Ruigang et al. [19], where a stereovision camera was used to improve the robustness of the tracking. The 3D model based technique was taken a step further by making use of a generic deformable 3D head model [20]. This enables the tracker to determine for each frame of video, the model parameters which best match the position of the face and its shape in that particular frame. The primary advantage of this approach is that the head can be tracked with high accuracy, by calculating the small pose shifts between each frame. The downside is that it requires an accurate initialization of position and pose, and that is usually done manually. The majority of these models work using differential registration algorithms which are known for their user-independence and high precision for short time-scale estimates of pose change. However, they are typically susceptible to long time-scale accuracy drift due to accumulated uncertainty over time [3]. 11

2.4 Discussion The various techniques used to extract head pose information from images all have their strong and weak points. Feature based methods are fast, but lose track when the facial features disappear from view. Feature detection and tracking are critical to the accuracy of the pose estimation results. On the other hand, feature based models can easily run in real time, and yield accurate pose measurements by tracking only a few facial features. Appearance based models are popular since they can easily work on low-resolution images, and work on the whole image of the detected facial region. However, a large amount of training data is required for good pose estimation on unseen images. Additionally, such approaches tend to be sensitive to different illumination conditions, since these affect the appearance of the facial images. The 3D model based approach has the potential to provide accurate results, at the expense of a complex head model and a significant amount of computer resources. This technique also requires accurate initialisation and suffers from error accumulation over time. Certain models such as user specific and 3D morphable models have very precise pose estimation potential. However, these approaches require a relatively high resolution that could slow down real time performance. In most present head pose estimation systems a trade-off exists between the model complexity, the accuracy of pose estimation and the computation time. It can clearly be seen that current head modelling techniques do not satisfy all desirable criteria on their own. Despite this, a feature based modelling approach would allow the 6 DOF to be tracked at a relatively good accuracy whilst achieving real time performance, thus fulfilling the primary objectives of this project.

By using a simple head model

composed of a few face features other key objectives such as the use of an uncalibrated camera, automatic system initialisation and user invariance can also be achieved.

12

2.5 Conclusion The primary objectives of this project demand the full 6 DOF, real-time performance, good accuracy, and user invariance at an economical cost. A full 3D head model will be too taxing on the processor, and would not allow other software to run simultaneously. Appearance based techniques tend to depend heavily on the training data to acquire sufficient prior knowledge of the model and to make the system capable of handling faces in general. Keeping the primary objectives in mind, it can be seen that a simple feature based modelling approach composed of a few face features will most likely satisfy the requirements.

13

Chapter 3: Head Pose Tracking Algorithm The chosen pose estimation technique is based on work done by Gee and Cipolla in 1994 [1]. Locating the eyes, the corners of the mouth, and the tip of the nose, the facial symmetry axis can be found by drawing a line from the midpoint between the eyes, to the midpoint between the mouth corners. Assuming the face to be a plane in 3D space, the position of the nose tip in the image could be used to estimate the vector which is normal to the facial plane, and from which the pose parameters can be estimated. This technique also assumes fixed ratios between facial features, and a weak perspective camera model. Although simple in nature, this system achieves credible head orientation estimates at a very low computational cost. Due to the nature of the feature-based geometrical method, several processing steps are required as can be seen in Figure 3.1.

Figure 3.1: The main processing steps include image acquisition, face detection, facial feature detection, feature tracking and subsequently pose estimation.

3.1 Face & Feature Detection Face detection is the capability of a computer to determine the location of one or more faces in an image, and forms the first step towards an automatic tracking initialisation. It is performed prior to facial feature detection since knowing the location of the face limits the area in which the features can be found. The chosen approach must be invariant to race and ethnicity, and cater for various appearances of faces. No prior information about the user‟s appearance should be needed, as this would significantly limit the applicability of the system. Various methods exist for extracting the location of a face in an image. These include skin colour methods that look for the largest connected region of skin coloured pixels [21]. The facial features are then extracted by assuming they are the darkest regions

14

that satisfy certain geometrical constraints. Feris et al. [22] also used a skin colour model and included template matching to aid in finding the feature locations. Colour alone is not a reliable source of information to detect a face since noise in the colour images and varying illumination conditions can lead to inaccurate detections. Therefore skin colour face detection can be combined with edge detection, motion detection, background subtraction and ellipse fitting in order to achieve the desired robustness [23], [24]. Processing colour is much faster than processing other facial characteristics, with te added advantage that colour is orientation invariant under certain lighting conditions and robust to geometric variations of the face pattern [22]. However, the possible presence of skin coloured non-face objects in the scene can confuse the detector and produce unwanted results. Furthermore, although colour is good for detecting and tracking faces, it is not so useful to detect the facial features such as the eyes, nose and mouth reliably. Therefore other techniques must be used such as thresholding, assuming that the facial features are the darkest patches in the face image. Since in our application detection precision is critical, another more reliable approach is necessary. The chosen face and facial feature detection algorithm is based on the well known Viola-Jones algorithm [25]. This technique enables very fast object recognition, and although it is used for face recognition, it can be extended to other facial features [26] and most rigid objects that have distinguishing views.

This

algorithm has been shown to process images very quickly, whilst at the same time achieving high detection rates and low false positives. (A false positive detection occurs when a feature is located in a place where it shouldn‟t have been located.) It involves preparing positive images (approx. 4,000 images) and negative images (approx. 10,000 images) of the desired object, and training them using a boosting technique. The purpose of the next section is to introduce the concepts used in the algorithm. Since this work is not directly focused on face detection, only a brief description of the ideas will be discussed.

15

3.1.1 The Viola-Jones face detector The paper by Viola and Jones [25] published in 2004, introduced a novel technique to detect faces in real time from grey scale images with a very high detection rate. The paper contributed towards three main techniques. The first is a new image representation called an integral image that allows for very fast feature evaluation. The second is the use of AdaBoost to select a small number of important features from a large number of potential features, and the third is a method of combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focusing on promising regions of the image [25]. The face detection algorithm classifies images based on the value of simple Haar-like features as shown in Figure 3.2. These form the tools which will be used by the boosted classifiers. Having the system based on these simple features makes the learning process easier, and enables much faster operation, when compared to a pixelbased system. The value of these features is the difference of the sum of pixels lying in the white and gray regions. The total set of possible rectangular features is very large and would require simple yet excessive amounts of computation. The value of these features is the difference of the sum of pixels lying in the white and gray regions. The total set of possible rectangular features is very large and would require simple yet excessive amounts of computation.

Figure 3.2: Haar-like features. Their value is determined by subtracting the sum of pixels enclosed in the white region from the sum of pixels in the gray region.

16

Figure 3.3: The value of the integral image at point (x,y) is the sum of all the pixels above and to the left of that point.

To be able to compute the Haar feature very rapidly and at many scales, Viola and Jones used an intermediate image representation called the integral image. This is found by making each pixel equal to the entire sum of all pixels above and to the left of the concerned pixel, as can be seen in Figure 3.3. The integral image is calculated using:

(3.1)

Where ii(x,y) is the integral image and i(x,y) is the original image.

This allows for the calculation of the sum of all pixels inside any given rectangle using only four values. These values are the pixels in the integral image that coincide with the corners of the rectangle in the input image.

Once computed, it allows the

calculation of the Haar-like features to be carried out at any scale in a constant amount of time.

17

Figure 3.4: The value of the integral image at 1 is the sum of the pixels in rectangle A. The value at location 2 is A+B, at location 3 is A+C, and at location 4 is A+B+C+D. The sum within D can be computed as 4+1-(2+3).

Despite the advantages of the integral image and the optimized feature selection, a large amount of features still need to be computed. When the program is running, search regions or sub-windows, of different sizes are swept over the original image. Within any image sub-window the total number of Haar–like features is far greater than the number of pixels. In order to ensure fast classification an efficient learning process must be employed to exclude a large majority of the available features, and focus on a small set of critical features. A variant of AdaBoost is used to select the features and to train the classifier. AdaBoost is a machine learning boosting algorithm capable of constructing a strong classifier through a weighted combination of weak classifiers. A weak classifier can be seen as a collection of „rules of thumb‟ that increase the classification performance to just above random guessing. The Viola-Jones algorithm uses a cascade of classifiers shaped in a tree structure which achieves increased detection performance while reducing computation time. The cascaded classifier is composed of stages each containing a strong classifier. At each stage, a given sub window is classified as either non-face or face. When a subwindow is classified as non-face, it is immediately discarded, and if it is classified as face, then it passes on to the next stage. If a region in an image gets a positive 18

Figure 3.5: The detection cascade. A series of classifiers are applied to each sub-window. If they are classified as „face‟ they move on to a more complex classifier and if they are classified as „non-face‟ they are immediately rejected.

classification, it is evaluated by a second more complex classifier. The initial classifier eliminates a large number of negative examples with very little processing. Subsequent stages eliminate further negatives but require additional computation since more complex classifiers are used.

A simple block diagram of this process is

illustrated in Figure 3.5. The cascade aims to reject as many negative sub-windows as possible in the early stages. In practice around 70-80% of non-faces are rejected in the first two nodes of the rejection cascade [25]. This early rejection greatly increases performance of the face detection. 3.1.2 Implementation of the Viola-Jones Algorithm We have used the Viola-Jones face-detection algorithm made available in the open source computer vision library (OpenCV). We used two trained cascades that are provided by Intel‟s OpenCV library [27]: one for detecting the frontal face, and another for detecting the eyes. The nose and mouth were detected by using trained cascades developed by Castrillón et al. [26]. The result of the classifier depends on the way it is trained and also on the image in which the search is conducted. We cannot change the way the features were trained since no additional training was carried out in 19

this project. In order to improve the classification performance we can change the search region over which the classifier will look for the feature. Once the classifier has found the face, greater computational efficiency is achieved by looking for the face features in a specific region of interest (ROI), where they are most probable to be found. This improves speed performance, and reduces false positive detections. A flow diagram of the face and feature detection process can be seen in Figure 3.6. The speed and success with which it finds faces in images might lead one to think that it could be implemented as a tracking technique itself by running the face detector at every frame. However since we would also like to locate the eyes, nose and mouth in the detected face region with the same technique, the processing time would be too costly (approximately 240ms on our system which reduces the frame rate to 4-5fps) and therefore it is preferable to perform specific tracking of the facial features once the feature detector has found their position. Furthermore since the images from the webcam are noisy, slight variations in the position and scale of the detected features occurs even if the position of the face remains still, therefore a suitable tracking technique will increase the stability of the detections. Another key point is the fact that the Haar classifiers used have been trained on frontal faces and features, and therefore will not detect faces that are rotated by certain angles which would hinder subsequent pose estimation. The ability to keep track of the face feature locations in every frame is addressed in the next section.

Figure 3.6: The detected head region is depicted using a red box. The feature ROIs are extracted directly from the face ROI, and the detection of a feature location is only performed in its respective ROI. Finally the results are combined and displayed. The blue, green and yellow circles represent the detected eyes, nose and mouth respectively.

20

3.2 Feature Tracking Tracking is the ability to continuously update the location of the targeted feature of interest over time. Once the position of the features is known, their coordinates can be passed to a tracking algorithm, which assumes that the features were correctly located in the detection process. Popular techniques include skin colour tracking, optical flow and template matching. Since we are interested in tracking the facial features rather the actual head itself, we cannot use skin colour methods. Although optical flow techniques are very fast and show good tracking performance, they suffer from accuracy drift over time. This problem is often overcome by employing the epipolar constraint in stereovision systems [21], [28]. Simple template matching or correlation based techniques have proved to be suitable for practical and real-time systems, and are very widely applied to compute image motion [29]. In this type of tracking framework, a similarity metric is used to compare a template image to an image patch. If the template image contains an eye, then by correlating it with the input image, strong matches can be found that would indicate an eye has been found [30]. In order to reduce the number of computations required, the search region can be limited to a specified window around a location where it is most likely to be found. This approach was considered to be the most attractive since it could easily be initialised automatically from the previous detection stage, thereby avoiding tedious manual initialisation. The template image can be created by taking a small snapshot of the area around the detected feature point in the previous stage, and match it to a region of interest (ROI) in the current image. The current implementation employs template matching using the Normalised Sum of Squared Difference (SSD) which is a simple way to compare two images. One could also use correlation based matching which will produce more accurate results at a higher processing cost. Template matching is performed on each of the localized features obtained from the Viola-Jones detector. Normalization is useful because it helps to reduce the effects of lighting differences between the template image and the ROI image.

21

3.2.1 Algorithm for feature tracking with NSSD template matching: 1. Match template image of left eye pupil, shown in Figure 3.7 (a), to the region of interest image around the last known location of feature, shown in Figure 3.7 (b). 2. Obtain the NSSD values for all positions in the ROI as shown in Figure 3.7 (c) by using:

(3.2)

Where T is the template image, I is the ROI image, and w, h represent the width and height of the template image respectively. 3. Find the minimum value of S(x,y) and the respective (x,y) coordinates. 4. If the minimum is smaller than a pre-established threshold continue; otherwise go back to the Viola-Jones face and facial feature detection. 5. Update the eye location and the region of interest based on the matching result. 6. Repeat Steps 1-5 for the right eye, nose and mouth template images. 7. Repeat the whole process in an infinite loop that can be broken if the distances between the face features signal that the system has failed.

Min

(a)

(b)

(c)

Figure 3.7: (a) Template image of left eye obtained by taking a snapshot around the detected location found from the Viola-Jones detector in the previous processing stage. (b) Region of interest image around last known location of left eye. (c) NSSD image result from matching the template image of the left eye to the region of interest around the last known location of the left eye.

22

There are two instances where the template matching algorithm will revert to the Viola-Jones face and feature detection stage. The first will occur when the minimum value of S(x,y) is greater than a pre-established threshold.

The threshold was

determined from preliminary tests and fixed at a value such that small light variations and minor occlusions, such as eye blinking, would be ignored by the system. This threshold represents the point at which the minimum error between the template image and the ROI image is considered to be too large for a good match. For example, the minimum value of S(x,y) is likely to exceed the threshold if the lighting conditions change significantly. This would cause the system to re-initialise and take new template images of the features to match the current lighting conditions. This case is checked for in Step 4 of the template matching algorithm. The second instance where the template matching algorithm will revert to the ViolaJones detection stage will occur when the geometrical distances between the face features are not consistent with the initial face model distances recorded at initialisation. Before this happens, the face geometry is exploited to recover from single feature tracking failures without having to re-initialise the tracking system. This recovery framework is discussed further in section 3.4. If the single feature tracking failure is not corrected, or more than one feature has been lost, then the system will revert back to the Viola-Jones face feature detection stage in order to re-initialise the system. If the feature tracking fails when the person‟s head is under rotation, then it will revert to the Viola-Jones face and feature detection stage, but the tracking algorithm will not be started again until the face has returned to a frontal pose. This is important because the template images from a frontal pose are the most representative of the desired features to track. To make sure that the system does not initialise when the user‟s face is not facing the camera, the centroid of the triangle formed between the eyes and mouth is compared to the position of the nose tip. If the nose tip is within a certain range of the centroid between the eyes and mouth, then the person can be considered to be in frontal pose and the tracking system can be initialised. This is illustrated in Figure 3.8. The range is set manually and allows for variations in face appearance between people.

23

(a) (b) Figure 3.8: The Viola-Jones face and feature detection algorithm is checking whether the person is in a frontal pose before moving on to the tracking algorithm. (a) Frontal head pose. The nose tip is within a region defined from the centroid of the eye and mouth features and therefore the system will proceed to initialise the tracking stage. (b) Head pose is not frontal and the nose tip is out of the targeted region, therefore the system will not proceed to initialise the tracking.

3.3 Head Pose Estimation The feature tracking on its own allows the estimation of four degrees of freedom. The „x‟ and „y‟ translation can be taken directly from the position of the features in the image. The „z‟ translation can be estimated by measuring the distance between the features, and the in-plane rotation can be found by finding the angle of the line that joins both eyes. The last two degrees of freedom are both out-of-plane rotations, namely the yaw and pitch, which are estimated in the final algorithmic step. The last step is the recovery of the facial normal which is calculated from the position of the features in the face. This is done by assuming constant model ratios between the eyes and mouth, and estimating the length from the nose base (located on the symmetry axis of the face) to the node tip. These model ratios are then compared to the ratios seen in the 2D image, and, assuming a weak perspective camera model, the 3D vector normal to the face plane is calculated [1].

24

3.3.1 Preliminary Processing steps The distance between the feature points can be used to estimate the scale of the head. By recording the initial distances between the facial features and comparing that value to the current distances, the movement through the „z‟ axis can be estimated. The distance between two points seen in Figure 3.9 can be found using Pythagoras‟ theorem:

(3.3)

If the two image points in Figure 3.9 correspond to the eyes, then the rotation angle in the image plane (roll) can be calculated. The angle between two points in a 2D image can be found by finding the arctangent of the slope between the two points:

(3.4)

Figure 3.9: Points „Pt1‟ and „Pt2‟ are two points in the image which can represent the eye locations. The distance „d‟ between the points can be found by using Pythagoras‟ theorem. The angle between the two points can be found from the arctangent of the slope between the two points.

25

The angle between two lines can be found by first finding the angle each line makes with respect to the „x‟ axis, and subsequently subtracting the two angles as seen in Figure 3.10. Alternatively the angle may be found by finding the dot product of the two lines. This calculation will be necessary when dealing with the lines joining the located feature points.

Figure 3.10: Angle between two lines L1 and L2 is

26

3.3.2 The Facial Model

Figure 3.11: The face model is composed of the distance ratios between the eyes, nose, and mouth locations of a typical face. The eyes and mouth form three points which define the facial plane. The symmetry axis is found by joining the midpoint between the eyes to the mouth.

The facial model is based on the ratios of four parameters from the subject‟s head, shown in Figure 3.11. An upper case „L‟ is used when referring to the model distances, whereas a lower case „l‟ is used to represent the current distances measured from the image. These are: Lf Ln Lm Le

→ → → →

model distance between eyes and mouth. model distance between base of the nose and the nose tip. model distance between the nose and mouth. model distance between the eyes.

The symmetry axis can be located by finding the mid-point between the eyes and joining it to the mouth. Using the model ratio

, the location of the nose base

can be estimated.

27

3.3.3 Estimating the Facial Normal

Figure 3.12: Finding the facial normal from the image by joining the nose base to the nose tip. The angles Ө and τ can easily be extracted from the 2D image.

The facial normal can be drawn by joining the nose tip to the nose base as shown in Figure 3.12. The tilt τ is the angle in the image plane between the „x‟ axis and the facial normal. Ө is the angle between the symmetry axis and the image normal. The angle σ which the facial normal makes with the image plane in 3D is called the slant.

This angle can be seen in Figure 3.13, and is found by using the image

measurements Ө, ln, lf, and the model ratio

[1].

Figure 3.13: The face centred coordinate system used to calculate the slant. The facial slant σ is the angle between the vector normal to the image plane, and the facial normal located along the „z‟ axis of the facial plane.

28

. The slant angle of the facial plane σ can be found

Consider a vector since

. If a weak perspective camera model is assumed, it can be shown

that:

(3.5)

Where recovery of

,

, and

. This quadratic equation allows the

and hence the slant angle:

(3.6)

For the case where

using Equation 3.5:

(3.7)

For the case where

using Equation 3.5:

(3.8)

Equation 3.5 will always give a single root in the range 0 to 1. Then the facial normal

is given by: (3.9)

For a complete proof refer to the paper by Gee and Cipolla [1].

29

3.4 Recovery from failure At some stage, due to severe occlusions, fast movements, or quick lighting variation the tracking system may fail. The next step is to provide a mechanism to allow the algorithm to recover from failure. When the similarity between the template image and the current region of interest is above a certain threshold, the algorithm will return to the Viola-Jones face and facial feature detection stage to allow re-initialisation. In the case that one feature has been lost and the distances between the other facial features are below a certain threshold, the algorithm will try to correct the position of the lost feature. By exploiting the geometry and symmetry of the face, the ROI of the lost facial feature can be corrected as shown in Figure 3.14. If three features are being tracked correctly, then it is trivial to estimate where the other feature should be since the face structure is known. The algorithm can recover from single feature tracking failures assuming that the person has returned to a near frontal pose. For the algorithm to recover from single feature tracking failures when the head is in a certain orientation, then the head pose information is needed, and this will not be available when one feature is lost. If one or more features is tracked incorrectly, then the head pose estimates will also be incorrect.

Figure 3.14: The left eye can be considered as lost if the distance between the left eye and the right eye, and the distance between the left eye and the nose is greater than a certain threshold. In that event, the feature can be recovered by exploiting the geometrical properties of the face in frontal pose. In this case, the face symmetry is used to recover the position of the left eye.

30

3.5 Software Implementation The main programming language used in this project is „C‟ since it is fast and powerful and well suited for real time applications. The main library usaed is OpenCV (Open Source Computer Vision) version 1.1 on a Windows platform which contains a collection of C functions and a few C++ classes that implement many popular Image Processing and Computer Vision algorithms. OpenCV is free for academic and commercial use in contrast to the very expensive tool kits like Halcon, MATLAB & Simulink. The library is actively used by a large number of companies (such as Intel, IBM, Microsoft, SONY, Siemens, Google) and research centres (Stanford, MIT, CMU, Cambridge, INRIA). 3.5.1 Structured Programming The main program was split into multiple source files which facilitate software reusability and good software engineering. The aim was to split a complex problem into smaller building blocks, which in turn contain a number of related functions. The software framework can be seen in Figure 3.15.

Feature Detect

Drawing Image Capture Main Program

Pose Estimation

Feature Tracking

Figure 3.15: The Main processing algorithms are separated into distinct sub-programs which are brought together in the main program.

31

3.5.2 System Architecture The chosen detection, tracking and pose estimation steps were integrated to form one complete system capable of starting automatically, tracking the detected features over time, and simultaneously estimating the head pose of the user. The system flow can be seen in Figure 3.16. Initialize system

Capture Video Frame

Detect Face and Features

No

Found?

Yes Initialise Tracker and get Image Templates

Capture Video Frame

Track features

No

Square error of matches smaller than Threshold?

Yes Pose Estimation

Figure 3.16: The computational flow chart for the integrated “closed loop” system. Only the main processing steps of the integrated process are shown.

32

Chapter 4: Results and Discussion In order to test the performance of the head pose estimation system, proper evaluation and testing must be carried out. An accurate method is required to measure the ground truth pose information, on which to ground the accuracy of our system, and enable comparisons with other algorithms. The system was tested on a number of high quality videos provided by the Boston University [17]. Images were captured from a Sony Handycam on a tripod, at 30 frames per second and 320×240 resolution. Ground truth information was collected via a “Flock of Birds” 3D magnetic tracker, which was attached to the subject‟s head. The system was also tested on videos created with a low quality Creative webcam attached to the top of the computer screen which captures images at 30fps and a resolution of 320×240.

Ground truth was not available in this latter experiment,

therefore the performance was evaluated using empirical techniques.

4.1 Performance measures In order to be able to compare this system with others, the most common informative metrics will be measured in order to achieve a quantitative measure of the system‟s performance. The mean absolute angular error for roll, pitch and yaw is a common statistic used in results obtained from coarse and fine pose estimators, and gives insight into the accuracy of competing methods [16]. The main interest is in the accuracy of the three rotation angles (roll, yaw & pitch), since these angles define the orientation of the head. The precision of the tracker can be defined as the mean absolute angular error over sequences in which all features were tracked. It is important not to include results from sequences in which some features were completely lost as these would make the accuracy results useless. In the tests carried out, the system successfully tracked all the video sequences. Instances where the feature locations were tracked inaccurately were not removed from the results so that the accuracy would also reflect the tracking reliability.

33

4.1.1 Mean absolute error The mean absolute error (MAE) is defined as the average of the absolute differences

, where

is the estimated value, and

is the ground truth value.

The MAE is given in Equation 4.1, where N is the number of video frames in the sequence. It is used in statistics to measure how close predictions are to the eventual outcomes. In the MAE all individual differences are weighted equally irrespective of the distance from the true value.

(4.1)

4.1.2 Root mean squared error The root mean squared error (RMSE) is defined as the difference between estimated and ground truth values which are each squared and averaged over the number of samples. Finally, the square root of the average is taken. It is also used to measure how close predictions are to the eventual outcomes, but it gives a higher weight to large differences.

This makes the RMSE a useful metric when large errors are

particularly undesirable. The RMSE equation is shown in Equation 4.2.

(4.2)

When compared, the MAE and RMSE measures give insight into the variation of the errors in the estimations. The RMSE will always be larger or equal to the MAE, and a large difference would indicate a larger variance in the calculated errors. All errors are of the same magnitude if the MAE and RMSE are equal.

Both measures are

negatively oriented scores where lower values indicate superior performance. 4.1.3 Standard Deviation Equally important as the mean error, the standard deviation is a measure of the dispersion of a data set. In these results, the standard deviation will be used to compare the spread of angular displacement between the estimated angles and the ground truth angles over a video sequence. Furthermore it will be used as a measure of 34

confidence accompanying the mean error. Equation 4.3, where

The standard deviation is defined in

is the mean value of the dataset, and

is the data point in a

particular frame N.

(4.3)

4.2 Ground Truth Dataset The dataset of the Boston University is composed of 72 ground truth annotated videos, and contains subjects with various head shapes, skin colour and facial hair. The first set consists of 45 sequences (5 subjects in 9 different video clips each) taken under uniform illumination, where the subjects perform free head motion including translations and both in-plane and out-of-plane rotations. The second set consists of 27 sequences (3 subjects in 9 different video clips each) taken under varying illumination. Each sequence is 200 frames long (approx. 7 seconds). The “Flock of birds” system shown in Figure 4.1 measures the relative position of the transmitter with respect to the receiver and the orientation of the receiver. In theory, the accuracy of the system is 1.8mm in translation and 0.5 degrees in rotation, however lower accuracies were experienced due to the typical operating environment in which computers and some metal furniture can be found.

Despite this the captured measurements were still

accurate enough to evaluate the pose estimation system [17].

Figure 4.1: Illustration of the “Flock of birds” 3D magnetic tracker setup.

35

4.3 Quantitative Performance Evaluation In this experiment the system is compared to the ground truth data obtained from the Boston University dataset. Two sets of videos were chosen (two subjects in nine different video clips each) that are representative of the whole range of movements in the 6DOF. Moreover, the lighting conditions, skin colour, facial hair and the distance of the user from to camera differ between the two sets. A total of 18 video sequences were used for this experiment. The first set called the „jam‟ videos, features a light skin coloured subject with no facial hair in a typical office with poor lighting conditions, as shown in Figure 4.2. The face pixels amount to approximately 15% of the total frame pixels. The second set known as the „ssm‟ videos, features a dark skin coloured subject with a moustache in the same setting but with good lighting conditions, as shown in Figure 4.3.

In this case the subject‟s face takes up

approximately 10% of the total frame. The complete set of results can be seen in Appendix A. The model parameters and the template image sizes were estimated manually from the subject‟s face appearance and size. These initialisation parameters were determined prior to the experiments and kept constant throughout. The values for the initialisation parameters can be seen in Appendix B. The system was allowed to start automatically as soon as the face features were detected on all the video sequences, and inaccurate detections were not corrected in any way.

Figure 4.2: Image of „jam‟ subject

Figure 4.3: Image of „ssm‟ subject

36

4.3.1 Presentation of Results In this section, the presentation of the screenshots, graphs and tables obtained from a typical video will be described. The results acquired from the video „jam8‟ will be displayed as a typical example, and will provide insight into interpreting the results in the following sections and well as in the complete set of results found in Appendix A. The video frames obtained when testing the system on the „jam8‟ video sequence from the Boston University can be seen in Figure 4.4. The two rows of images show specific frames of the video sequence. In each screenshot, the original video frames are overlaid with drawings displayed by the pose estimation algorithm. At system initialisation, the face and facial features are found using the Viola-Jones algorithm. This processing step can be seen in Frame 1 of Figure 4.4 and is identified by the square red box surrounding the located face, and the coloured circles which represent the located facial features. Frame 3 is displayed to show the transition between the Viola-Jones face and feature detection stage, and the tracking and pose estimation display outputs. It also shows the initial tracking positions for the eyes nose and mouth that have been located from the detection stage. If Frame 3 displays the same red box and coloured circles as seen in Frame 1, then the face features were not all detected in the first frame, and therefore the tracker was not initialised immediately. In subsequent frames, the feature locations are identified by a small red box surrounding the feature locations, and a drawing pin representation of the head orientation is depicted in the top left hand corner of each frame.

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure 4.4: Video frames from the sequence „jam8‟. The automatic initialisation stage can be seen in „Frame 1‟ where the detected face region is represented by a red box, whereas the eyes, nose, and mouth are represented by blue, green, and yellow circles respectively. In the remaining frames, the feature locations are made known by drawing red boxes around them, and the pose is depicted using a drawing pin representation at the top left hand corner.

37

The graphical results for the video „jam8‟ can be seen in the six graphs of Figure 4.5. The first row of graphs in Figure 4.5 labelled (a), (b), and (c) show the variation of the rotation angles with respect to the video frame number. The videos were processed at 30 frames per second, and hence the video frame numbers also represent an instance in time. The red solid line, which represents the estimated pose, should ideally follow the black solid line which represents the ground truth. In the second row of graphs Figure 4.5 labelled (d), (e), and (f), the ground truth was plotted against the estimated pose angles in order to better visualise the difference between the two data sets. The solid black line represents the ideal case where the ground truth values are exactly equal to the estimated values. The data points are plotted with red crosses, and a straight red line is fitted to the data. In this particular video, the subject undergoes variations in the „x‟, „y‟, and „z‟ dimensions, as well as yaw, pitch and roll rotations. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

Estimated Pose (degrees)

40

20

150

200

(f) Pitch 40

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

-20

0

20

Estimated Pose (degrees)

40

30 20 10 0 -10 -20 -30 -40 -40

-20

0

20

40

Estimated Pose (degrees)

Figure 4.5: Results for video „jam8‟. Graphs (a), (b), and (c), show the variation of the rotation angles with respect to the video frame number. The red solid line, which represents the estimated pose, should ideally follow the black solid line which represents the ground truth. In the second row of graphs labelled (d), (e), and (f), the ground truth was plotted against the estimated pose angles in order to better visualise the difference between the two data sets. In the ideal case, when the estimated data is exactly equal to the ground truth, the data points should all lie on the line .

38

The MSE, RMSE and standard deviations for the ground truth and estimated pose angles for the „jam8‟ video are tabulated in Table 4.1. In this case the purpose of the standard deviation is to give an indication of the range of angles undergone throughout the video sequence. From the graphs of Figure 4.5, it can be seen that the maximum and minimum yaw rotation angles seen in graph (b) were greater than those of the roll and pitch rotations in graphs (a) and (c) respectively. This can be seen quantitatively in the results of Table 4.1. The comparison between the standard deviations of the ground truth and the estimated pose values gives insight into the range of tracked rotation angles. If the standard deviation of the estimated pose values is similar to the one for the ground truth, then one can say that the system did not over estimate or underestimate the rotation angles. Table 4.1: Mean absolute error, root mean square error, ground truth standard deviation and estimated pose standard deviation for video „jam8‟. MAE (degrees) RMSE (degrees) Ground truth std. dev. Estimated pose std. dev.

Roll

Yaw

Pitch

2.4025 3.3538 7.1570 6.7187

4.3232 5.6096 11.8967 11.5014

3.6923 4.6010 6.8172 6.9322

39

4.3.2 Global Results The global average results from all the video sequences can be seen in Table 4.2 for the „jam‟ videos and in Table 4.3 for the „ssm‟ videos. Both the MAE and the RMSE are expressed in degrees. In these results, the standard deviation accompanying the performance metric is expressing how the average errors varied over the collection of tested videos, and acts as a confidence measure for the average errors. Table 4.2: Average errors and standard deviations over all „jam‟ videos JAM MAE (degrees) MAE std. dev. RMSE (degrees) RMSE std. dev. ROLL YAW PITCH

2.7395 5.3352 3.3362

1.0170 2.3770 2.3563

3.2990 6.6412 4.0389

1.2166 2.3563 1.3460

Table 4.3: Average errors and standard deviations over all „ssm‟ videos SSM MAE (degrees) MAE std. dev. RMSE (degrees) RMSE std. dev. ROLL YAW

3.3128 5.2009

1.9982 2.5681

4.0635 6.3971

2.5774 3.0693

PITCH

4.4933

3.8348

5.3448

4.0524

From the 18 videos tested, the head pose estimation system managed to track all the video sequences, and never lost track to the point where re-initialisation was required. The maximum and minimum rotations registered in the Boston University videos are shown in Table 4.4. Table 4.4: Largest rotation angles registered in Boston University videos.

Yaw Pitch Roll

Minimum (degrees)

Maximum (degrees)

-30 -22.5 -31.5

+24 +21.5 +28

In the following sections, the results for the estimated roll, pitch and yaw rotation angles will be discussed separately. In each section, the results from certain videos have been chosen and displayed in order illustrate specific issues which are being discussed.

40

4.3.3 Roll

Figure 4.6: Illustration of the roll rotation.

In this section, the results from the roll rotation estimation will be analysed and discussed. The roll is the rotation about the „z‟ axis which is normal to the image plane as shown in Figure 4.6; therefore the whole face will remain in view even under large rotation angles. The MSE, RMSE and standard deviations for the roll rotation angles for each video can be seen in Table 4.5. The roll rotation estimation achieved the least error when compared to yaw and pitch, as can be seen in Table 4.1, and this was expected since it is being extracted directly from the 2D image plane. When the head rotates about the z-axis, it remains facing the camera and no face features being tracked become occluded. Table 4.5: Roll results from „jam‟ and „ssm‟ videos.

ROLL jam1 jam2 jam3 jam4 jam5 jam6 jam7 jam8 jam9 ssm1 ssm2 ssm3 ssm4 ssm5 ssm6 ssm7 ssm8 ssm9

MAE(deg) 2.6895 1.6747 1.4109 2.8532 2.4244 3.1111 3.1490 2.4025 4.9400 1.5556 2.3413 1.0614 5.1681 3.9413 2.7709 4.6246 7.2980 1.0537

RMSE(deg) 3.4423 1.9779 1.8308 3.2855 2.6886 3.6744 3.4211 3.3538 6.0170 1.9205 2.4970 1.3870 6.5360 4.7095 3.1554 5.5559 9.4296 1.3803

GT std. dev. 14.4918 3.625 5.465 3.3381 3.3810 3.4277 5.4186 7.1570 20.8261 4.1743 1.0833 3.8329 18.3951 2.1541 1.4352 14.6605 1.3086 4.5588

Est. std. dev. 13.8435 4.5986 5.6018 2.2143 3.2859 1.9537 5.5026 6.7187 17.2058 4.4390 1.1591 4.2349 20.0387 3.7925 1.8188 16.6752 5.2381 4.9336 41

(a) Roll

(b) Yaw

20 10 0 -10 -20 -30 50

100

150

30 20 10 0 -10 -20 -30 -40 0

200

40

Rotation angle (degrees)

30

-40 0

Video frame number

50

100

150

30

10 0 -10 -20 -30 -40 0

200

(e) Yaw 30

30

10 0 -10 -20 -30 0

20

Estimated Pose (degrees)

40

Ground Truth (degrees)

30

Ground Truth (degrees)

40

20

20 10 0 -10 -20 -30 -40 -40

-20

0

100

150

200

(f) Pitch

40

-20

50

Video frame number

40

-40 -40

Ground Truth Estimated Pose

20

Video frame number

(d) Roll Ground Truth (degrees)

(c) Pitch

40

Rotation angle (degrees)

Rotation angle (degrees)

40

20

Estimated Pose (degrees)

40

20

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

-20

0

20

Estimated Pose (degrees)

Figure 4.7: Graphical results from video „jam1‟. In graph (a), the roll value starts „jumping‟ above and below the ground truth value as it exceeds 22 degrees and decreases below -30 degrees. This occurs since the fixed template image of the eye is not a very good representation of the eye under large roll rotations.

Large MAE errors for the roll are usually caused when the large roll rotation causes the template image to be a very poor match to the eye region. This can be seen in the graphical results for video „ssm4‟ in Figure 4.7 (a), where in frames 40-80 and in frames 150-120, the roll rotation becomes uncertain and starts „jumping‟ above and below the ground truth value. This occurs since the fixed template images of the eyes taken from the subject‟s face in frontal pose are not good representations of the eyes under large roll rotations. Furthermore, in the „ssm‟ videos, the eye region is of very low resolution, adding to the uncertainty of the eye position under rotation. The uncertain roll rotation angles in the graph of Figure 4.7 (a) directly corrupted the corresponding estimated angles for the yaw and pitch rotations as can be seen in frames 40-80 and in frames 150-120 of graphs (b) and (c). The videos „jam1‟ and „jam9‟ demonstrated that large roll rotations can be tracked correctly without „jumping‟ if the eye resolution of the subject is larger. The eyes are of higher resolution because the face of the „jam‟ was closer to the camera than the face of the „ssm‟ subject and therefore appears larger in the image frame.

The

graphical results for these videos can be seen in Figure 4.8 and Figure 4.9. These two results proved that even though large roll rotations were present, they can still be successfully tracked if the eye regions are of sufficient resolution. This behaviour can 42

40

(a) Roll

(b) Yaw

20 10 0 -10 -20 -30 50

100

150

30 20 10 0 -10 -20 -30 -40 0

200

40

Ground Truth Estimated Pose

Video frame number

50

100

150

(d) Roll

10 0 -10 -20 -30 50

10 0 -10 -20

30 20

0 -10 -20 -30

-40 -40

-40 -40

20

40

Estimated Pose (degrees)

200

(f) Pitch

10

-30

150

40

Results Ideal Line Fitted line

Ground Truth (degrees)

Ground Truth (degrees)

20

100

Video frame number

(e) Yaw

30

0

20

-40 0

200

40

-20

30

Video frame number

40

Ground Truth (degrees)

Rotation angle (degrees)

30

-40 0

(c) Pitch

40

Rotation angle (degrees)

Rotation angle (degrees)

40

30 20 10 0 -10 -20 -30

-20

0

20

-40 -40

40

Estimated Pose (degrees)

-20

0

20

40

Estimated Pose (degrees)

Figure 4.8: Graphical results from video „jam1‟. In graph (a) the roll rotations were successfully tracked, even though large rotations were present. The maximum and minimum values of roll rotation are of +28 degrees and -24 degrees. At the positions where large rotations were present, the yaw and pitch angles were not tracked properly. Looking at the video frames carefully one will notice that at large rotations, the eye positions were incorrectly located at the eye corners, thus causing errors in yaw estimation, as seen in graph (b). (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

Estimated Pose (degrees)

40

20

150

200

(f) Pitch 40

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

-20

0

20

Estimated Pose (degrees)

40

30 20 10 0 -10 -20 -30 -40 -40

-20

0

20

Estimated Pose (degrees)

Figure 4.9: Graphical results from video „jam9‟. In graph (a) the roll rotations were successfully tracked under clockwise rotations (negative values), and slightly underestimated for anticlockwise rotations (positive values). The maximum and minimum values of roll rotation are of +21 degrees and -31.5 degrees respectively. In the last frames of the sequence, the eye positions were incorrectly located at the eye corners, resulting in inaccurate yaw and pitch values.

43

40

Figure 4.10: Frame 200 of the „jam1‟ video. The eye centres are incorrectly located at the eye corners. This does not cause significant errors for the roll values but it greatly affects the yaw and pitch estimation, especially when one of the eyes is incorrectly located.

be attributed to the fact that even though the template images are fixed, and they do not resemble the eye regions under large rotations, they are still most similar to the eye regions as compared to other regions around the eye.

In both sequences it can

consistently be seen that the yaw and pitch values were very uncertain when a large simultaneous roll rotation was present. One common type of error under large roll rotations was to incorrectly locate the eye centres at the eye corners as can be seen in Figure 4.10. This type of error does not affect the roll rotation estimation significantly since the eye corners are located on the same line as the eye pupils. 4.3.4 Yaw

Figure 4.11: Illustration of the yaw rotation.

The yaw is the rotation about the „y‟ axis which corresponds to side-to-side head movement as shown in Figure 4.11; therefore under large yaw rotations, the eyes can become self-occluded. The results for yaw rotation can be found in Table 4.6 for the „jam‟ and „ssm‟ videos. 44

Table 4.6: Yaw results from „jam‟ and „ssm‟ videos. YAW jam1 jam2 jam3 jam4 jam5 jam6 jam7 jam8 jam9 ssm1 ssm2 ssm3 ssm4 ssm5 ssm6 ssm7 ssm8 ssm9

MAE(deg) 5.2468 2.6357 2.6342 10.7012 5.9931 5.0611 5.7368 4.3232 5.6843 5.3563 1.8804 2.6407 9.5319 2.8864 6.7943 7.0114 7.7576 2.9231

RMSE(deg) 6.8113 4.0806 3.7517 11.8587 7.6036 6.3400 7.0615 5.6096 6.6543 6.2192 2.3846 3.4838 11.8003 3.7800 7.8153 8.6549 9.8596 3.8193

GT std. dev. 6.1753 8.8127 2.6566 1.9510 17.8341 2.5994 13.7342 11.8967 6.6478 4.3138 3.1402 3.9152 2.8554 1.6754 1.6599 3.7108 23.0452 16.9459

Est. std. dev. 3.0193 6.7202 2.5893 4.9851 11.7124 6.6671 7.8323 11.5014 6.2765 7.0049 1.6121 5.8087 9.7870 3.3596 3.8186 10.9169 18.1841 17.4105

Figure 4.12: Feature location confidence drops as head rotates away from the image plane. Since the template image is fixed and extracted from a frontal face orientation, the minimum value of the NSSD value increases as the head rotates away from frontal pose.

45

The yaw rotation attained the largest MAE as compared to the roll and pitch, and this can be attributed to the fact that the eye features change most in this rotation direction causing the SSD template matching to be increasingly uncertain of the eye positions. This is illustrated in Figure 4.12, where the subject‟s head undergoes rotation in yaw, and the left eye ROI is seen to change in appearance as the head rotates away from the image plane. Another reason for the large MAE error emerges when the graphical results for videos in which the subject underwent yaw rotations are analysed. In video „jam5‟, the yaw angles are not correctly tracked beyond ±20 degrees as shown in Figure 4.13(b) and (e). Similar behaviour is also seen in Figure 4.14(b) and (e) where the yaw rotations were not tracked accurately beyond ±25 degrees. This seems to be the value at which the yaw estimation peaks and therefore, any rotation larger than these values is not accurately tracked, causing a large yaw angle MAE for this video.

(a) Roll

(b) Yaw

Ground Truth Estimated Pose

20 10 0 -10 -20 -30 -40 0

50

100

150

40

30

30

20 10 0 -10 -20 -30 -40 0

200

Video frame number

50

(d) Roll

0 -10 -20 -30 -20

0

0 -10 -20 -30 -40 0

200

50

20

Estimated Pose (degrees)

40

40

30

30

20 10 0 -10 -20 -30 -20

0

150

200

(f) Pitch

40

-40 -40

100

Video frame number

Ground Truth (degrees)

Results Ideal Line Fitted line

10

-40 -40

150

10

(e) Yaw Ground Truth (degrees)

Ground Truth (degrees)

20

100

20

Video frame number

40 30

(c) Pitch

40

Rotation angle (degrees)

30

Rotation angle (degrees)

Rotation angle (degrees)

40

20

Estimated Pose (degrees)

40

20 10 0 -10 -20 -30 -40 -40

-20

0

20

40

Estimated Pose (degrees)

Figure 4.13: Graphical results of video „jam5‟. In graph (b) the yaw estimation can be seen to saturate at approximately ±20 degrees. This is also seen in graph (e), where the fitted line clearly shows an underestimation for the negative and positive yaw angles, especially at higher angles.

46

(a) Roll

(b) Yaw

20 10 0 -10 -20 -30 50

100

150

30 20 10 0 -10 -20 -30 -40 0

200

40

Rotation angle (degrees)

30

-40 0

Video frame number

50

(d) Roll

150

0 -10 -20 -30 50

10 0 -10 -20 -30 20

Estimated Pose (degrees)

40

150

200

(f) Pitch 40

30 20 10 0 -10 -20 -30 -40 -40

100

Video frame number

Ground Truth (degrees)

20

0

10

(e) Yaw

30

-20

Ground Truth Estimated Pose

20

-40 0

200

40

Ground Truth (degrees)

Ground Truth (degrees)

100

30

Video frame number

40

-40 -40

(c) Pitch

40

Rotation angle (degrees)

Rotation angle (degrees)

40

-20

0

20

Estimated Pose (degrees)

40

30 20

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

-20

0

20

40

Estimated Pose (degrees)

Figure 4.14: Graphical results from video „ssm8‟. It can be seen that in graph (b), the yaw angle is greatly underestimated between frames 50 and 100. The yaw is only underestimated for the positive angles since these were larger than the negative angles and harder to track. Graph (e) shows that the negative rotations were tracked, whilst the higher positive angles were underestimated. From frames 165 onwards the tracking is lost, and caused outliers in the results which distorted the results as seen in graph (d). Despite this setback, the track was recovered in the last frame.

The main reason for this saturation lies in the model assumption that the nose tip can be accurately located in the face image, which is valid for frontal pose. However at large yaw rotations, the tracker does not follow the nose tip, but rather finds a closer match at the nose centre, as shown in Figure 4.15.

Figure 4.15: Frame 125 from the „jam5‟ sequence. Head model assumes that the nose tip can be accurately located in the face image for accurate pose estimation. The yaw angle is underestimated since under large yaw rotations, the nose centre tends to be a better match than the nose tip for the NSSD tracker.

47

Another source of inaccuracy arises from model ratios assumed at the beginning of the experiment. Since no profile views of the subjects were available, the ratio of the length between the nose tip and nose base, to the length between the mouth and the eyes

, defined in the initial face model seen in Figure 3.11, could only be

roughly estimated. Moreover, since variations in yaw at small angles cause a larger change to the imaged face normal than at larger angles, the accuracy of the yaw values is greater for smaller angles, and becomes less accurate for larger yaw rotations. This can be seen in Figure 4.16 (b) where although the yaw is precisely tracked, the accuracy of the values diminishes as the rotation increases.

(a) Roll

(b) Yaw

20 10 0 -10 -20 -30 50

100

150

30 20 10 0 -10 -20 -30 -40 0

200

40

Ground Truth Estimated Pose

Video frame number

50

(d) Roll

150

0 -10 -20 -30 50

10 0 -10 -20 -30 20

Estimated Pose (degrees)

40

30 20

150

200

(f) Pitch 40

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

100

Video frame number

Ground Truth (degrees)

20

0

10

(e) Yaw

30

-20

20

-40 0

200

40

Ground Truth (degrees)

Ground Truth (degrees)

100

30

Video frame number

40

-40 -40

Rotation angle (degrees)

30

-40 0

(c) Pitch

40

Rotation angle (degrees)

Rotation angle (degrees)

40

-20

0

20

Estimated Pose (degrees)

40

30 20 10 0 -10 -20 -30 -40 -40

-20

0

20

40

Estimated Pose (degrees)

Figure 4.16: Graphical results from video „ssm9‟. The maximum and minimum values of successfully tracked yaw angles were registered in this video sequence. The maximum value is of +24 degrees, and the minimum value is of -29 degrees.

48

4.3.5 Pitch

Figure 4.17: Illustration of the pitch orientation.

The pitch is the rotation about the „x‟ axis which corresponds to up-down head movement as shown in Figure 4.17; therefore under large pitch rotations, the face features can also become self-occluded. The results for the pitch rotation can be seen in Table 4.7 for the „jam‟ and „ssm‟ dataset.

Table 4.7: Roll results from „jam‟ and „ssm‟ videos. PITCH jam1 jam2 jam3 jam4 jam5 jam6 jam7 jam8 jam9 ssm1 ssm2 ssm3 ssm4 ssm5 ssm6 ssm7 ssm8 ssm9

MAE(deg) 4.1833 2.8787 2.1114 1.7998 2.1578 4.6748 4.1783 3.6923 4.3488 1.7773 2.2737 1.9928 4.4455 12.7329 10.0363 2.1435 2.0220 2.8741

RMSE(deg) 4.9945 3.3008 2.6209 2.1629 2.8024 5.9939 4.6003 4.6010 5.2730 2.2279 2.8009 2.3994 6.5093 14.0140 10.8905 2.8097 2.8028 3.5857

GT std. dev. 1.7933 1.3469 0.9990 2.5440 2.1206 16.5889 1.4645 6.8172 3.2582 1.1843 1.9445 1.1701 4.3160 17.0536 9.2786 2.8742 1.1242 1.8108

Est. std. dev. 2.8541 1.5941 1.9922 1.6141 2.1902 14.5513 1.7007 6.9322 2.9984 1.2141 3.0353 2.2167 4.0425 17.1382 6.3282 1.7476 2.7342 1.3336

49

(b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

(d) Roll 30 20 10 0 -10 -20 -30 -20

0

10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

30 20

150

200

(f) Pitch 40

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

100

Video frame number

40

Ground Truth (degrees)

Ground Truth (degrees)

150

20

(e) Yaw

40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

-20

0

20

Estimated Pose (degrees)

40

30 20 10 0 -10 -20 -30 -40 -40

-20

0

20

40

Estimated Pose (degrees)

Figure 4.18: Graphical results for video „jam6‟. In graph (c) the pitch is correctly tracked but suffers from underestimation at positive pitch rotations between frames 75-115 and temporary overestimation at negative pitch rotations between frames 148-160.

The accuracy of the pitch estimation, just like the yaw, suffered at large rotations, as can be seen in Figure 4.18 (c) and (f). However, the model ratio that predominantly affects the pitch could be estimated more accurately since the ratio R m is taken directly from a frontal view of the face. Another kind of error showed up in two of the „ssm‟ videos that directly affected the MAE of the pitch rotation angles. In Figure 4.19 (c) and Figure 4.20 (c), the pitch rotation was precisely tracked, however at a constant offset. This is mainly caused by an inaccuracy in the detection stage, where an imprecise initialisation would cause the pose estimation to be offset by a certain angle. For the „ssm‟ subject, the source of this error may be attributed to his moustache causing a slight inaccuracy in the mouth detection as shown in Figure 4.21(a) and (b). Although the MAE will be large, the standard deviation of both datasets is similar, thus confirming that the system is tracking the pitch rotation. This is further confirmed when the datasets are plotted against each other and a line is fitted to the data. As shown in Figure 4.19 (f) and Figure 4.20 (f).

50

(b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

20

40

200

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30

Estimated Pose (degrees)

150

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

-20

0

20

30 20 10 0 -10 -20 -30 -40 -40

40

Estimated Pose (degrees)

-20

0

20

40

Estimated Pose (degrees)

Figure 4.19: Graphical results for video „ssm5‟. In graphs (c) and (f) it is very clear that the pitch has been tracked at a constant offset value. It can also be seen that in frames 45-65 the system temporarily lost track and recovered in subsequent frames.

(b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

Estimated Pose (degrees)

40

20

150

200

(f) Pitch 40

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

-20

0

20

Estimated Pose (degrees)

40

30 20 10 0 -10 -20 -30 -40 -40

-20

0

20

40

Estimated Pose (degrees)

Figure 4.20: Graphical results for video „ssm6‟. Similarly to the figure above, the pitch values were offset by a constant value, which can be seen clearly in graphs (c) and (f).

51

(a)

(b)

Figure 4.21: (a) Frame 3 from the video „ssm5‟. (b) Frame 3 from the video „ssm9‟. The difference in the initial location of the mouth in video „ssm5‟ (a) caused a significant offset in the pitch rotation angles. The location of the mouth was more accurate in video „ssm9‟(b), and this can be seen from the drawing pin representation depicted in the top left corner of the frames. In the ideal case there should be a perfectly spherical circle and a single point in the centre.

4.3.6 X, Y, and Z translation Since a calibrated camera was not used, the „x‟ and „y‟ translations are measured in image pixels and not real world measurements. The „z‟ translation or scale is being measured as a scale factor relative to the initial position of the head. Since the camera is un-calibrated, we cannot compare the accuracy of the translation to the 3D magnetic tracker. The „x‟ and „y‟ coordinates of the head are assumed to be equal to the „x‟ and „y‟ coordinates of the nose base. This means that the centre of the head is assumed to be located on the facial plane, and therefore small errors in „x‟ and „y‟ translation will result when the head is oriented away from a frontal pose. However, the minor errors in „x‟ and „y‟ translation that occur from this assumption can be ignored without significant loss in accuracy. The „x‟ and „y‟ translation of the head is limited to the size of the camera frame. The scale of the head is being extracted from the initial distances between the feature points as compared to the distances in the current frame. If the head is closer to the camera than it was initially, then the average distance between the features increases, and the opposite is true when the head moves away from the camera.

What is

interesting to know is how a change in scale affects the system performance, and this can be seen in video „ssm2‟, frames of which are shown in Figure 4.22, where the subjects first moves closer to the camera and then moves further away. 52

Frame 1

Frame 50

Frame 75

Frame 125

Frame 175

Figure 4.22: Frames from video „ssm2‟ where the subject displays translation in the „z‟ direction. Even though the templates are of constant size, they are still matched correctly to the intended feature locations. This is important for the system to be of practical use. (b) Yaw

30

30

20 10 0 -10 -20 -30 -40 0

50

100

150

Ground Truth Estimated Pose

20 10 0 -10 -20 -30 -40 0

200

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Video frame number

50

100

150

30 20 10 0 -10 -20 -30 -40 0

200

50

Video frame number

100

150

200

Video frame number

0 300 -10

250

-20

200 -30

150

-40 -40

100

-20

0

20

40

Estimated Pose (degrees)

50 0 0

20

Fitted line

(b) Y

10 0

200 -10

(c) Z

10

1.3-10

-20

150 -30 -40 100 -40

20

1.5 0 1.4

-20

0

20

40

Estimated Pose (degrees) 50

Scale factor

(a) X

10

Ground Truth (degrees)

Ground Truth (degrees)

20

Frame pixel number

Frame pixel number Ground Truth (degrees)

Figure 4.23: Graphical variation. (d) Roll results for video „ssm2‟ in (e) which Yaw the subject undergoes scale (f) Pitch 40 40 The40 system does not lose track of the features and still manages to estimate the pose within Results 30 30 30 acceptable accuracy. Ideal Line

1.2-20 1.1

-30

1 0.9-40 -40 0.8

-20

0

20

40

Estimated Pose (degrees)

0.7 0.6

50

100

150

Video frame number

200

0 0

50

100

150

Video frame number

200

0.5 0

50

100

150

200

Video frame number

Figure 4.24: Graphical results for „x‟, „y‟, and „z‟ translations for video „ssm2‟. The „x‟ and „y‟ coordinates of the head are measured in image pixels, whilst the „z‟ translation is measured as a scale factor relative to the initial distance from the camera.

The plotted results for the video „ssm2‟ roll, pitch and yaw rotation angles can be seen in Figure 4.23(a), (b) and (c) respectively. In Figure 4.24(a), (b) and (c), the results for the „x‟, „y‟, and „z‟ translations can be seen. As the subject in Figure 4.22 moves towards the camera, it also moves downwards with respect to the camera.

This

downward „y‟ translation is clearly captured in graph (b) of Figure 4.24. As the face of the subject moves closer to the camera, the size of the face increases in the image, and 53

therefore, the distances between the face features also increase.

This gradual

translation in the „z‟ direction is shown in Figure 4.24 (c), where a scale factor greater than unity indicates that the subject is moving closer to the camera, and a scale factor smaller than unity indicates that the subject has moved further away relative to the subject‟s initial position. There was hardly any movement in the „x‟ direction in this video and this can be seen in Figure 4.24(a). The „x‟, „y‟ and „z‟ translations of the subject are all clearly captured in the graphs of Figure 4.24, and match the movements seen in video „ssm2‟. Since the template images do not change in scale, poorer pose estimation results are to be expected when the head is at a different scale than it was initially. Despite this, the results of Figure 4.23 demonstrate that when the face moves closer or further away from the camera, the head pose is still tracked. The main reason why the system is still able to track the pose when the distances between the features change is because the head model is made up of the ratios between lengths, which don‟t change as the head moves closer or away from the camera.

4.3.7 Discussion From the 18 videos tested, the head pose estimation system managed to track all the sequences, and never lost track to the point where re-initialisation was required. This shows a particular robustness to various movements in the six degrees of freedom, and also towards different skin coloured subjects in two different lighting conditions. Although the pose was not always accurate as compared to the ground truth data, the general head orientation was always tracked. The results from the „jam‟ videos consistently achieved a lower MAE when compared to the „ssm‟ videos as can be seen from Table 4.2 and Table 4.3, which indicates that a higher face resolution and proper feature localization improves the accuracy and robustness of the tracking. The pose estimation is directly reliant on the tracking of the salient features in the face and any tracking error will immediately corrupt the pose estimates. Any inaccuracy in the face detection stage will result in a constant offset error in the pose estimation. Errors in pitch and yaw are expected since the exact face model ratios of the subject are unknown and were estimated. 54

4.4 Qualitative Empirical Evaluation The purpose of testing the head pose estimation system on videos that were created for this project is to test its capabilities on low quality web camera images and its widespread applicability for use on a personal computer or laptop. Particularly, its empirical limits in terms of rotation angles need to be tested as well as its capabilities of recovering from tracking failures and tracking sequences for long periods of time. Furthermore, it is an opportunity to test the system‟s performance when single feature or full face occlusions occur and its capability of tracking fast movements, like head shaking and nodding. The subject was located in a room with a cluttered background and the lighting conditions were non-uniform but constant. In all three sequences, the tracker was initialised automatically and allowed to run for the whole period of the video. The image capture device was a standard „CREATIVE‟ web camera that captures frames at 30fps and resolution of 320×240. The camera was mounted on top of a computer monitor, in the same position as one would mount it for use in a video call.

4.4.1 Video ‘my1’ The video „my1‟ was intended to track consecutive rotations in yaw, pitch and roll. Selected video frames can be seen in Figure 4.25. The corresponding orientation estimations in time are shown in Figure 4.26. In Figure 4.26 (a), the dotted box from frames 50-330 shows the slight estimated roll rotation resulting from the head yaw rotation from frames 50-330 seen in Figure 4.26 (b).

Immediately after the roll

rotation, a pitch rotation can be seen in graph (c), where the track was temporarily lost from frames 480-540. This occurs because the left eye is mistakenly located as the left eye-brow as seen in Frame 500 of Figure 4.25. Following the pitch rotation, a roll rotation is seen in Figure 4.26 (a) starting from approximately frame 560. A slight tracking error also occurs when the left eye is again located incorrectly at the position of the left eye-brow. The maximum and minimum rotation angles can be seen in Table 4.8.

55

Frame 1

Frame 3

Frame 50

Frame 100

Frame 150

Frame 200

Frame 250

Frame 300

Frame 350

Frame 400

Frame 450

Frame 500

Frame 550

Frame 600

Frame 650

Frame 700

Frame 750

Frame 790

Rotation angle (degrees)

Rotation angle (degrees)

Rotation angle (degrees)

Figure 4.25: Video frames taken from the first sequence aimed at tracking head rotation about the „x‟, „y‟, and „z‟ axis. (a) Roll 40

Estimated Pose 20 0 -20 -40 0

100

200

300

400 Video frame number

500

600

700

500

600

700

500

600

700

(b) Yaw 40 20 0 -20 -40 0

100

200

300

400 Video frame number

(c) Pitch 40 20 0 -20 -40 0

100

200

300

400 Video frame number

Figure 4.26: Graphical results from video „my1‟. The red solid line represents the estimated angle in time. The black square boxes were drawn over the results in order to emphasize key regions of the graph where particular rotations are taking place. The blue square boxes represent regions where the tracker has failed thus causing jump discontinuities in the estimated angles. In this case the failures are mainly due to excessive rotations.

56

Table 4.8: Largest rotation angles registered in video „my1‟.

Yaw Pitch Roll

Minimum (degrees)

Maximum (degrees)

-30.8 -31.8 -38.8

+31.0 +18.5 +30.8

4.4.2 Video ‘my2’ In the video „my2‟, the system‟s robustness to head-shake in both yaw and pitch orientations, and single feature occlusions was tested. In Frame 150, seen in the video snapshots of Figure 4.27, one can see significant motion blur that the system has to cope with during fast movements of the subject. This is mainly due to the low quality and frame rate of the webcam, and is the source of another type of error that can cause the system to lose track of the facial features.

Frame 1

Frame 3

Frame 50

Frame 100

Frame 150

Frame 200

Frame 250

Frame 300

Frame 350

Frame 400

Frame 450

Frame 500

Frame 550

Figure 4.27: Video frames from „my2‟ sequence. This sequence was aimed at tracking headshake, single feature occlusions and full re-initialisation capability when the tracker fails dramatically, for example, due to a full face occlusion shown in Frame 500.

57

Rotation angle (degrees) Rotation angle (degrees) Rotation angle (degrees)

(a) Roll 40 20

Estimated Pose

total face

mouth nose

left eye

0 -20

right eye

-40 0

100

200

300 Video frame number

400

500

400

500

400

500

(b) Yaw 40 20 0 -20 head shake -40 0

100

200

300 Video frame number

(c) Pitch 40 20 0 -20 head shake -40 0

100

200

300 Video frame number

Figure 4.28: Graphical results for video „my2‟. The black boxes in graphs (b) and (c) were drawn around the head shaking sequences. The blue boxes represent regions where the system temporarily lost track of the pose due to occlusions. The green line towards the end marks the time after which the face has become completely occluded and system automatically reinitialises.

The first blue box from the left in Figure 4.28 represents the region where the subject was occluding his mouth just like one would do whilst yawning, as shown in Frame 200 of Figure 4.27. The system loses track of the mouth but regains its position when the hand is not occluding the mouth anymore. The second blue box marks the point where the right eye becomes occluded and the eye is temporarily located as the eyebrow as shown in Frame 300 of Figure 4.27. In the following frames, the subject occludes the nose with his finger, however the system does not lose track of the nose since the colour of the finger and the nose are quite similar. In Frame 400, the left eye becomes occluded, and both eye positions are lost, but the system quickly recovered. In Frame 500 the subject‟s hand occluded all his face causing the system to lose track of all the features. The system quickly detects that it has lost the face features and reinitialise.

58

4.4.3 Video ‘my3’ The purpose of video „my3‟ is to test the system at the limits of yaw and pitch, and to capture the maximum angles at which the system is capable of tracking the head orientation. It also tests the systems capability to recover from failure when some features have been lost due to self-occlusion (features which move out of view due to large rotations and not due to physical obstructions). Selected frames taken from the video sequence can be seen in Figure 4.29. The corresponding orientation estimations in time are shown in Figure 4.30. It can clearly be observed that when the system loses track of all the face features in Frame 850, the system detects this and the reinitialisation process is started as shown in Frame 900.

The maximum angles

estimated before the tracker failed are listed in Table 4.9.

Frame 1

Frame 3

Frame 50

Frame 100

Frame 150

Frame 200

Frame 250

Frame 300

Frame 350

Frame 400

Frame 450

Frame 500

Frame 550

Frame 600

Frame 650

Frame 700

Frame 750

Frame 800

Frame 850

Frame 900

Figure 4.29: Video frames from sequence „my3‟. This video is 900 frames or 30 seconds long, and the subject takes the rotation in yaw and pitch to its limits to identify the angles at which the tracker fails to estimate the pose further.

59

Rotation angle (degrees) Rotation angle (degrees) Rotation angle (degrees)

(a) Roll 40 20 0 -20 -40 0

100

200

300

400 500 Video frame number

600

700

800

900

600

700

800

900

600

700

800

900

(b) Yaw 40

Estimated Pose 20 0 -20 -40 0

100

200

300

400 500 Video frame number

(c) Pitch 40 20 0 -20 -40 0

100

200

300

400 500 Video frame number

Figure 4.30: Graphical results from video „my3‟. The black boxes mark regions where the system has lost track, and the blue circles mark the points at which the maximum rotation angles were read. The green line just before frame 900 marks the time when the system was reinitialised, and fresh template images were taken.

It is important to note that the face tracker will not be initialised until the face returns to a frontal pose. This is because the template images from a frontal pose are the most representative of the face features in different orientations.

The initialisation is

constrained to a frontal pose by comparing the centroid of the triangle formed by joining the points of the eyes and mouth, to the location of the nose tip in the image.

Table 4.9: Maximum and minimum rotation angles for yaw and pitch.

Yaw Pitch

Minimum (degrees)

Maximum (degrees)

-32.5 -36.2

+32.4 +16

60

4.4.4 Simple Application In order to prove the applicability of the pose estimation system as a hands-free pointing device controlled by the users head orientation, a simple game was created. The aim of the game is to shoot down blue circles that appear in random positions on the screen with a black crosshair style pointer controlled by the user. The position of the black crosshair is determined from the yaw and pitch angles of the user's head, and can be moved about in real time. Once the user has positioned the black crosshair over the blue circle, it will disappear and another one will appear in another random position on the screen. A counter then informs the user how many blue circles he has managed to hit. If the system re-initialises, the counter resets to zero. This pose estimation game runs at 30 fps which is the maximum frame rate of the web camera. The pose estimation system takes approximately 15ms to provide an estimate at each frame, which means that the system could potentially run at double the frame rate of the camera (60 fps). Typical screenshots of the game can be seen in Figure 4.31. This simple application is proof of the initial declaration that a head pose estimation system will open new ways to control machines and interact with computers. This simple „zero-cost‟ application was created to show how the system can potentially be converted to one which controls a smart head-up display (HUD) or any other system that would benefit from head pose control.

Figure 4.31: Typical screenshots whilst trying to shoot the blue balls with the black crosshair.

61

When the game was initially tested, the cursor, represented by a black crosshair, successfully moved to points on the screen controlled by the user‟s head orientation. It was noted that the motion of the black crosshair was „jumpy‟ and unnatural when moving across the screen and it seemed as if there were some „dead‟ areas of the screen to which the cursor could not be directed. The angles which are generated by the head pose estimation are discrete by virtue of the face feature locations which reside at discrete pixel locations.

The possible

pointing positions on the screen are therefore also discrete and jump between locations when the user‟s head is undergoing a rotation. This has the advantage that slight head movements will not be registered. However if the slight movement causes the nose position to shift by just one pixel, the cursor position will jump to the next discrete angle. One way of increasing the resolution of the pose angles would be to increase the image resolution, however this is undesirable. A significant reduction in „jumping‟ was obtained by re-configuring the pose estimation system to average the current pose estimates with the previous values. This had the effect of adding a more natural motion to the cursor‟s movements at the expense of adding a slight response delay.

4.4.5 Discussion In the experiments using „my1‟, „my2, and „my3‟, no ground truth data was available, and therefore the accuracy of the tracker could not be estimated in these video sequences. The three video sequences taken in the second experiment exposed certain system qualities in a way that was not possible with the videos from the Boston University. The results obtained clearly exposed the weakness of the system to facial feature occlusions, and its potential to recover from failure. One key advantage of having a static template is that the system will not suffer from diminishing accuracy over time and could run for very long periods of time with the same performance. This is because the system is relocating the template images in each and every frame, and therefore the accuracy at each frame depends on the matching result between the template image and the ROI region, irrespective of the number of frames of the video. 62

Figure 4.32: Frame 800 from video „my3‟. This image illustrates the reason why downward pitch rotation is difficult to track for the NSSD template matching algorithm.

An experiment was carried out in which the template images were continuously updated to better track the head movements under rotation. Although it works well for the first few frames, small location errors each time the template image is updated cause the feature location to drift. Moreover, if the user blinks his eyes and the template is updated at that moment, then the system will lose track of the eyes immediately. In the case of a fixed template image, when the user blinks his eyes, the system will lose track momentarily, and recover as soon as the user re-opens his eyes. The extreme rotation angles that the system can track are directly related to the angles at which the features become occluded in the image plane. This is expected since the tracker assumes that all features are present in each frame, and requires them to be there in order to find a match. The single rotation direction that suffers more than the rest is the downward pitch direction, which only achieved angles of +18.5 and +16 degrees in videos „my1‟ and „my3‟ respectively. This occurred because as the head rotates downwards the eye pupil and the eyebrow merge and the tracker finds a better match at the eyebrow. This is clearly seen in Figure 4.32. A larger pitch angle of +21.5 degrees was achieved in video „jam6‟ form the Boston University dataset, however this increase in angle can be attributed to the fact that the scale of the head is smaller in the „jam‟ videos and the field of view of the camera is not exactly the same as the one used for the „my‟ videos. In another test it was shown that when the user is moving the eyeballs, the pose estimation can be fooled into thinking that the user has changed his head pose. This is 63

(a) Head Frontal, Eyes Frontal

(b) Head Frontal, Eyes Right

(c) Head Frontal, Eyes Right

Figure 4.33: Correcting pose measurements to cope with eyeball rotation.

illustrated in Figure 4.33, where in image (a) the subject has a frontal pose and the eyes are looking forward. Since the pose estimation is basing its measurements on the pupil location, errors may arise when the pupils are not in the centre of the eyes as shown in image (b). In image (c) the eye pupils are still being tracked, but the pose estimation is now being based on the eye corners. Since the eye corners cannot change position when the user is looking in different directions, the pose measurement is now more accurate. The disadvantage of tracking the eye corners is that they are less salient than the eye pupils, and their location can only be estimated from the position of the detected eye region. Moreover, tracking the eye corners limits yaw rotation further since the eye corners become self occluded before the eye pupils, however when comparing the results from Table 4.9 and Table 4.10 the difference was only marginal. The results of the test on video „my3‟ tracking the eye corners instead of the eye pupils are shown in Table 4.10 and Figure 4.34. In the case of the Boston university videos, tracking the eye corners instead of the eye pupils would not have increased the accuracy significantly because the eyes were of a very low resolution when compared to the „my‟ videos. Table 4.10: Maximum and minimum rotation angles for yaw and pitch when tracking the eye-corners instead of the eye pupils. The reduction in rotation angles was only marginal which suggests that it could be a more accurate method for tracking the head pose.

Yaw Pitch

Minimum (degrees)

Maximum (degrees)

-31.5 -31.5

+32.2 +16 64

Rotation angle (degrees) Rotation angle (degrees) Rotation angle (degrees)

(a) Roll 40 20 0 -20 -40 0

100

200

300

400 500 Video frame number

600

700

800

900

600

700

800

900

600

700

800

900

(b) Yaw 40

Estimated Pose 20 0 -20 -40 0

100

200

300

400 500 Video frame number

(c) Pitch 40 20 0 -20 -40 0

100

200

300

400 500 Video frame number

Figure 4.34: Graphical results from video „my3‟ tracking the pose using the eye corners instead of the eye pupils. The black boxes mark regions where the system has lost track, and the blue circles mark the points at which the maximum rotation angles were read.

The system can only recover from single feature occlusions when the face is in frontal pose. This is because for the tracker to estimate where the feature locations are in any face orientation, the pose information is required. If one or more features is tracked incorrectly, then the head pose estimates will also be incorrect. The ability of the system to recover from single feature tracking failures without having to reinitialise the system greatly improves the systems usability. If severe occlusions or drastic lighting variations are present, the system will reinitialise and restart automatically from the detection stage.

65

4.5 Summary In order to test the quantitative performance of the head pose estimation system, tests were carried out on ground truth videos provided from the Boston University. The MAE, RMSE and standard deviation were used as performance measures to evaluate the system. All the 18 test videos were tracked successfully, and were representative of the whole range of movements in the 6 DOF. Global mean absolute errors of 3.03, 5.27 and 3.91 degrees were achieved for roll, yaw and pitch angles respectively. The „x‟ „y‟ and „z‟ translations were also being tracked; however, their performance was not measured quantitatively since an un-calibrated camera was being used. Qualitative empirical evaluation was carried out on videos created to test the limits for head rotation, fast head movements, recovering from single feature occlusions, tracking failures, and tracking video sequences for long periods of time. Finally a simple game was created to test the practical usability of the pose estimation system as a hands-free, head controlled pointing device. The positive results demonstrate that the system is indeed capable of being used practically by any unskilled user and proves that the system can be used for other applications such as the control of a smart headup display (HUD).

66

Chapter 5: Conclusion 5.1 Project Objectives and Achievements The results obtained from the head tracking and pose estimation system demonstrate that the key objective to design a cheap non-intrusive system able to track the human head pose in the 6 degrees of freedom has been successfully fulfilled. All the 6 DOF are all being extracted from the image without the help of any intrusive hardware attached to the user.

The rotation angles are all constrained to approximately

±35degrees with respect to the image normal, as this is the angle at which the features start becoming self occluded. These restrictions however do not limit the effectiveness of the system because if one is seated in front of a laptop, these rotations are rarely ever exceeded. The image capture device is monocular, cheap and does not require calibration. The system achieves real time performance (approx. 60fps), starts automatically and recovers from failure automatically without any previous knowledge of the user‟s appearance or location. It can operate in a wide range of constant lighting conditions, but handling drastic changes in lighting conditions is still a problem since template images change considerably with varying illumination. In the current system, this problem is overcome when the system detects failure and automatically re-initialises to capture new template images under the changed illumination, but this is undesirable. Despite the inherent inaccuracies of the system, the results look promising and are well within the error margin for several practical applications. One can also conclude that the head pose is a good indicator of a person‟s general gaze direction. This is because after a person identifies the source of interest by rotating the eyes, the head usually follows suit and moves in the same direction in order to reposition the pupils to their most comfortable central position. The simple game that was created to demonstrate the practical applicability of the system also proved to be entertaining, and provided insight into the weaknesses of the system and areas for future development.

67

5.2 Future Work Tests on the simple pointing game showed that the movement of the cursor is rather „jumpy‟ and unnatural due to the low resolution of the camera. One quick solution to increase the resolution of the pose angles would be to increase the image resolution; however this could slow down real time performance. In future work, a head motion model can be used to enhance the pose estimation system, which can seek to predict the next pose position from past measurements. This would have the effect of smoothing the pose measurements and removing outliers, adding a more natural movement to the cursor. In order to improve the feature tracker, the template images can be matched against the regions of interest at various scales and orientations in order to improve the possibility of finding a good match. This would increase the freedom of movement especially in the „z‟ direction and roll rotation. This approach could be extended to any affine transformation which would further improve the matching results and thus enhance the robustness of the tracker. Motion detection can be used to better estimate where the region of interest window should move in the next frame, and better locate the ROI for matching. One could also use a Kalman filter to dynamically update the width and height of the region of interest depending on the speed with which the user is moving in the image. It would also try to predict where the next feature location should be in the next frame thus excluding outlier values when the template matching fails. If the recovery of feature locations from frontal pose is to be extended to all possible face orientations, then a redundant pose estimation algorithm can be implemented that is based on a different approach so that when one fails the other will not. A Kalman filter can also be used to estimate the most likely pose orientation from the two values obtained from the pose estimation algorithms running in parallel. A second algorithm can be developed to cope with larger out of plane rotations; however care must be taken to ensure that the complexity of the system does not compromise the real time performance. In future work gaze direction improvements can be investigated by coupling the eyeball movement to the head orientation. This can be done by tracking one of the eye 68

corners and finding the relative position of the pupil to it. In this way, the pose estimate can be corrected by the eye position in order to better estimate the person‟s visual focus of attention. Difficulties were encountered when tests were carried out to locate the pupil position with respect to the eyeball in low resolution images, however investigations using higher resolution images or multiple cameras may prove to be a solution. Further work includes combining the eye pupil tracker with the head orientation estimator and improving the current tracking performance by coupling it with another technique. In addition, future research calls for the integration of other head pose estimation algorithms to accurately track greater head rotations and to exploit advantages of multiple techniques.

69

References [1]

A. Gee and R. Cipolla, “Determining the gaze of faces in images,” Technical Report CUED/FINFENG/TR 174, Cambridge University Department of Engineering, March 1994.

[2]

S. Cristina, “Tracking of single eye gaze under stationary head conditions for mouse cursor control,” Unpublished, B.Eng. dissertation, Department of Systems & Control Engineering, University of Malta, Malta, 2008.

[3]

L.P. Morency, J. Whitehill and J. Movellan, “Generalized Adaptive Viewbased Appearance Model Integrated Framework for Monocular Head Pose Estimation,” Proceedings of 8th International Conference on Automatic Face and Gesture Recognition (FG 2008), 2008.

[4]

T. Horprasert, Y. Yacob and L.S. Davis, “Computing 3-D Head Orientation from a Monocular Image Sequence,” Proceedings of the second International Conference on Automatic Face and Gesture Recognition, pp. 242-247, 1996.

[5]

P. Ballard and G. C. Stockman, “Controlling a computer via facial aspect,” Systems, Man and Cybernetics, IEEE Transactions on, volume 25, pp 669-677, 1994.

[6]

Y.X. Hu, L.B. Chen, Y. Zhou and H.J. Zhang, “Estimating Face Pose by Facial Asymmetry and Geometry,” Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp 651-656, 2004.

[7]

N. Gourier, D. Hall and J.L. Crowley, “Estimating Face orientation from Robust Detection of Salient Facial Structures,” FG Net Workshop on visual Observation of Deictic Gestures (POINTING), 2004.

[8]

S. Niyogi and W.T. Freeman, “Example-Based Head Tracking,” Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pp.374 – 378, 1996. 70

[9]

S. J. McKenna, and S. Gong, “Real-Time Face Pose Estimation,” International Journal on Real Time Imaging, Vol. 4, pp. 333-347, 1998.

[10]

S. Srinivasan, and K.L. Boyer, “Head Pose Estimation Using View Based Eigenspaces,” Proceedings of 16th International Conference on Pattern Recognition, Vol.4, pp.302 – 305, 2002.

[11]

Wei. Yucheng, L. Fradet and Tan. Tieniu, “Head Pose Estimation Using Gabor Eigenspace Modelling,” Procceedings of International Conference on Image Processing, Vol.1, pp.281 – 284. 2002.

[12]

S.Z. Li, Fu. Qingdong, Gu. Lie, B. Scholkopf, Cheng. Yimin and Zhang. Hongjiag, “Kernel Machine Based Learning For Multi-View Face Detection and Pose Estimation,” Proceedings of Eighth IEEE International Conference on Computer Vision, Vol.2, pp.674 - 679, 2001.

[13]

B. Raytchev, I. Yoda and K. Sakaue, “Head pose estimation by nonlinear manifold learning,” Proceedings of the 17th International Conference on Pattern Recognition, Vol.4, pp.462 – 466, 2004.

[14]

Fu. Yun and T.S. Huang, “Graph embedded analysis for head pose estimation,” Proceedings of 7th International Conference on Automatic Face and Gesture Recognition, pp. 6 - 8, April 2006.

[15]

Zhao. Liang, G. Pingali and I. Carlbom, “Real-Time Head Orientation Estimation Using Neural Networks,” Proceedings of International Conference on Image Processing, Vol.1, pp.297-300, 2002.

[16]

E. Murphy-Chutorian and M.M. Trivedi, “Head Pose Estimation in Computer Vision: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 31, pp.607 - 626, 2008.

71

[17]

M. La Cascia, S. Sclaroff and V. Athitsos, “Fast, reliable head tracking under varying illumination: An approach based on registration of texture-mapped 3d models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no.4, pp. 322-336, April 2000.

[18]

Z. Zivkovic and F. van der Heijden, “A Stabilized Adaptive Appearance Changes Model for 3D Head Tracking,” Proceedings of IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pp.175 – 181, 2001.

[19]

Yang. Ruigang, and Zhang. Zhengyou, “Model-based Head Pose Tracking With Stereovision,” Proceedings of Fifth IEEE International Conference on Automatic Face and Gesture Recognition, pp.255 – 260, 2002.

[20]

J. Paterson, and A. Fitzgibbon, “3d head tracking using non-linear optimization,” In Proceedings of the British Machine Vision Conference BMVC, pp. 609–618, 2003.

[21]

Chen. Jingying and B. Tiddeman, “A real-time stereo head pose tracking system,” Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, pp.258 – 263, Dec 2005.

[22]

R. S. Feris, T. E. de Campos and R. M. C. Junior, “Detection and tracking of facial features in video sequences,” In Lecture notes in Artificial Intelligence, vol. 1793, pp.197–206, 2000.

[23]

K.S. Huang and M.M. Trivedi, “Robust real-time detection, tracking, and pose estimation of faces in video streams,” Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 965- 968, Aug. 2004

[24]

P. Fitzpatrick, “Head pose estimation without manual initialization,” Technical Report, AI Lab, MIT, Cambridge, USA, 2000.

72

[25]

P. Viola and M. Jones, “Robust Real-Time Face Detection,” International Journal of Computer Vision, pp. 137-154, 2004.

[26]

M. Castrillón, O. Déniz, L. Antón-Canalís, and J. Lorenzo, “Face and Facial Feature Detector Evaluation,” International conference on computer vision theory and applications VISAPP’2008, 2008

[27]

Open source computer vision library. [Online]. Available: http://sourceforge.net/projects/opencvlibrary/ [Accessed: Oct. 08, 2008]

[28]

Z.G. Liu, Y.F. Li and P. Bao, “Stereo-Based Head Pose Tracking with Motion Compensation Model,” IEEE International Conference on Robotics and Biomimetics, ROBIO 2004, pp. 700 – 704, Aug. 2004.

[29]

A. Giachetti, “Matching techniques to compute image motion”, Image and Vision Computing, vol.18, pp. 247-260, 2000.

[30]

G. Bradski and A. Kaehler, Learning OpenCV. Sebastopol, O‟Riley, 2008.

73

Appendix A:

Quantitative Results

This appendix shows all the results obtained from the Boston University videos that were used in the quantitative performance evaluation. Each page contains the results obtained from one video sequence, and each result contains two figures and one table. The first figure contains a set of ten snapshots taken at spaced intervals throughout the video sequence. The automatic initialisation stage can be seen in „Frame 1‟ where the detected face region is represented by a red box, whereas the eyes, nose, and mouth are represented by blue, green, and yellow circles respectively. In the remaining frames, the feature locations are made known by drawing red boxes around them, and the pose is depicted using a drawing pin representation. In some instances, „Frame 3‟ will show the detection stage boxes indicating that the features were not correctly located in the first frame, and therefore the tracker was not initialised immediately. In the videos where Frame 3 shows the small red squares representing the feature locations, then the tracker was initialised immediately since the feature locations were found and satisfied certain geometrical criteria. The second figure shows six graphical results. The first row of graphs labelled (a), (b), and (c) respectively display the ground truth and estimated pose against time. The red solid line, which represents the estimated pose, should ideally follow the black solid line which represents the ground truth. A positive roll value indicates that the person is rotating anticlockwise in the image plane and the other way around for a positive roll. A positive yaw rotation indicates that the person is looking towards the right and vice versa for a negative yaw. A positive pitch value signifies that the person is looking downwards, and an upward motion is indicated by a negative pitch value. The second row of graphs labelled (d), (e) and (f) respectively display the ground truth data plotted against the estimated pose data in order to better visualise the difference between the two data sets. In the ideal case, when the estimated data is exactly equal to the ground truth, the data points should all lie on the line

.

The table of results in each page shows the numerical results for the Mean absolute error, the Root mean square error, the standard deviation of the ground truth data, and the standard deviation of the estimated pose data. 74

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.1: video frames from „jam1‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20

20

10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

Results Ideal Line Fitted line

-20

-40 -40

30 20 10 0 -10 -20 -30

-20

0

20

40

-40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.2: Graphical results from „jam1‟ sequence.

Table A.1: Numerical results from „jam1‟ sequence. MAE RMSE Ground truth std Estimated pose std

Roll 2.6895 3.4423 14.4555 13.8088

200

(f) Pitch

-10

-40 -40

150

40

0

-30

100

Video frame number

10

-30 0

150

20

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-20

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Yaw 5.2468 6.8113 6.1599 3.0118

Pitch 4.1833 4.9945 1.7888 2.8470

75

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.3: video frames from „jam2‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

20 10 0 -10 -20 -30 -40 0

200

50

30

30

20 10 0 -10 -20

20

-20

-40 -40

20

40

Estimated Pose (degrees)

Results Ideal Line Fitted line

0

-30

30 20 10 0 -10 -20 -30

-20

0

20

40

-40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.4: Graphical results from „jam2‟ sequence.

Table A.2: Numerical results from „jam2‟ sequence. MAE RMSE Ground truth std Estimated pose std

Roll 1.6747 1.9779 3.6159 4.5870

200

(f) Pitch

-10

-40 -40

150

40

10

-30

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll

0

150

30

Video frame number

40

-20

100

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Yaw 2.6357 4.0806 8.7906 6.7033

Pitch 2.8787 3.3008 1.3435 1.5901

76

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.5: video frames from „jam3‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

20 10 0 -10 -20 -30 -40 0

200

50

30

30

20 10 0 -10 -20

20

-20

-40 -40

20

40

Estimated Pose (degrees)

Results Ideal Line Fitted line

0

-30

30 20 10 0 -10 -20 -30

-20

0

20

40

-40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.6: Graphical results from „jam3‟ sequence.

Table A.3: Numerical results from „jam3‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

(f) Pitch

-10

-40 -40

150

40

10

-30

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll

0

150

30

Video frame number

40

-20

100

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

1.4109 1.8308 5.4513 5.5878

2.6342 3.7517 2.6499 2.5828

2.1114 2.6209 0.9965 1.9872

77

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.7: video frames from „jam4‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

20 10 0 -10 -20 -30 -40 0

200

50

30

30

20 10 0 -10 -20

20

-20

-40 -40

20

40

Estimated Pose (degrees)

Results Ideal Line Fitted line

0

-30

30 20 10 0 -10 -20 -30

-20

0

20

40

-40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.8: Graphical results from „jam4‟ sequence.

Table A.4: Numerical results from „jam4‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

(f) Pitch

-10

-40 -40

150

40

10

-30

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll

0

150

30

Video frame number

40

-20

100

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

2.8532 3.2855 3.3298 2.2088

10.7012 11.8587 1.9461 4.9726

1.7998 2.1629 2.5376 1.6101

78

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.9: video frames from „jam5‟ sequence. (a) Roll

(b) Yaw

Ground Truth Estimated Pose

20 10 0 -10 -20 -30 -40 0

50

100

150

40

30

30

20 10 0 -10 -20 -30 -40 0

200

Video frame number

50

(d) Roll

0 -10 -20 -30 -20

0

0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

40

30

30

20 10 0 -10 -20 -30 -20

0

150

20

40

20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.10: Graphical results from „jam5‟ sequence.

Table A.5: Numerical results from „jam5‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

(f) Pitch

40

-40 -40

100

Video frame number

Ground Truth (degrees)

Results Ideal Line Fitted line

10

-40 -40

150

10

(e) Yaw Ground Truth (degrees)

Ground Truth (degrees)

20

100

20

Video frame number

40 30

(c) Pitch

40

Rotation angle (degrees)

30

Rotation angle (degrees)

Rotation angle (degrees)

40

Roll

Yaw

Pitch

2.4244 2.6886 3.3726 3.2777

5.9931 7.6036 17.7894 11.6831

2.1578 2.8024 2.1153 2.1847

79

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.11: video frames from „jam6‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.12: Graphical results from „jam6‟ sequence.

Table A.6: Numerical results from „jam6‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

3.1111 3.6744 3.4191 1.9488

5.0611 6.3400 2.5929 6.6504

4.6748 5.9939 16.5474 14.5149

80

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.13: video frames from „jam7‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.14: Graphical results from „jam7‟ sequence.

Table A.7: Numerical results from „jam7‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

3.1490 3.4211 5.4051 5.4889

5.7368 7.0615 13.6998 7.8127

4.1783 4.6003 1.4608 1.6965

81

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.15: video frames from „jam8‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

(d) Roll 30 20 10 0 -10 -20 -30 -20

0

10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

30 20

150

(f) Pitch

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.16: Graphical results from „jam8‟ sequence.

Table A.8: Numerical results from „jam8‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

Results Ideal Line Fitted line

10

-40 -40

100

Video frame number

40

Ground Truth (degrees)

Ground Truth (degrees)

150

20

(e) Yaw

40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

2.4025 2.3538 7.1391 6.7019

4.3232 5.6096 11.8669 11.4726

3.6923 4.6010 6.8001 6.9149

82

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.17: video frames from „jam9‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.18: Graphical results from „jam9‟ sequence.

Table A.9: Numerical results from „jam9‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

4.9400 6.0170 20.7739 17.1628

5.6843 6.6543 6.6312 6.2608

4.3488 5.2730 3.2500 2.9909

83

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.19: video frames from „ssm1‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

20 10 0 -10 -20 -30 -40 0

200

50

30

30

20 10 0 -10 -20

20

-20

-40 -40

20

40

Estimated Pose (degrees)

Results Ideal Line Fitted line

0

-30

30 20 10 0 -10 -20 -30

-20

0

20

40

-40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.20: Graphical results from „ssm1‟ sequence.

Table A.10: Numerical results from „ssm1‟ sequence. MAE RMSE Ground truth std Estimated pose std

Roll 1.5556 1.9205 4.1639 4.4279

200

(f) Pitch

-10

-40 -40

150

40

10

-30

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll

0

150

30

Video frame number

40

-20

100

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Yaw 5.3563 6.2192 4.3030 6.9873

Pitch 1.7773 2.2279 1.1813 1.2111

84

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.21: video frames from „ssm2‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

20 10 0 -10 -20 -30 -40 0

200

50

30

30

20 10 0 -10 -20

20

-20

-40 -40

20

40

Estimated Pose (degrees)

Results Ideal Line Fitted line

0

-30

30 20 10 0 -10 -20 -30

-20

0

20

40

-40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.22: Graphical results from „ssm2‟ sequence.

Table A.11: Numerical results from „ssm2‟ sequence. MAE RMSE Ground truth std Estimated pose std

Roll 2.3413 2.4970 1.0806 1.1562

200

(f) Pitch

-10

-40 -40

150

40

10

-30

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll

0

150

30

Video frame number

40

-20

100

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Yaw 1.8804 2.3846 3.1324 1.6080

Pitch 2.2737 2.8009 1.9396 3.0277

85

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.23: video frames from „ssm3‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.24: Graphical results from „ssm3‟ sequence.

Table A.12: Numerical results from „ssm3‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

1.0614 1.3870 3.8233 4.2243

2.6407 3.4838 3.9054 5.7941

1.9928 2.3994 1.1672 2.2111

86

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.25: video frames from „ssm4‟ sequence. (b) Yaw

30

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Rotation angle (degrees)

40

-40 0

Video frame number

50

100

150

20 10 0 -10 -20 -30 -40 0

200

(e) Yaw 30

30

10 0 -10 -20 -30 0

20

40

Estimated Pose (degrees)

Ground Truth (degrees)

30

Ground Truth (degrees)

40

20

20 10 0 -10 -20 -30 -40 -40

-20

0

100

150

20

40

20

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.26: Graphical results from „ssm4‟ sequence.

Table A.13: Numerical results from „ssm4‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

(f) Pitch

40

-20

50

Video frame number

40

-40 -40

Ground Truth Estimated Pose

Video frame number

(d) Roll Ground Truth (degrees)

(c) Pitch

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

5.1681 6.5360 18.3490 19.9885

9.5319 11.8003 2.8483 9.7625

4.4455 6.5093 4.3052 4.0324

87

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.27: video frames from „ssm5‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.28: Graphical results from „ssm5‟ sequence.

Table A.14: Numerical results from „ssm5‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

3.9413 4.7095 2.1487 3.7830

2.8864 3.7800 1.6712 3.3512

12.7329 14.0140 17.0109 17.0953

88

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.29: video frames from „ssm6‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.30: Graphical results from „ssm6‟ sequence.

Table A.15: Numerical results from „ssm6‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

2.7709 3.1554 1.4316 1.8143

6.7943 7.8153 1.6557 3.8090

10.1363 10.8905 9.2554 6.3124

89

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.31: video frames from „ssm7‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 -20

0

150

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.32: Graphical results from „ssm7‟ sequence.

Table A.16: Numerical results from „ssm7‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll 40

-40 -40

100

30

Video frame number

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

4.6246 5.5559 14.6238 16.6312

7.0380 8.6870 3.7015 10.9169

2.1849 2.8726 2.8670 1.8990

90

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.33: video frames from „ssm8‟ sequence. (b) Yaw

30

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Rotation angle (degrees)

40

-40 0

Video frame number

50

(d) Roll

150

0 -10 -20 -30 50

0 -10 -20 -30 20

40

Estimated Pose (degrees)

150

(f) Pitch

20 10 0 -10 -20 -30 -20

0

20

40

30 20

Results Ideal Line Fitted line

10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.34: Graphical results from „ssm8‟ sequence.

Table A.17: Numerical results from „ssm8‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

30

-40 -40

100

Video frame number

Ground Truth (degrees)

Ground Truth (degrees)

10

0

10

(e) Yaw

20

-20

20

-40 0

200

40

30

-40 -40

100

Ground Truth Estimated Pose

Video frame number

40

Ground Truth (degrees)

(c) Pitch

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

7.2980 9.4296 1.3054 5.2250

7.7576 9.5896 22.9875 18.1385

2.0220 2.8028 1.1211 2.7274

91

40

Frame 1

Frame 3

Frame 25

Frame 50

Frame 75

Frame 100

Frame 125

Frame 150

Frame 175

Frame 200

Figure A.35: Video frames from „ssm9‟ sequence. (b) Yaw

30

30

20 10 0 -10 -20 -30 50

100

150

20 10 0 -10 -20 -30 -40 0

200

Ground Truth Estimated Pose

Video frame number

50

30

30

20 10 0 -10 -20 -30 0

20 10 0 -10 -20 -30 -40 0

200

50

20

40

Estimated Pose (degrees)

20

150

(f) Pitch

Results Ideal Line Fitted line

0 -10 -20 -30 -20

0

20

40

30 20 10 0 -10 -20 -30 -40 -40

Estimated Pose (degrees)

-20

0

20

Estimated Pose (degrees)

Figure A.36: Graphical results from „ssm9‟ sequence.

Table A.18: Numerical results from „ssm9‟ sequence. MAE RMSE Ground truth std Estimated pose std

200

40

10

-40 -40

100

Video frame number

(e) Yaw 40

Ground Truth (degrees)

Ground Truth (degrees)

(d) Roll

-20

150

30

Video frame number

40

-40 -40

100

Ground Truth (degrees)

-40 0

(c) Pitch 40

Rotation angle (degrees)

40

Rotation angle (degrees)

Rotation angle (degrees)

(a) Roll 40

Roll

Yaw

Pitch

1.0537 1.3803 4.5474 4.9213

2.9231 3.8139 16.9035 17.3669

2.8741 3.5857 1.8062 1.3303

92

40

Appendix B:

Initialisation Parameters

Table B.1: Model Rations for „jam‟ and „ssm‟ videos JAM

SSM

Rm

0.5

0.58

Rn

0.55

0.50

Table B.2: Template initialisation parameters for „jam‟ and „ssm‟ videos Size (pixels) eye template width

10

eye template height

10

eye search window width

30

eye search window height

20

nose template width

20

nose template height

20

nose search window width

40

nose search window height

40

mouth template width

16

mouth template height

8

mouth search window width

32

mouth search window height

16

93

Robust Tracking with Motion Estimation and Local ...

Accuracy and Robustness of Kinect Pose Estimation in ...

Robust Tracking with Motion Estimation and Local ... - Semantic Scholar

Fasthpe: A recipe for quick head pose estimation

gaze direction estimation tool based on head motion ... - eurasip

Efficient intensity-based camera pose estimation in ...

Kinect in Motion - Audio and Visual Tracking by Example ...

Motion Tracking and Interpretation in Intelligent ...

Egocentric Hand Pose Estimation and Distance ... - IEEE Xplore

Face Pose Estimation with Combined 2D and 3D ... - Jiaolong Yang

Integration of Magnetic and Optical Motion Tracking ...

six degrees of separation play script pdf

Optical Motion Tracking in Earthquake-Simulation ...

Speckle Tracking in 3D Echocardiography with Motion ... - IEEE Xplore