UMA ABORDAGEM PARA ESTIMAÇÃO REMOTA DO OLHAR SOB LUZ VISÍVEL EM TEMPO REAL SAMUEL FELIX DE SOUSA JUNIOR UMA ABORDAGEM PARA ESTIMAÇÃO REMOTA DO OLHAR SOB LUZ VISÍVEL EM TEMPO REAL Dissertação apresentada ao Programa de Pós-Graduação em Ciência da Com- putação do Instituto de Ciências Exatas da Universidade Federal de Minas Gerais como requisito parcial para a obtenção do grau de Mestre em Ciência da Com- putação. ORIENTADOR: MARIO FERNANDO MONTENEGRO CAMPOS Belo Horizonte 16 de fevereiro de 2012 SAMUEL FELIX DE SOUSA JUNIOR AN APPROACH FOR REAL TIME REMOTE GAZE ESTIMATION UNDER VISIBLE LIGHTING Thesis presented to the Graduate Pro- gram in Computer Science of the Univer- sidade Federal de Minas Gerais in partial fulfillment of the requirements for the de- gree of Master in Computer Science. ADVISOR: MARIO FERNANDO MONTENEGRO CAMPOS Belo Horizonte February 16, 2012 c© 2012, Samuel Felix de Sousa Junior. Todos os direitos reservados. Sousa Junior, Samuel Felix de S725a Uma Abordagem para Estimação Remota do Olhar sob Luz Visível em Tempo Real / Samuel Felix de Sousa Junior. — Belo Horizonte, 2012 xxviii, 73 f. : il. ; 29cm Dissertação (mestrado) — Universidade Federal de Minas Gerais — Departamento de Ciência da Computação. Orientador: Mario Fernando Montenegro Campos 1. Computação – Teses. 2. Visão Computacional – Teses. I. Orientador. II. Título. 519.6*84(043) to the sources of my strength and happiness: my parents, Samuel and Silsa. ix Acknowledgments I thank God for all the love that has been given to me and the strength to conclude my work. Also, it is a privilege to have great support from my family during this journey. My deepest gratitude goes to my parents Silsa Andrade and Samuel Felix, and my sister Samantha Vale. I sincerely thank my supervisor Prof. Mario Campos, for kindly welcome me at Laboratório de Visão e Robótica (VeRLab), for introducing me to this research field and for his continuous support along the last two years. I also thank all the staff of PPGCC for helping me every time I needed. I thank all friends of my lab for the discussions, exchange of ideas and for the incredible help during experimental phase. Many thanks to Antônio Wilson, Erickson Nascimento, and Cláudio dos Santos. My special thanks to Dafne Bastos, whose love, friendship and support guided me during this stage of my life. Special thanks to Marina Oikawa for always cheer- ing me up. I also would like to thank Elizabeth Duane, Alberto Pimentel, and Yuri Tavares. Talking and spending time with these people has been one of the best parts of my study at UFMG, and I hope that some of them will remain friends for life. Finally, I would like to thank the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for the scholarship that allowed this work to be fini- shed. Portions of the research in this work use the FERET database of facial images collected under the FERET program, sponsored by the US Department of Defense Counterdrug Technology Development Program Office. xi “We can dream the impossible and make it happen” (Eugene Cernan) xiii Resumo Esta dissertação propõe uma abordagem para o problema da estimação remota do olhar utilizando uma câmera localizada à frente do usuário. Uma solução também é proposta para o problema da estimação da pose facial, que permite melhorias na determinação do olhar. Dentre as inúmeras aplicações estão incluídas a interação humano computador, a monitoração de fadiga de motoristas e a recuperação de informação visual baseada em conteúdo. Um dos desafios relacionados aos problemas da detecção do olhar, é a esti- mação da pose facial. Uma solução para esse problema pode permitir maior nat- uralidade na utilização do sistema e, a grosso modo, a possível orientação do ol- har. Várias outras aplicações tem motivado o aumento crescente da investigação do rastreamento do olhar nos últimos anos. Diversas abordagens foram propostas na literatura, sendo que a grande maioria requer dispositivos específicos para o seu funcionamento correto. A solução proposta nesta dissertação baseia-se na análise de imagens adquiridas por câmeras comuns no espectro visível. Ou seja, nenhum dis- positivo de hardware especializado ou mesmo requisitos rígidos sobre a iluminação são impostos. A metodologia foi implementada e avaliada por meio de experimentos reais com diversos indivíduos. Os resultados obtidos demonstraram que o sistema opera adequadamente em tempo real, obtendo a localização dos olhos com exatidão. O sistema foi também aplicado no controle remoto de uma cabeça robótica e na pro- dução de mapa de calor das regiões focadas pelo usuários. Palavras-chave: Visão Computacional, Rastreamento de Olhar, Estimação da Pose Facial. xv Abstract This thesis proposes an approach for the remote gaze estimation problem using a single camera. A solution for the head pose estimation problem is also suggested, indicating the possible gaze location. It has applicability in Human Computer Inter- action, in monitoring driver drowsiness for avoiding road accidents, and in Content Image Based Retrieval field. Many applications have motivated the increasing investigation on gaze track- ing on the past few years. Several approaches were proposed in the literature, many of those requiring specific hardware devices for working properly. The solution pro- posed in this thesis is based on the analysis of images acquired by ordinary cameras in visible lighting. In other words, no specific hardware device or even illumination constraints are imposed. The methodology was implemented and evaluated by real experiments with different individuals. Results coming from those experiments demonstrated that the system works properly in real time, obtaining the eyes location accurately. Our system was also applied in the remote control of a robotic head. We built heatmaps of the regions focused by users in the screen. Keywords: Computer Vision, Gaze Tracking, Head Pose Estimation. xvii List of Figures 1.1 Eye image acquired in visible light. . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Anatomy of Human Eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Sclera Search Coils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Electrooculography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 EyeWriter Glasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Active Infrared Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Omega Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.8 Yaw, Pitch, and Roll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.9 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.10 ASM Start and Final shapes . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Overview of the two main modules . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Head Pose Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Eye Tracking Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 LK Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Mesh Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Face Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 ROI and Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.8 Enhanced Iris and Binarized Image . . . . . . . . . . . . . . . . . . . . . . 35 3.9 Results of Iris Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.10 Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.11 Calibration Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.12 Calibration Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.13 Example of Good Calibration . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 Robotic Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Tele-immersion robot used in experiments. . . . . . . . . . . . . . . . . . 45 xix 4.3 Robotic Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Overview of Users Eyes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Calibration Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Gaze results of Individual 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Gaze results of Individual 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.8 Gaze results of Individual 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.9 Gaze results of Individual 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.10 Gaze results of Individual 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.11 Gaze results of Individual 6 . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.12 Gaze results of Individual 7 . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.13 Good iris exposure vs Bad iris exposure . . . . . . . . . . . . . . . . . . . 57 A.1 Mathematical Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A.2 Structuring Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.3 Erosion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.4 Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 B.1 Conics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 xx List of Tables 2.1 Dataset Use for Eyes and Face detection . . . . . . . . . . . . . . . . . . . 17 4.1 Quantitative Results for Individual 1 . . . . . . . . . . . . . . . . . . . . . 47 4.2 Quantitative Results for Individual 2 . . . . . . . . . . . . . . . . . . . . . 47 4.3 Quantitative Results for Individual 3 . . . . . . . . . . . . . . . . . . . . . 49 4.4 Quantitative Results for Individual 4 . . . . . . . . . . . . . . . . . . . . . 50 4.5 Quantitative Results for Individual 5 . . . . . . . . . . . . . . . . . . . . . 51 4.6 Quantitative Results for Individual 6 . . . . . . . . . . . . . . . . . . . . . 52 4.7 Quantitative Results for Individual 7 . . . . . . . . . . . . . . . . . . . . . 53 B.1 Quadric Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 B.2 Conic Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 xxi List of Acronyms AAM Active Appearance Model ALS Amyotrophic Lateral Sclerosis ASEF Average of Synthetic Exact Filters ASM Active Shape Model FERET The Facial Recognition Technology Database EOG Electrooculography GSL GNU Scientific Library HCI Human Computer Interaction IC Isocenters IMU Inertial Measurement Unit LED Light Emitting Diodes LK Lucas and Kanade [1981] MM Mathematical Morphology IR Infrared PCA Principal Component Analysis PoR Point of Regard POS Pose from Orthography and Scaling POSIT POS with ITerations RANSAC RANdom SAmple Consensus RGB Additive Color Space ROI Region of Interest SSC Scleral Search Coils V4L2 Video for Linux 2 YUV Luminance and Chrominance Color Space xxiii List of Algorithms 1 Algorithm for Remote Gaze Estimation . . . . . . . . . . . . . . . . . . 26 2 Iris Detection and Tracking Algorithm. . . . . . . . . . . . . . . . . . . 33 3 Circle Detection Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 37 xxv Contents Acknowledgments xi Resumo xv Abstract xvii List of Figures xix List of Tables xxi List of Acronyms xxiii List of Algorithms xxv 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Roadmap of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Gaze Tracking: A Literature Review 7 2.1 Eye Segmentation and Gaze Estimation . . . . . . . . . . . . . . . . . 7 2.1.1 Anatomy of the Human Eye . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Non Video-based Approaches . . . . . . . . . . . . . . . . . . . 9 2.1.3 Video-based Approaches . . . . . . . . . . . . . . . . . . . . . . 9 2.1.4 Eye Segmentation Approaches . . . . . . . . . . . . . . . . . . 12 2.2 Head Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Human Head Behavior . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Active Shape Model . . . . . . . . . . . . . . . . . . . . . . . . 21 xxvii 3 Methodology 25 3.1 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Head Pose Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Tracking of Facial Landmarks . . . . . . . . . . . . . . . . . . . 29 3.2.2 Mesh Projection and Pose Estimation . . . . . . . . . . . . . . . 30 3.3 Eye Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 Iris Compensation . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Circle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Consensus of Circle Fitting . . . . . . . . . . . . . . . . . . . . 35 3.3.4 Iris Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Gaze Estimation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 39 4 Experimental Analysis 41 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1 Heatmap of Gaze Location . . . . . . . . . . . . . . . . . . . . . 43 4.2.2 Remote Control of a Robotic Head . . . . . . . . . . . . . . . . 44 4.3 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.1 Gaze Measurements . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Limitations of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5 Conclusions and Future Work 59 Bibliography 63 Appendix A Mathematical Morphology 67 A.1 Structuring Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.2 Erosion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A.3 Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Appendix B Quadrics and Conics 71 B.1 Algebraic Curves and Surfaces . . . . . . . . . . . . . . . . . . . . . . 71 B.2 Quadrics and Conics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 xxviii Chapter 1 Introduction The advance of technology changes the way people interact with computers. Hu- man Computer Interaction (HCI) has applied great effort on planning and designing a better interface for interaction with computational devices. Recently, computer vision has brought many unusual and challenging oppor- tunities for interaction with human beings using cameras. For instance, Gaze Track- ing is the problem of estimating where a person is looking using head-mounted or remote devices. An effective gaze tracking system would definitely enhance HCI for most individuals. However, for those ones not capable of interacting with com- putational systems by using conventional devices such as mouses or tablets, gaze tracking becomes more than just an innovative technique, but an essential device for allowing one to connect to the world. 1.1 Motivation Many different problems can be solved using computer vision. Indeed, several dif- ferent fields take advantage of cameras. Some include reconstruction problems (i.e. when a three-dimensional model of an object needs to be estimated), others have applied computer visions in surveillance, mobile robot localization, object detection and recognition, etc. There is a whole field calledDigital Image Processing for dealing specifically with image related problems such as: Image Enhancing, Segmentation, and so forth. When the cost for acquiring a sensor is reduced, it tends to become more pop- ular. Therefore, digital cameras are inexpensive and useful sensors available for common users. This has provided a perfect scenario for applying computer vision to handle problems that were most unlikely to be solved in the past due to the diffi- 1 2 CHAPTER 1. INTRODUCTION culties of obtaining a powerful and precise sensor. Until the advent of tablets and mobile devices, the most popular way to in- teract with computers was using keyboards and mouses. Nevertheless, there are many users that due disability are incapable of using computers with their hands. For instance, individuals suffering from Amyotrophic Lateral Sclerosis (ALS) need gaze trackers to assist them in communication. Hence, this thesis is motivated by the need to handle the problem of estimating the Point of Regard (PoR) by proposing a simple, fast, and inexpensive remote gaze tracker. Point of Regard or Point of Gaze is a point in the scene imaged in the fovea of the eye [Guestrin and Eizenman, 2006]. 1.2 Overview of the Problem Roughly, the problem of estimating the PoR consists in determining where a person is looking. In our configuration, the user is looking towards a plane, which can be the computer screen and the input of our system is the data gathered by a single web cam standing frontal to the user. The system output is the two-dimensional location on the screen representing the gaze. Many approaches address the gaze problem by using images acquired in In- frared (IR) spectrum from one or more cameras using active lighting. Active light- ing is a technique that uses a light projected onto the eye surface generating a glint [Mulvey et al., 2008]. The difference between both pupil and glint is computed for estimating the gaze. Also, many commercial gaze trackers use chin rests and hel- mets to restrict head movements, making the task easier when compared with head free estimators. Although this head movement constraint helps the estimation, it is intrusive since it restricts the natural behavior of the user. As aforementioned, several techniques use IR [Morimoto et al., 2000; Ji and Yang, 2002]. Nevertheless, a more interesting approach would perform it under visible wavelength lighting, since ordinary cameras can be used instead of specific hardware devices. Unfortunately, there are some drawbacks when using visible wavelength lighting instead of IR ones. For instance, lack of illumination jeopardizes the image quality which may lead to failure during the tracking pro- cess. Also, illumination produces specular reflections on the iris, corneal, and sclera surfaces. Pupil might not be detected, due to poor illumination and the low contrast on dark colored eyes. Hence, generally, tracking is based on the iris region. How- ever, in some individuals, the iris might be blocked by the eyelids, turning the gaze estimation based on visible lighting into a hard and complex problem. 1.3. RELATED PROBLEMS 3 Figure 1.1. Human female eye image acquired in visible light. There are specular reflections and glints generated in eye’s surface due to source lights in the scene. Figure 1.1 shows the human eye under visible light. The iris is also known as the circular colored part of eye. Pupil is the aperture in the center of the iris. Sclera is the white region surrounding the iris. Section 2.1.1 provides a better explanation of the eye’s anatomy. Problem (remote non-intrusive gaze tracker) Given a user with a single camera capa- ble of capturing images under visible lighting, estimate and track the user’s gaze in real time. This is the problem we address in this thesis and for handling it, we propose approaches for addressing two other problems: The Head Pose Estimation Problem and the Iris Segmentation Problem. In this thesis, we revise those three problems and propose solutions for them. 1.3 Related Problems There are several problems related to gaze estimation, which are beyond the scope of this work. However, they are important to the context of this work and therefore we will just list them. The following problems are not considered in this thesis: • Face recognition - the problem of distinguishing an individual by identifying his face against others. This is a very interesting problem and there are a vast number of promising works in the literature. Recently, Ni and Chellappa [2010] performed an overview on remote face recognition algorithms. • Multiple users - In case of multiple faces detected, we track the largest one, assuming that it is the closest face to the camera. • Illumination invariance - A poor illuminated face or abrupt light changes might cause problems to the system behaviour. Illumination invariance is not the 4 CHAPTER 1. INTRODUCTION focus of this work. 1.4 Contributions The main contributions sought for the development of this thesis are: • An overview onHead Pose, Eye Tracking, and Gaze Estimation problems focusing special attention on remote techniques that use a single camera. • The development of a simple Head Pose Estimator based on image features that allow a reliable pose estimation. Our contribution consists on the mesh pro- jection based on some facial proportions. • The development of a Iris Segmentation Algorithm for extracting the informa- tion needed for gaze estimation. Our contribution consists of a mask that de- creases glints and reflections of eye surface for enhancing the limbus detection. • The development of a Gaze Tracker and demonstration of its capabilities by heatmap computation and quantitative analysis of seven individuals. Ourmo- tivation for heatmap is based on the focus of attention. Regions that grabbed more attention of users are highlighted by using reddish colors. • The application of the proposed approach on the remote control of a Robotic Head. Our motivation focused on the tele-immersion simulation, where the head movements of human controller are reproduced remotely by the robot’s head. 1.5 Roadmap of this Thesis Chapter 2 aims to provide an overview of recent results on head pose and gaze es- timation, focusing on approaches using images acquired under visible wavelength. That chapter also describes relevant information regarding eye anatomy and head behavior. Furthermore, it introduces some techniques for iris and sclera detection based on different approaches: Isophotes, Correlation Filters, Clustering and so on. Finally, it presents some databases that are commonly used as ground-truth in eye and face detection. Chapter 3 introduces the methodology applied in this work and the back- ground to support our design decisions. It describes the two modules proposed in this thesis. The first module defines the steps applied for head detection and pose 1.5. ROADMAP OF THIS THESIS 5 estimation. The second one describes our approach on iris segmentation, tracking and gaze estimation. Also, it discusses the calibration procedure used. Considering that we do not have three-dimensional information, we need to establish some cali- bration for mapping the points in the image into points in the screen for estimating the user’s gaze. Chapter 4 explains how we have structured and conducted the experiments (both quantitative and qualitative). Some applications have been developed in robotics whose interaction is built on the head pose and gaze proposed in this thesis. All applications are introduced and discussed in that chapter. Chapter 5 presents our conclusions related to those three problems, the con- straints and limitations of the current approach, and it points out some directions for future work. It also discusses the importance of gaze trackers in computer re- lated research in different areas. Chapter 2 Gaze Tracking: A Literature Review Due to their vast applicability in HCI and human behavior studies, several re- searchers have been investigating eye tracking techniques over the past years. Many different approaches have been proposed, some of them requiring specific hardware devices either to enhance the eye detection or to decrease the impact of head move- ment. Along with eye and gaze estimation, the head pose estimation problem has also played an important role in the eye tracking community, due to the more nat- ural interaction when using such system, but also because head pose provides a coarse gaze direction when the eyes are not visible [Murphy-Chutorian and Trivedi, 2009]. A pose can be estimated by using, for instance, deformable models or nonlinear regression methods. In this chapter, we aim to provide a lit- erature review on head pose estimation, eye, and gaze tracking solutions. Murphy-Chutorian and Trivedi [2009] have organized head pose estimation ap- proaches into eight categories according to their operating domain. Hence, they proposed a head taxonomy that shall be discussed later. 2.1 Eye Segmentation and Gaze Estimation The eye taxonomy described by Hansen and Ji [2010] categorizes eye detection into shape-based, appearance-based, and hybrid methods. Shape-based methods are the ones based on features, edges, and other structures that are useful for constructing a fixed or deformable model. Appearance-based, on the other hand, focuses on tem- plate matching or statistical approaches (holistic) for analysing the object appear- ance. An advantage of this latter method is the possibility to conduct a detection technique either in spatial or transformed domain. Hybrid methods combine those 7 8 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW approaches for improving the results. In this chapter, we focus not on the explanation on many different approaches but on the theoretical foundation for the problems we address in this thesis and on recent results regarding those three problems. For general understanding, we first provide a simplistic model of headmovement behavior and eye anatomy explaining both properties and constraints associated with relevant gaze topic and later in this chapter we discuss in a deeper analysis each technique. 2.1.1 Anatomy of the Human Eye We now move into the eye tracking problem. However, in order to fully under- stand how a eye tracker system should work, it becomes essential to comprehend the anatomy of human eye. Figure 2.1 shows a simplistic view of human eyeball. Sclera Iris Fovea Pupil Cornea Visual Axis Optic Nerve Figure 2.1. Anatomy of Human Eye According to Hansen and Ji [2010], the human eyeball is relatively spherical with radius varying in the range 12-13 mm. The sclera (also known as the white part of the eye) is one of the easiest distinguishable properties. Iris is commonly defined as the colored part of the eye which contains an aperture called pupil. The pupil is responsible for controlling the amount of light which enters the eye by contracting and expanding itself. The cornea is a membrane that lies on the eye’s surface. Finally, fovea is a region in the center of retina that contains many sensitive cells. The line which connects the fovea and the center of cornea is known as the Line of Gaze (LoG) or visual axis. 2.1. EYE SEGMENTATION AND GAZE ESTIMATION 9 2.1.1.1 Eye Movements The human eye has different kinds of movements. According to Duchowski [2007], some eye movements correspond to: • Saccades: rapid eye movements (ranging from 10 to 100 ms) which are volun- tary and reflexive. The goal of this movement is to reposition the fovea with respect to a new location in the environment. On the other hand, microsac- cades are involuntary and correspond to small spacial random movements that generally occurs during fixation. • Smooth Pursuit: this kind of movement is associated when an individual is tracking a moving object on the screen. The human eye is capable to synchro- nize with the velocity of the moving object. • Fixation: this movement occurs when an individual is focusing a stationary object, so the image on the retina is stabilized. During fixation, other move- ments occur, such as microsaccades. 2.1.2 Non Video-based Approaches Some techniques are not based on video processing, but on measuring signals ob- tained by different sensors. Scleral Search Coils (SSC) is a technique that uses a contact lens with a coil of wire attached to the subject eye. It is based on the assumption that a magnetic field induces voltage while intersecting the coil. In this way, the position of the eye can be accurately retrieved, but this solution is intrusive due to use of contact lens and uncomfortable for users. Figure 2.2 displays the use of SSC technique. Another way to measure the eye position is using the Electrooculography (EOG) method. The information is processed by placing electrodes around the eye and it infers the position also related to the head. Figure 2.3 shows an example of electro-oculography system. 2.1.3 Video-based Approaches Video-based approaches (video-oculography) represent the most common way to track the gaze by acquiring images delivered by one or more cameras. There are several different implementations of video-oculography that may use either visible 10 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW Figure 2.2. Sclera Search Coils developed by Chronos Vision. Image extracted from: http://www.chronos-vision.de/eye-tracking-produkte.html Figure 2.3. EOG system by PHYWE Biology. Image extracted from: http://www.phywe.com/461/pid/26780 wavelength light or infrared light. Moreover, the solution might require a wear- able device, such as glasses or helmet, or it can determine the gaze using a remote approach, reducing intrusiveness. This section provides information regarding those techniques and classifies the current work according to those differences. 2.1.3.1 Head-Mounted Gaze Trackers Free human head movements imply in a less intrusive and more natural system for the user but it is a hard task for tracking the gaze once head movements affect the accuracy of the estimation. 2.1. EYE SEGMENTATION AND GAZE ESTIMATION 11 Among many HCI applications that could take advantage of a gaze tracker, the EyeWriter initiative [Lieberman et al., 2011] developed an open source software along cheap hardware for allowing disabled people to write and draw art by using only the eyes. The system is built upon head mounted device as shown in Figure 2.4. Potential users are those artists disabled by Amyotrophic Lateral Sclerosis (ALS) and others that are not capable of making movements with their hands or heads but eyes. The EyeWriter group uses a Infrared Sensitive Camera attached to the glasses and the eye is illuminated by IR Light Emitting Diodes (LED) for creating a high contrast between the iris and pupil, thus enhancing the pupil region. Roughly, the software part of the system is divided in the eye tracking and drawing modules for performing the art activity. The eye tracking software has an ellipse detection algorithm based on the iris pixels and a calibration procedure that uses a method for interpolating the gaze location based on the segmented iris location. (a) Artist using the EyeWriter System (b) System Representation Figure 2.4. EyeWriter: An innitiative for artists and writers disable by Amy- otrophic lateral Sclerosis to produce art by using the movement of eyes. Images extracted from http://www.eyewriter.org/ 2.1.3.2 Remote Gaze Trackers Remote gaze estimation is usually obtained by using video cameras that do not require the usage of glasses, although some remote trackers use chin-rests for stabi- lizing the subject’s head. A common approach for commercial eye trackers is to use active near infrared light with wavelength in 780-880 nm as described by Hansen and Ji [2010]. Such 12 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW systems project an IR light onto the individual’s eyes generating a glint. This kind of system is somehow calibrated to identify the pupil ellipse and the glint. (a) Bright pupil (b) Dark pupil Figure 2.5. Active Infrared Illumination. Pupil reflection in infrared mode with one glint projection. Picture obtained in [Mulvey et al., 2008]. Despite Infrared illumination, there aremany other remote gaze trackers based on visible light and they are carefully discussed in the next section. 2.1.4 Eye Segmentation Approaches Parker and Duong [2009] propose a technique for gaze tracking based primarily on the sclera recognition. The color of sclera is approximately stable for different gen- der and ethnic individuals which becomes an easily detectable feature for the eye. A two phase process is performed for detecting sclera region. First, all pix- els inside the face bounding box are classified into four categories: A = {skin}, B = {sclera}, C = {hair, iris, and eye shadow}, D = {noise}. They remove pixels belong- ing to classes C and D based on thresholding and generate an histogram for the R channel. They say that this histogram is going to have a bi-modal behavior and the larger peak represents skin pixels while the smaller peak contains sclera pixels. For some people, this pattern might be more evident than for others. The second phase consists in using Mahalanobis distance in the U and V components to classify the first-phase successful pixels into either skin or sclera pixel (considering now the Luminance and Chrominance Color Space (YUV)). Finally, iris pixels are detected using thresholding and least squares approximation. As aforementioned, many IR approaches use the point-reference technique. According to Proença [2011], IR wavelength can be dangerous, considering that the eye does not respond naturally to the illumination exposure (such as pupil contrac- tion, blinking, etc). Hence, in order to compare the iris to a stable fixed point, they proposed a novel feature called eye-region point. So, the eye bounding box is esti- 2.1. EYE SEGMENTATION AND GAZE ESTIMATION 13 mated using iris and sclera previously obtained and sclera shape is generalized as an ellipse. By using the eye region boundary and the major axis of the estimated ellipse, they compute the eye-region point by calculating the center of mass. They proposed the eye-iris vector V = {(erow, ecol), (irow, icol)} which combines the row and columns of the stable reference point (eye-region point) and the iris center re- spectively. This vector is created during calibrating procedure while user focuses his gaze onto the corners of the screen. Experiments were conducted with different subjects of different ethnicities, genders, and lighting conditions. A frame is only considered successfully detected if both eyes regions are identified and fully bound. Results coming from those ex- periments seem to support the reliability of proposed method. Proença [2011] proposed a two phase technique for segmenting iris in de- graded visible wavelength images. First, a deterministic linear-time algorithm is proposed for discriminating noise-free iris pixels from other pixels. After that, iris parametrization is performed using constrained polynomial fitting. The proposed technique assumes that sclera is the most distinguishable prop- erty of the eye. They have empirically chosen the hue, blue and red chroma (hcbcr) color components and extracted a 20-dimensional feature set for the image pixels: { x, y, h µ,σ 0,3,7(x, y), cb µ,σ 0,3,7(x, y), cr µ,σ 0,3,7(x, y) } , (2.1) x and y define the pixel location and h, cr, cb define the color region centered at the provided location. Superscripts denote average (µ) and standard deviation (σ) for each region defined by those radii values (subscripts). They introduce a novel feature called proportion of sclera which aims to mea- sure the proportion of sclera pixels in a direction d(↑, ↓,←,→) according to a ref- erence point (x, y). There are also some steps using neural network classification (back-propagation learning algorithm). This approach ends up by parametrizing the ellipse using constrained polynomial fitting. They conducted several experiments with some well-known datasets. For instance, their algorithm was compared against three techniques: integrodifferen- tial operator, active contour approaches, and the algorithm proposed by Tan et al. [2010]. They conclude that their results were usually similar to [Tan et al., 2010] for VW, but integrodifferential and active contours had some problems on degraded images in VW, although they are effective in IR datasets. Valenti and Gevers [2008] proposed an accurate algorithm for eye location and tracking using isophote curvature. Isophotes are defined as curves connecting points 14 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW of same intensity and they can be used for estimating center of semicircular patterns. That center estimation is reached by a voting step based on isophotes. However, the authors point out that in real world there is no guarantee that the boundaries of an object will have the same intensity values, which means that it can produce mean- ingless results if all isophotes are allowed to vote for the curvature center. Hence, in order to overcome it, they state that only meaningful parts of the isophotes should be used: the ones that follow the edges of an object. They called Isocenters (IC) the high responses on isocentric isophotes patterns that are near edges. Finally, they can discriminate between dark and bright centers by checking the sign of the curvature. So, considering that sclera is brighter than iris they ignore the positive curvature. The estimated eye center is then represented by the maximum isocenter (MIC). The approach proposed by Bolme et al. [2009] introduces a class of correlation filters called Average of Synthetic Exact Filters (ASEF). Authors endorse that basi- cally eye detection can be performed when a prior knowledge of eye is known, e.g. when a face is successfully detected. However, an accurate eye localization with- out prior constraints would be a harder and challeging task. That is the category of localization that ASEF obtains better performance. A relevant difference amongASEF and other correlation filters relies on the fact that ASEF filter is over constrained, which means that the training process of ASEF filters considers images which specify a desired response (generally it is a bright peak at the center of the desired object) at every location in the training images. For avoiding the over-fitting problem they average the filters of those images. Authors show the benefits of using ASEF such as: more freedom and flexibility in the training step, considering that training images are not required to be centered on the target due to the fact that the peak may be placed wherever the target appears. In their experiments, they conducted a localization restricted to eye regions and a localization without constraint. The ASEF filters were compared with other correlation filters, to Cascade Classifier, and Gabor Wavelet algorithms. The results from the restricted experiment indicate that all algorithms have a good performance when searching for the eye in a restricted region, but ASEF and UMACE filters produce more interesting results if compared to those other two well known ap- proaches. In the second experiment, the whole image was taken into account, but Gabor filter and Cascade Classifier were remove from the experiment due to some problems (such as high false alarm rate of the Cascade Classifier) that would not produce good results. Hence, ASEF showed better results than other methods. An interesting result is that ASEF produces high responses for a correct eye, but rarely it outputs a wrong eye. 2.1. EYE SEGMENTATION AND GAZE ESTIMATION 15 Nguyen et al. [2010] noticed that Additive Color Space (RGB) images have in- formation which can make the iris detection easier by creating a compensated red channel. They first define ΨR−G(x, y) and ΨR−B(x, y) as the difference between red (ΨR(x, y)), green (ΨG(x, y)), and blue (ΨB(x, y)) channels: ΨR−G(x, y) = ΨR(x, y)−ΨG(x, y), ΨR−B(x, y) = ΨR(x, y)−ΨB(x, y), (2.2) Nguyen et al. [2010] also noticed that when the difference (ΨR−G(x, y) and ΨR−B(x, y)) is computed between channels for green and blue irises, the region ap- proximately over the irises tend to be near zero or smaller than zero. Hence, they defined the set Ω for those pixels that are smaller than zero in the ΨR−G(x, y) chan- nel, and those ones that are smaller than five in the ΨR−B(x, y): Ω = {(x, y)|(ΨR−G(x, y) < 0) ∪ (ΨR−B(x, y) < 5)} . (2.3) Figure 2.6 displays the Ω sets for three European eyes. The first row shows the original RGB images and the second row shows pixels inside the Ω set (white ones). (a) (b) (c) Figure 2.6. The first row shows European eyes and the second one shows the pixels inside the group Ω. As it might be noticed, the group lies partially over the iris region giving us a good clue about where the iris is. Hence, they propose a compensation factor (λ) that takes into account information related to values of Ω pixels in the ΨR−G(x, y) channel: λ = ln   ∣∣∣∣∣∣∣ ∑ (x,y)∈Ω ΨR−G(x, y) ‖ Ω ‖ ∣∣∣∣∣∣∣   . (2.4) The compensated red channel (ΨcR(x, y)) is obtained by convolving the 16 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW ΨR−G(x, y) with a Gaussian kernel and the compensation factor (λ): ΨcR(x, y) = ΨR(x, y) + λ ∗ Gσ ∗ΨR−G(x, y), (2.5) we will explain the impact of compensating an image during the clustering step. Considering that the image has been compensated, pixels need to be separated into their classes. In other words, pixels belonging to the iris should be grouped into the same cluster in the same way that sclera pixels should remain together. For that task, they applied k-Means to find four clusters (k = 4), which is a non- supervised learning algorithm that has been applied for eye segmentation before [Li et al., 2010]. After the clustering process, there are four groups of pixels and one of them might represent the iris that the algorithm is trying to find. We have implemented the algorithm of [Nguyen et al., 2010] and Figure 2.7 demonstrates the difference of applying the clustering algorithm on the original red channel of an image and applying it on an image compensated by Equation 2.5. The first row shows the result for the first case, where no compensation has been used. As one might notice, the iris is segmented into several classes. This undesired effect happens due to glints generated by source lights, but also due to the color difference of the iris and pupil region. (a) (b) (c) Figure 2.7. Clustering using k-Means. The first row shows the clustering with- out the compensation and the second one shows the groups in the compensated channel The pixels of Ω set (generated by Equation 2.3) are mainly negative. So, by increasing their magnitudes and integrating them into the original image, the iris region will be darker. Hence, the difference generated by glints and pupil region is going to be reduced increasing the iris separability. After applying k-Means, Nguyen et al. [2010] suggested that the class with the minimum mean in the ΨcR(x, y) should be selected as the iris one. They proposed the following equation for estimating the iris class: 2.1. EYE SEGMENTATION AND GAZE ESTIMATION 17 Ωiris = argmin ∑ (x,y)∈ΩIk ΨCR(x, y) ‖ ΩIk ‖ , k = {1, 2, 3, 4}. (2.6) In Equation 2.6, ΩIk is the set of pixels grouped into the cluster k. So, after taking the values (intensity) of group k from the compensated channel and dividing it by the size of the this group, we have a number that represents the mean of that group in the ΨCR(x, y). Assuming that the iris has a small mean, the class which minimizes that equation is chosen as the iris class (Ωiris). The techniques presented in this section use different datasets. Those datasets contain images acquired in visible wavelength. Table 2.1 summarizes those previous methods1, the evaluated dataset, and main strategy applied. Table 2.1. Dataset Use for Eyes and Face detection Method Datasets Strategies Valenti and Gevers [2008] BioID, YALE B Isophotes Curvature Bolme et al. [2009] FERET Correlation Filter (ASEF) Parker and Duong [2009] (None) Thresholding + Mahalanobis Proença [2011] FERET, ICE Feature Extraction + NN + FRGC, UBIRIS.v2 Constrained Polynomial Fitting Milborrow [2007] XM2VTS, BioID, AR Active Shape Model (ASM) Nguyen et al. [2010] FERET K-Means + Hough BioID [2001] database offers 1521 grayscale images in pgm format with resolu- tion of 384× 286 pixels. Images were taken with different illumination conditions of 23 people and for each image, there is a file describing the eye location. Owners of the dataset proposed the relative eye distance as measure for comparing the quality of different detection algorithms. There is also several additional points that were manually labeled for allowing facial analysis and gesture recognition studies. The Facial Recognition Technology Database (FERET) [Phillips et al., 2000] was a program sponsored by the Department of Defense through the DARPA whose mission was to develop automatic facial r ecognition technologies for secu- rity, intelligence and law enforcement personnel in the performance of their duties [Phillips et al., 2000]. It contains 3,368 images of 1,204 people. The University of Surrey XM2VTS database [Messer et al., 1999] is a multi- modal face database available for purchase containing different information from 1The work of Milborrow [2007] has not been introduced yet. However it will be explained later. 18 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW 295 people acquired during fourmonths. A great advantage of this database is that it has been manually labelled with 68 points across images. This is the dataset chosen by Milborrow [2007] for training his face detector based on deformable templates. Many eye segmentation approaches have been proposed with interesting and promising ideas. For instance, the colored iris compensation proposed by [Nguyen et al., 2010] helps to reduce glints in the image. We will expand this com- pensation idea for any color of iris due to the fact it reduces reflections and enhances the segmentation approach. Valenti and Gevers [2008] proposed a voting procedure for detecting the idea based on isophotes. However, Hansen and Ji [2010] points out that the isophote voting method might produce incorrect detection by finding eye corners and eyebrows instead of eyes due to the fact they rely on maxima in feature space. For avoiding such problems in our algorithm, we create an image ROI based on a prior face model and we apply an algorithm to to both constraint the search space and to remove non relevant pixels in the image. 2.2 Head Pose Estimation Human expressions and intentions are easily recognisable in the face. It is straight- forward for a human being to perceive and distinguish the orientation and displace- ments of someone’s head. In the computer vision sense, head pose estimation might be understood as the determination of a person’s head relative to a coordinate sys- tem (e.g. camera). Despite all the meaningful roles of head orientation in human behavior such as intention or gesture, we are interested in the relationship between head orientation and gaze location. Murphy-Chutorian and Trivedi [2009] points out that the head pose problem is intrinsically related to the gaze problem, discussing that head’s pose presents itself as a coarse indication of the gaze. Moreover, they emphasize that for an accurate eye tracking, it is required to know the head pose, and they argue that using only a passive camera without knowledge of light conditions, it is not possible to accurately estimate the human gaze. Before estimating the pose of a given object, it is required to determinewhether such object exists in the image or not, and if so, where it is located in the image. Problem 2.1 (head detection and localization). Let I be an image, determine if there is a head H in I. If yes, retrieve the tuple {x, y,width, height} associated to H where x, y stand for the left upper corner (in pixel coordinate) of the Region of Interest (ROI) that contains H and width and height stand for the dimension of this ROI. 2.2. HEAD POSE ESTIMATION 19 For solving Problem 2.1, two of the most relevant solutions are the Viola and Jones [2004] and Rowley et al. [1998] detectors. In the study of Active Shape Model conducted by Milborrow [2007], he applied both algorithms and he has shown pros and cons for both of them. He concluded that although for some of his experiments Rowley et al. [1998] presented better results, it is slower than Viola and Jones [2004]. Hence, in this work, we have decided to choose Viola and Jones [2004] as our face detector. 2.2.1 Human Head Behavior Figure 2.8 displays a human head along its three angles. Now, we state a related problem that we also address in this thesis. yaw pitch roll Figure 2.8. The three degrees of freedom: yaw, pitch, and roll. Problem 2.2 (head pose estimation). Let I be an image of a head H, determine the three angles (yaw, pitch, and roll) that compose the orientation of H with respect to the camera that acquired I 2. Problem 2.2 is closely related to Problem 2.1, which consists on the face detec- tion. A common solution for the Head Pose Estimation Problem starts with the face 2We found it convenient to represent orientation using Euler angles, however one might choose quaternions or other representations. Any object orientation format is applicable. 20 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW detection, although some approaches do not need this detection step. Many solu- tions have been created for solving Problem 2.2 and they were carefully categorized by Murphy-Chutorian and Trivedi [2009] into the following taxonomy: (i) appearance template: An image containing a head is compared against a set of previously labeled images with known head poses. The goal is to find the most similar pose of the queried image. This approach is suitable for both high or low resolution images. (ii) detector array: An image is analyzed by different face detectors that were trained for specific poses. An advantage of this approach is that a step for face localization is not required. Once trained, the algorithm is capable of dis- tinguishing whether or not a face is present in the image. (iii) nonlinear regression: The solution is based on learning a nonlinear functional that maps the image space and the desired poses. An example would be Mul- tilayer Perceptron and Locally Linear Mapping. (iv) manifold embedding: It is a technique that applies a dimensionality reduc- tion algorithm (e.g. Principal Component Analysis (PCA)) for creating low- dimensional manifolds for modeling different face poses. (v) flexible models: This method tries to adjust a generic model to the input face by deforming the features of model. Pose is estimated using the adjusted fea- tures. (vi) geometric methods: These methods use a set of features, such as nose, mouth to infer the head pose. The estimation is based on the geometric configuration of those features. (vii) tracking methods: Methods that estimate the head pose based on the move- ment between frames. (viii) hybrid methods: Methods that combine those previous approaches. Murphy-Chutorian and Trivedi [2009] provide a great discussion about advan- tages and drawbacks of those methods. For instance, they point out that detector array methods do not need an additional step for detection and localization due to the fact that the trained detectors are capable of distinguishing between face and non face. However, one clear disadvantage is the difficult training step that needs 2.2. HEAD POSE ESTIMATION 21 to handle each discrete pose. Geometric models exploit the relations between fa- cial features such as angles, distances, etc. Our idea for estimating the pose tries to explore facial features, but using deformable models in order to estimate a good model for each specific user. Deformable models have been used before for estimating the head pose. Martins and Batista [2008] propose the use of Active AppearanceModel (AAM) and POS with ITerations (POSIT). They use a three-dimensional scanner to obtain the points of a face model. First, they first adjust the AAM to the users face and using the model created with the scan, they estimate the pose using POSIT. Deja [2010] propose a head pose and gaze estimator. For solving the head estimation problem, he also applies AAM and POSIT. 2.2.2 Active Shape Model In this work, we take advantage of deformable models for finding human facial markers in the image (e.g. eyebrows, nose, etc.). We briefly describe what general models are and the specific behaviour of Active Shape Model (ASM) as defined by Milborrow [2007]. Definition 2.1 (Model or Shape). A shape is a set of (x,y) points (landmarks) located in an image that defines or represents a geometric figure. Figure 2.9 shows two different representations of a simple shape. In this shape, there are six points whose Delaunay triangulation generates a pentagon or five tri- angles. Those edges do not exist in the original shape, just the vertices. However, edges (or even triangles) help humans visualize shapes (Figure 2.9.a). The shape in Figure 2.9.b shows a typical representation for computer-related tasks, a n × 2 matrix. Similarly to the shape presented before, there are much more complex shapes describing facial features. For instance, the database BioID presented in Chapter 2 has been manually marked with 20 points, while the XM2VTS has been labelled with 68 points. In his thesis, Milborrow [2007] first explains the concept of template alignment: an iterative method for producing the minimum distance between shapes. It is a set of transformations (scaling, rotation, and linear translation) applied to a template for matching a reference shape. For instance, the following equation transforms P(x, y) into P′(x, y) by applying a rotation (α) and a translation (δx, δy): 22 CHAPTER 2. GAZE TRACKING: A LITERATURE REVIEW [0.92,2.85] [3,−7.34] [0.92,−2.85] [0,0] [−2.42,−1.76] [−2.42,1.76] -2.42 -1.76 -2.42 1.76 0.00 0.00 0.92 -2.85 0.92 2.85 3.00 -7.34 (a) (b) Figure 2.9. Six points shape. Left image (a) shows the Delaunay triangulation of the two-dimensional vertices and right image (b) shows a common matrix representation. P [ x y ] = [ δx δy ] + [ cos(α) − sin(α) sin(α) cos(α) ] [ x y ] As many other deformable models, ASM requires an initialization close the real face. Hence, considering that ASM is initialized over a ROI face. Milborrow [2007] created two submodels: (i) Profile Model: It acts locally trying to find the best location for positioning a feature. (ii) Shape Model: It acts globally trying to correct the feature locations suggested by the profile model in order to adjust a possible face model. A profile is based on training and searching. The training phase consists in averaging the area around each landmark for building a profile (specific profile for each landmark in all training images). The algorithm of Milborrow [2007] samples the area around the landmark (±3 pixels) and tries to find the best match using Mahalanobis distance (d) between profile g and the model mean profile g¯: d = (g− g¯)TS−1g (g− g¯), (2.7) where S−1g is a covariance matrix. 2.2. HEAD POSE ESTIMATION 23 The search phase consists in positioning the landmark in the best place that matches the profile built. The model tries to adjust the proposed model for confor- mity with allowable face shapes. The shape model (xˆ) consists of averaging the face with distortions: xˆ = x¯+ Φb, (2.8) in which x¯ stands for the average of aligned trained shapes, Φ is the eigenvectors of the covariance matrix of training shape points. Milborrow [2007] trains his algorithm using the XM2VTS database [Messer et al., 1999], he evaluates in the AR dataset [Martinez and Benavente, 1998] and he tests on BioID. He also extends the original 68 landmarks of XM2VTS to 76 and 84 dataset points, providing better accuracy on detection. Figure 2.10 illustrates the result of an ASM adjustment using Stasm [Milborrow, 2007] . In the left, the initial model applied over the region containing the face. This is really important for a successful match, in the right, the final adjusted face. Start Shape Final Shape Figure 2.10. Results of applying ASM for adjusting the user’s face with Stasm software [Milborrow, 2007]. On the left, the initial model and on the right the final model adjusted. Chapter 3 Methodology The gaze tracking problem is usually based on several subproblems. Our methodo- logy for this problem is based on two main modules: (i) the head pose estimation module, and (ii) the iris segmentation module that will lead us to gaze estimation. Those modules are described in Figure 3.1. Usually, before one targets its gaze to a certain direction, the face moves to- wards that direction for better accommodation although it is possible to move the eyes with fixed head. However, pose estimation becomes an important step in gaze tracking task. Image Acquisition Head Pose Estimation Iris Segmentation Gaze Estimation Remote Gaze Estimation Figure 3.1. Overview of our methodology. The input is an image acquired by a camera. After pose estimation, we perform the iris segmentation and finally the system outputs the gaze location. 25 26 CHAPTER 3. METHODOLOGY The input is the image acquired by a camera. After capturing the image, we start the first module for estimating the head pose and then, the second module de- tects the user iris. Finally, the system outputs the gaze location. It is also important to mention that before estimating the gaze, there is a calibration step for mapping the points in the image to points in the screen. Those modules are closely related, a failure in first module will affect the second one. For instance, the gaze is estimated based on the position of the iris which is, by definition, inside a ROI that was deter- mined by the head pose. Thus, a failure in the head pose module will cause failure in the other modules in cascade. Algorithm 1 describes all steps applied for the gaze estimation in a higher level. In this chapter, we explore each one of those methods explaining both the ideas and procedures to accomplish each task. 1 begin 2 Points model2D← ∅, model3D← ∅ ; 3 Template eyeTplt← ∅; 4 Point2D iris← ∅, gaze← ∅; 5 while true do 6 Image I← grabFrame(); 7 if model is ∅ then 8 model2D← ASM(I); 9 model3D← adjust3DModel(model2D); 10 else 11 model2D← trackModel(I, model2D); 12 end 13 pose← POSIT(model2D, model3D); 14 ROI← findEyeROI(I, pose); 15 if eyeTplt is ∅ then 16 iris← conicFitting(ROI); 17 eyeTplt← createTemplate(iris); 18 else 19 iris← templateMatching(ROI, eyeTplt); 20 end 21 gaze← estimateGaze(iris); 22 end 23 end Algorithm 1: Algorithm for Remote Gaze Estimation From lines 2 to 4, we initialize some sets. During the main loop, we process each frame by first estimating the pose (Lines 7 to 13). After that, we define the ROI of the eye (Line 14) and we estimate the eye location by either detecting or tracking 3.1. METHODOLOGY OVERVIEW 27 over the frames (Lines 15 to 20). Finally, we estimate the gaze (Line 21). However, for the gaze estimation step, it is required a calibration procedure that is discussed later. 3.1 Methodology Overview Figure 3.2 introduces the stages that we apply for the first problem: the head pose estimation. First, it is required to detect a face in the image as stated in Problem 2.2. For the detection step, we are assuming a single frontal face looking towards the camera. In case of multiple faces, we will consider only the biggest face in the image, which is also the closest one to the camera. The algorithm of Viola and Jones [2004] was chosen for the detection stage. After detecting the face, we crop the ROI and apply the ASM [Milborrow, 2007] for adjusting a generic face model into the specific user face. Now, considering we have specific features detected (mouth, eyebrows, chin), we no longer detect the face in the consecutive frame, but we track those features over the next images. This decision has been taken considering that face detection and ASM adjustment are slow activities to be performed every new frame. If we track those features, we obtain a faster solution. ASM fits a two-dimensional model, however for estimating the pose, we need a three-dimensional model. Hence, we performed an offline stage (Section 3.2.2) called Mesh Projection for estimating the depth of the two-dimensional model ad- justed by ASM. Considering that so far we have obtained the points of the head model (de- formed by some transformations), we can restore the pose of that object by using POSIT [Dementhon and Davis, 1995]. We use POSIT for restoring the rotation ma- trix and translation vector of the user head. The head pose can be estimated as long as there is a successful tracking. By successful, we mean a correct consensus in the movement of the image features, e.g: Two neighbor features in the image t should remain neighbors in the image t+ 1. Our secondmodule contains the eye segmentation step for estimating the gaze. Figure 3.3 shows an overview of the module. The input for the eye segmentation step is the estimated face pose. For every new image acquired, we track the head features whose pose provides the location of both eyes. The first step of this module consists in applying a compensation in the iris for decreasing the brightness of the eye in a similar way as the one proposed by 28 CHAPTER 3. METHODOLOGY Landmarks ASM - Stasm Pose Estimation POSIT Face Detection Viola and Jones Figure 3.2. Methodology for head pose estimation is divided in three main stages: (i) the face is detected using Viola and Jones [2004], the landmarks repre- senting the face features are obtained using ASM [Milborrow, 2007] and (iii) the pose is restored using POSIT [Dementhon and Davis, 1995]. Nguyen et al. [2010]. After being processed by compensation and equalization, the image is binarized. An implicit conic fitting is applied along with RANdom SAmple Consensus (RANSAC) for estimating the iris. The whole process for iris detection is carefully discussed in Section 3.3. The eye detection process occurs only once per eye due to real time require- ments. After detecting the user iris, a template is created based on the specific prop- erties of the eye. Hence, the eyes displayed in the subsequent images do not need to be compensated considering we know how the appearance of the iris is. The only required step is to find the best match for the template. This process is called template matching. Taking into account that so far we know the current head pose and we have successfully obtained the iris location, we are ready to proceed to the final part of the problem: gaze estimation. The surface of the eye is not plane but curved. However, in this work, we can approximate the eye surface as a plane based onweak perspective. Thus, to estimate the gaze, we perform a calibration that maps points in the image into points on the screen (gaze locations). This calibration procedure pairs two point sets Peye = { 1Peye, 2 Peye, . . . , n Peye } and Pscreen = { 1Pscreen, 2 Pscreen, . . . , n Pscreen } that represent the location of the iris in the image and location of the gaze on the screen respectively. Considering that we are trying to estimate the gaze based on the iris location, we calculate the homography between those two point sets. 3.2. HEAD POSE ESTIMATION ALGORITHM 29 Head Pose Iris Tracking Extract ROI Template Creation Iris Detection Gaze Estimation Figure 3.3. Methodology for gaze estimation: the ROI that contains the eye is obtained based on the estimated head pose. The iris is detected using implicit fitting and a template is created for the specific eye. After the template creation, the subsequent frames are tracked using template matching. Finally, the gaze is estimated using homography. 3.2 Head Pose Estimation Algorithm Our algorithm estimates the face pose based on facial landmarks. The landmarks lo- calization is performed with ASM introduced by Cootes et al. [1995]. We apply the ASM developed byMilborrow and Nicolls [2008] to find those points. After the suc- cessful initialization of the model, the points are tracked using Lucas and Kanade [1981] (LK) pyramidal optical flow. A three-dimensional model is required for the head pose estimation. Hence, the two-dimensional points adjusted by ASM were manually projected to a generic three-dimensional mesh (Section 3.2.2). Considering that we have a model for the possible head, we apply an iterative method called POSIT [Dementhon and Davis, 1995] for estimating the pose of the points related to the model we built. POSIT gives us the transformation of the pose (rotation and translation matrices). 3.2.1 Tracking of Facial Landmarks Due to real time constraints, it is expensive to perform another ASM fitting for every new frame in order to find the landmarks. Hence, we decided to track those points over the sequence using Optical Flow. The tracking task is based on both spatial (x, y) and temporal (t) information. Thus, considering that the brightness (B) is a function of (x, y, t), the optical flow 30 CHAPTER 3. METHODOLOGY assumes that B is constant and differentiable over time. Hence: dB(x(t), y(t), t) dt = ∂B ∂x dx dt + ∂B ∂y dy dt + ∂B ∂t = 0, (3.1) where the partial derivative components of this equation ( ∂B∂x , ∂B ∂y ) correspond to the spatial gradient ∇B and the total derivative ones ( dxdt , dy dt ) correspond to the vector field v. Then, the Image Brightness Constancy Equation as shown by Trucco and Verri [1998] is: (∇B)Tv+ Bt = 0, (3.2) which relates the image brightness B(x, y, t) with the motion field v. We apply the pyramidal algorithm (LK) of Lucas and Kanade [1981] for tracking the 76 markers presented in the previous section. Roughly, LK relies on three important assumptions: The brightness of the tracked pixels should remain constant and the points moving slowly from frame to frame. Finally, it is assumed a spatial relationship between points, considering that they belong to the same surface. Figure 3.4 shows two different frames whose landmarks were tracked using LK algorithm. f ramet f ramet+1 Figure 3.4. Lucas and Kanade [1981] algorithm for tracking the blue dots that are used for pose estimation. 3.2.2 Mesh Projection and Pose Estimation Before explaining the pose estimation, we discuss the mesh projection, which is an offline stage where we try to obtain a coarse depth estimation of those 76 features adjusted by the ASM model. 3.2. HEAD POSE ESTIMATION ALGORITHM 31 We apply POSIT [Dementhon and Davis, 1995] to retrieve the transformation of the head. This algorithm requires a set of points describing an object under some perspective (the points we adjust with the deformable model). Moreover, we need to provide the model of the object whose pose we are trying to estimate. To build this model, we obtained a three-dimensional head using the makehuman1 software and we adjusted the ASM model to the frontal projection of the face. After the adjustment, we manually extracted the depth for each one of those points to create the template in Figure 3.5. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 Figure 3.5. Mesh Projection: Merging three-dimensional information. When ASM fits the generic model into the user face, we no longer have a generic model, but an adjusted model which is unique for each user. In order to extend this specific model to the three-dimensional model, we assume that the max- imum depth of the face (nose to ear distance in profile view) is proportional to the distance from the chin to eyebrow (in frontal view) as described in Figure 3.6. Hence, we interpolate the depth value of the mesh based on this proportion. This idea is 1http://www.makehuman.org/ 32 CHAPTER 3. METHODOLOGY based on the study of divine proportion in human anatomy described in [Jefferson, 2004]. Our experiments seem to support that this chin-eyebrow proportion provides good results on pose estimation. Figure 3.6. We consider that the chin to eyebrow distance is proportional to nose to ear distance. Image extracted from [Jefferson, 2004]. POSIT is divided into two algorithms: 1. Pose from Orthography and Scaling (POS): the goal of this algorithm is to approximate the perspective projection using a scaled orthographic projection. It also estimates the rotation matrix and the translation vector. 2. POS with ITerations (POSIT): the goal of this algorithm is to compute itera- tively a better scaled orthographic projection using the proposed pose found by POS instead of the original image. Finally, after initial adjustment, we can track the points and estimate the pose for every new frame acquired by the camera. As soon as the pose is obtained, we set a ROI for the eyes and start the eye segmentation algorithm. 3.3 Eye Segmentation Algorithm The face pose allows us to crop a Region of Interest (ROI) around each eye. For the initial iris detection, we first start by compensating the iris region, which aims to enhance some properties of the eye. The image is then binarized and sent to a circle detector algorithm. After that, we track the known iris pattern. 3.3. EYE SEGMENTATION ALGORITHM 33 Algorithm 2 describes how we segment the iris. input : Eye ROI output: Iris x,y,r 1 begin 2 Template← ∅; 3 if not Template then 4 ROIb ← binarize(ROI); 5 C ← compensate(ROIb); 6 x, y, r ← conic_ransac(C); 7 Template← obtain_template(ROI, x, y, r) 8 else 9 x, y, r = track(Template,ROI); 10 end 11 end Algorithm 2: Iris Detection and Tracking Algorithm. 3.3.1 Iris Compensation We are dealing with images acquired in visible light. The eye’s surface reflects lights that come from many different sources. Specifically, the lights in the scene generate glints on the iris and the sclera surfaces. Those reflections make reliable eye segmen- tation difficult. Figure 1.1 shows an example of an eye with specular reflections. Problem 3.1 (Iris Compensation). Let I be an image that contains an eye. Find the region w ∈ I which most likely represents the iris. Reduce the brightness of w by decreasing the light reflections on the cornea surface and increasing the contrast between iris and sclera (limbus). When we apply ASM for detecting the head features, we obtain several fea- tures delimiting the eye region (such as eye corners, etc.). So, based on the location of those features, we define ROI and we create a mask that will be applied to the image for enhancing the iris region. The first amelioration that we perform is the histogram equalization. We choose to work with the red channel in the RGB space, because the sclera is better visualized in this channel [Parker and Duong, 2009]. Figure 3.7 shows the original image, the histogram equalization of the red channel (E), and the mask (ℜ) that is going to be applied in the image. Our algo- rithm works as follows, we should erase pixels that do not belong to the sclera (S), 34 CHAPTER 3. METHODOLOGY Figure 3.7. Each row shows the original image ROI, the image after histogram equalization, and the mask applied for decreasing the brightness of the iris. In this example, we see the results for a black and a blue eye. iris (I), and pupil (P) since they have no relevance to the problem. Pixels belonging to I and P should have their brightness reduced, so: E(x, y) = E(x, y)−ℜ(x, y). Considering that each channel is composed of unsigned 8 bits, the sub- traction will set pixels outside sclera into zero due to the fact that ℜ(x, y) = {255 | ∀(x, y) /∈ S(x, y) ∪ I(x, y) ∪ P(x, y)}. As one might notice, the region ℜ(x, y) = {0 | ∀(x, y) ∈ S(x, y)}. Thus, the difference presented in the previous equation will not interfere in the original region of the sclera. Finally, the region ℜ(x, y) = {τ | ∀(x, y) ∈ I(x, y) ∪ P(x, y)} has a value τ that is responsible for de- creasing the brightness. Figure 3.8 shows three images describing the enhancing process. The first im- age shows the subtraction of the equalized image by the mask introduced before. By doing that, the brightness of the iris region is decreased. After that, the whole image is binarized. However, there are still holes generated by glints and reflections. We have then applied an algorithm for filling those holes and we prune the irises using Mathematical Morphology (MM) that is discussed in the Appendix A . The idea behind this is that dilation might connect noisy pixels to the iris. Then, we first apply the erosion operation in the image. However, if there are many holes generated by glints, the iris will break apart destroying the circular shapes. Thus, we first fill all possible holes and then we start applying erosion for removing noisy pixels without loosing the circular shape of the iris. 3.3. EYE SEGMENTATION ALGORITHM 35 Figure 3.8. Each row shows the image ROI for the black iris and the blue iris showed in Figure 3.7. The first column shows the ROI after themask subtraction, the second column shows the result the binarization result, and the last column shows the final result after hole filling and morphological operations. 3.3.2 Circle Detection Once we obtain the binary image, we need to estimate the center and radius of the iris. For reaching that goal, we apply implicit conic fitting. Appendix B provides the background for understanding conics and quadrics. In this section, we demonstrate how to perform a conic fitting using Singular Value Decomposition (SVD). The technique for estimating quadrics is very similar, how- ever, we focus on the implicit estimation of a conic (circle). Given the equation of circle with center (a, b) and radius r: r2 = (x− a)2 + (y− b)2, we can expand it: 1︸︷︷︸ A x2 + 1︸︷︷︸ B y2 − 2a︸︷︷︸ C x− 2b︸︷︷︸ D y+ a2 + b2 − r2︸ ︷︷ ︸ E = 0 if we take advantage of the following notation (P = Qt.S), we have: Qt = [x2, y2, x, y, 1], and St = [A, B,C,D, E]. Thus, our problem is now the estimation of S. We solve this problem using Singular Value Decomposition. Once we obtain S, we can retrieve a, b, and r. 3.3.3 Consensus of Circle Fitting In order to provide robustness to our approach, the implicit conic fitting is applied along with RANSAC [Fischler and Bolles, 1981]. RANSAC is a technique for fitting (experimental) data to a model when there are outliers. We have made that decision due to the fact that even after morphological operations, we still have some noisy 36 CHAPTER 3. METHODOLOGY pixels that might disturb the circle estimation process. Roughly, instead of using all data points to fit the model, RANSAC starts with a small random subset. Algorithm 3 demonstrates the usage of RANSAC along SVD for estimating the best circle in the image. The input of the algorithm contains: (i) the points that we want to fit (P), (ii) the number of iterations wewant to perform (k), (iii) the minimum points (m) inside the inliers group to consider this model as a good one, and finally, (iv) the threshold (t) for allowing a point to enter the inliers group. Line 2 initializes model, inliers and error sets. Lines 4 to 6, we randomly pick a subset of the original points and we estimate the circle (MC) for those points (lines 7 to 10). When the initial model is created, we test the other points that are not in our initial set. If those other points are closer enough to the circle (i.e. distance is smaller than t), we add this point to the inliers group (IC). If the number of inliers are greater thanm, it means we found a good model, and finally, if our newmodel has a smaller error than the current model, we reassign the previous data (model, inliers, 3.3. EYE SEGMENTATION ALGORITHM 37 and error). Data: P = {p1, . . . , pn}, pi is the ith point in P; Result: the best model (center (a, b) and radius r) for the circle; input : iterations k, threshold t, min points m output: best model M, inliers I, and best error ǫ 1 begin 2 I ← ∅; M ← ∅; ǫ← ∞; 3 for i ← 1 to k do 4 for z← 1 to 5 do 5 Ic ← Ic ∪ {Pr};//r is random number 6 end 7 Q← Ic × ITc ; 8 UΣDT ← SVD(Q); 9 ǫc ← Σ5,5; 10 Mc ← U1:5,5; 11 for pj ∈ P and pj /∈ Ic do 12 if distance(pj,Mc) < t then 13 Ic ← Ic ∪ {pj}; 14 end 15 end 16 if ‖ Ic ‖≥ m then 17 if ǫc < ǫ then 18 I ← Ic; 19 M ← Mc; 20 ǫ← ǫc; 21 end 22 end 23 end 24 return I,M, ǫ; 25 end Algorithm 3: Circle Detection Algorithm. Figure 3.9 shows the results of applying RANSAC along implicit fitting for es- timating the best iris in the image. The result represents the circle (green) describing the best iris found. 38 CHAPTER 3. METHODOLOGY Figure 3.9. Results of Iris Detection 3.3.4 Iris Tracking Once we have found the user iris, we track it for avoid the computational cost of compensating the image again, running RANSAC and all those previous steps. Our assumption is based on the fact that the user iris does not change too much during the process, the iris in frame t + 1 has a strong relation with the iris in frame t. Therefore, it is suitable to create an image patch with discovered iris for tracking it along frames. The technique known as template matching computes the difference between an image (patch) against another image (the one that we are trying to find that patch) by sliding the patch over it. Figure 3.10 shows how the matching is performed. Figure 3.10. Template Matching: the patch is slided over the image searching for the best match. There are many different ways for calculating the difference between the two images: Squared Difference, Correlation Difference, Correlation Coefficient Differ- ence, and so on. We decide to use the squared difference. Let F be the final image, P the iris patch used and I the eye ROI. The squared difference is computed as follows: F(x, y) = [P(x′, y′)− I(x+ x′, y+ y′)]2 (3.3) 3.4. GAZE ESTIMATION ALGORITHM 39 Hence, when the iris is detected, a template is created using the region over the iris. Thus, in the subsequent images, instead of performing all those steps, we compute the squared difference between the patch and the image to find the best match for the user iris. 3.4 Gaze Estimation Algorithm We apply homography technique to estimate the gaze by mapping points in the image into points in the screen. Deja [2010] has also applied homography for gaze estimation. It requires a calibration procedure to pair up those two planes (image and screen). This calibration contains a pattern (Figure 3.11) that guides the user during the procedure. This idea is similar to the EyeWriter software described in Chapter 2. Figure 3.11. Calibration Pattern: while calibrating, the user has to look at corner of this grid. A black circle with changeable radius appears showing the next location for fixation and when the red circle appears, the location of iris in the image is grabbed. At each grid intersection, a black circle with a changeable radius size appears indicating the next calibration location. When the black circle becomes a red circle with fixed size, the two points are being paired up and the user cannot blink, other- wise it will pair wrong iris values. The grabbing time is about three seconds, which means that the user cannot blink during three seconds for each one of those nine points. This process is really important for the estimation, considering that a poor calibration will lead to a poor gaze estimation. The homography procedure becomes easier to be understood when looking at the system configuration displayed in Figure 3.12. The user stands in the front of the screen and camera. The camera can be placed either at the top of the screen or at 40 CHAPTER 3. METHODOLOGY the bottom. During our experimental step, we found that when the camera placed at the bottom, it presents better results because there less oclusion by eyelids. Eye Projection in the Image Screen Web Camera Figure 3.12. Calibration Procedure: the user stands in front of the camera and the screen that contains the grid. A homography is created mapping the eye location in the image and the gaze in the screen. Figure 3.13 shows a good calibration example. All nine points were success- fully obtained. Figure 3.13. Example of a good calibration. All nine points were successfully detected. As one can notice, the displacement between those points is really subtle. Chapter 4 Experimental Analysis In this chapter, we describe the results obtained by applying the methodology dis- cussed in this work: the head pose estimation and the gaze tracker. This chapter is organized as follows: we first present the qualitative results, where we focus on the development of different applications that could take advan- tage of each technique. After that, we explain how we have conducted our quan- titative results and the uncertainty associated with the gaze estimation. Finally, we discuss the limitations of our approach. 4.1 Experimental Setup This section clarifies the setup of our system. Hence, we describe the libraries (soft- ware) and devices (hardware) that were used during the development of the main module and additional hardware devices (such as robots and servomotors) that were used during the experimental phase of this work. 4.1.1 Software Our software was developed under the Linux Operating System (but it is not res- tricted to it) using the C++ language with the help of the following libraries: • OpenCV (http://opencv.willowgarage.com): A widely used computer vision library which contains many algorithms for handling image processing and vision problems. • OpenGL (www.opengl.org): A library for rendering three-dimensional mod- els, such as the axes of head direction and the face mesh. It allows one to 41 42 CHAPTER 4. EXPERIMENTAL ANALYSIS define the location of a virtual camera pointing towards a specific direction where many models (e.g. triangles, meshes, points, etc.) can be visualized. • Boost (www.boost.org): A set of C++ libraries (e.g. Thread, System, Filesys- tem) containing many helpers classes and functions for speeding up the soft- ware development process. For instance, our eye tracker was working with threads available in Boost. • Golld1: OpenCV retrieves images from a camera. However, currently it is not possible to change specific hardware controls, such as enabling or disabling white balance, gain, exposure, and so on. Hence, Golld is a library developed in VeRLab at UFMG that allows one to set a specific hardware configuration. • Video for Linux 2 (V4L2): A a video capture programming interface for the linux operating system. Golld is based on V4L2. • Stasm [Milborrow and Nicolls, 2008]: A the Active Shape Model (ASM) li- brary for shrinking the generic model into a specific user face. • GSL (www.gnu.org/s/gsl): stands for GNU Scientific Library (GSL) and it is a numerical library that was used in this work for handling linear algebra prob- lems. • QT (qt.nokia.com): A the graphical user interface applied for rendering win- dows, buttons and so on. For instance, the calibration procedure required to drawing of lines, circles and animation in the screen. Those ones were totally rendering in QT windows. • QWT (qwt.sourceforge.net): A QT library extension for rendering graphics (line, histogram plots). In this work, QWT was used to provide the scatter plot after completing the calibration procedure in order to know the quality of results obtained. • ImageMagick (www.imagemagick.org) : A library that contains many algo- rithms for image manipulation and drawing that was used for drawing infor- mation inside images, overlapping images, and so on. 1www.verlab.dcc.ufmg.br 4.2. QUALITATIVE RESULTS 43 4.1.2 Hardware The approach presented in this dissertation used only one camera. The chosen cam- era was the Logitech 9000 with resolution of 800× 600 pixels standing about 50 cm from the user. We have disabled most of automatic control, as gain and auto white balance. This camera is presented in Figure 4.1 along with the robotic head devel- oped. The computer processor was an Intel Core 2 Duo 2.1 Ghz. Later in this chapter, we introduce the remote control of a robotic head. For this robotic head, we have used two Logitech 9000 cameras and two servomotors. The images grabbed by the camera were delivered to the user through UDP network protocol. During the experiments, the camera was standing over a pionner robot. The calibration procedure requires the user to look to a computer screen. In our experiments, it was a 17 inches screen with resolution 1600× 900. 4.2 Qualitative Results In this section, we present two applications of the techniques developed in this work. One application uses gaze tracking while the other is advantageous only for pose estimation. 4.2.1 Heatmap of Gaze Location The first application we have developed for validating the proposed approach was the generation of a heatmap. A heatmap is a visualization technique in which some regions are highlighted by overlapping color intensity maps. In our case, we have tracked the user’s gaze and highlighted the regions where he focused its attention. Our algorithm displays the gaze locations computed when the user was ob- serving the image content. Regions that the user spent more time are more likely to be red. On the other hand, regions that the user spent less time will be green. The main idea behind this application is the fact that for some studies, it might be extremely relevant to understand what is attracting the human attention, either to study the basis of image understanding, but also, it may be an indication of which properties of an image affect the interaction with a computational system. Hence, heatmap is an adequate visualization technique to combine both infor- mation: spatial locations and fixation duration. The spatial location is represented by image locations where the user focused his gaze. The fixation duration is associa- ted to the color of those locations in the map. 44 CHAPTER 4. EXPERIMENTAL ANALYSIS 4.2.2 Remote Control of a Robotic Head Tele-immersion is one of the several research fields that could take advantage from the results of the work presented here. There has been a lot of effort on remote control of robots, e.g. remote surgeries, rescue robotics, surveillance. For instance, Fernandes et al. [2011] have controlled a robotic head using an InertialMeasurement Unit (IMU) as displayed in Figure 4.1. Figure 4.1. Robotic Head Our robotic head is composed of two servomotors and two Logitech 9000 cam- eras and it is mounted on a Pioneer robot (Figure 4.2). It has two degrees of freedom (pan and tilt) and we can be remotely controlled using our head pose estimator. When the head pose of the user is obtained, it is transmitted to the robot which will reproduce the behaviour using the robotic head. Figure 4.3 shows the user moving the robotic head using his pose. Another improvement would be to calibrate that stereo system for allowing reconstruction of a remote environment prioritizing the region the user is looking. Besides that, by combining both images, one could use a three-dimensional stereo vision to simulate the environment that the robot is inserted. 4.3 Quantitative Analysis Our quantitative analysis approach is based on the results obtained by an applica- tion where some points are displayed on the screen and the gaze is estimated during three seconds for each point. By using this approach, we can obtain the accuracy 4.3. QUANTITATIVE ANALYSIS 45 Figure 4.2. Tele-immersion robot used in experiments. Figure 4.3. Robotic Head: two cameras were attached to two servomotors for providing the yaw and pitch degrees of freedom. 46 CHAPTER 4. EXPERIMENTAL ANALYSIS and precisionof the system. We present the results we have obtained as well as the uncertainty associated with each user in both axes for all points. The process occurs as follows: users use a chin rest for stabilizing the head po- sition. The system is initiated obtaining the face pose. When the user feels comfort- able with the system, the calibration procedure starts. The user looks at ten points on the screen for three seconds. Also, it was given three seconds for resting the eye. During this resting time, the user is allowed to blink, however during the capturing time, the user cannot blink neither move his head. We have tested the system with seven users whose skin colors vary from Caucasian to a darker skin (Latin) with light brown and dark black eyes. Although we obtain the head pose, we were unable to estimate the gaze and allow free head motion at the same time. Thus, it is really important to mention that the results were obtained using a chin rest for holding the users head, otherwise the results presented here would not consider the uncertainty associated with the algo- rithm, but the uncertainty associated with how long the user can stay still without moving his head. Figure 4.4. Overview of Individuals. Seven individuals were chosen and they are using a chin-rest during the experiments. Figure 4.4 shows the users during experimental evaluation. Users are looking at the top of the screen. Most individuals have dark colored eyes, which turns the pupil detection in visible light a really hard task. Hence, using visible light we 4.3. QUANTITATIVE ANALYSIS 47 Id Point Mean SD Row SD Col Err Row Err Col 1 (200,200) (131.614,216.055) 22.062 30.872 68.386 16.055 2 (200,400) (169.977,372.211) 10.290 12.998 30.023 27.789 3 (200,600) (180.032,595.909) 22.423 18.855 19.968 4.091 4 (600,200) (606.913,224.814) 12.075 33.563 6.913 24.814 5 (600,400) (599.360,395.442) 21.097 14.242 0.640 4.558 6 (600,600) (606.452,584.591) 17.548 33.829 6.452 15.409 7 (1000,200) (993.268,194.902) 36.651 25.197 6.732 5.098 8 (1000,400) (973.932,398.385) 18.739 17.815 26.068 1.615 9 (1000,600) (966.821,595.219) 27.150 22.523 33.179 4.781 10 (1400,200) (1425.927,209.254) 35.029 23.189 25.927 9.254 11 (1400,400) (1389.441,411.427) 39.486 23.797 10.559 11.427 12 (1400,600) (1380.071,591.811) 30.406 26.405 19.929 8.189 Table 4.1. Quantitative Results for Individual 1 Id Point Mean SD Row SD Col Err Row Err Col 1 (200,200) (-60.178,144.838) 10.993 10.535 260.178 55.162 2 (200,400) (78.533,402.470) 26.996 17.809 121.467 2.470 3 (200,600) (-129.165,642.806) 63.276 22.047 329.165 42.806 4 (600,200) (512.367,172.542) 19.985 54.221 87.633 27.458 5 (600,400) (550.538,319.736) 22.002 29.502 49.462 80.264 6 (600,600) (603.151,587.753) 52.059 10.839 3.151 12.247 7 (1000,200) (965.478,202.038) 18.628 11.336 34.522 2.038 8 (1000,400) (953.926,377.338) 18.653 18.514 46.074 22.662 9 (1000,600) (1094.450,623.324) 32.929 35.501 94.450 23.324 10 (1400,200) (1486.091,132.511) 40.263 16.690 86.091 67.489 11 (1400,400) (1510.403,385.137) 60.059 20.260 110.403 14.863 12 (1400,600) (1560.118,508.448) 68.624 37.147 160.118 91.552 Table 4.2. Quantitative Results for Individual 2 need to identify the boundaries of the irises. However, some users have their irises slightly blocked by eyelids leading to more complexity. Calibration is one of the most important steps during the gaze estimation. Based on the calibration, we can estimate the user’s gaze either with poor or good accuracy. In Figure 4.5 we show the calibration results for the seven users intro- duced here. 4.3.1 Gaze Measurements Figure 4.5 shows the center of the iris in the image coordinate system when users were looking at the calibration points in the screen. The horizontal and vertical axes 48 CHAPTER 4. EXPERIMENTAL ANALYSIS 354 356 358 360 362 364 366 174 175 176 177 178 179 180 181 182 183 184 456 458 460 462 464 466 142 143 144 145 146 147 148 149 150 151 Individual 1 Individual 2 480 482 484 486 488 490 185 186 187 188 189 190 191 192 193 356 358 360 362 364 366 368 370 372 374 150 152 154 156 158 160 162 164 Individual 3 Individual 4 352 353 354 355 356 357 358 359 360 361 230 231 232 233 234 235 236 237 294 296 298 300 302 154 155 156 157 158 159 160 161 162 Individual 5 Individual 6 392 393 394 395 396 397 398 399 400 401 183 184 185 186 187 188 189 190 Individual 7 Figure 4.5. Calibration Results. All results use pixels as units of measurement. 4.3. QUANTITATIVE ANALYSIS 49 −200 0 200 400 600 800 1000 1200 1400 1600 1800 −200 0 200 400 600 800 1000 1200 [200,200] [200,400] [200,600] [600,200] [600,400] [600,600] [1000,200] [1000,400] [1000,600] [1400,200] [1400,400] [1400,600] Gaze Measurements − Individual 1 X values (pixels) Y v al ue s ( pix els ) Figure 4.6. Gaze results of Individual 1 Id Point Mean SD Row SD Col Err Row Err Col 1 (200,200) (142.239,190.214) 22.989 6.201 57.761 9.786 2 (200,400) (-3.164,416.743) 17.680 10.152 203.164 16.743 3 (200,600) (129.689,549.464) 28.927 29.837 70.311 50.536 4 (600,200) (611.784,165.926) 47.375 17.082 11.784 34.074 5 (600,400) (408.927,466.807) 0.983 9.338 191.073 66.807 6 (600,600) (610.500,594.624) 28.238 19.509 10.500 5.376 7 (1000,200) (1019.238,115.202) 27.787 23.119 19.238 84.798 8 (1000,400) (866.498,480.524) 29.135 18.960 133.502 80.524 9 (1000,600) (1031.223,594.839) 21.916 9.569 31.223 5.161 10 (1400,200) (1530.578,114.990) 41.904 17.720 130.578 85.010 11 (1400,400) (1485.639,334.262) 34.329 11.091 85.639 65.738 12 (1400,600) (1676.977,617.480) 43.176 17.053 276.977 17.480 Table 4.3. Quantitative Results for Individual 3 50 CHAPTER 4. EXPERIMENTAL ANALYSIS −200 0 200 400 600 800 1000 1200 1400 1600 1800 −200 0 200 400 600 800 1000 1200 [200,200] [200,400] [200,600] [600,200] [600,400] [600,600] [1000,200] [1000,400] [1000,600] [1400,200] [1400,400] [1400,600] Gaze Measurements − Individual 2 X values (pixels) Y v al ue s ( pix els ) Figure 4.7. Gaze results of Individual 2 Id Point Mean SD Row SD Col Err Row Err Col 1 (200,200) (110.113,147.994) 0.387 0.019 89.887 52.006 2 (200,400) (101.208,383.863) 21.568 3.416 98.792 16.137 3 (200,600) (0.000,683.429) 0.000 16.716 200.000 83.429 4 (600,200) (598.112,154.785) 8.735 9.676 1.888 45.215 5 (600,400) (525.159,340.260) 28.122 10.105 74.841 59.740 6 (600,600) (404.204,671.818) 17.925 37.546 195.796 71.818 7 (1000,200) (912.656,174.348) 22.232 9.530 87.344 25.652 8 (1000,400) (838.786,361.253) 20.475 17.337 161.214 38.747 9 (1000,600) (896.948,674.977) 8.222 6.548 103.052 74.977 10 (1400,200) (1376.000,175.000) 0.000 0.000 24.000 25.000 11 (1400,400) (1290.507,370.913) 6.195 17.383 109.493 29.087 12 (1400,600) (1287.062,598.537) 13.105 18.880 112.938 1.463 Table 4.4. Quantitative Results for Individual 4 4.3. QUANTITATIVE ANALYSIS 51 −200 0 200 400 600 800 1000 1200 1400 1600 1800 −200 0 200 400 600 800 1000 1200 [200,200] [200,400] [200,600] [600,200] [600,400] [600,600] [1000,200] [1000,400] [1000,600] [1400,200] [1400,400] [1400,600] Gaze Measurements − Individual 3 X values (pixels) Y v al ue s ( pix els ) Figure 4.8. Gaze results of Individual 3 Id Point Mean SD Row SD Col Err Row Err Col 1 (200,200) (16.826,89.022) 10.885 1.360 183.174 110.978 2 (200,400) (27.627,582.831) 30.057 22.094 172.373 182.831 3 (200,600) (135.945,1019.860) 30.503 26.658 64.055 419.860 4 (600,200) (410.257,129.236) 28.026 31.917 189.743 70.764 5 (600,400) (578.579,648.087) 44.898 33.452 21.421 248.087 6 (600,600) (653.728,873.690) 50.316 10.974 53.728 273.690 7 (1000,200) (813.115,131.465) 19.835 28.711 186.885 68.535 8 (1000,400) (1050.247,652.203) 34.890 38.219 50.247 252.203 9 (1000,600) (1032.386,852.798) 50.434 38.659 32.386 252.798 10 (1400,200) (1372.943,233.429) 65.999 39.433 27.057 33.429 11 (1400,400) (1573.102,659.920) 71.647 45.054 173.102 259.920 12 (1400,600) (1629.571,904.635) 36.556 44.835 229.571 304.635 Table 4.5. Quantitative Results for Individual 5 52 CHAPTER 4. EXPERIMENTAL ANALYSIS −200 0 200 400 600 800 1000 1200 1400 1600 1800 −200 0 200 400 600 800 1000 1200 [200,200] [200,400] [200,600] [600,200] [600,400] [600,600] [1000,200] [1000,400] [1000,600] [1400,200] [1400,400] [1400,600] Gaze Measurements − Individual 4 X values (pixels) Y v al ue s ( pix els ) Figure 4.9. Gaze results of Individual 4 Id Point Mean SD Row SD Col Err Row Err Col 1 (200,200) (138.022,503.411) 17.224 13.138 61.978 303.411 2 (200,400) (181.784,454.205) 19.318 41.067 18.216 54.205 3 (200,600) (251.834,670.298) 40.114 62.801 51.834 70.298 4 (600,200) (395.697,457.423) 97.269 22.257 204.303 257.423 5 (600,400) (597.841,413.933) 24.290 28.840 2.159 13.933 6 (600,600) (551.275,584.272) 32.964 15.885 48.725 15.728 7 (1000,200) (801.184,272.191) 51.068 20.063 198.816 72.191 8 (1000,400) (973.739,427.979) 32.502 31.729 26.261 27.979 9 (1000,600) (973.356,530.904) 57.845 28.093 26.644 69.096 10 (1400,200) (1538.885,93.665) 61.040 25.307 138.885 106.335 11 (1400,400) (1565.189,428.568) 114.562 46.047 165.189 28.568 12 (1400,600) (1490.541,631.526) 87.952 50.073 90.541 31.526 Table 4.6. Quantitative Results for Individual 6 4.3. QUANTITATIVE ANALYSIS 53 −200 0 200 400 600 800 1000 1200 1400 1600 1800 −200 0 200 400 600 800 1000 1200 [200,200] [200,400] [200,600] [600,200] [600,400] [600,600] [1000,200] [1000,400] [1000,600] [1400,200] [1400,400] [1400,600] Gaze Measurements − Individual 5 X values (pixels) Y v al ue s ( pix els ) Figure 4.10. Gaze results of Individual 5 Id Point Mean SD Row SD Col Err Row Err Col 1 (200,200) (160.632,-504.749) 1418.648 921.570 39.368 704.749 2 (200,400) (-864.896,308.313) 439.424 139.571 1064.896 91.687 3 (200,600) (-420.925,619.291) 87.518 77.279 620.925 19.291 4 (600,200) (573.726,-57.811) 122.444 53.830 26.274 257.811 5 (600,400) (497.753,400.380) 79.186 57.727 102.247 0.380 6 (600,600) (556.243,607.184) 72.797 45.754 43.757 7.184 7 (1000,200) (1230.198,191.090) 40.206 26.339 230.198 8.910 8 (1000,400) (1116.814,398.094) 35.781 32.805 116.814 1.906 9 (1000,600) (1319.060,851.367) 9.325 33.174 319.060 251.367 10 (1400,200) (1538.532,287.852) 10.449 31.342 138.532 87.852 11 (1400,400) (1532.482,497.620) 6.929 15.730 132.482 97.620 12 (1400,600) (1570.934,768.745) 4.528 10.951 170.934 168.745 Table 4.7. Quantitative Results for Individual 7 54 CHAPTER 4. EXPERIMENTAL ANALYSIS −200 0 200 400 600 800 1000 1200 1400 1600 1800 −200 0 200 400 600 800 1000 1200 [200,200] [200,400] [200,600] [600,200] [600,400] [600,600] [1000,200] [1000,400] [1000,600] [1400,200] [1400,400] [1400,600] Gaze Measurements − Individual 6 X values (pixels) Y v al ue s ( pix els ) Figure 4.11. Gaze results of Individual 6 correspond to both rows and columns of the image in pixels. The calibration chart shows, for instance, the difference between the highest and lowest values in y axis, the accuracy and precision of the calibration. We have computed those values and they are presented in Tables 4.1 to 4.7. The first column (Id) shows the index of the point (in left-right, top-down). The second column (Point) shows the real location of the point in the screen. The third column (Mean) shows the estimated point location. Columns SD Row and SD Col show the precision of the estimation and columns Err Row and Err Col describe how accurate the estimation was. The iris of Individual 1 is mainly visible. The uncertainty associated with the iris detection during the calibration procedure was really low, which indicates that the gaze estimation might present good results as well. Indeed, the iris tracking was really reliable for that user as shown in Figure 4.6. It contains the plot of 12 points in the screen (colored square with black border) and the estimation locations for the Individual 1. As can be seen, the dispersion of the estimated gaze was low showing 4.3. QUANTITATIVE ANALYSIS 55 −1000 −500 0 500 1000 1500 −400 −200 0 200 400 600 800 1000 1200 [200,200] [200,400] [200,600] [600,200] [600,400] [600,600] [1000,200] [1000,400] [1000,600] [1400,200] [1400,400] [1400,600] Gaze Measurements − Individual 7 X values (pixels) Y v al ue s ( pix els ) Figure 4.12. Gaze results of Individual 7 good precision and the distance between the the estimation and real location shows good accuracy. Before discussing about Individual 2, we need to observe the calibration results in Figure 4.5. The dispersion of the calibration for this user was also really small. It is also possible to see that this user was not as close to the camera as Individual 1. For Individual 1, the max difference of highest and lowest location in axis y is about 8 pixels. However, for Individual 2, it is only around 4 pixels. Results for individual 2 indicate that the hard cases for estimation are in the borders as shown in Figure 4.7. The calibration procedure Individual 3 also shows good precision, which means that when the user looks in a certain direction, the estimation of the eye tend to produce small deviation even though it might not be accurate. Figure 4.8 shows good precision in the gaze estimation, but not so good accuracy. Individual 4 has really good precision in almost all mapped locations as displayed by the cali- 56 CHAPTER 4. EXPERIMENTAL ANALYSIS bration chart. However, there was an outlier in the corner, which indicates that for this location, the estimation might produced wrong results. Individuals 5, 6, and 7 have really poor calibration results. Individual 5 does not have a good precision and results displayed in Figure 4.10 show that the poor estimation for this individual is more related to the y-axis. Calibration for individual 6 also lacks precision, meaning that for each point, the tracking produced many different values. This high dispersion around the mean produced bad estimation as shown in Figure 4.11. One behaviour that can be noticed for all individuals is that central points have less uncertainty. This gives us a clue that it is easier to track the iris when it is standing towards the center of the screen, but when individuals look towards the corners, the uncertainty may increase. 4.4 Limitations of Our Work As many works, our approach contain limitations that will be investigated. There are some small issues that are discussed in the conclusions, for instance, one limi- tation of our approach is to perform gaze estimation along with head movements. We, now, state those limitations from the most serious issues to the most simplistic ones. Although we have successfully obtained the head pose, the gaze estimation is only reliable when the head is static. It is not straight forward to merge both infor- mation. For instance, a simple translation of the user in the X axis would invalidate the mapping obtained by the homography. Currently, the user can move his head freely and the pose and iris location can be estimated, but as soon as it calibrates for gaze estimation, the user cannot move his head, otherwise the gaze will not be established. We intend to solve this problem in the future considering that we have both the rotation matrix and the translation vector of the head. The current approach works well for eyes that show a good iris exposition. For many eyes, this is not true, as shown in Figure 4.13. As can be seen, the major part of the iris in the left eye is visible. On the other hand, the right eye has the iris partly blocked by eyelids. Although the system might eventually find the iris, it is very likely that it will fail to accurately track it. The accuracy for this latter eye is really low compared with the one in the left image. Hence, a limitation of our approach is the low accuracy in the eye location when the iris is partially occluded by eyelids. A straight-forward solution for users with this characteristic is the use of infrared 4.4. LIMITATIONS OF OUR WORK 57 light for detecting the pupil instead of iris. Figure 4.13. The left eye demonstrates good iris exposure, (i.e. a high percentage of iris is visible), on the other hand, the right eye has the iris blocked by the eyelids The last problem is the robustness related to the head tracking. If the user is being tracked by the system and the face region is blocked by the hand, or a moving object crosses the background in such a way that it would disturb the features, the pose estimation fails. Hence, our system needs to obtain the feature movement consensus in order to correct wrong tracked values for increasing stability. Chapter 5 Conclusions and Future Work The advance of technology changes the way people interact with computers. Com- puter Vision has brought many unusual and challenging opportunities to interact with human beings. For instance, gaze tracking is the problem of estimating where a person is looking. To be more specific, the problem discussed over this thesis is the remote estimation of human gaze using a single camera in visible lighting. This problem can be addressed in very different ways. Our solution was to split this problem into two others: (i) the head pose estimation and (ii) eye segmentation. A single camera system consists in a scenario where there is no explicit three dimensional information when observing the image acquired, i.e. we can not know, without prior information, if the distance between a certain object to the camera is larger than the distance from another object to the same camera when this camera is the only one responsible for acquiring images. This is our scenario, although, we found a way to estimate the missing information: By using ASM and the three- dimensional face model, we are able to notice, for instance, that the nose is closer to camera when compared to the eye. Hence, the first challenge adressed in this thesis was to overcome the lack of depth information in the image by solving theHead Pose Estimation Problem. Our experiments showed that the algorithm implemented in this thesis is pow- erful and reliable when user performs smooth movements. However, when fast movements occur, or the face is blocked by the user’s hand, or even when the user turns his head completely in such a way that most features are not visible, the pose can not be estimated and tracking is lost. Our tracking system, thus, lacks robust- ness. A possible solution for the lack of robustness could be implemented by obtain- ing the movement consensus, i.e. when one feature is lost during tracking, we could 59 60 CHAPTER 5. CONCLUSIONS AND FUTURE WORK infer where it should be supporting our assumption by using the consensus of the others. Afterwards, we could lock this wrong feature into its correct location. If this proposed consensus method can be successfully built, then, partial oc- clusion of the face would not disturb the tracking activity, providing, in this way, a powerful tracking. However, it would not be able to solve the complete facial occlu- sion problemwhen all features are lost and therefore there is no consensus. Hence, an automatic restabilization procedure could be developed to restore the face tracking when all features are lost. This might happen when the user changes his location in such a way that he is not visible by the camera anymore, e.g. user leaves the room. This could also happen when an abrupt illumination change occurs, e.g. turning the lights off. A possible direction for solving this latter problem could consist on pairing up both images and poses to create an online learning approach that would feed an algorithm with both face poses and images. The correspondence would be created using the output of the estimator. Therefore, when the system looses completely its tracking, the learning technique could restore the system to the best match between the detected face (with wrong or no pose) to its dictionary of own correct poses. The challenge would now be to perform this learning task in a non-supervised way, otherwise this solution would be cumbersome to most users. Considering that our estimation has identified the pose correctly, we move now to the second phase of our problem: eye segmentation. Segmentation means that we are trying to decompose the image into its different parts. Our case consists in information acquired on visible lighting only. So, we face many problems such as glints and specular reflections generated on the user’s eye as shown in Figure 1.1. However, the biggest problem is that we cannot separate the user’s pupil from his iris when the subject eye is dark, the boundaries between pupil and iris are not easily distinguishable, even for human beings. This is the first big difference when working in NIR spectrum and visible lighting. The iris-pupil boundary is easily found in NIR images. Thus, our information relies solely on the limbus, i.e. the boundary between iris and sclera which shows great contrast for most eyes. The eye segmentation algorithm proposed in this work assumes that the eye is centered in the image (ROI) and for achieving that, we use information of the head pose estimated before. Thus, we are able to see that the pose estimation has an important impact in the rest of the system. There are many different ways of segmenting an eye. For some studies, it might be relevant to identify not only the iris, but also pupil, eye corners, eyelids. For our purposes, we need only the successful identification of the iris. We first have 61 presented a way of reducing the impact of specular reflections on the eye surface by decreasing the brightness of the region which is more likely to contain the iris. Hence, a mask was proposed for enhancing the iris region and eliminate undesired content, such as skin pixels. Experiments showed that our iris compensation have presented a good impact to reduce the noise generated by glints. Without compensation, the iris might be divided into several parts due to the high intensity of reflections. However, in a poor illuminated scene, sometimes the segmentation process maymingle skin pixels with iris pixels leading to a failure in the iris estimation. This specially happens in a poor illuminated scene or when the individual has a small sclera exposure. Considering that we proposed a solution for iris segmentation problems, it is important to clarify our decision about tracking the iris instead of detecting it again. Particularly, our decision for using template matching was based on the strong rela- tionship between the iris pixels. Our algorithm relies on the assumption that irises from different individuals vary strongly. On the other hand, the iris of one indi- vidual does not vary too much when compared on different frames in the same illumination condition. Hence, template matching acts like a snapshot of the iris identified whose tracking corresponds to a very lightweight solution if compared to the initial iris detection algorithm. For the gaze estimation problem, our decision consisted on applying homog- raphy to map the points in the image to gaze locations in the screen. This tech- nique provides a simple solution for finding the relationship between image and gaze plane. Results indicate that the quality of the estimation using homography is related to how reliable and accurate was the iris detection and calibration proce- dure. One drawback of our approach is once the head starts moving, the relation- ship between plane and eye changes. We need to integrate the head pose obtained previously with the angles in order to turn this solution into a head invariant gaze tracker. So far, we have solved those problems separately and applied it into dif- ferent applications, such as tele-immersion with remote control of robotic head and content analysis with the heatmap. Bibliography BioID (2001). The bioid face database avaliable at https://support.bioid.com/downloads/. Last accessed february 2012. Bolme, D., Draper, B., and Beveridge, J. (2009). Average of synthetic exact filters. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 2105–2112. Cootes, T. F., Taylor, C. J., Cooper, D. H., and Graham, J. (1995). Active shape models – their training and application. Computer Vision and Image Understanding, 61(1):38- -59. Deja, S. (2010). System rozpoznawania i aktywnego s´ledzenia oczu uz˙ytkownika komputera za pos´rednictwem kamery w czasie rzeczywistym. Master’s the- sis, Akademia Wydział Elektrotechniki, Automatyki, Informatyki i Elektroniki, Poland. Dementhon, D. F. and Davis, L. S. (1995). Model-based object pose in 25 lines of code. International Journal of Computer Vision, 15:123–141. Duchowski, A. T. (2007). Eye Tracking Methodology: Theory and Practice. Springer- Verlag New York, Inc., Secaucus, NJ, USA. Fernandes, C., Alves Neto, A., and Campos, M. F. M. (2011). Um sistema de ras- treamento de pose para aplicações em teleimersão. In X Simpósio Brasileiro de Au- tomação Inteligente (SBAI’11), SBAI’11, São João del-Rey, Brazil. Fischler, M. A. and Bolles, R. C. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381--395. Gonzalez, R. C. and Woods, R. E. (2007). Digital Image Processing (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA. 63 64 BIBLIOGRAPHY Guestrin, E. D. and Eizenman, M. (2006). General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Transactions on Biomedical Engineering, 53(6):1124–33. Hansen, D. W. and Ji, Q. (2010). In the eye of the beholder: A survey of models for eyes and gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99(1). Jefferson, Y. (2004). Facial beauty - establishing a universal standard. International Journal of Orthodontics - IJO, 15(1):9–22. Ji, Q. and Yang, X. (2002). Real-time eye, gaze, and face pose tracking for monitoring driver vigilance. Real-Time Imaging, 8:357–377. Li, P., Liu, X., Xiao, L., and Song, Q. (2010). Robust and accurate iris segmentation in very noisy iris images. Image Vision Computing, 28:246–253. Lieberman, Z., Powderly, J., Roth, E., Sugrue, C., Tempt1, and Watson, T. (2011). Eyewriter, http://www.eyewriter.org. Lucas, B. D. and Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In IJCAI’81: Proceedings of the 7th international joint conference on Artificial intelligence, pages 674–679, San Francisco, CA, USA.Morgan Kaufmann Publishers Inc. Martinez, A. and Benavente, R. (1998). The AR face database. Tech- nical report, CVC. Technical Report #24. http://www2.ece.ohio-state.edu/∼ aleix/ARdatabase.html. Martins, P. and Batista, J. (2008). Monocular head pose estimation. In Proceedings of the 5th international conference on Image Analysis and Recognition, ICIAR ’08, pages 357–368, Berlin, Heidelberg. Springer-Verlag. Messer, K., Matas, J., Kittler, J., Luettin, J., and Verlag, G. M. X. (1999). M2VTS: The extended M2VTS database. In Proceedings 2nd Conference on Audio and Video-base Biometric Personal Verification, AVBPA ’99. Springer. Milborrow, S. (2007). Locating facial features with active shape models. Master’s thesis, University of Cape Town. Milborrow, S. and Nicolls, F. (2008). Locating facial features with an extended active shape model. European Conference on Computer Vision. http://www.milbo.users.sonic.net/stasm. BIBLIOGRAPHY 65 Morimoto, C., Koons, D., Amir, A., and Flickner, M. (2000). Pupil detection and tracking using multiple light sources. Image and Vision Computing, 18(4):331 – 335. Mulvey, F., Villanueva, A., Sliney, D., Lange, R., Cotmore, S., and Donegan, M. (2008). Exploration of safety issues in eyetracking. d5.4. Technical report, Com- munication by Gaze Interaction (COGAIN). IST-2003-511598. Murphy-Chutorian, E. and Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):607–626. Nguyen, V. H., Nguyen, T. H. B., and Kim, H. (2010). Eye feature extraction using k-means clustering for low illumination and iris color variety. In International Conference on Control, Automation, Robotics and Vision, ICARCV ’10, pages 633–637. IEEE. Ni, J. and Chellappa, R. (2010). Evaluation of state-of-the-art algorithms for remote face recognition. In International Conference on Image Processing, ICIP ’10, pages 1581–1584. IEEE. Parker, J. R. and Duong, A. (2009). Gaze tracking - a sclera recognition approach. In IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, USA, 11-14 October 2009, pages 3836–3841. Phillips, J. P., Moon, H., Rizvi, S. A., and Rauss, P. J. (2000). The FERET Evalua- tion Methodology for Face-Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10):1090–1104. Proença, H. (2011). Quality assessment of degraded iris images acquired in the visi- ble wavelength. IEEE Transactions on Information Forensics and Security, 6(1):82–95. Rowley, H. A., Baluja, S., and Kanade, T. (1998). Neural network-based face detec- tion. IEEE Transactions On Pattern Analysis and Machine intelligence, 20:23–38. Tan, T., He, Z., and Sun, Z. (2010). Efficient and robust segmentation of noisy iris images for non-cooperative iris recognition. Image Vision Computing, 28:223–230. Tasdizen, T., Tarel, J.-P., and Cooper, D. (1999). Algebraic curves that work better. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 2, pages 2 vol. (xxiii+637+663). 66 BIBLIOGRAPHY Trucco, E. and Verri, A. (1998). Introductory Techniques for 3-D Computer Vision. Pren- tice Hall PTR, Upper Saddle River, NJ, USA. Valenti, R. and Gevers, T. (2008). Accurate eye center location and tracking us- ing isophote curvature. In Proceedings of Computer Vision and Pattern Recognition (CVPR). Viola, P. and Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154–154. Wang, J. Z., Li, J., and Wiederhold, G. (2000). Simplicity: Semantics-sensitive inte- grated matching for picture libraries. In Proceedings of the 4th International Confer- ence on Advances in Visual Information Systems, VISUAL ’00, pages 360--371, Lon- don, UK, UK. Springer-Verlag. Appendix A Mathematical Morphology Mathematical Morphology (MM) can be used to extract components of the image that are meaningful to represent and describe a shape [Gonzalez and Woods, 2007]. So, we apply the hole filling algorithm followed by morphological operators for removing the outside ridges. The key idea is to approximate the original form of the iris. In this work, we apply Erosion andDilation to eliminate holes and ridges. How- ever, before defining those operations, we first present the concept of Structuring Element and finally we introduce the basis and demonstrate the usage of the MM operations in the iris detection problem. Figure A.1 shows results of applying morphological operations on binary im- ages. As one might notice, the holes were filled and “bad” pixels were removed by the application of morphological operations (erosion and dilation) using a circular structuring element. (a) (b) (c) Figure A.1. Morphological Operations: the first row shows some binary images and the second row shows the same images after hole filling and MM opera- tions. 67 68 APPENDIX A. MATHEMATICAL MORPHOLOGY A.1 Structuring Element A Structuring Element is a shape used inMMwhose goal is to interact with an image. For instance, in this work, we choose a circular shape (disk) structuring element to dilate and to erode the image that is displayed in Figure A.2. Gonzalez and Woods [2007] introduce MM as operations in set theory. For instance, given an set I (binary image) and sliding the set E (structuring element) over it, one could take into account only those pixels that lie on the intersection of both sets. Figure A.2. Structuring Element: a 7× 7 convolution idea representing a circular shape. We move now onto a formal definition of those two morphological operations applied in this work: erosion and dilation. A.2 Erosion Given a binary image I and a structuring element E, Gonzalez and Woods [2007] defined the erosion (I ⊖ E) of the I by the element E is: I ⊖ E = {z|(E)z ⊆ I}. (A.1) In other words, when sliding the structuring element E over the image I, we retain those pixels of I only if E is completely contained by I. Figure A.3 demon- strates the application of five erosion operations of a square using the structuring element already mentioned in Figure A.1. After applying an erosion, the original image is shrinked. In this case, the result is a smaller square, however, the shape of the image after erosion depends mainly on the shape of structuring element (including its size) and the number of times that this operation has been applied. The impact of the erosion in an image is really significant. Elements of the image that are smaller than the structuring element are removed (filtered). In this A.3. DILATION 69 (a) (b) Figure A.3. Erosion Operation: (a) binary image and (b) image eroded five times by the structuring element of Figure A.2 . way, it is possible to reduce and hopefully eliminate ridges and isolated pixels. A.3 Dilation Differently from erosion, dilation focuses on expanding the image using a structur- ing element. Again, Gonzalez and Woods [2007] define dilation as: I ⊕ S = {z|[(Sˆ)z ∪ I] ⊆ I} (A.2) If I and S overlap by at least one element, we retain those pixels. Figure A.4 demonstrates the results of applying dilation on the same square presented before. Due to the circular nature of the structuring element, the result is now a rounded squared. (a) (b) Figure A.4. Dilation Operation: (a) binary image and (b) image dilated five times by the structuring element A.1 . According to Gonzalez and Woods [2007], erosion and dilation have a dual relationship related to the set operations (complementation and reflection). The fol- 70 APPENDIX A. MATHEMATICAL MORPHOLOGY lowing equation (duality) shows that the erosion of the image S by I can be obtained by dilating the image background (if the structuring element is the same). (I ⊖ S)C = IC ⊕ Sˆ. (A.3) Considering we explain the important behavior of mathematical morphology, one might wonder why it is necessary to fill holes once morphological operations may fill them. Dilation operation might be able to fill holes but it also might connect outliers to the iris. Hence, we first apply erosion for erasing most outliers, but the decision of applying erosion may cause a problem even bigger: The iris might be splitted turning into two or more parts. Hence, filling the holes inside iris before eroding showed good results in order to restore the iris shape. Appendix B Quadrics and Conics This chapter provides the background for understanding quadrics and conics. B.1 Algebraic Curves and Surfaces An algebraic curve is represented as the zero value of a polynomial in two variables [Tasdizen et al., 1999]. The polynomial P with degree n is a function Pn : R 2 → R given by the following expression: Pn(x, y) = ∑ 0≤i+j≤n ai,jx iyj = 0. (B.1) An algebraic surface has the same properties and definition as well. However, the polynomial now has three variables taking into account the third dimension: Pn(x, y, z) = ∑ 0≤i+j+k≤n ai,j,kx iyjzk = 0. (B.2) Let S be the coefficients of the polynomial, and Q its monomials, it is conve- nient to use the following notation for the polynomial: Pn(x, y, z) = Q t.S. (B.3) For instance, the following polynomial P2(x, y, z) = 3x 2 + 2y2 − 4z2 − xy+ 2yz+ 4x− 5y+ 8, can be described as the dot product of Qt and S: Qt = [x2, y2, z2, xy, xz, yz, x, y, z, 1], and St = [3, 2,−4,−1, 0, 2, 4,−5, 0, 8]. 71 72 APPENDIX B. QUADRICS AND CONICS B.2 Quadrics and Conics A quadric surface is a special case of a algebraic surface in which P2(x, y, z) = Ax 2 + By2 + Cz2 + Dxy+ Exz+ Fyz+ Gx+ Hy+ Iz+ J, (B.4) and it can be rewritten as P = Qt.S: • Qt = [A, B,C,D, E, F,G,H, I, J], and • St = [x2, y2, z2, xy, xz, yz, x, y, z, 1]. By using quadrics, many surfaces can take advantage of implicit formulation L = {(x, y, z) ∈ R3; P(x, y, z) = Qt.S = 0}. Table B.2 describes some of those quadric surfaces and their proper equations. It also classifies the quadrics as degen- erate or non-degenerate. Table B.1. Quadric surfaces. Quadric Equation Degenerate Ellipsoid x2 a2 + y2 b2 + z2 c2 = 1 No Elliptic Paraboloid x2 a2 + y2 b2 − z = 0 No Hyperbolic Paraboloid x2 a2 − y2 b2 − z = 0 No Hyperboloid x2 a2 − y2 b2 − z2 c2 = 1 No Cone x2 a2 − y2 b2 − z2 c2 = 0 Yes In Euclidean space, quadrics have dimension d = 2. However, in the Euclidean plane they have dimension d = 1, also known as conics. Let c be a cone, and l a hyperplane. Figure B.2 shows some conics generated by by the intersection of l and c. B.2. QUADRICS AND CONICS 73 (a) (b) (c) (d) Figure B.1. Conics Sections generated by the intersection of a hyperplane through a cone: (a) parabola, (b) circle, (c) ellipse, and (d) hyperbola. Table B.2. Conic Sections. Conic Equation Circle x2 + y2 = r2 Ellipse x2 a2 + y2 b2 = 1 Hyperbola x2 a2 − y2 b2 = 1 Parabola y2 = 4ax