International usage and interest in Closed-Circuit Television (CCTV) for surveillance of public spaces is growing at an unprecedented pace in response to global security issues. In crimes such as assault and robbery, manual analysis of surveillance video has proved to be effective in helping to find and successfully prosecute perpetrators. However, for all events it would be useful to detect incidents and persons-of-interest, in real-time or close to real-time, in order to mitigate possible harm. One solution is to apply an automated face-based identity inference system to short list the person of interest from CCTV video streams.
Over the last two decades, a vast variety of algorithms has been proposed for the reliable inference of identity based on face images. However, most of the literature can be criticised for having assumptions that are quite unrealistic for faces extracted from surveillance contexts. These face images often have imprecise face localisation, uncontrolled head pose and illumination conditions, and low image resolution (say, 10 pixels between the eyes). While there are existing commercial approaches that achieve good performance on passport size photos, there is presently no algorithm which concurrently addresses all of these problems. All in all, automated identity inference in the surveillance environment remains an unsolved problem for the computer vision research community.
In this thesis, we aim to achieve robust and efficient face-based identity inference with three approaches: (1) robust face representation for images in unconstrained environments, (2) face synthesis for addressing a large degree of pose mismatch problems, and (3) face image quality assessment for image subset selection. The first and third approaches are categorised as computer vision and machine learning problems, whereas the second approach is related to computer graphic problems. For all approaches, we apply patch-based facial analysis where each local patch covers a relatively small face area.
The first approach is motivated by the benefits of local feature based face representation, inspired by the well-established bag-of-words literature. To address the aforementioned problems for faces extracted from surveillance contexts, local patches are encoded as sparse descriptors, with the spatial relationships between patches deliberately ignored. These sparse descriptors are then pooled to form a set of regional descriptors. The overall descriptor has good robustness towards face images with various problems (e.g., alignment errors, small degree of pose mismatches, etc.). Evaluations on multiple still images and video datasets show robust and efficient performance when compared with a number of established face descriptors as well as the state-of-the-art sparse representation based classifier.
The second contribution of this work is to improve identity inference under large degree pose mismatches. Under typical surveillance conditions, an identity inference system is often required to match a frontal view watch-list with probe faces, under various pose views. Therefore, we propose a non-iterative patch-based face synthesis algorithm, which transforms a given frontal view image into views at specific poses, without recourse to computationally intensive 3D analysis or model-fitting techniques that may fail to converge. The algorithm synthesizes a non-frontal representation through applying a multivariate linear regression on a low-dimensional representation of each patch. The proposed algorithm is designed for lower resolution images and requires only the simplest form of face alignment (i.e., with the locations of eye coordinates). Evaluations on various degrees of pose mismatches show marked improvement of the proposed face descriptor on a large degree of pose mismatches (i.e., matching frontal view images with ±60◦ side view images).
The third approach designs a face quality assessment algorithm to select faces with the most ideal conditions and improve the identity inference on CCTV video streams. It analyses a given video stream and ranks face images based on the suitability for robust identity inference. A set of “ideal” training images with controlled alignment and environment conditions is used to generate a set of location-specific probabilistic models. In contrast with quality assessment algorithms in the literature, the proposed quality metric is capable of simultaneously handling issues such as pose variations, cast shadows and blurriness as well as alignment errors caused by automatic face localisation. Evaluations on surveillance videos show significant improvements in verification accuracy and reductions in computational cost.
Finally, this work also presents a new camera configuration in the surveillance environment. This configuration allows the surveillance system to capture frontal view faces without explicitly controlling the movements of the pedestrians. A new video dataset, namely ChokePoint dataset, was recorded under this camera configuration. The ChokePoint dataset can be used to evaluate the algorithm for face-based identity inference under realistic conditions, as well as for applications like 3D face reconstruction.