Human action recognition has been widely studied over the last few years due to its excessive demands in a huge number of real-world applications, such as ageing care monitoring, video surveillance, human computer interaction (HCI), and etc Meanwhile, the amount of video data containing human actions is increasing exponentially, which makes the management of these resources a challenging task. Given a database with huge volumes of unlabeled videos, it is prohibitive to manually assign specific action types to these videos. Considering that it is much easier to obtain a small number of labeled videos, a practical solution for organizing them is to build a mechanism which is able to conduct action annotation automatically by leveraging the limited labeled videos.
To address this issue, in the first work, we present a method for categorizing human actions using multiple visual features in a semi-supervised manner. The proposed algorithm simultaneously learns multiple features from a small number of labeled videos, and automatically utilizes data distributions between labeled and unlabeled data to boost the recognition performance. Shared structural analysis is applied in our approach to discover a common subspace shared by each type of feature. In the subspace, the proposed algorithm is able to characterize more discriminative information of each feature type. Additionally, data distribution information of each type of feature has been preserved. The aforementioned attributes make our algorithm robust for action recognition, especially when only limited labeled training samples are provided. The main contribution of this work is to incorporate the shared structural analysis into a novel semi-supervised learning framework for human action recognition. Also, l21-norm on the loss function contributes to suppress noise of representation. Extensive experiments have been conducted on both the choreographed and the realistic video datasets. Experimental results show that our method outperforms several state-of-the-art algorithms. Most notably, much better performances have been achieved when there are only a few labeled training samples.
In the second work, the framework is extended to learn multiple features simultaneously rather than only one single type of feature. For each type of feature, a separate shared structural analysis is conducted to uncover a low-dimensional subspace. The proposed framework not only considers the local structural consistency which is reflected by each type of feature, but also takes the global consistency into account by decreasing differences between the global and local virtual label predictions. Noise handling by l21-norm is incorporated as well. The main contribution of this work is to find a trade-off which balances the local consistency from each type of feature and the global consistency which takes all types of feature into account together. The performance of framework has been maximized through the proposed joint framework. Extensive experiments demonstrate that the proposed algorithm outperforms the compared algorithms for action recognition. Most notably, our method has a very distinct advantage over other compared algorithms when we have only a few labeled samples.
In the last work, we realize that action recognition of monitored elderly people can save huge labor costs for the ageing care. However, some existing systems are based on wearable sensors which continuously collect an object's data including velocity and acceleration by using a gyroscope and accelerometer. In the background, these data are further processed and categorized into different action types. Compared to sensor-based systems, video-based ones are more intuitive and reliable. Unfortunately, current video-based systems perform poorly when occlusions exist. Meanwhile, quite limited data relating to ageing care monitoring are available due to privacy issues. For example, fall actions are regarded as the most dangerous incidents for the elderly. This type of action video is very difficult to find which becomes a barrier of related academic research. To solve this problem, we present our study on fall detection for ageing care monitoring. As one of the major contributions in this work, we collect a choreographed multi-camera dataset that contains fall actions and other actions such as walking, standing up, sitting down and so forth. All videos in the dataset are recorded by two cameras independently deployed on walls with an angle of 90 degree to each other. After the extraction of MoSIFT feature from the videos, two video representations, Bag-of-Words (BoW) and spatial Bag-of-Words (spBoW), combining with three multi-modal fusion schemes, are evaluated with a non-linear SVM with χ2 kernel in the system. Besides, the explicit feature map which approximates χ2 kernel is also added into the comparisons for the consideration of scalability problem. Our extensive experiment results show that late fusion of Bag-of-Words with a codebook with 1000 centers obtains the best performance. The best result reaches 90.46 in average precision, which may contribute a more independent and safer living environment for the elder people.