In recent years, due to the rapid development and extensive availability of massive storage devices, video cameras, fast networks, and social media sharing sites, we have witnessed a tremendous explosion of user-generated multimedia data (e.g., images, videos) on the web. In order to facilitate effective multimedia management, indexing and retrieval, it is critical to associate such multimedia data with semantic information, such as tags, annotations, captions.
Semantic multimedia indexing, which is also known as multimedia annotation, multimedia tagging or multimedia concept detection, refers to the task of associating semantic information to multimedia data. Existing approaches can be roughly summarized into two categories: a) manual labeling and b) automatic annotation. Manual labeling can help collect high-quality semantic information, but it is usually impractical due to unaffordable cost. Recently, automatic semantic multimedia indexing, as an alternative to manual labeling, has been attracting significant research interest in the field of multimedia. However, due to deficiency of labeled training data, most existing approaches can scarcely achieve satisfactory performance. In this dissertation, we aim to 1) identify subsistent problems and challenges existing in semantic multimedia indexing, 2) develop effective and efficient solutions to tackle these problems, and 3) evaluate the proposed approaches on real-world multimedia benchmarks.
Firstly, we propose a data-driven image tagging approach, which automatically assigns images with accurate and complete tags. We first identify all the near-duplicate clusters from a set of user-tagged web images. Then, tags of all the images in each near-duplicate cluster are aggregated to form a relatively complete “document”. In order to tag an image, we collect a set of initial candidate tags from its near-duplicate neighbors. Then, the candidate tag set is expanded using the multi-tag associations mined from all the near-duplicate clusters’ documents. Besides, a denoising model is devised to alleviate the influence of noisy tags by taking relevance between tags and images into consideration. Extensive experiments on a real-world web image corpus demonstrate the effectiveness of the proposed approach.
We also study the problem of localizing tags to image regions, so as to index images more precisely. Given a test image, we first segment it into several test image regions. Then, a novel reconstruction model, which simultaneously takes the robust encoding ability of group sparse coding, spatial correlations among training image regions, and intrinsic correlations among test image regions into consideration, is exploited to jointly reconstruct the test regions from a set of training image regions. Finally, tags of the sparsely selected training image regions are propagated to the test image regions according to the reconstruction coefficients. Extensive experiments on three public image collections illustrate the superiority of the proposed approach.
Further, we investigate the problem of video tagging. It is noticed that abundant well-tagged images are available. We propose a novel video tagging framework, termed as Cross-Media Tag Transfer (CMTT), which utilizes the abundance of well-tagged images to facilitate video tagging. Specifically, we build a “cross-media tunnel” to transfer knowledge from images to videos. An optimal kernel space, in which distribution distance between images and videos is minimized, is found to tackle domain-shift problem. A novel cross-media semi-supervised video tagging model is proposed to infer tags by exploring the intrinsic local structures of both labeled and unlabeled data, and learn reliable video classifiers. An efficient algorithm is designed to optimize the proposed model in an alternative way. Extensive experiments illustrate the superiority of our proposal compared to the state-of-the-art algorithms.
Finally, we move one step forward and propose a robust semantic video indexing framework, which exploits user-tagged web images to assist learning robust semantic video indexing classifiers. Different from the previous work, this work is confronted with a more challenging problem, i.e., web images are often associated with noisy tags. Specifically, we first estimate the probabilities of images being correctly tagged as confidence scores and filter out the images with low confidence scores. We then develop a robust image-to-video indexing approach to learn reliable classifiers from a limited number of training videos together with abundant user-tagged images. A robust loss function weighted by the confidence scores of images is used to further alleviate the influence of noisy samples. An optimal kernel space, in which the domain difference between images and videos is minimal, is automatically discovered by the approach to tackle the domain difference problem. Experiments on NUS-WIDE web image dataset and Kodak consumer video corpus demonstrate the effectiveness of the proposed robust semantic video indexing framework.