|
ABSTRACT
Automatic video annotation is an important ingredient for semantic-level video browsing, search and navigation. Much attention has been paid to this topic in recent years. These researches have evolved through two paradigms. In the first paradigm, each concept is individually annotated by a pre-trained binary classifier. However, this method ignores the rich information between the video concepts and only achieves limited success. Evolved from the first paradigm, the methods in the second paradigm add an extra step on the top of the first individual classifiers to fuse the multiple detections of the concepts. However, the performance of these methods can be degraded by the error propagation incurred in the first step to the second fusion one. In this article, another paradigm of the video annotation method is proposed to address these problems. It simultaneously annotates the concepts as well as model correlations between them in one step by the proposed Correlative Multilabel (CML) method, which benefits from the compensation of complementary information between different labels. Furthermore, since the video clips are composed by temporally ordered frame sequences, we extend the proposed method to exploit the rich temporal information in the videos. Specifically, a temporal-kernel is incorporated into the CML method based on the discriminative information between Hidden Markov Models (HMMs) that are learned from the videos. We compare the performance between the proposed approach and the state-of-the-art approaches in the first and second paradigms on the widely used TRECVID data set. As to be shown, superior performance of the proposed method is gained.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Berg, B. A. 2004. Markov Chain Monte Carlo Simulations and Their Statistical Analysis. World Scientific.
|
| |
2
|
|
| |
3
|
Campbell, M., et al. 2006. Ibm research trecvid-2006 video retrieval system. TREC Video Retrieval Evaluation (TRECVID) Proceedings.
|
| |
4
|
Chang, S.-F., et al. 2006. Columbia university trecvid-2006 video search and high-level feature extraction. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.
|
| |
5
|
|
| |
6
|
|
| |
7
|
Do, M. 2003. Fast approximation of kullback-leibler distance for dependence trees and hidden markov models. IEEE Signal Process. Lett. 10, 4, 115--118.
|
| |
8
|
Ebadollahi, S., Xie, L., Chang, S.-F., and Smith, J. R. 2006. Visual event detection using multidimensional concept dynamics. In Proceedings of the IEEE International Conference on Multimedia and Expo.
|
| |
9
|
Gauvain, J.-L. and Lee, C.-H. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Audio Process. 2, 2, 291--298.
|
| |
10
|
Godbole, S. and Sarawagi, S. 2004. Discriminative methods for multi-labeled classification. In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining.
|
| |
11
|
Goldberger, J. and Aronowitz, H. 2005. A distance measure between gmms based on the unscented transform and its application to speaker recognition. In Proceedings of the International Conference on Spoken Language Processes.
|
| |
12
|
Hauptmann, A. G., Chen, M.-Y., and Christel, M. 2004. Confounded expectations: Informedia at TRECVID 2004. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.
|
| |
13
|
Hauptmann, A. G., et al. 2006. Multi-lingual broadcast news retrieval. In TREC Video Retrieval Evaluation (TRECVID) Procedings.
|
| |
14
|
Hauptmann, A. G., Yan, R., Lin, W.-H., Christel, M., and Wactlar, H. 2007. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9, 5, 958--966.
|
| |
15
|
Hua, X.-S., Mei, T., Lai, W., Wang, M., Tang, J., Qi, G.-J., Li, L., and Gu, Z. 2006. Microsoft reseach asia trecvid 2006 high-level feature extraction and rushes exploitation. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.
|
| |
16
|
Jiang, W., Chang, S.-F., and Loui, A. 2006. Active concept-based concept fusion with partial user labels. In Proceedings of the IEEE International Conference on Image Processing.
|
 |
17
|
|
| |
18
|
Koskela, M., Smeaton, A., and Laaksonen, J. 2007. Measuring concept similarities in multimedia ontologies: analysis and evaluations. IEEE Trans. Multimed. 9, 5, 912--922.
|
| |
19
|
|
| |
20
|
|
| |
21
|
Liu, P., Soong, F. K., and Zhou, J.-L. 2007. Divergence-based similarity measure for spoken document retrieval. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
|
| |
22
|
Marr, D. 1982. Vision. W. H. Freeman and Company.
|
| |
23
|
Naphade, M. R., Kozintsev, I., and Huang, T. 2002. Factor graph framework for semantic video indexing. IEEE Trans. CSVT 12, 1 (Jan.).
|
| |
24
|
Milind Naphade , John R. Smith , Jelena Tesic , Shih-Fu Chang , Winston Hsu , Lyndon Kennedy , Alexander Hauptmann , Jon Curtis, Large-Scale Concept Ontology for Multimedia, IEEE MultiMedia, v.13 n.3, p.86-91, July 2006
[doi> 10.1109/MMUL.2006.63]
|
| |
25
|
Naphade, M. R. 2002. Statistical techniques in video data management. In Proceedings of the IEEE Workshop on Multimedia Signal Processing.
|
| |
26
|
Naphade, M. R., Kennedy, L., Kender, J. R., Chang, S.-F., Smith, J. R., Over, P., and Hauptmann, A. G. 2005. A light scale concept ontology for multimedia understanding for TRECVID 2005. IBM Research Report RC23612 (W0505-104).
|
| |
27
|
Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. 61--67.
|
| |
28
|
Petersohn, C. 2004. Fraunhofer hhi at trecvid 2004: shot boundary detection system. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.
|
| |
29
|
Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.
|
 |
30
|
|
| |
31
|
|
| |
32
|
Cees G. M. Snoek , Marcel Worring , Jan-Mark Geusebroek , Dennis C. Koelma , Frank J. Seinstra , Arnold W. M. Smeulders, The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.28 n.10, p.1678-1689, October 2006
[doi> 10.1109/TPAMI.2006.212]
|
 |
33
|
Cees G. M. Snoek , Marcel Worring , Jan C. van Gemert , Jan-Mark Geusebroek , Arnold W. M. Smeulders, The challenge problem for automated detection of 101 semantic concepts in multimedia, Proceedings of the 14th annual ACM international conference on Multimedia, October 23-27, 2006, Santa Barbara, CA, USA
[doi> 10.1145/1180639.1180727]
|
 |
34
|
Jinhui Tang , Xian-Sheng Hua , Guo-Jun Qi , Meng Wang , Tao Mei , Xiuqing Wu, Structure-sensitive manifold ranking for video concept detection, Proceedings of the 15th international conference on Multimedia, September 25-29, 2007, Augsburg, Germany
[doi> 10.1145/1291233.1291430]
|
 |
35
|
Ioannis Tsochantaridis , Thomas Hofmann , Thorsten Joachims , Yasemin Altun, Support vector machine learning for interdependent and structured output spaces, Proceedings of the twenty-first international conference on Machine learning, p.104, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015341]
|
 |
36
|
Dong Wang , Xiaobing Liu , Linjie Luo , Jianmin Li , Bo Zhang, Video diver: generic video indexing with diverse features, Proceedings of the international workshop on Workshop on multimedia information retrieval, September 24-29, 2007, Augsburg, Bavaria, Germany
[doi> 10.1145/1290082.1290094]
|
| |
37
|
Tao Wang , Jianguo Li , Qian Diao , Wei Hu , Yimin Zhang , Carole Dulong, Semantic Event Detection using Conditional Random Fields, Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, p.109, June 17-22, 2006
[doi> 10.1109/CVPRW.2006.190]
|
| |
38
|
|
| |
39
|
Wu, Y., Tseng, B. L., and Smith, J. R. 2004. Ontology-based multi-classification learning for video concept detection. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.
|
| |
40
|
Xie, L. and Chang, S.-F. 2002. Structural analysis of soccer video with hidden markov models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
|
| |
41
|
Yan, R., Chen, M.-Y., and Hauptmann, A. G. 2006. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.
|
| |
42
|
Yanagawa, A., Chang, S.-F., Kennedy, L., and Hsu, W. 2007. Columbia university's baseline detectors for 374 lscom semantic visual concepts. Tech. Rep. 222-2006-8, Columbia University ADVENT Technical Report. March. 20.
|
| |
43
|
|
 |
44
|
Zheng-Jun Zha , Tao Mei , Xian-Sheng Hua , Guo-Jun Qi , Zengfu Wang, Refining video annotation by exploiting pairwise concurrent relation, Proceedings of the 15th international conference on Multimedia, September 25-29, 2007, Augsburg, Germany
[doi> 10.1145/1291233.1291308]
|
| |
45
|
|
CITED BY 2
|
|
Jinhui Tang , Xian-Sheng Hua , Meng Wang , Zhiwei Gu , Guo-Jun Qi , Xiuqing Wu, Correlative linear neighborhood propagation for video annotation, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, v.39 n.2, p.409-416, April 2009
|
|
|
|
REVIEW
"Sebastien Lefevre : Reviewer"
Annotation of multimedia data is a very topical yet very challenging problem. Indeed, Web sites such as YouTube store terabytes or even petabytes of video data. To successfully enable user navigation or retrieval in these huge databases, some auto
more...
|