|
ABSTRACT
In this paper, we resolve the problem of multi-modality video representation and semantic concept detection. Interaction and integration of multi-modality media types such as visual, audio and textual data in video are essential to video semantic analysis. Traditionally, videos are represented as vectors in the Euclidean space. Many learning algorithms are then taken to these vectors in a high dimensional space for dimension reduction, classification, clustering and so on. However, the multiple modalities in video not only have their own properties, but also have correlations among them; whereas the simple vector representation weakens the power of these relatively independent modalities and even ignores their relations to some extent. In this paper, we introduce a higher-order tensor framework for video analysis, in which we represent image, video and text three modalities in video shots as data points by the 3rd-order tensor called tensorshots. We propose a novel dimension reduction method that explicitly considers the manifold structure of the tensor space from multimodal media data which is temporal associated co-occurrence and then detect video semantic concepts through powerful classifiers which take tensor as input. Our algorithm preserves the intrinsic structure of the submanifold where tensorshots are sampled, and is also able to map out-of-sample data points directly. Moreover we apply an active learning based contextual and temporal post-refining strategy to enhance detection accuracy. Experiment results show that our method improves the performance of video semantic concept detection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
N. Babaguchi, Y. Kawai, T. Kitahashi. Event based indexing of broadcast sports video by intermodal collaboration. In IEEE Transactions on Multimedia, 2002
|
| |
2
|
Cees G. M. Snoek, Marcel Worring. Multimedia event-based video indexing using time intervals. In IEEE Transactions on Multimedia, 2005
|
| |
3
|
Yanan Liu, Fei Wu. Video semantic concept detection using multi-modality subspace correlation propagation. In 13th Int. Multimedia Modeling Conf. (mmm2007), 2006.
|
| |
4
|
|
| |
5
|
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems (NIPS2002), MIT Press, Cambridge 585--591.
|
| |
6
|
Xiaofei He, and Partha Niyogi. Locality preserving projections. Advances in Neural Information Processing Systems (NIPS2003).
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
Guo-Jun Qi , Xian-Sheng Hua , Yong Rui , Jinhui Tang , Tao Mei , Hong-Jiang Zhang, Correlative multi-label video annotation, Proceedings of the 15th international conference on Multimedia, September 25-29, 2007, Augsburg, Germany
[doi> 10.1145/1291233.1291245]
|
 |
12
|
|
 |
13
|
|
| |
14
|
I. T. Jolliffe. Principal Component Analysis. Springer, New York, 2nd edition, 2002.
|
| |
15
|
T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.
|
| |
16
|
Sam T. Roweis, Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, Vol.290, 2323--2326, 2000.
|
| |
17
|
Joshua B. Tenenbaum, Vin de Silva, John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, Vol.290, 2319--2323, 2000.
|
| |
18
|
K. Q. Weinberger, B. D. Packer, and L. K. Saul. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In Proceedings of the Tenth International Workshop on AI and Statistics (AISTATS-05), Barbados, WI, 2005.
|
| |
19
|
L. K. Saul, K. Q. Weinberger, Fei Sha, Jihun Ham, and Daniel D. Lee. Spectral Methods for Dimensionality Reduction - Semisupervised Learning. MIT Press, Cambridge, MA, 2006.
|
| |
20
|
M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. Computer Vision and Pattern Recognition, 586--591, 1991.
|
| |
21
|
L. Itti, C. Koch and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 25, no 9, 1075--1088, 2003.
|
| |
22
|
|
| |
23
|
|
| |
24
|
Ning Liu , Benyu Zhang , Jun Yan , Zheng Chen , Wenyin Liu , Fengshan Bai , Leefeng Chien, Text Representation: From Vector to Tensor, Proceedings of the Fifth IEEE International Conference on Data Mining, p.725-728, November 27-30, 2005
[doi> 10.1109/ICDM.2005.144]
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
 |
28
|
|
| |
29
|
Lieven De Lathauwer. Signal Processing based on Multilinear Algebra. Ph.D. Thesis, September 1997.
|
 |
30
|
S. T. Dumais , G. W. Furnas , T. K. Landauer , S. Deerwester , R. Harshman, Using latent semantic analysis to improve access to textual information, Proceedings of the SIGCHI conference on Human factors in computing systems, p.281-285, May 15-19, 1988, Washington, D.C., United States
[doi> 10.1145/57167.57214]
|
| |
31
|
Xiaofei He, Deng Cai, and Partha Niyogi. Tensor subspace analysis. Advances in Neural Information Processing Systems (NIPS2005).
|
| |
32
|
Fan Rong K. Chung. Spectral Graph Theory, volume 92 of Regional Conference Series in Mathematics. 1997.
|
| |
33
|
Brett W. Bader and Tamara G. Kolda. MATLAB Tensor Classes for Fast Algorithm Prototyping. Technical Report SAND2004-5187, Sandia National Laboratories, October 2004.
|
| |
34
|
Brett. W. Bader and Tamara G. Kolda. Efficient MATLAB Computations with Sparse and Factored Tensors. Technical Report SAND02006-7592, Sandia National Laboratories, December 2006.
|
 |
35
|
Zheng-Jun Zha , Tao Mei , Zengfu Wang , Xian-Sheng Hua, Building a comprehensive ontology to refine video concept detection, Proceedings of the international workshop on Workshop on multimedia information retrieval, September 24-29, 2007, Augsburg, Bavaria, Germany
[doi> 10.1145/1290082.1290114]
|
| |
36
|
Y. Y. Yao. Information-theoretic measures for knowledge discovery and data mining. In Entropy Measure, Maximum Entropy Principle and Emerging Applications, pages 115--136. Springer, 2003.
|
| |
37
|
TREVID. http://www-nlpir.nist.gov/projects/trevid/.
|
| |
38
|
LSCOM lexicon definitions and annotations version 1.0. In DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report 117-2006-3, 2006.
|
 |
39
|
|
| |
40
|
Yi Yang, Yueting Zhuang, Fei Wu, Yunhe Pan. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. In IEEE Transactions on Multimedia, 10(3): 437--446, 2008.
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Subjects:
Retrieval models
Additional Classification:
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.10
Vision and Scene Understanding
Subjects:
Video analysis
General Terms:
Algorithms,
Design,
Measurement
Keywords:
active learning,
contextual correlation,
dimension reduction,
hosvd,
multi-modality video semantic concept detection,
support tensor machines (stm),
temporal associated cooccurrence (tac),
temporal dependency,
tensorshot
|