|
ABSTRACT
Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modalities has been identified, how do we best fuse them to map to semantics? In this paper, we propose a two-step approach. The first step finds <i>statistically independent modalities</i> from raw features. In the second step, we use <i>super-kernel fusion</i> to determine the optimal combination of individual modalities. We carefully analyze the tradeoffs between three design factors that affect fusion performance: <i>modality independence</i>, <i>curse of dimensionality</i>, and <i>fusion-model complexity</i>. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a careful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. Advances in Neural Information Processing Systems, 8:757--763, 1996.
|
| |
2
|
A. Amir, H. W, G. Iyengar, C.-Y.Lin, M. Naphade, A. Natsev, C. Neti, H. J. Nock, J. R. Smith, B. L. Tseng, Y. Wu, and D. Zhang. IBM research TRECVID-2003 system. NIST Text Retrieval Conf. (TREC), 2003.
|
| |
3
|
|
| |
4
|
M. S. Bartlett, H. M. Lades, and T. J. Sejnowski. Independent component representation for face recognition. SPIE Conf. on Human Vision and Electronic Imaging III, 3299:528--539, 1998.
|
| |
5
|
|
| |
6
|
R. Bellman. Adaptive control processes. Princeton, 1961.
|
| |
7
|
|
| |
8
|
|
| |
9
|
T. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Artifical Intelligence Research, 2:263--286, 1995.
|
| |
10
|
|
| |
11
|
D. L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. American Math. Society Lecture---Match Challenges of the 21st Century, 2000.
|
 |
12
|
|
| |
13
|
Myron Flickner , Harpreet Sawhney , Wayne Niblack , Jonathan Ashley , Qian Huang , Byron Dom , Monika Gorkani , Jim Hafner , Denis Lee , Dragutin Petkovic , David Steele , Peter Yanker, Query by image and video content: the QBIC system, Intelligent multimedia information retrieval, MIT Press, Cambridge, MA, 1997
|
 |
14
|
|
| |
15
|
L. Hansen, J. Larsen, and T. Kolenda. On independent component analysis for multimedia signals. Multimedia Image and VideoProcessing, CRC Press, 2000.
|
| |
16
|
J. Hershey and J. Movellan. Using audio-visual synchrony to locate sounds. Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA, 2001.
|
| |
17
|
|
| |
18
|
J. F. III, T. Darrell, W. Freeman, and P. Viola. Learning joint statistical models for audio-visual fusion and segregation. Advances in Neural Information Processing Systems 13. MIT Press, Cambridge MA, 2000.
|
| |
19
|
I. Joliffe. Principal component analysis. Springer-Verlag, New York, 1986.
|
| |
20
|
J. Kittler, M. Hatef, and R. P. W. Duin. Combining classifiers. Intl. Pattern Recognition, pages 897--901, 1996.
|
| |
21
|
T. Kolenda, L. K. Hansen, J. Larsen, and O. Winther. Independent component analysis for understanding multimedia content. IEEE Workshop on Neural Networks for Signal Processing, pages 757--766, 2002.
|
| |
22
|
B. Li and E. Chang. Discovery of a perceptual distance function for measuring image similarity. ACM Multimedia Journal Special Issue on Content-Based Image Retrieval, 8(6):512--522, 2003.
|
| |
23
|
A. S. Lukic, M. N. Wernick, L. K. Hansen, and S. C. Strother. An ICA algorithm for analyzing multiple data sets. IEEE Int. Conf. on Image Processing, pages 821--824, 2002.
|
| |
24
|
J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classifiers, MIT Press, pages 61--74, 2000.
|
| |
25
|
Y. Rui, T. S. Huang, and S. F. Chang. Image retrieval: Past, present, and future. International Symposium on Multimedia Information Processing, 1997.
|
| |
26
|
Y. Rui, T. S. Huang, and S. Mehrotra. Content-based image retrieval with relevance feedback in mars. IEEE International Conference on Image, 1997.
|
| |
27
|
P. Smaragdis and M. Casey. Audio/visual independent components. International Symposium on Independent Component Analysis and Blind Source Separation, pages 709--714, 2003.
|
| |
28
|
J. R. Smith and S. F. Chang. Automatic image retrieval using color and texture. IEEE Trans Pattern Anal Mach Intell, 1996.
|
| |
29
|
D. M. J. Tax, M. V. Breukelen, R. P. W. Duin, and J. Kittler. Combing multiple classifiers by averaging or by multiplying. Pattern Recognition, 33:1475--1485, 2000.
|
| |
30
|
K. M. Ting and I. H. Witten. Issues in styacked generalization. Artificial Intelligence Research, 10:271--289, 1999.
|
| |
31
|
A. Velivelli, C. W. Ngo, and T. S. Huang. Detection of documentarty scene changes by audio-visual fusion. International conference on Image and video retrieval, pages 227--237, 2003.
|
| |
32
|
A. Vinokourov, D. R. Hardoon, and J. Shawe-Taylor. Learning the semantics of multimedia content with application to web image retrieval and classification. Fourth International Symposium on Independent Component Analysis and Blind Source Separation, 2003.
|
| |
33
|
A. Vinokourov, J. Shawe-Taylor, and N. Cristianini. Inferring a semantic representation of text via cross-language correlation analysis. Advances of Neural Information Processing, 2002.
|
| |
34
|
T. Westerveld. Image retrieval: Content versus context. Content-Based Multimedia Information Access, RIAO, 2000.
|
 |
35
|
|
CITED BY 26
|
Jose Iria , Victoria Uren , Alberto Lavelli , Sebastian Blohm , Aba-sah Dadzie , Thomas Franz , Ioannis Kompatsiaris , Joao Magalhaes , Spiros Nikolopoulos , Christine Preisach , Piercarlo Slavazza, Enhancing enterprise knowledge processes via cross-media extraction, Proceedings of the 4th international conference on Knowledge capture, October 28-31, 2007, Whistler, BC, Canada
|
|
|
|
|
|
|
|
|
|
En Cheng , Feng Jing , Lei Zhang , Hai Jin, Scalable relevance feedback using click-through data for web image retrieval, Proceedings of the 14th annual ACM international conference on Multimedia, October 23-27, 2006, Santa Barbara, CA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hanghang Tong , Jingrui He , Mingjing Li , Changshui Zhang , Wei-Ying Ma, Graph based multi-modality learning, Proceedings of the 13th annual ACM international conference on Multimedia, November 06-11, 2005, Hilton, Singapore
|
|
|
|
|
|
Meng Wang , Xian-Sheng Hua , Xun Yuan , Yan Song , Li-Rong Dai, Optimizing multi-graph learning: towards a unified video annotation scheme, Proceedings of the 15th international conference on Multimedia, September 25-29, 2007, Augsburg, Germany
|
|
|
|
|
|
|
|
|
|
|
Xin Geng , Zhi-Hua Zhou , Yu Zhang , Gang Li , Honghua Dai, Learning from facial aging patterns for automatic age estimation, Proceedings of the 14th annual ACM international conference on Multimedia, October 23-27, 2006, Santa Barbara, CA, USA
|
|
Ruofei Zhang , Ramesh Sarukkai , Jyh-Herng Chow , Wei Dai , Zhongfei Zhang, Joint categorization of queries and clips for web-based video search, Proceedings of the 8th ACM international workshop on Multimedia information retrieval, October 26-27, 2006, Santa Barbara, California, USA
|
|
|
|
|
H. Luo , J. Fan , S. Satoh , J. Yang , W. Ribarsky, Integrating multi-modal content analysis and hyperbolic visualization for large-scale news video retrieval and exploration, Image Communication, v.23 n.7, p.538-553, August, 2008
|
|
|
|
|
Ritendra Datta , Dhiraj Joshi , Jia Li , James Z. Wang, Image retrieval: Ideas, influences, and trends of the new age, ACM Computing Surveys (CSUR), v.40 n.2, p.1-60, April 2008
|
|