|
ABSTRACT
In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Berger and J. Lafferty. The Weaver System for Document Retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 163--174. NIST Special Publication 500-246, 2000.
|
| |
2
|
|
| |
3
|
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, pages 168--175, 2002.
|
| |
4
|
J. Darroch and D. Ratcliff. Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics, 43(5):1470--1480, 1972.
|
| |
5
|
|
| |
6
|
S. L. Feng,, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Intl. Conf. on Computer Vision and Pattern Recognition, Washington D.C., June 2004.
|
 |
7
|
|
 |
8
|
|
| |
9
|
A. Hauptmann, D. Ng, R. Baron, M. Chen, and et. al. Informedia at TRECVID 2003: Analyzing and searching broadcast news video. In Proceedings of TRECVID2003, Gaithersburg, MD, November 2003. NIST.
|
| |
10
|
T. M. J. Baldridge and G. Bierner. openNLP maximum entropy modeling toolkit. http://maxent.sourceforge.net/, version 2.2.0, 2004.
|
 |
11
|
|
| |
12
|
D. Klakow. Log-linear interpolation of language models. In Proc. International Conference on Speech and Language Processing (ICSLP, Sydney, Australia, November 1998.
|
 |
13
|
John Lafferty , Chengxiang Zhai, Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.111-119, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383970]
|
| |
14
|
V. Lavrenko, S. L. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. In Intl. Conf. On Acoust., Sp., and Sig. Proc., pages 417--420, Montreal, QC, May 2004.
|
| |
15
|
C.-Y. Lin, B. Tseng, and J. R. Smith. Video Collaborative Annotation Forum: Establishing Ground-truth Labels on Large Multimedia Datasets. In Proceedings of the TRECVID2003: NIST Special Publications, Gaithersburg, MD, 2003. NIST.
|
| |
16
|
NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2003), Gaithersburg, MD, November 2003.
|
| |
17
|
NIST. Proceedings of the TREC Video Retrieval Evaluation Conference(TRECVID2004), Gaithersburg, MD, November 2004.
|
 |
18
|
|
| |
19
|
A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In E. Brill and K. Church, editors, Proc. Conf. on Empirical Methods in Natural Language Processing, pages 133--142. Assn Comp. Ling., Somerset, New Jersey, 1996.
|
| |
20
|
|
| |
21
|
N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368--377, 1999.
|
| |
22
|
T. Westerveld and A. P. de Vries. Multimedia retrieval using multiple examples. In Proceedings of Conference on Image and Video Retrieval CIVR, Dublin, Ireland, July 2004.
|
| |
23
|
|
 |
24
|
Hui Yang , Lekha Chaisorn , Yunlong Zhao , Shi-Yong Neo , Tat-Seng Chua, VideoQA: question answering on news video, Proceedings of the eleventh ACM international conference on Multimedia, November 02-08, 2003, Berkeley, CA, USA
[doi> 10.1145/957013.957146]
|
CITED BY 8
|
|
|
|
|
|
|
|
Hiranmay Ghosh , P. Poornachander , Anupama Mallik , Santanu Chaudhury, Learning ontology for personalized video retrieval, Workshop on multimedia information retrieval on The many faces of multimedia semantics, September 28-28, 2007, Augsburg, Bavaria, Germany
|
|
|
|
|
|
Jingjing Liu , Wei Lai , Xian-Sheng Hua , Yalou Huang , Shipeng Li, Video search re-ranking via multi-graph propagation, Proceedings of the 15th international conference on Multimedia, September 25-29, 2007, Augsburg, Germany
|
|
|
|
|
|
Julien Ah-Pine , Marco Bressan , Stephane Clinchant , Gabriela Csurka , Yves Hoppenot , Jean-Michel Renders, Crossing textual and visual content in different application scenarios, Multimedia Tools and Applications, v.42 n.1, p.31-56, March 2009
|
|
|
|
|