| Large-scale multimodal semantic concept detection for consumer video |
| Full text |
Pdf
(1.22 MB)
|
Source
|
International Multimedia Conference
archive
Proceedings of the international workshop on Workshop on multimedia information retrieval
table of contents
Augsburg, Bavaria, Germany
SESSION: Semantic indexing of consumer and web videos
table of contents
Pages: 255 - 264
Year of Publication: 2007
ISBN:978-1-59593-778-0
|
|
Authors
|
|
Shih-Fu Chang
|
Columbia University, New York, NY
|
|
Dan Ellis
|
Columbia University, New York, NY
|
|
Wei Jiang
|
Columbia University, New York, NY
|
|
Keansub Lee
|
Columbia University, New York, NY
|
|
Akira Yanagawa
|
Columbia University, New York, NY
|
|
Alexander C. Loui
|
Eastman Kodak Company, Rochester, NY
|
|
Jiebo Luo
|
Eastman Kodak Company, Rochester, NY
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 10, Downloads (12 Months): 93, Citation Count: 6
|
|
|
ABSTRACT
In this paper we present a systematic study of automatic classification of consumer videos into a large set of diverse semantic concept classes, which have been carefully selected based on user studies and extensively annotated over 1300+ videos from real users. Our goals are to assess the state of the art of multimedia analytics (including both audio and visual analysis) in consumer video classification and to discover new research opportunities. We investigated several statistical approaches built upon global/local visual features, audio features, and audio-visual combinations. Three multi-modal fusion frameworks (ensemble, context fusion, and joint boosting) are also evaluated. Experiment results show that visual and audio models perform best for different sets of concepts. Both provide significant contributions to multimodal fusion, via expansion of the classifier pool for context fusion and the feature bases for feature sharing. The fused multimodal models are shown to significantly reduce the detection errors (compared to single modality models), resulting in a promising accuracy of 83% over diverse concepts. To the best of our knowledge, this is the first work on systematic investigation of multimodal classification using a large-scale ontology and realistic video corpus.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C.C. Chang and C.J. Lin. LIBSVM: a Library for Support Vector Machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
|
| |
2
|
S.F. Chang, et al. Columbia University TRECVID-2005 Video Search and High-Level Feature Extraction. In NIST TRECVID workshop, Gaithersburg, MD, 2005.
|
| |
3
|
A. Amir, et al. IBM Research TRECVID-2004 Video Retrieval System. In NIST TRECVID 2004 Workshop, Gaithersburg, MD, 2004.
|
| |
4
|
R.Fergus, P. Perona, A. Zisserman. Object class recognition by unsupervised scale-invariant learning. IEEE Proc. CVPR, 2003, pp. 264--271.
|
| |
5
|
J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Dept. Statistics, Stanford University Technical Report, 1998.
|
| |
6
|
K. Grauman and T. Darrel. Approximate correspondences in high dimensions. Advances in NIPS. 2006.
|
| |
7
|
W. Jiang, S.F. Chang, and A.C. Loui. Kernel sharing with joint boosting for multi-class concept detection. In CVPR Workshop on Semantic Learning Applications in Multimedia, Minneapolis, MN, 2007.
|
| |
8
|
W. Jiang, S.F. Chang, and A.C. Loui. Context-based concept fusion with boosted conditional random fields. In IEEE Proc. ICASSP. vol.1, 2007, pp. 949--952.
|
| |
9
|
|
 |
10
|
Alexander Loui , Jiebo Luo , Shih-Fu Chang , Dan Ellis , Wei Jiang , Lyndon Kennedy , Keansub Lee , Akira Yanagawa, Kodak's consumer video benchmark data set: concept definition and annotation, Proceedings of the international workshop on Workshop on multimedia information retrieval, September 24-29, 2007, Augsburg, Bavaria, Germany
[doi> 10.1145/1290082.1290117]
|
| |
11
|
|
| |
12
|
NIST. TREC Video Retrieval Evaluation (TRECVID). 2001-2006, http://www-nlpir.nist.gov/projects/trecvid/
|
| |
13
|
A. Torralba, K. Murphy, and W. Freeman. Sharing features: effective boosting procedure for multi-class object detection. In Proc. CVPR, vol. 2, 2004, pp. 762--769.
|
| |
14
|
A. Torralba, K. Murphy, and W. Freeman. Contextual models for object detection using boosted random fields. Advances in NIPS, 2004.
|
| |
15
|
A. Yanagawa, et al. Columbia University's Baseline Detectors for 374 LSCOM Semantic Visual Concepts. Columbia University ADVENT Tech. Report # 222-2006-8, March 2007, http://www.ee.columbia.edu/dvmm/columbia374 .
|
| |
16
|
A. Yanagawa, W. Hsu, and S.-F. Chang. Brief Descriptions of Visual Features for Baseline TRECVID Concept Detectors. Columbia University ADVENT Tech. Report #219-2006-5, July 2006.
|
| |
17
|
Caltech 101 data sets, http://www.vision.caltech.edu/Image_Datasets/Caltech101
|
 |
18
|
|
CITED BY 6
|
|
|
|
|
|
|
|
Jiebo Luo , Jie Yu , Dhiraj Joshi , Wei Hao, Event recognition: viewing the world with a third eye, Proceeding of the 16th ACM international conference on Multimedia, October 26-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Lei Wu , Xian-Sheng Hua , Nenghai Yu , Wei-Ying Ma , Shipeng Li, Flickr distance, Proceeding of the 16th ACM international conference on Multimedia, October 26-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
|