ACM Home Page
Please provide us with feedback. Feedback
Large-scale multimodal semantic concept detection for consumer video
Full text PdfPdf (1.22 MB)
Source
International Multimedia Conference archive
Proceedings of the international workshop on Workshop on multimedia information retrieval table of contents
Augsburg, Bavaria, Germany
SESSION: Semantic indexing of consumer and web videos table of contents
Pages: 255 - 264  
Year of Publication: 2007
ISBN:978-1-59593-778-0
Authors
Shih-Fu Chang  Columbia University, New York, NY
Dan Ellis  Columbia University, New York, NY
Wei Jiang  Columbia University, New York, NY
Keansub Lee  Columbia University, New York, NY
Akira Yanagawa  Columbia University, New York, NY
Alexander C. Loui  Eastman Kodak Company, Rochester, NY
Jiebo Luo  Eastman Kodak Company, Rochester, NY
Sponsors
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
ACM: Association for Computing Machinery
SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 93,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1290082.1290118
What is a DOI?

ABSTRACT

In this paper we present a systematic study of automatic classification of consumer videos into a large set of diverse semantic concept classes, which have been carefully selected based on user studies and extensively annotated over 1300+ videos from real users. Our goals are to assess the state of the art of multimedia analytics (including both audio and visual analysis) in consumer video classification and to discover new research opportunities. We investigated several statistical approaches built upon global/local visual features, audio features, and audio-visual combinations. Three multi-modal fusion frameworks (ensemble, context fusion, and joint boosting) are also evaluated. Experiment results show that visual and audio models perform best for different sets of concepts. Both provide significant contributions to multimodal fusion, via expansion of the classifier pool for context fusion and the feature bases for feature sharing. The fused multimodal models are shown to significantly reduce the detection errors (compared to single modality models), resulting in a promising accuracy of 83% over diverse concepts. To the best of our knowledge, this is the first work on systematic investigation of multimodal classification using a large-scale ontology and realistic video corpus.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C.C. Chang and C.J. Lin. LIBSVM: a Library for Support Vector Machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
 
2
S.F. Chang, et al. Columbia University TRECVID-2005 Video Search and High-Level Feature Extraction. In NIST TRECVID workshop, Gaithersburg, MD, 2005.
 
3
A. Amir, et al. IBM Research TRECVID-2004 Video Retrieval System. In NIST TRECVID 2004 Workshop, Gaithersburg, MD, 2004.
 
4
R.Fergus, P. Perona, A. Zisserman. Object class recognition by unsupervised scale-invariant learning. IEEE Proc. CVPR, 2003, pp. 264--271.
 
5
J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Dept. Statistics, Stanford University Technical Report, 1998.
 
6
K. Grauman and T. Darrel. Approximate correspondences in high dimensions. Advances in NIPS. 2006.
 
7
W. Jiang, S.F. Chang, and A.C. Loui. Kernel sharing with joint boosting for multi-class concept detection. In CVPR Workshop on Semantic Learning Applications in Multimedia, Minneapolis, MN, 2007.
 
8
W. Jiang, S.F. Chang, and A.C. Loui. Context-based concept fusion with boosted conditional random fields. In IEEE Proc. ICASSP. vol.1, 2007, pp. 949--952.
 
9
10
 
11
 
12
NIST. TREC Video Retrieval Evaluation (TRECVID). 2001-2006, http://www-nlpir.nist.gov/projects/trecvid/
 
13
A. Torralba, K. Murphy, and W. Freeman. Sharing features: effective boosting procedure for multi-class object detection. In Proc. CVPR, vol. 2, 2004, pp. 762--769.
 
14
A. Torralba, K. Murphy, and W. Freeman. Contextual models for object detection using boosted random fields. Advances in NIPS, 2004.
 
15
A. Yanagawa, et al. Columbia University's Baseline Detectors for 374 LSCOM Semantic Visual Concepts. Columbia University ADVENT Tech. Report # 222-2006-8, March 2007, http://www.ee.columbia.edu/dvmm/columbia374 .
 
16
A. Yanagawa, W. Hsu, and S.-F. Chang. Brief Descriptions of Visual Features for Baseline TRECVID Concept Detectors. Columbia University ADVENT Tech. Report #219-2006-5, July 2006.
 
17
Caltech 101 data sets, http://www.vision.caltech.edu/Image_Datasets/Caltech101
18


Collaborative Colleagues:
Shih-Fu Chang: colleagues
Dan Ellis: colleagues
Wei Jiang: colleagues
Keansub Lee: colleagues
Akira Yanagawa: colleagues
Alexander C. Loui: colleagues
Jiebo Luo: colleagues