ACM Home Page
Please provide us with feedback. Feedback
Short-term audio-visual atoms for generic video concept classification
Full text PdfPdf (1.69 MB)
Source
International Multimedia Conference archive
Proceedings of the seventeen ACM international conference on Multimedia table of contents
Beijing, China
SESSION: Best Paper Session table of contents
Pages 5-14  
Year of Publication: 2009
ISBN:978-1-60558-608-3
Authors
Wei Jiang  Columbia University, New York, NY, USA
Courtenay Cotton  Columbia University, New York, NY, USA
Shih-Fu Chang  Columbia University, New York, NY, USA
Dan Ellis  Columbia University, New York, NY, USA
Alexander Loui  Eastman Kodak, Rochester, NY, USA
Sponsor
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 54,   Downloads (12 Months): 54,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1631272.1631277
What is a DOI?

ABSTRACT

We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named Short-Term Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak's consumer benchmark video set from real users. Experimental results confirm significant performance improvements - over 120% MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5% (in terms of AP) over 21 concepts, with many concepts achieving more than 20%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Anemueller and et al. Biologically motivated audio-visual cue integration for object categorization. In CogSys, 2008.
 
2
Z. Barzelay and Y. Schechner. Harmony in motion. In Proc. CVPR, pages 1--8, 2007.
 
3
M.J. Beal and et al. A graphical model for audiovisual object tracking. IEEE Trans. PAMI, 25(7):828--836, 2003.
 
4
S. Birchfield. KLT: An Implementation of the Kanade-Lucas-Tomasi Feature Tracker. http://vision.stanford.edu/birch.
 
5
S.F. Chang and et al. Columbia university TRECVID-2005 video search and high-level feature extraction. In NIST TRECVID workshop, Gaithersburg, MD, 2005.
 
6
S.F. Chang and et al. Large-scale multimodal semantic concept detection for consumer video. In ACM MIR, 2007.
 
7
Y.X. Chen and et al. Image categorization by learning and reasoning with regions. In JMLR, 5:913--939, 2004.
 
8
M. Cristani and et al. Audio-visual event recognition in surveillance video sequences. In IEEE Trans. Multimedia, 9(2):257--267, 2007.
 
9
S. Chu and et al. Environmental sound recognition using MP-based features. in Proc. ICASSP, pages 1--4, 2008.
 
10
N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. CVPR, pages 886--893, 2005.
 
11
D. Dementhon and D. Doermann. Video retrieval using spatial-temporal descriptors. In ACM Multimedia, 2003.
 
12
Y. Deng and B.S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. In IEEE Trans. PAMI, 23(8):800--810, 2001.
 
13
J. Friedman and et al. Additive logistic regression: a statistical view of boosting. Ann. of Sta., 28(22):337--407, 2000.
 
14
K. Grauman and T. Darrel. The pyramid match kernel: Discriminative classification with sets of image features. In Proc. ICCV, 2:1458--1465, 2005.
 
15
B. Han and et al. Incremental density approximation and kernel-based bayesian filtering for object tracking. In Proc. CVPR, pages 638--644, 2004.
 
16
J. Hershey and J. Movellan. Audio-vision: Using audio-visual synchrony to locate sounds. In NIPS, 1999.
 
17
K. Iwano and et al. Audio-visual speech recognition using lip information extracted from side-face images. In EURASIP JASMP, 2007(1):4--4, 2007.
 
18
A. Jepson and et al. Robust online appearence models for visual tracking. IEEE Trans.PAMI, 25(10):1296--1311, 2003.
 
19
R. Kaucic, B. Dalton, and A. Blake. Real-time lip tracking for audio-visual speech recognition applications. In Proc. ECCV, vol.2, pages 376--387, 1996.
 
20
R. Gribonval and S. Krstulovic. MPTK, the matching pursuit toolkit. http://mptk.irisa.fr/
 
21
A. Loui and et al. Kodak's consumer video benchmark data set: concept definition and annotation. In ACM SIGMM Int'l Workshop on MIR, pages 245--254, 2007.
 
22
D. Lowe. Distinctive image features from scale-invariant keypoints. In IJCV, 60(2):91--110, 2004.
 
23
B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. Imaging understanding workshop, pages 121--130, 1981.
 
24
S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. In IEEE Trans. Signal Processing, 41(12):3397--3415, 1993.
 
25
O. Maron and et al. A framework for multiple-instance learning. In NIPS, 1998.
 
26
J.C. Niebles and et al. Extracting moving people from internet videos. in Proc. ECCV, pages 527--540, 2008.
 
27
NIST. TREC Video Retrieval Evaluation (TRECVID). 2001 -- 2008. http://www-nlpir.nist.gov/projects/trecvid/
 
28
J. Ogle and D. Ellis. Fingerprinting to identify repeated sound events in long-duration personal audio recordings. In Proc. ICASSP, pages I-233-236, 2007.
 
29
F. Petitcolas. MPEG for MATLAB. http://www.petitcolas.net/fabien/software/mpeg
 
30
J. Shi and C. Tomasi. Good features to track. In Proc. CVPR, pages 593--600, 1994.
 
31
C. Stauffer and W.E.L. Grimson. Learning patterns of activity using real-time tracking. In IEEE Trans. PAMI, 22(8):747--757, 2002.
 
32
K. Tieu and P. Viola. Boosting image retrieval. In IJCV, 56(1-2):228--235, 2000.
 
33
V. Vapnik. Statistical learning theory. Wiley-Interscience, New York, 1998.
 
34
X.G. Wang and et al. Learning Semantic Scene Models by Trajectory Analysis. In Proc. ECCV, pages 110--123, 2006.
 
35
Y. Wu and et al. Multimodal information fusion for video concept detection. in Proc. ICIP, pages 2391--2394, 2004.
 
36
C. Yang and et al. Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In Proc. CVPR, pages 2057--2063, 2006.
 
37
G.Q. Zhao and et al. Large head movement tracking using SIFT-based registration. In ACM Multimedia, 2007.
 
38
H. Zhou and et al. Object tracking using sift features and mean shift. Com. Vis. & Ima. Und., 113(3):345--352, 2009.
 
39
J.C. Niebles and et al.. Extracting moving people from