ACM Home Page
Please provide us with feedback. Feedback
Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor
Full text PdfPdf (2.18 MB)
Source
International Multimedia Conference archive
Proceedings of the seventeen ACM international conference on Multimedia table of contents
Beijing, China
SESSION: Content track C4: video analysis table of contents
Pages 165-174  
Year of Publication: 2009
ISBN:978-1-60558-608-3
Authors
Guangyu Zhu  Institute of Automation, CAS, Beijing, China
Ming Yang  NEC Laboratories America, Cupertino, CA, USA
Kai Yu  NEC Laboratories America, Cupertino, CA, USA
Wei Xu  NEC Laboratories America, Cupertino, CA, USA
Yihong Gong  NEC Laboratories America, Cupertino, CA, USA
Sponsor
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 65,   Downloads (12 Months): 65,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1631272.1631297
What is a DOI?

ABSTRACT

Event detection plays an essential role in video content analysis and remains a challenging open problem. In particular, the study on detecting human-related video events in complex scenes with both a crowd of people and dynamic motion is still limited. In this paper, we investigate detecting video events that involve elementary human actions, e.g. making cellphone call, putting an object down, and pointing to something, in complex scenes using a novel spatio-temporal descriptor based approach. A new spatio-temporal descriptor, which temporally integrates the statistics of a set of response maps of low-level features, e.g. image gradients and optical flows, in a space-time cube, is proposed to capture the characteristics of actions in terms of their appearance and motion patterns. Based on this kind of descriptors, the bag-of-words method is utilized to describe a human figure as a concise feature vector. Then, these features are employed to train SVM classifiers at multiple spatial pyramid levels to distinguish different actions. Finally, a Gaussian kernel based temporal filtering is conducted to segment the sequences of events from a video stream taking account of the temporal consistency of actions. The proposed approach is capable of tolerating spatial layout variations and local deformations of human actions due to diverse view angles and rough human figure alignment in complex scenes. Extensive experiments on the 50-hour video dataset of TRECVid 2008 event detection task demonstrate that our approach outperforms the well-known SIFT descriptor based methods and effectively detects video events in challenging real-world conditions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
F. Wang, Y.G. Jiang, and C.W. Ngo, "Video event detection using motion relativity and visual relatedness," in Proc. ACM Multimedia, 2008, pp. 239--248.
 
2
Y. Ke, R. Sukthankar, and M. Hebert, "Efficient visual event detection using volumetric features," in Proc. Int. Conf. Computer Vision, 2005, vol. 1, pp. 166--173.
 
3
D. Xu and S.F Chang, "Visual event recognition in news video using kernel methods with multi-level temporal alignment," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2007, pp. 1--8.
 
4
G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia, "Event detection and analysis from video streams," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 8, pp. 873--889, 2001.
 
5
G. Zhu, C. Xu, Q. Huang, W. Gao, and L. Xing, "Player action recognition in broadcast tennis video with applications to semantic analysis of sports game," in Porc. ACM Multimedia, 2006, pp. 431--440.
 
6
Y. Ke, R. Sukthankar, and M. Hebert, "Event detection in crowded videos," in Proc. Int. Conf. Computer Vision, 2007, pp. 1--8.
 
7
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning realistic human actions from movies," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2008, pp. 1--8.
 
8
C. Schuldt, I. Laptev, and B. Caputa, "Recognizing human actions: a local svm approach," in Proc. Int. Conf. Pattern Recognition, 2004, pp. 1--8.
 
9
TREC Video Retrieval Evaluation, http://www-nlpir.nist.gov/projects/trecvid. http://www.itl.nist.gov/iad/mig//tests/trecvid/2008/doc/EventDet08-EvalPlan-v07.htm. http://www-nlpir.nist.gov/projects/tvpubs/tv8.slides/event-detection.pdf.
 
10
Z. Li, Y. Fu, T.S. Huang, and S. Yan, "Real-time human action recognition by luminance field trajectory analysis," in Proc. ACM Multimedia, 2008, pp. 671--675.
 
11
H. Buxton, "Learning and understanding dynamic scene activity: a review," Image and Vision Computing, vol. 21, pp. 125--136, 2003.
 
12
W. Hu, T. Tan, L. Wang, and S. Maybank, "A survey on visual surveillance of object motion and behaviors," IEEE Trans. Systems, Man, and Cybernetics, vol. 34, no. 3, pp. 334--352, 2004.
 
13
J. Shen, D. Tao, and X. Li, "Modality mixture projections for semantic video event detection," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1587--1596, 2008.
 
14
M. Xu, L. Duan, C. Xu, and Q. Tian, "A fusion scheme of visual and auditory modalities for event detection in sports video," in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 3, 2003, pp. 189--192.
 
15
N. Babaguchi, Y. Kawai, T. Ogura, and T. Kitahashi, "Personalized abstraction of broadcasted American football video by highlight selection," IEEE Trans. Multimedia, vol. 6, no. 4, pp. 575--586, 2004.
 
16
C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan, "Live sports event detection based on broadcast video and web-casting text," in Proc. ACM Multimedia, 2006, pp. 221--230.
 
17
L. Xie, P. Xu, S.F. Chang, A. Divakaran, and H. Sun, "Structure analysis of soccer video with domain knowledge and hidden markov models," Pattern Recognition Letter, vol. 25, no. 7, pp. 767--775, 2004.
 
18
M.L. Shyu, X. Xie, M. Chen, and S.C. Chen, "Video semantic event/concept detection using a subspace-based multimedia data mining framework," IEEE Trans. Multimedia, vol. 10, no. 5, pp. 252--259, 2008.
 
19
C.G.M. Snoek and M. Worring, "Multimedia event-based video indexing using time intervals," IEEE Trans. Multimedia, vol. 7, no. 4, pp. 638--647, 2005.
 
20
D.A. Sadlier and N.E. Oconnor, "Event detection in field sports video using audio-visual features and a support vector machine," IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 10, pp. 1225--1233, 2008.
 
21
P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, "Machine recognition of human activities: a survey," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1473--1488, 2008.
 
22
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," in Porc. The IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.
 
23
M. Han, W. Xu, H. Tao, and Y. Gong, "An algorithm for multiple object trajectory tracking," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2004, pp. 864--871.
 
24
M. Yang, F. Lv, W. Xu, and Y. Gong, "Detection driven adaptive multi-cue integration for multiple human tracking," in Proc. Int. Conf. Computer Vision, 2009.
 
25
G.A. Korn and T.M. Korn, Math handbook for scientists and engineers, New York: McGraw-Hill, 1968.
 
26
M. Giese and T. Poggio, "Neural mechanisms for the recognition of biological movements and action," Nature Reviews Neuroscience, vol. 4, pp. 179--192, 2003.
 
27
H. Jhuang, T. Serre, L. Wolf, and T. Poggio, "A biologically inspired system for action recognition," in Proc. Int. Conf. Computer Vision, 2007, pp. 1--8.
 
28
S. Lazebnik, c. Schmid, and J. Ponce, "Beyond bags of features: spatial pyramid matching for recognizing natural scene categories," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2169--2178.
 
29
V. Vapnik, The nature of statistical learning theory, New York: Spinger-Verlag, 1995.
 
30
Y.G. Jiang, C.W. Ngo, and J. Yang, "Towards optimal bag-of-features for object categorization and semantic video retrieval," in Proc. ACM Int. Conf. Image and Video Retrieval, 2007, pp. 494--501.
 
31
D. Lowe, "Distinctive image features from scale-invariant keypoints," Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91--110, 2004.
 
32
A.A. Efros, A.C. Berg, G. Mori, and J. Malik, "Recognizing action at a distance," in Proc. Int. Conf. Computer Vision, vol. 2, 2003, pp. 726--733.
 
33
J.C. Platt, "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods", in Advances in Large Margin Classifiers, Cambridge: MIT Press, 1999.
 
34
R. Duda and P. Hart, Pattern classification and scene analysis, New York: John Wiley & Sons Inc, 1973.
 
35
B.K.P. Horn and B.G. Schunck, "Determining optical flow," Artificial Intelligence, vol. 17, pp. 185--203, 1981.
 
36
F. Lv, W. Xu, M, Yang, K. Yu, G. Zhu, and Y. Gong, "Surveillance event detection," TRECVid notebook paper in Proc. TRECVid workshop, 2008.