ACM Home Page
Please provide us with feedback. Feedback
Visual localization of non-stationary sound sources
Full text PdfPdf (974 KB)
Source
International Multimedia Conference archive
Proceedings of the seventeen ACM international conference on Multimedia table of contents
Beijing, China
SESSION: Short papers session 1: content analysis table of contents
Pages 513-516  
Year of Publication: 2009
ISBN:978-1-60558-608-3
Authors
Yuyu Liu  The University of Tokyo, Tokyo, Japan
Yoichi Sato  The University of Tokyo, Tokyo, Japan
Sponsor
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 5,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1631272.1631344
What is a DOI?

ABSTRACT

Sound source can be visually localized by analyzing the correlation between audio and visual data. To correctly analyze this correlation, the sound source is required to be stationary in a scene to date. We introduce a technique that localizes the non-stationary sound sources to overcome this limitation. The problem is formulated as finding the optimal visual trajectories that best represent the movement of the sound source over the pixels in a spatio-temporal volume. Using a beam search, we search these optimal visual trajectories by maximizing the correlation between the newly introduced audiovisual features of inconsistency. An incremental correlation evaluation with mutual information is developed here, which significantly reduces the computational cost. The correlations computed along the optimal trajectories are finally incorporated into a segmentation technique to localize a sound source region in the first visual frame of the current time window. Experimental results demonstrate the effectiveness of our method.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
. Barzelay and Y. Schechner. "Harmony in Motion". In Proc. CVPR, pp.1--8, 2007.
 
2
. Darrell and J. Fisher III. "Speaker Association with Signal-Level Audiovisual Fusion". IEEE Trans. on Multimedia, 6(3):406--413, 2004.
 
3
. Driver. "Enhancement of Selective Listening by Illusory Mislocation of Speech Sounds due to Lip-Reading", Nature, 381:66--68, 1996.
 
4
. Hershey and J. R. Movellan. "Audio Vision: Using Audiovisual Synchrony to Locate Sounds". In Proc. NIPS, pp.813--819, 1999.
 
5
. Kidron, Y. Schechner, and M. Elad. "Pixels that Sound". In Proc. CVPR, pp.88--95, 2005.
 
6
. Liu and Y. Sato. "Finding Speaker Face Region by Audiovisual Correlation". In Proc. ECCV Workshop, pp.1--12, 2008.
 
7
. Liu and Y. Sato. "Recovering Audio-to-Video Synchronization by Audiovisual Correlation Analysis". In Proc. ICPR, pp.1--4, 2008.
 
8
. Lucas and T. Kanade. "An Iterative Image Registration Technique with an Application to Stereo Vision". In Proc. Int'l Joint Conf. on Artificial Intelligence, pp.674--679, 1981.
 
9
. Monaci and P. Vandergheynst. "Audiovisual Gestalts". In Proc. CVPR Workshop on Perceptual Organization in Computer Vision, pp.1--8, 2006.
 
10
. O'Donovan, R. Duraiswami, and J. Neumann. "Microphone Arrays as Generalized Cameras for Integrated Audio Visual Processing". In Proc. CVPR, 1--8, 2007.
 
11
. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy. "Moving-Talker, Speaker-Independent Feature Study and Baseline Results using the Cuave Multimodal Speech Corpus". EURASIP J. on Applied Signal Processing, 2002(11):1189--1201, 2002.
 
12
. Rabiner and B. Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.
 
13
. Shannon. "Prediction and entropy of printed English". The Bell System Technical Journal, 30:50--64, 1951.
 
14
. Shechtman and M. Irani. "Space-Time Behaviour-Based Correlation". Trans. on Pattern Analysis and Machine Intelligence, 29(11):2045--2056, 2007.