ACM Home Page
Please provide us with feedback. Feedback
Visual speaker localization aided by acoustic models
Full text PdfPdf (666 KB)
Source
International Multimedia Conference archive
Proceedings of the seventeen ACM international conference on Multimedia table of contents
Beijing, China
SESSION: Content track C5: audio and music table of contents
Pages 195-202  
Year of Publication: 2009
ISBN:978-1-60558-608-3
Authors
Gerald Friedland  International Computer Science Institute, Berkeley, CA, USA
Chuohao Yeo  University of California, Berkeley, CA, USA
Hayley Hung  IDIAP Research Institute, Martigny, Switzerland
Sponsor
SIGMULTIMEDIA: ACM Special Interest Group on Multimedia
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 15,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1631272.1631301
What is a DOI?

ABSTRACT

The following paper presents a novel audio-visual approach for unsupervised speaker locationing. Using recordings from a single, low-resolution room overview camera and a single far-field microphone, a state-of-the art audio-only speaker localization system (traditionally called speaker diarization) is extended so that both acoustic and visual models are estimated as part of a joint unsupervised optimization problem. The speaker diarization system first automatically determines the number of speakers and estimates "who spoke when", then, in a second step, the visual models are used to infer the location of the speakers in the video. The experiments were performed on real-world meetings using 4.5 hours of the publicly available AMI meeting corpus. The proposed system is able to exploit audio-visual integration to not only improve the accuracy of a state-of-the-art (audio-only) speaker diarization, but also adds visual speaker locationing at little incremental engineering and computation costs.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
N. Campbell and N. Suzuki. Working with Very Sparse Data to Detect Speaker and Listener Participation in a Meetings Corpus. In Workshop Programme, volume 10, May 2006.
 
2
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraiij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, M. McCowan, W. Post, D. Reidsma, and P. Wellner. The AMI meeting corpus: A pre-announcement. In Joint Workshop on Machine Learning and Multimodal Interaction (MLMI), 2005.
 
3
T. Chen and R. Rao. Cross-modal Prediction in Audio-visual Communication. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages 2056--2059, 1996.
 
4
J. W. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia, 6(3):406--413, 2004.
 
5
J. W. Fisher, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In Conference on Neural Information Processing Systems (NIPS), pages 772--778, 2000.
 
6
G. Friedland, H. Hung, and C. Yeo. Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), page (to appear), 2009.
 
7
M. Hershenson. Reaction time as a measure of intersensory facilitation. J Exp Psychol, 63:289--93, 1962.
 
8
M. Huijbregts. Segmentation, Diarization, and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede, The Netherlands, 2008.
 
9
H. Hung and G. Friedland. Towards audio-visual on-line diarization of participants in group meetings. In Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications in conjunction with ECCV, Marseille, France, October 2008.
 
10
H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating the dominant person in multi-party conversations using speaker diarization strategies. In International Conference on Acoustics, Speech, and Signal Processing, 2008.
 
11
H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Associating audio-visual activity cues in a dominance estimation framework. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Human Communicative Behavior, Ankorage, Alaska, 2008.
 
12
H. Hung, Y. Huang, C. Yeo, and D. Gatica-Perez. Correlating audio-visual cues in a dominance estimation framework. In CVPR Workshop on Human Communicative Behavior Analysis, 2008.
 
13
H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746--48, 1976.
 
14
S. J. McKenna, S. Gong, and Y. Raja. Modelling facial colour and identity with gaussian mixtures. Pattern Recognition, 31(12):1883--1892, 1998.
 
15
D. McNeill. Language and Gesture. Cambridge University Press New York, 2000.
 
16
H. J. Nock, G. Iyengar, and C. Neti. Speaker localisation using audio-visual synchrony: An empirical study. In ACM International Conference on Image and Video Retrieval (CIVR), pages 488--499, 2003.
 
17
A. Noulas and B. J. A. Krose. On-line multi-modal speaker diarization. In Proc. International Conference on Multimodal Interfaces (ICMI), pages 350--357, New York, USA, 2007. ACM.
 
18
J. Pardo, X. Anguera, and C. Wooters. Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information. IEEE Transactions on Computers, 56(9):1189, 2007.
 
19
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy. CUAVE: A new audio-visual database for multimodal human-computer interface research. In International Conference on Acoustics, Speech, and Signal Processing, pages 2017--2020, 2002.
 
20
R. Rao and T. Chen. Exploiting audio-visual correlation in coding of talking head sequences. International Picture Coding Symposium, March 1996.
 
21
D. A. Reynolds and P. Torres-Carrasquillo. Approaches and applications of audio diarization. In Proc. of International Conference on Audio and Speech Signal Processing, 2005.
 
22
M. Siracusa and J. Fisher. Dynamic dependency tests for audio-visual speaker association. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), April 2007.
 
23
S. Tamura, K. Iwano, and S. FURUI. Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images. Real World Speech Processing, 2004.
 
24
H. Vajaria, T. Islam, S. Sarkar, R. Sankar, and R. Kasturi. Audio segmentation and speaker localization in meeting videos. International Conference on Pattern Recognition, 2006. ICPR 2006. 18th, 2:1150--1153, 2006.
 
25
H. Vajaria, S. Sarkar, and R. Kasturi. Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Transactions on Circuits and Systems for Video Technology, 18:1608--1617, Nov 2008.
 
26
C. Wooters and M. Huijbregts. The ICSI RT07s speaker diarization system. In Proceedings of the Rich Transcription 2007 Meeting Recognition Evaluation Workshop, 2007.
 
27
C. Yeo and K. Ramchandran. Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection. Technical Report UCB/EECS-2008-79, EECS Department, University of California, Berkeley, Jun 2008.
 
28
C. Zhang, P. Yin, Y. Rui, R. Cutler, and P. Viola. Boosting-Based Multimodal Speaker Detection for Distributed Meetings. IEEE International Workshop on Multimedia Signal Processing (MMSP) 2006, 2006.