ACM Home Page
Please provide us with feedback. Feedback
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization
Full text PdfPdf (2.97 MB)
Source
International Conference on Multimodal Interfaces archive
Proceedings of the 10th international conference on Multimodal interfaces table of contents
Chania, Crete, Greece
POSTER SESSION: Multimodal systems II (poster session) table of contents
Pages 257-264  
Year of Publication: 2008
ISBN:978-1-60558-198-9
Authors
Kazuhiro Otsuka  NTT Communication Science Labs, Atsugi, Japan
Shoko Araki  NTT Communication Science Labs, Kyoto, Japan
Kentaro Ishizuka  NTT Communication Science Labs, Kyoto, Japan
Masakiyo Fujimoto  NTT Communication Science Labs, Kyoto, Japan
Martin Heinrich  NTT Communication Science Labs, Atsugi, Japan
Junji Yamato  NTT Communication Science Labs, Atsugi, Japan
Sponsors
SIGCHI: ACM Special Interest Group on Computer-Human Interaction
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 110,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1452392.1452446
What is a DOI?

ABSTRACT

This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
S. Araki, M. Fujimoto, K. Ishizuka, H. Sawada, and S. Makino. A DOA based speaker diarization system for real meetings. In Proc. HSCMA2008, pages 29--32, 2008.
 
2
M. Argyle. Bodily Communication - 2nd ed. Routledge, London and New York, 1988.
 
3
S. O. Ba and J.-M. Odobez. A study on visual focus of attention recognition from head pose in a meeting room. In Proc. MLMI2006, pages 75--87, 2006.
 
4
C. Busso, P. G. Georgiou, and S. S. Narayanan. Real-time monitoring of participants' interaction in a meeting using audio-visual sensors. In Proc. ICASSP2007, pages 685--688, 2007.
 
5
D. Douxchamps and N. Campbell. Robust real time face tracking for the analysis of human behaviour. In Proc. MLMI2007, pages 1--10, 2007.
 
6
M. Fujimoto, K. Ishizuka, and T. Nakatani. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme. In Proc. ICASSP2008, pages 4441--4444, 2008.
 
7
D. Gatica-Perez. Analyzing group interactions in conversations: a review. In Proc. IEEE Int. Conf. Multisensor Fusion and Integration for Intelligent Systems '06, pages 41--46, 2006.
 
8
D. Gatica-Perez, J.-M. Odobez, S. Ba, K. Smith, and G. Lathoud. Tracking people in meetings with particles. Technical Report IDIAP-RR 04-71, IDIAP, 2004.
 
9
A. Kendon. Some functions of gaze-direction in social interaction. Acta Psychologica, 26:22--63, 1967.
 
10
C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Trans. ASSP, 24(4):320--327, 1976.
 
11
L. Chen, et al. Vace multimodal meeting corpus. In Proc. MLMI2006, pages 40--51, 2006.
 
12
O. Mateo Lozano and K. Otsuka. Real-time visual tracker by stream processing. Journal of Signal Processing Systems, DOI 10.1007/s11265-008-0250-2, 2008.
 
13
O. Mateo Lozano and K. Otsuka. Simultaneous and fast 3D tracking of multiple faces in video by GPU-based stream processing. In Proc. ICASSP2008, pages 713--716, 2008.
 
14
Y. Matsusaka, H. Asoh, and F. Asano. Multi human trajectory estimation using stochastic sampling and its application to meeting recognition. In Proc. MVA2007, pages 16--18, 2007.
 
15
NIST Speech Group. Spring 2007 (RT-07) rich transcription meeting recognition evaluation plan. Technical Report rt07-meeting-eval-plan-v2, NIST, 2007.
16
 
17
 
18
K. Otsuka, J. Yamato, and H. Murase. Conversation scene analysis with dynamic Bayesian network based on visual head tracking. In Proc. ICME'06, pages 949--952, 2006.
 
19
S. Renals, T. Hain, and H. Bourlard. Interpretation of multiparty meetings the AMI and AMIDA projects. In Proc. HSCMA2008, pages 115--118, 2008.
 
20
K. Smith, S. Schreiber, I. Potúcek, V. Beran, G. Rigoll, and D. Gatica-Perez. Real-time monitoring of participants' interaction in a meeting using audio-visual sensors. In Proc. MLMI2006, pages 88--101, 2006.
 
21
R. Stiefelhagen, J. Yang, and A. Waibel. Modeling focus of attention for meeting index based on multiple cues. IEEE Trans. Neural Networks, 13(4), 2002.
 
22
23
 
24

Collaborative Colleagues:
Kazuhiro Otsuka: colleagues
Shoko Araki: colleagues
Kentaro Ishizuka: colleagues
Masakiyo Fujimoto: colleagues
Martin Heinrich: colleagues
Junji Yamato: colleagues