|
ABSTRACT
This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Araki, M. Fujimoto, K. Ishizuka, H. Sawada, and S. Makino. A DOA based speaker diarization system for real meetings. In Proc. HSCMA2008, pages 29--32, 2008.
|
| |
2
|
M. Argyle. Bodily Communication - 2nd ed. Routledge, London and New York, 1988.
|
| |
3
|
S. O. Ba and J.-M. Odobez. A study on visual focus of attention recognition from head pose in a meeting room. In Proc. MLMI2006, pages 75--87, 2006.
|
| |
4
|
C. Busso, P. G. Georgiou, and S. S. Narayanan. Real-time monitoring of participants' interaction in a meeting using audio-visual sensors. In Proc. ICASSP2007, pages 685--688, 2007.
|
| |
5
|
D. Douxchamps and N. Campbell. Robust real time face tracking for the analysis of human behaviour. In Proc. MLMI2007, pages 1--10, 2007.
|
| |
6
|
M. Fujimoto, K. Ishizuka, and T. Nakatani. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme. In Proc. ICASSP2008, pages 4441--4444, 2008.
|
| |
7
|
D. Gatica-Perez. Analyzing group interactions in conversations: a review. In Proc. IEEE Int. Conf. Multisensor Fusion and Integration for Intelligent Systems '06, pages 41--46, 2006.
|
| |
8
|
D. Gatica-Perez, J.-M. Odobez, S. Ba, K. Smith, and G. Lathoud. Tracking people in meetings with particles. Technical Report IDIAP-RR 04-71, IDIAP, 2004.
|
| |
9
|
A. Kendon. Some functions of gaze-direction in social interaction. Acta Psychologica, 26:22--63, 1967.
|
| |
10
|
C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Trans. ASSP, 24(4):320--327, 1976.
|
| |
11
|
L. Chen, et al. Vace multimodal meeting corpus. In Proc. MLMI2006, pages 40--51, 2006.
|
| |
12
|
O. Mateo Lozano and K. Otsuka. Real-time visual tracker by stream processing. Journal of Signal Processing Systems, DOI 10.1007/s11265-008-0250-2, 2008.
|
| |
13
|
O. Mateo Lozano and K. Otsuka. Simultaneous and fast 3D tracking of multiple faces in video by GPU-based stream processing. In Proc. ICASSP2008, pages 713--716, 2008.
|
| |
14
|
Y. Matsusaka, H. Asoh, and F. Asano. Multi human trajectory estimation using stochastic sampling and its application to meeting recognition. In Proc. MVA2007, pages 16--18, 2007.
|
| |
15
|
NIST Speech Group. Spring 2007 (RT-07) rich transcription meeting recognition evaluation plan. Technical Report rt07-meeting-eval-plan-v2, NIST, 2007.
|
 |
16
|
Kazuhiro Otsuka , Yoshinao Takemae , Junji Yamato, A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances, Proceedings of the 7th international conference on Multimodal interfaces, October 04-06, 2005, Torento, Italy
[doi> 10.1145/1088463.1088497]
|
| |
17
|
|
| |
18
|
K. Otsuka, J. Yamato, and H. Murase. Conversation scene analysis with dynamic Bayesian network based on visual head tracking. In Proc. ICME'06, pages 949--952, 2006.
|
| |
19
|
S. Renals, T. Hain, and H. Bourlard. Interpretation of multiparty meetings the AMI and AMIDA projects. In Proc. HSCMA2008, pages 115--118, 2008.
|
| |
20
|
K. Smith, S. Schreiber, I. Potúcek, V. Beran, G. Rigoll, and D. Gatica-Perez. Real-time monitoring of participants' interaction in a meeting using audio-visual sensors. In Proc. MLMI2006, pages 88--101, 2006.
|
| |
21
|
R. Stiefelhagen, J. Yang, and A. Waibel. Modeling focus of attention for meeting index based on multiple cues. IEEE Trans. Neural Networks, 13(4), 2002.
|
| |
22
|
|
 |
23
|
|
| |
24
|
|
CITED BY 6
|
|
|
|
|
|
|
|
Kazuhiro Otsuka , Shoko Araki , Dan Mikami , Kentaro Ishizuka , Masakiyo Fujimoto , Junji Yamato, Realtime meeting analysis and 3D meeting viewer based on omnidirectional multimodal sensors, Proceedings of the 2009 international conference on Multimodal interfaces, November 02-04, 2009, Cambridge, Massachusetts, USA
|
|
|
Shiro Kumano , Kazuhiro Otsuka , Dan Mikami , Junji Yamato, Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings, Proceedings of the 2009 international conference on Multimodal interfaces, November 02-04, 2009, Cambridge, Massachusetts, USA
|
|
|
Kentaro Ishizuka , Shoko Araki , Kazuhiro Otsuka , Tomohiro Nakatani , Masakiyo Fujimoto, A speaker diarization method based on the probabilistic fusion of audio-visual location information, Proceedings of the 2009 international conference on Multimodal interfaces, November 02-04, 2009, Cambridge, Massachusetts, USA
|
|
|
|
|