|
ABSTRACT
This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectation-maximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single- and multiple-speaker detection and localization, in the presence of other audio sources.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
J. Vermaak, M. Ganget, A. Blake, and P. Pérez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE ICCV, pages 741--746, 2001.
|
| |
6
|
P. Perez, J. Vermaak, and A. Blake. Data fusion for visual tracking with particles. Proc. of IEEE, 92(3):495--513, 2004.
|
| |
7
|
Y. Chen and Y. Rui. Real-time speaker tracking using particle filter sensor fusion. Proc. of IEEE, 92(3):485--494, 2004.
|
 |
8
|
|
| |
9
|
T. Hospedales, J. Cartwright, and S. Vijayakumar. Structure inference for bayesian multisensory perception and tracking. In Proc. International Joint Conference on Artificial Intelligence, pages 2122--2128, 2007.
|
| |
10
|
N. Checka, K. Wilson, M. Siracusa, and T. Darrell. Multiple person and speaker activity tracking with a particle filter. In IEEE Conf. Acoust. Sp. Sign. Proc., pages 881--884, 2004.
|
| |
11
|
D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I. McCowan. Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans. on ASLP, 15(2):601--616, 2007.
|
 |
12
|
|
| |
13
|
R. Brunelli, A. Brutti, P. Chippendale, O. Lanz, M. Omologo, P. Svaizer, and F. Tobia. A generative approach to audio-visual person tracking. In Multimodal Technologies for Perception of Humans: Proc. 1st International Evaluation Workshop, pages 55--68, 2007.
|
| |
14
|
J. Fisher and T. Darrell. Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia, 6(3):406--413, 2004.
|
| |
15
|
Z. Barzelay and Y. Y. Schechner. Harmony in motion. In Proc. of IEEE CVPR, pages 1--8, 2007.
|
| |
16
|
M. Hansard and R. P. Horaud. Patterns of binocular disparity for a fixating observer. In Advances in Brain, Vision, & AI, 2nd Int. Symp., pages 308--317. Springer, 2007.
|
| |
17
|
J. R. Movellan and G. Chadderdon. Channel separability in the audio-visual integration of speech: A Bayesian approach. In D. G. Stork and M. E. Hennecke, editors, Speech Reading by Humans and Machines: Models, Systems and Applications, NATO ASI Series, pages 473--487. Springer, Berlin, 1996.
|
| |
18
|
D. W. Massaro and D. G. Stork. Speech recognition and sensory integration. American Scientist, 86(3):236--244, 1998.
|
| |
19
|
G. Celeux, F. Forbes, and N. Peyrard. EM procedures using mean-field approximations for Markov model-based image segmentation. Pattern Recognition, 36:131--144, 2003.
|
| |
20
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B, 39(1):1--38, 1977.
|
| |
21
|
|
| |
22
|
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461--464, March 1978.
|
 |
23
|
Elise Arnaud , Heidi Christensen , Yan-Chen Lu , Jon Barker , Vasil Khalidov , Miles Hansard , Bertrand Holveck , Hervé Mathieu , Ramya Narasimha , Elise Taillant , Florence Forbes , Radu Horaud, The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements, Proceedings of the 10th international conference on Multimodal interfaces, October 20-22, 2008, Chania, Crete, Greece
[doi> 10.1145/1452392.1452414]
|
| |
24
|
C. Harris and M. Stephens. A combined corner and edge detector. In Proc. 4th Alvey Vision Conference, pages 147--151, 1988.
|
| |
25
|
Intel OpenCV Computer Vision library. http://www.intel.com/technology/computing/opencv.
|
| |
26
|
H. Christensen, N. Ma, S. N. Wrigley, and J. Barker. Integrating pitch and localisation cues at a speech fragment level. In Proc. of Interspeech 2007, pages 2769--2772, 2007.
|
CITED BY
|
|
Elise Arnaud , Heidi Christensen , Yan-Chen Lu , Jon Barker , Vasil Khalidov , Miles Hansard , Bertrand Holveck , Hervé Mathieu , Ramya Narasimha , Elise Taillant , Florence Forbes , Radu Horaud, The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements, Proceedings of the 10th international conference on Multimodal interfaces, October 20-22, 2008, Chania, Crete, Greece
|
|