|
ABSTRACT
In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form location estimator for use with room environment microphone arrays. IEEE Trans. Speech Audio Proc., 5(1):45--50, January 1997.
|
| |
3
|
Y. T. Chan and K. C. Ho. A simple and efficient estimator for hyperbolic location. IEEE Trans. Signal Proc., 42(8):1905--15, August 1994.
|
| |
4
|
N. Checka, K. Wilson, V. Rangarajan, and T. Darrell. A probabilistic framework for multi-modal multi-person tracking. In IEEE Workshop on Multi-Object Tracking (in conjunction with CVPR), 2003.
|
| |
5
|
J. Chen, J. Benesty, and Y. A. Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Trans. Speech Audio Proc., 11(6):549--57, November 2003.
|
| |
6
|
|
| |
7
|
D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez. A mixed-state i-particle filter for multi-camera speaker tracking. In Proc. IEEE ICCV Workshop on Multimedia Technologies in E-Learning and Collaboration (ICCV-WOMTEC), 2003.
|
| |
8
|
T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough. Kalman filters for audio-video source localization. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, to appear Oct. 2005.
|
| |
9
|
Yiteng Huang, Jacob Benesty, Gary W. Elko, and Russell M. Mersereau. Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans. Speech Audio Proc., 9(8):943--956, November 2001.
|
| |
10
|
|
| |
11
|
U. Klee, T. Gehrig, and J. McDonough. Kalman filters for time delay of arrival-based source localization. EURASIP Special Issue on Multichannel Speech Processing, submitted for publication.
|
| |
12
|
H. Kruppa, M. Castrillon-Santana, and B. Schiele. Fast and robust face finding via local context. In IEEE Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, October 2003.
|
| |
13
|
R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In ICIP, volume 1, pages 900--903, September 2002.
|
 |
14
|
Qiong Liu , Yong Rui , Anoop Gupta , J. J. Cadiz, Automating camera management for lecture room environments, Proceedings of the SIGCHI conference on Human factors in computing systems, p.442-449, March 2001, Seattle, Washington, United States
[doi> 10.1145/365024.365310]
|
| |
15
|
I. Mikic, S. Santini, and R. Jain. Tracking objects in 3d using multiple camera views. In ACCV, 2000.
|
| |
16
|
M. Omologo and P. Svaizer. Acoustic event localization using a crosspower-spectrum phase based technique. Proc. ICASSP, II:273--6, 1994.
|
| |
17
|
|
| |
18
|
G. Potamianos, C. Neti, and S. Deligne. Joint audio-visual speech processing for recognition and enhancement. In Proc. Work. Audio-Visual Speech Processing, pages 95--104, September 2003.
|
| |
19
|
H. C. Schau and A. Z. Robinson. Passive source localization employing intersecting spherical surfaces from time-of-arrival differences. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(8):1223--5, August 1987.
|
| |
20
|
J. O. Smith and J. S. Abel. Closed-form least-squares source location estimation from range-difference measurements. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(12):1661--9, December 1987.
|
| |
21
|
C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In CVPR, pages 246--252, 1999.
|
| |
22
|
J. Vermaak, M. Gangnet, A. Blake, and P. Pérez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE Intl. Conf. on Computer Vision, volume 1, pages 741--746, 2001.
|
| |
23
|
P. Viola and M. Jones. Robust real-time object detection. In ICCV Workshop on Statistical and Computation Theories of Vision, July 2001.
|
| |
24
|
D. B. Ward, E. A. Lehmann, and R. C. Williamson. Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Proc., 11(6):826--836, 2003.
|
| |
25
|
M. Wölfel and J. McDonough. Combining multi-source far distance speech recognition strategies: beamforming, blind channel and confusion network combination. In Interspeech, to appear Sept. 2005.
|
| |
26
|
|
| |
27
|
D. Zotkin, R. Duraiswami, and L. Davis. Joint audio-visual tracking using particle filters. EURASIP journal on Applied Signal Processing, 2002(11), 2002.
|
CITED BY 5
|
Vasil Khalidov , Florence Forbes , Miles Hansard , Elise Arnaud , Radu Horaud, Detection and localization of 3d audio-visual objects using unsupervised clustering, Proceedings of the 10th international conference on Multimodal interfaces, October 20-22, 2008, Chania, Crete, Greece
|
|
|
|
|
|
|
|
|
|
|
R. Stiefelhagen , K. Bernardin , H. K. Ekenel , J. McDonough , K. Nickel , M. Voit , M. Wölfel, Audio-visual perception of a lecturer in a smart seminar room, Signal Processing, v.86 n.12, p.3518-3533, December 2006
|
|