ACM Home Page
Please provide us with feedback. Feedback
A joint particle filter for audio-visual speaker tracking
Full text PdfPdf (461 KB)
Source International Conference on Multimodal Interfaces archive
Proceedings of the 7th international conference on Multimodal interfaces table of contents
Torento, Italy
POSTER SESSION: Posters table of contents
Pages: 61 - 68  
Year of Publication: 2005
ISBN:1-59593-028-0
Authors
Kai Nickel  Universität Karlsruhe (TH), Germany
Tobias Gehrig  Universität Karlsruhe (TH), Germany
Rainer Stiefelhagen  Universität Karlsruhe (TH), Germany
John McDonough  Universität Karlsruhe (TH), Germany
Sponsors
SIGCHI: ACM Special Interest Group on Computer-Human Interaction
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 81,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1088463.1088477
What is a DOI?

ABSTRACT

In this paper, we present a novel approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view face detection and upper body detection. On the audio side, the time delays of arrival between pairs of microphones are estimated with a generalized cross correlation function. Computationally expensive features are evaluated only at the particles' projected positions in the respective camera images, thus the complexity of the proposed algorithm is low. We evaluated the system on data that was recorded during actual lectures. The results of our experiments were 36 cm average error for video only tracking, 46 cm for audio only, and 31 cm for the combined audio-video system.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form location estimator for use with room environment microphone arrays. IEEE Trans. Speech Audio Proc., 5(1):45--50, January 1997.
 
3
Y. T. Chan and K. C. Ho. A simple and efficient estimator for hyperbolic location. IEEE Trans. Signal Proc., 42(8):1905--15, August 1994.
 
4
N. Checka, K. Wilson, V. Rangarajan, and T. Darrell. A probabilistic framework for multi-modal multi-person tracking. In IEEE Workshop on Multi-Object Tracking (in conjunction with CVPR), 2003.
 
5
J. Chen, J. Benesty, and Y. A. Huang. Robust time delay estimation exploiting redundancy among multiple microphones. IEEE Trans. Speech Audio Proc., 11(6):549--57, November 2003.
 
6
 
7
D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez. A mixed-state i-particle filter for multi-camera speaker tracking. In Proc. IEEE ICCV Workshop on Multimedia Technologies in E-Learning and Collaboration (ICCV-WOMTEC), 2003.
 
8
T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough. Kalman filters for audio-video source localization. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, to appear Oct. 2005.
 
9
Yiteng Huang, Jacob Benesty, Gary W. Elko, and Russell M. Mersereau. Real-time passive source localization: A practical linear-correction least-squares approach. IEEE Trans. Speech Audio Proc., 9(8):943--956, November 2001.
 
10
 
11
U. Klee, T. Gehrig, and J. McDonough. Kalman filters for time delay of arrival-based source localization. EURASIP Special Issue on Multichannel Speech Processing, submitted for publication.
 
12
H. Kruppa, M. Castrillon-Santana, and B. Schiele. Fast and robust face finding via local context. In IEEE Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, October 2003.
 
13
R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In ICIP, volume 1, pages 900--903, September 2002.
14
 
15
I. Mikic, S. Santini, and R. Jain. Tracking objects in 3d using multiple camera views. In ACCV, 2000.
 
16
M. Omologo and P. Svaizer. Acoustic event localization using a crosspower-spectrum phase based technique. Proc. ICASSP, II:273--6, 1994.
 
17
 
18
G. Potamianos, C. Neti, and S. Deligne. Joint audio-visual speech processing for recognition and enhancement. In Proc. Work. Audio-Visual Speech Processing, pages 95--104, September 2003.
 
19
H. C. Schau and A. Z. Robinson. Passive source localization employing intersecting spherical surfaces from time-of-arrival differences. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(8):1223--5, August 1987.
 
20
J. O. Smith and J. S. Abel. Closed-form least-squares source location estimation from range-difference measurements. IEEE Trans. Acoust. Speech Signal Proc., ASSP-35(12):1661--9, December 1987.
 
21
C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In CVPR, pages 246--252, 1999.
 
22
J. Vermaak, M. Gangnet, A. Blake, and P. Pérez. Sequential monte carlo fusion of sound and vision for speaker tracking. In Proc. IEEE Intl. Conf. on Computer Vision, volume 1, pages 741--746, 2001.
 
23
P. Viola and M. Jones. Robust real-time object detection. In ICCV Workshop on Statistical and Computation Theories of Vision, July 2001.
 
24
D. B. Ward, E. A. Lehmann, and R. C. Williamson. Particle filtering algorithms for tracking an acoustic source in a reverberant environment. IEEE Trans. Speech Audio Proc., 11(6):826--836, 2003.
 
25
M. Wölfel and J. McDonough. Combining multi-source far distance speech recognition strategies: beamforming, blind channel and confusion network combination. In Interspeech, to appear Sept. 2005.
 
26
 
27
D. Zotkin, R. Duraiswami, and L. Davis. Joint audio-visual tracking using particle filters. EURASIP journal on Applied Signal Processing, 2002(11), 2002.


Collaborative Colleagues:
Kai Nickel: colleagues
Tobias Gehrig: colleagues
Rainer Stiefelhagen: colleagues
John McDonough: colleagues