|
ABSTRACT
Visual information has been shown to improve the performance of speech recognition systems in noisy acoustic environments. However, most audio-visual speech recognizers rely on a clean visual signal. In this paper, we explore a novel approach to visual speech modeling, based on articulatory features, which has potential benefits under visually challenging conditions. The idea is to use a set of parallel classifiers to extract different articulatory attributes from the input images, and then combine their decisions to obtain higher-level units, such as visemes or words. We evaluate our approach in a preliminary experiment on a small audio-visual database, using several image noise conditions, and compare it to the standard viseme-based modeling approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Adjoudani and C. Benoit, "On the integration of auditory and visual parameters in HMM-based ASR," in Speechreading by Humans and Machines, D. G. Stork and M. E. Hennecke, Eds. Berlin, Germany: Springer, pp. 461--471, 1996.
|
| |
2
|
S. Boll, "Speech enhancement in the 1980s: noise suppression with pattern matching," In Advances in Speech Signal Processing, pp. 309--325, Dekker, 1992.
|
| |
3
|
C. Bregler and Y. Konig, "Eigenlips for Robust Speech Recognition," In Proc. ICASSP, 1994.
|
| |
4
|
M. Chan, Y. Zhang, and T. Huang, "Real-time lip tracking and bimodal continuous speech recognition," in Proc. Works. Multimedia Signal Processing, pp. 65--70, Redondo Beach, CA, 1998.
|
| |
5
|
C. Chang and C. Lin, LIBSVM: A Library For Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
|
| |
6
|
N. Chomsky and M. Halle, The Sound Pattern of English, Harper and Row, New York, 1968.
|
| |
7
|
S. Chu and T. Huang, "Bimodal speech recognition using coupled hidden Markov models," In Proc. Int. Conf. Spoken Lang. Processing, vol. II, Beijing, China, pp. 747--750, 2000.
|
| |
8
|
|
| |
9
|
|
| |
10
|
S. Dupont and J. Luettin, "Audio-visual speech modeling for continuous speech recognition," IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141--151, 2000.
|
| |
11
|
G. Fant, Acoustic Theory of Speech Production, Netherlands: Mouton and Co., 1960.
|
| |
12
|
M. Gordan, C. Kotropoulos, and I. Pitas, "A support vector machine based dynamic network for visual speech recognition applications," EURASIP J. Appl. Signal Processing, vol. 2002, no. 11, pp. 1248--1259, 2002.
|
| |
13
|
S. Gurbuz, Z. Tufekci, E. Patterson, and J. Gowdy, "Application of affine-invariant fourier descriptors to lipreading for audio-visual speech recognition," in Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 177--180, Salt Lake City, UT, 2001.
|
| |
14
|
M. Kass, A. Witkin, and D. Terzopoulos, "Snakes: Active contour models," Int. J. Computer Vision, vol. 1, no. 4, pp. 321--331, 1988.
|
| |
15
|
S. King, T. Stephenson, S. Isard, P. Taylor and A. Strachan, "Speech recognition via phonetically featured syllables," In Proc. ICSLP, Sydney, 1998.
|
| |
16
|
K. Kirchhoff, G. Fink and G. Sagerer, "Combining Acoustic and Articulatory-feature Information for Robust Speech Recognition," In Proc. ICSLP, pp. 891--894, Sydney, 1998.
|
| |
17
|
G. Krone, B. Talle, A. Wichert, and G. Palm, "Neural architectures for sensor fusion in speech recognition," In Proc. Europ. Tut. Works. Audio-Visual Speech Processing, pp. 57--60, Greece, 1997.
|
| |
18
|
K. Livescu and J. Glass, "Feature-based Pronunciation Modeling for Speech Recognition," In Proc. HLT/NAACL, Boston, May, 2004.
|
| |
19
|
K. Mase and A. Pentland, "Automatic Lipreading by optical flow analysis," Systems and Computers in Japan, vol. 22, no. 6, pp. 67--76, 1991.
|
| |
20
|
|
| |
21
|
F. Metze, and A. Waibel, "A Flexible Stream Architecture for ASR Using Articulatory Features," In Proc. ICSLP, Denver, 2002.
|
| |
22
|
G. Miller and P. Nicely, "An Analysis of Perceptual Confusions among some English Consonants," J. Acoustical Society America, vol. 27, no. 2, pp. 338--352, 1955.
|
| |
23
|
C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, "Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop," In Proc. Works. Signal Processing, pp. 619--624, Cannes, France, 2001.
|
| |
24
|
L. Ng, G. Burnett, J. Holzrichter, and T. Gable, "Denoising of Human Speech Using Combined Acoustic and EM Sensor Signal Processing," In Proc. ICASSP, Istanbul, Turkey, June, 2000.
|
| |
25
|
P. Niyogi, E. Petajan, and J. Zhong, "Feature Based Representation for Audio-Visual Speech Recognition", Proceedings of the Audio Visual Speech Conference, Santa Cruz, CA, 1999.
|
| |
26
|
E. Petajan, "Automatic lipreading to enhance speech recognition," In Proc. Global Telecomm. Conf., pp. 265--272, Atlanta, GA, 1984.
|
| |
27
|
G. Potamianos and C. Neti, "Audio-visual speech recognition in challenging environments," In Proc. Eur. Conf. Speech Comm. Tech., pp. 1293--1296, Geneva, September, 2003.
|
| |
28
|
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, "Recent Advances in the Automatic Recognition of Audio-Visual Speech", In Proc. IEEE, 2003.
|
| |
29
|
G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, "A Cascade Image Transform for Speaker-Independent Automatic Speechreading," In Proc. ICME, volume II, pp. 1097--1100, New York, 2000.
|
| |
30
|
W. Sumby, and I. Pollack, "Visual contribution to speech intelligibility in noise," J. Acoustical Society America, vol. 26, no. 2, pp. 212--215, 1954.
|
| |
31
|
J. Sun and L. Deng, "An Overlapping-Feature Based Phonological Model Incorporating Linguistic Constraints: Applications to Speech Recognition", J. Acoustic Society of America, vol. 111, No. 2, pp. 1086--1101, 2002.
|
| |
32
|
P. Teissier, J. Robert-Ribes, and J. Schwartz, "Comparing models for audiovisual fusion in a noisy-vowel recognition task," IEEE Trans. Speech Audio Processing, vol. 7, no. 6, pp. 629--642, 1999.
|
| |
33
|
|
CITED BY 4
|
|
Timothy J. Hazen , Kate Saenko , Chia-Hao La , James R. Glass, A segment-based audio-visual speech recognizer: data collection, development, and initial experiments, Proceedings of the 6th international conference on Multimodal interfaces, October 13-15, 2004, State College, PA, USA
|
|
|
|
|
|
|
|
|
|
|