ACM Home Page
Please provide us with feedback. Feedback
A low-latency lip-synchronized videoconferencing system
Full text PdfPdf (267 KB)
Source Conference on Human Factors in Computing Systems archive
Proceedings of the SIGCHI conference on Human factors in computing systems table of contents
Ft. Lauderdale, Florida, USA
DEMONSTRATION SESSION: Camera-based input and video techniques table of contents
Pages: 465 - 471  
Year of Publication: 2003
ISBN:1-58113-630-7
Author
Milton Chen  Stanford University, Stanford, CA
Sponsors
SIGCHI: ACM Special Interest Group on Computer-Human Interaction
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 48,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/642611.642692
What is a DOI?

ABSTRACT

Audio is presented ahead of video in some videoconferencing systems since audio requires less time to process. Audio could be delayed to synchronize with video to achieve lip synchronization; however, the overall audio latency might then become unacceptable. We built a videoconferencing system to achieve lip synchronization with minimal perceived audio latency. Instead of adding a fixed audio delay, our system time-stretches the audio at the beginning of each utterance until the audio is synchronized with the video. We conducted user studies and found that (1) audio could lead video by roughly 50 msec and still be perceived as synchronized; (2) audio could lead video by 300 msec and still be perceived as synchronized if the audio was time-stretched to synchronization within a short period; and (3) our algorithm appears to strike a favorable balance between minimizing audio latency and supporting lip synchronization.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. Binnie, A. Montgomery, and P. Jackson. Auditory and Visual Contributions to the Perception of Selected English Consonants for Normally Hearing and Hearing-impaired Listeners. Visual and Audio-visual Perception of Speech, volume 4, pages 181--209, 1986.
 
2
R. Campbell and B. Dodd. Hearing by Eye. Quarterly Journal of Experimental Psychology, Volume 32, pages 85--99, 1980.
3
 
4
J. Cooper. Video-to-Audio Synchrony Monitoring and Correction. Journal of the Society of Motion Picture and Television Engineers, pages 695--698, September, 1988.
 
5
N. Dixon and L. Spitz. The Detection of Auditory Visual Desynchrony. Perception, volume 9, pages 719--721, 1980.
 
6
N. Erber and C. DeFilippo. Voice/Mouth Synthesis and Tactual/Visual Perception of Pa, Ba, Ma. Journal of Acoustical Society of America, volume 64, pages 1015--1019, 1978.
 
7
E. Isaacs and J. Tang. Studying Video-Based Collaboration in Context: from Small Workgroups to Large Organizations. Video-Mediated Communication, Lawrence Erlbaum Associates, pages 173--197, 1997.
 
8
E. Koenig. Data discussed at Round table meeting on Modification of Speech Audiometry. VII International Congress on Audiology, volume 4, pages 72--75, 1965.
 
9
H. Knoche, H. De Meer, and D. Kirsh. Utility Curves: Mean Opinion Scores Considered Biased. Proceedings of the Seventh International Workshop on Quality of Service, 1999.
 
10
 
11
D. Massaro, M. Cohen, and P. Smeele. Perception of Asynchronous and Conflicting Visual and Auditory Speech. Journal of the Acoustical Society of America, volume 100, pages 1777--1786, 1996.
 
12
M. McGrath and Q. Summerfield. Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. Journal of Acoustical Society of America, volume 77, pages 678--685, 1985.
 
13
H. McGurk and J. MacDonald. Hearing Lips and Seeing Speech. Nature, volume 264, pages 746--748, 1976.
 
14
N. Miner and T. Caudell. Computational Requirements and Synchronization Issues of Virtual Acoustic Displays. Presence: Teleoperators and Virtual Environments, volume 7, pages 396--409, 1998.
 
15
K. Munhall, P. Gribble, L. Sacco, and M. Ward. Temporal Constraints on the McGurk Effect. Perception & Psychophysics, volume 58, pages 351--362, 1996.
 
16
P. Pandey, H. Kunov, and S. Abel. Disruptive Effects of Auditory Signal Delay on Speech Perception with Lipreading. Journal of Auditory Research, volume 26, pages 27--41, 1986.
 
17
S. Rosen, A. Fourcin, and B. Moore. Voice Pitch as an Aid to Lipreading. Nature, volume 291, pages 150--152, 1981.
 
18
R. Steinmetz. Human Perception of Jitter and Media Synchronization. IEEE Journal on Selected Areas in Communications, volume 14, pages 61--72, 1996.
 
19
W. Sumby and I. Pollack. Visual Contribution to Speech Intelligibility in Noise. Journal of Acoustical Society of America, volume 26, pages 212--215, 1954.
 
20
H. Tillmann, B. Pompino-Marschall, and H. Prozig. Zum Einfluß visuell dargeborener Sprachbewegungen auf die Wahrnehmung der akustisch dodierten Artikulation. Forschungsberichtedes Instituts fur Phonetik und Sprachliche Kommunikation der Universitat Munchen, volume 19, pages 318--338, 1984.
 
21
E. Walther. Lipreading, Nelson-Hall Publishers, 1982.
 
22
Television Signal Transmission Standards. NAB Engineering Handbook, 7th Edition, National Association of Broadcasters, pages 41--49, 1985.
 
23
Tolerances for Transmission Time Differences between the Vision and the Sound Components of a Television Signal. CCIR Recommendation 717, Dusseldorf, 1990.
 
24