|
ABSTRACT
Audio is presented ahead of video in some videoconferencing systems since audio requires less time to process. Audio could be delayed to synchronize with video to achieve lip synchronization; however, the overall audio latency might then become unacceptable. We built a videoconferencing system to achieve lip synchronization with minimal perceived audio latency. Instead of adding a fixed audio delay, our system time-stretches the audio at the beginning of each utterance until the audio is synchronized with the video. We conducted user studies and found that (1) audio could lead video by roughly 50 msec and still be perceived as synchronized; (2) audio could lead video by 300 msec and still be perceived as synchronized if the audio was time-stretched to synchronization within a short period; and (3) our algorithm appears to strike a favorable balance between minimizing audio latency and supporting lip synchronization.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C. Binnie, A. Montgomery, and P. Jackson. Auditory and Visual Contributions to the Perception of Selected English Consonants for Normally Hearing and Hearing-impaired Listeners. Visual and Audio-visual Perception of Speech, volume 4, pages 181--209, 1986.
|
| |
2
|
R. Campbell and B. Dodd. Hearing by Eye. Quarterly Journal of Experimental Psychology, Volume 32, pages 85--99, 1980.
|
 |
3
|
|
| |
4
|
J. Cooper. Video-to-Audio Synchrony Monitoring and Correction. Journal of the Society of Motion Picture and Television Engineers, pages 695--698, September, 1988.
|
| |
5
|
N. Dixon and L. Spitz. The Detection of Auditory Visual Desynchrony. Perception, volume 9, pages 719--721, 1980.
|
| |
6
|
N. Erber and C. DeFilippo. Voice/Mouth Synthesis and Tactual/Visual Perception of Pa, Ba, Ma. Journal of Acoustical Society of America, volume 64, pages 1015--1019, 1978.
|
| |
7
|
E. Isaacs and J. Tang. Studying Video-Based Collaboration in Context: from Small Workgroups to Large Organizations. Video-Mediated Communication, Lawrence Erlbaum Associates, pages 173--197, 1997.
|
| |
8
|
E. Koenig. Data discussed at Round table meeting on Modification of Speech Audiometry. VII International Congress on Audiology, volume 4, pages 72--75, 1965.
|
| |
9
|
H. Knoche, H. De Meer, and D. Kirsh. Utility Curves: Mean Opinion Scores Considered Biased. Proceedings of the Seventh International Workshop on Quality of Service, 1999.
|
| |
10
|
|
| |
11
|
D. Massaro, M. Cohen, and P. Smeele. Perception of Asynchronous and Conflicting Visual and Auditory Speech. Journal of the Acoustical Society of America, volume 100, pages 1777--1786, 1996.
|
| |
12
|
M. McGrath and Q. Summerfield. Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. Journal of Acoustical Society of America, volume 77, pages 678--685, 1985.
|
| |
13
|
H. McGurk and J. MacDonald. Hearing Lips and Seeing Speech. Nature, volume 264, pages 746--748, 1976.
|
| |
14
|
N. Miner and T. Caudell. Computational Requirements and Synchronization Issues of Virtual Acoustic Displays. Presence: Teleoperators and Virtual Environments, volume 7, pages 396--409, 1998.
|
| |
15
|
K. Munhall, P. Gribble, L. Sacco, and M. Ward. Temporal Constraints on the McGurk Effect. Perception & Psychophysics, volume 58, pages 351--362, 1996.
|
| |
16
|
P. Pandey, H. Kunov, and S. Abel. Disruptive Effects of Auditory Signal Delay on Speech Perception with Lipreading. Journal of Auditory Research, volume 26, pages 27--41, 1986.
|
| |
17
|
S. Rosen, A. Fourcin, and B. Moore. Voice Pitch as an Aid to Lipreading. Nature, volume 291, pages 150--152, 1981.
|
| |
18
|
R. Steinmetz. Human Perception of Jitter and Media Synchronization. IEEE Journal on Selected Areas in Communications, volume 14, pages 61--72, 1996.
|
| |
19
|
W. Sumby and I. Pollack. Visual Contribution to Speech Intelligibility in Noise. Journal of Acoustical Society of America, volume 26, pages 212--215, 1954.
|
| |
20
|
H. Tillmann, B. Pompino-Marschall, and H. Prozig. Zum Einfluß visuell dargeborener Sprachbewegungen auf die Wahrnehmung der akustisch dodierten Artikulation. Forschungsberichtedes Instituts fur Phonetik und Sprachliche Kommunikation der Universitat Munchen, volume 19, pages 318--338, 1984.
|
| |
21
|
E. Walther. Lipreading, Nelson-Hall Publishers, 1982.
|
| |
22
|
Television Signal Transmission Standards. NAB Engineering Handbook, 7th Edition, National Association of Broadcasters, pages 41--49, 1985.
|
| |
23
|
Tolerances for Transmission Time Differences between the Vision and the Sound Components of a Television Signal. CCIR Recommendation 717, Dusseldorf, 1990.
|
| |
24
|
|
CITED BY 2
|
|
Norman P. Jouppi , Subu Iyer , Stan Thomas , April Slayden, BiReality: mutually-immersive telepresence, Proceedings of the 12th annual ACM international conference on Multimedia, October 10-16, 2004, New York, NY, USA
|
|
|
|
|