|
ABSTRACT
The relationship between gaze and speech is explored for the simple task of moving an object from one location to another on a computer screen. The subject moves a designated object from a group of objects to a new location on the screen by stating, "Move it there". Gaze and speech data are captured to determine if we can robustly predict the selected object and destination position. We have found that the source fixation closest to the desired object begins, with high probability, before the beginning of the word "Move". An analysis of all fixations before and after speech onset time shows that the fixation that best identifies the object to be moved occurs, on average, 630 milliseconds before speech onset with a range of 150 to 1200 milliseconds for individual subjects. The variance in these times for individuals is relatively small although the variance across subjects is large. Selecting a fixation closest to the onset of the word "Move" as the designator of the object to be moved gives a system accuracy close to 95% for all subjects. Thus, although significant differences exist between subjects, we believe that the speech and gaze integration patterns can be modeled reliably for individual users and therefore be used to improve the performance of multimodal systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bernsen, N. O. and Dybkjær, L.: Is speech the right thing for your application? In Proceedings of the International Conference for Spoken Language Processing, ICSLP'98, Sydney. Australian Speech Science and Technology Association 1998, 320--3212.
|
| |
2
|
Bolt, R. A. The Human Interface. Lifetime Learning Publications, Belmont, CA, 1984.
|
| |
3
|
Corno, F., Farineti, L. and Signorile, I. A cost-effective solution for eye-gaze assistive technology. In Proceedings of the ICME 2002 IEEE Conference on Multimedia and Expo, IEEE Press, Piscataway, NJ, 2002.
|
| |
4
|
Curry, R., Hung, G. K., Wilder, J. and Julesz, B. Context effect of common objects on visual processing. Optometry and Vision Science Vol.72, 1995, 452--460.
|
| |
5
|
Dabbs, J. M., Jr., Evans, M. S., Hooper, C. H., & Purvis, J. A. Self monitors in conversation: Patterns of speech and gaze. Journal of Personality and Social Psychology Vol. 39, 1980, 278--284.
|
| |
6
|
Farid, M. M. and Murtagh, F. Eye-movements and voice as interface modalities to computer systems. In Proceedings of OPTO Ireland, SPIE Press, Bellingham, WA, September 5-6, 2002, CD-ROM.
|
| |
7
|
Glenstrup, A. J. and Engell-Nielsen, T. Eye controlled media: present and future state. Technical report, University of Copenhagen, Denmark, 1995.
|
| |
8
|
|
| |
9
|
Hung, G. K., Wilder, J., Curry, R., and Julesz, B. Simultaneous better than sequential for brief presentations. Journal of the Optical Society of America Vol. 12, 1995, 441--449.
|
| |
10
|
Hung, G. K., Wilder, J., Weiss, F. and Curry, R. K, Random and direct path eye movements during target search. Medical Science Research Vol. 21, 1993, 389--391.
|
 |
11
|
|
| |
12
|
Kapoula, Z., and Robinson, D. A., "Saccadic undershoot is not inevitable: saccades can be accurate," Vision Research Vol. 26, 1986.735--743,
|
| |
13
|
Kaur, M. Integration of Gaze and Speech for Multimodal Human-Computer Interaction. Unpublished Ph.D. dissertation, Department of Biomedical Engineering, Rutgers, the State University, 2000, 142 pages.
|
| |
14
|
David B. Koons , Carlton J. Sparrell , Kristinn R. Thorisson, Integrating simultaneous input from speech, gaze, and hand gestures, Intelligent multimedia interfaces, American Association for Artificial Intelligence, Menlo Park, CA, 1993
|
| |
15
|
Kowler, E. and Blaser, E. The accuracy and precision of saccades to small and large targets. Vision Research Vol. 35 (12), 1995, 1741--1754.
|
| |
16
|
Lin, W., Kaur, M., Tremaine, M., Hung, G. and Wilder, J.. Performance analysis of an eye-tracker.In Proceedings of the SPIE Conference on Machine Vision Applications, Architectures and Systems Integration V, 1999, CD-ROM.
|
| |
17
|
|
| |
18
|
Mantravadi, C. S., Wilder, J., Grove, D. and Yuan, X. A Java-based multimodal human-computer interface architecture. In Proceedings of ICICS-2001, Singapore, IEEE Press, Piscataway, NJ, 2001, CD-ROM.
|
 |
19
|
|
| |
20
|
Oviatt, S., Cohen, P., Wu, L., Vergo, J., Duncan, L., Subh, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J. and Ferro, D. Designing the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions. Human Computer Interaction, Vol. 15 (4), 2000, pp. 263--322.
|
 |
21
|
Sharon Oviatt , Antonella DeAngeli , Karen Kuhn, Integration and synchronization of input modes during multimodal human-computer interaction, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258821]
|
 |
22
|
Jeff B. Pelz , Roxanne Canosa , Jason Babcock, Extended tasks elicit complex eye movement patterns, Proceedings of the 2000 symposium on Eye tracking research & applications, p.37-43, November 06-08, 2000, Palm Beach Gardens, Florida, United States
[doi> 10.1145/355017.355023]
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
Sharma, R., Pavlovic, V. I. and Huang, T. S. Toward multimodal human-computer interfaces. In Proceedings of the IEEE, Vol. 86, (5), May 1998, 853--869.
|
 |
28
|
|
 |
29
|
|
| |
30
|
Tanenhaus, M. K., Spivey-Knowlton, M., Eberhard, K. and Sedivy, J. Integration of visual and linguistic information during spoken language comprehension. Science, Vol. 268, 1995, pp. 1632--1634.
|
 |
31
|
|
 |
32
|
Roel Vertegaal , Robert Slagter , Gerrit van der Veer , Anton Nijholt, Eye gaze patterns in conversations: there is more to conversational agents than meets the eyes, Proceedings of the SIGCHI conference on Human factors in computing systems, p.301-308, March 2001, Seattle, Washington, United States
[doi> 10.1145/365024.365119]
|
 |
33
|
|
 |
34
|
Shumin Zhai , Carlos Morimoto , Steven Ihde, Manual and gaze input cascaded (MAGIC) pointing, Proceedings of the SIGCHI conference on Human factors in computing systems: the CHI is the limit, p.246-253, May 15-20, 1999, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/302979.303053]
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.5
INFORMATION INTERFACES AND PRESENTATION (I.7)
H.5.2
User Interfaces (D.2.2, H.1.2, I.3.6)
Subjects:
Input devices and strategies (e.g., mouse, touchscreen)
General Terms:
Design,
Experimentation,
Human Factors,
Measurement
Keywords:
eye-tracking,
gaze-speech co-occurrence,
multimodal fusion,
multimodal interfaces
|