ACM Home Page
Please provide us with feedback. Feedback
A multimodal learning interface for grounding spoken language in sensory perceptions
Full text PdfPdf (850 KB)
Source International Conference on Multimodal Interfaces archive
Proceedings of the 5th international conference on Multimodal interfaces table of contents
Vancouver, British Columbia, Canada
SESSION: Speech and gaze table of contents
Pages: 164 - 171  
Year of Publication: 2003
ISBN:1-58113-621-8
Authors
Chen Yu  University of Rochester, Rochester, NY
Dana H. Ballard  University of Rochester, Rochester, NY
Sponsors
ACM: Association for Computing Machinery
SIGCHI: ACM Special Interest Group on Computer-Human Interaction
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 34,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/958432.958465
What is a DOI?

ABSTRACT

Most speech interfaces are based on natural language processing techniques that use pre-defined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal human-computer interaction, computers need to acquire spoken language and map it to other sensory perceptions. This paper presents a multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information. The learning interface is trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. The system firstly estimates users' focus of attention from eye and head cues. Attention, as represented by gaze fixation, is used for spotting the target object of user interest. Attention switches are calculated and used to segment an action sequence into action units which are then categorized by mixture hidden Markov models. A multimodal learning algorithm is developed to spot words from continuous speech and then associate them with perceptually grounded meanings extracted from visual perception and action. Successful learning has been demonstrated in the experiments of three natural tasks: "unscrewing a jar", "stapling a letter" and "pouring water".


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
D. H. Ballard and C. Yu. A multimodal learning interface for word acquisition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Hong Kong, April 2003.
 
4
 
5
 
6
 
7
M. Hayhoe. Visual routines: A functional account of vision. Visual Cognition, 7:43--64, 2000.
 
8
M. Land, N. Mennie, and J. Rusted. The roles of vision and eye movements in the control of activities of daily living. Perception, 28:1311--1328, 1999.
 
9
 
10
 
11
T. Oates, L. Firoiu, and P. R. Cohen. Clustering time series with hidden Markov models and dynamic time warping. In Proceedings of the IJCAI-99 Workshop on Neural, Symbolic and Reinforcement Learning Methods for Sequence Learning, pages 17--21, 1999.
 
12
S. Oviatt. Multimodal interfaces. In J. Jacko and A. Sears, editors, Handbook of Human-Computer Interaction. Lawrence Erlbaum, New Jersey, 2002.
 
13
L. R. Rabiner and B. Juang. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.
 
14
T. Robinson. An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5(2):298--305, 1994.
 
15
D. Roy and A. Pentland. Learning words from sights and sounds: A computational model. Cognitive Science, 26(1):113--146, 2002.
 
16
D. D. Salvucci and J. Anderson. Tracking eye movement protocols with cognitive process models. In Proceedings of the Twentieth Annual Conference of the Cognitive Science Society, pages 923--928, LEA: Mahwah, NJ, 1998.
 
17
 
18
J. M. Siskind. Grounding language in perception. artificial Intelligence Review, 8:371--391, 1995.
 
19
P. Smyth. Clustering sequences with hidden markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, page 648. The MIT Press, 1997.
 
20
 
21
S. Wang and J. M. Siskind. Image segmentation with ratio cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003.
 
22


Collaborative Colleagues:
Chen Yu: colleagues
Dana H. Ballard: colleagues