|
ABSTRACT
Most speech interfaces are based on natural language processing techniques that use pre-defined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal human-computer interaction, computers need to acquire spoken language and map it to other sensory perceptions. This paper presents a multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing user-centric multisensory information. The learning interface is trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. The system firstly estimates users' focus of attention from eye and head cues. Attention, as represented by gaze fixation, is used for spotting the target object of user interest. Attention switches are calculated and used to segment an action sequence into action units which are then categorized by mixture hidden Markov models. A multimodal learning algorithm is developed to spot words from continuous speech and then associate them with perceptually grounded meanings extracted from visual perception and action. Successful learning has been demonstrated in the experiments of three natural tasks: "unscrewing a jar", "stapling a letter" and "pouring water".
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
D. H. Ballard and C. Yu. A multimodal learning interface for word acquisition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Hong Kong, April 2003.
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
M. Hayhoe. Visual routines: A functional account of vision. Visual Cognition, 7:43--64, 2000.
|
| |
8
|
M. Land, N. Mennie, and J. Rusted. The roles of vision and eye movements in the control of activities of daily living. Perception, 28:1311--1328, 1999.
|
| |
9
|
|
| |
10
|
|
| |
11
|
T. Oates, L. Firoiu, and P. R. Cohen. Clustering time series with hidden Markov models and dynamic time warping. In Proceedings of the IJCAI-99 Workshop on Neural, Symbolic and Reinforcement Learning Methods for Sequence Learning, pages 17--21, 1999.
|
| |
12
|
S. Oviatt. Multimodal interfaces. In J. Jacko and A. Sears, editors, Handbook of Human-Computer Interaction. Lawrence Erlbaum, New Jersey, 2002.
|
| |
13
|
L. R. Rabiner and B. Juang. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.
|
| |
14
|
T. Robinson. An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5(2):298--305, 1994.
|
| |
15
|
D. Roy and A. Pentland. Learning words from sights and sounds: A computational model. Cognitive Science, 26(1):113--146, 2002.
|
| |
16
|
D. D. Salvucci and J. Anderson. Tracking eye movement protocols with cognitive process models. In Proceedings of the Twentieth Annual Conference of the Cognitive Science Society, pages 923--928, LEA: Mahwah, NJ, 1998.
|
| |
17
|
|
| |
18
|
J. M. Siskind. Grounding language in perception. artificial Intelligence Review, 8:371--391, 1995.
|
| |
19
|
P. Smyth. Clustering sequences with hidden markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, page 648. The MIT Press, 1997.
|
| |
20
|
|
| |
21
|
S. Wang and J. M. Siskind. Image segmentation with ratio cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003.
|
| |
22
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE Design Automation Conference on
Gwo-Dong Chen
, Daniel D. Gajski
|