|
ABSTRACT
Robust joint visual attention is necessary for achieving a common frame of reference between humans and robots interacting multimodally in order to work together on real-world spatial tasks involving objects. We make a comprehensive examination of one component of this process that is often otherwise implemented in an ad hoc fashion: the ability to correctly determine the object referent from deictic reference including pointing gestures and speech. From this we describe the development of a modular spatial reasoning framework based around decomposition and resynthesis of speech and gesture into a language of pointing and object labeling. This framework supports multimodal and unimodal access in both real-world and mixed-reality workspaces, accounts for the need to discriminate and sequence identical and proximate objects, assists in overcoming inherent precision limitations in deictic gesture, and assists in the extraction of those gestures. We further discuss an implementation of the framework that has been deployed on two humanoid robot platforms to date.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bluethmann, W., Ambrose, R., Diftler, M., Huber, E., Fagg, A., Rosenstein, M., Platt, R., Grupen, R., Breazeal, C., Brooks, A.G., Lockerd, A., Peters, R.A. II, Jenkins, O.C., MatariĆ, M., and Bugajska, M. Building an autonomous humanoid tool user. In Proc. IEEE-RAS/RSJ Int'l Conf. on Humanoid Robots (Humanoids'04), Los Angeles, California, November 2004.
|
 |
2
|
|
| |
3
|
Breazeal, C., Brooks, A.G., Gray, J., Hoffman, G., Kidd, C., Lee, H., Lieberman, J., Lockerd, A., and Chilongo, D. Tutelage and collaboration for humanoid robots. International Journal of Humanoid Robots, 1(2):315--348, 2004.
|
| |
4
|
Breazeal, C., Kidd, C.D., Lockerd Thomaz, A., Hoffman, G., and Berlin, M. Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In Proc. International Conference on Intelligent Robots and Systems, 2004.
|
| |
5
|
Clark, H.H. and Marshall, C.R. Definite reference and mutual knowledge. In Joshi, A.K., Webber, B.L., and Sag, I.A., editors, Elements of Discourse Understanding. Cambridge University Press, Cambridge, 1981.
|
| |
6
|
Clark, H.H., Schreuder, R., and Buttrick, S. Common ground and the understanding of demonstrative reference. Journal of Verbal Learning and Verbal Behavior, 22:245--258, 1983.
|
| |
7
|
CMU Sphinx Group. Open Source Speech Recognition Engines. http://cmusphinx.sourceforge.net/.
|
| |
8
|
|
| |
9
|
Gullberg, M. Gestures in spatial descriptions. In Working Papers 47, pages 87--97. Lund University, Department of Linguistics, 1999.
|
 |
10
|
|
| |
11
|
Huber, E. and Baker, K. Using a hybrid of silhouette and range templates for real-time pose estimation. In Proc. International Conference on Robotics and Automation, pages 1652--1657, New Orleans, Louisiana, 2004. IEEE.
|
| |
12
|
|
 |
13
|
Manpreet Kaur , Marilyn Tremaine , Ning Huang , Joseph Wilder , Zoran Gacovski , Frans Flippo , Chandra Sekhar Mantravadi, Where is "it"? Event Synchronization in Gaze-Speech Input Systems, Proceedings of the 5th international conference on Multimodal interfaces, November 05-07, 2003, Vancouver, British Columbia, Canada
[doi> 10.1145/958432.958463]
|
| |
14
|
Kendon, A. Current issues in the study of gesture. In Nespoulous, J.-L., Perron, P., and Lecours, A.R., editors, The Biological Foundations of Gestures, pages 23--47. Lawrence Erlbaum Associates, Hillsdale, NJ, 1986.
|
| |
15
|
Alfred Kobsa , Jürgen Allgayer , Carola Reddig , Norbert Reithinger , Dagmar Schmauks , Karin Harbusch , Wolfgang Wahlster, Combining deictic gestures and natural language for referent identification, Proceedings of the 11th coference on Computational linguistics, August 25-29, 1986, Bonn, Germany
[doi> 10.3115/991365.991471]
|
| |
16
|
David B. Koons , Carlton J. Sparrell , Kristinn R. Thorisson, Integrating simultaneous input from speech, gaze, and hand gestures, Intelligent multimedia interfaces, American Association for Artificial Intelligence, Menlo Park, CA, 1993
|
| |
17
|
Kuniyoshi, Y. and Inoue, H. Qualitative recognition of ongoing human action sequences. In Proc. International Joint Conference on Artificial Intelligence, pages 1600--1609, 1993.
|
| |
18
|
|
| |
19
|
Louwerse, M.M. and Bangerter, A. Focusing attention with deictic gestures and linguistic expressions. In Proc. XXVII Annual Conference of the Cognitive Science Society (CogSci 2005), Stresa, Italy, July 21--23 2005.
|
| |
20
|
Machotka, P. and Spiegel, J. The Articulate Body. Irvington, 1982.
|
| |
21
|
Marslen-Wilson, W., Levy, E., and Tyler, L.K. Producing interpretable discourse: The establishment and maintenance of reference. In Jarvella, R.J. and Klein, W., editors, Speech, Place and Action: Studies in Deixis and Related Topics. Wiley, 1982.
|
| |
22
|
McNeill, D. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago, IL, 1992.
|
| |
23
|
McNeill, D. and Levy, E. Conceptual representations in language activity and gesture. In Jarvella, R.J. and Klein, W., editors, Speech, Place and Action: Studies in Deixis and Related Topics. Wiley, 1982.
|
| |
24
|
Milota, A.D. and Blattner, M.M. Multimodal interfaces with voice and gesture input. In Proc. International Conference on Systems, Man and Cybernetics, pages 2760--2765, Vancouver, Canada, October 1995. IEEE.
|
| |
25
|
Moore, C. and Dunham, P.J., editors. Joint Attention: Its Origins and Role in Development. Lawrence Erlbaum Associates, 1995.
|
| |
26
|
Moore, D., Essa, I., and Hayes, M. Exploiting human actions and object context for recognition tasks. In Proc. International Conference on Computer Vision, Corfu, Greece, 1999.
|
| |
27
|
Nagai, Y. Learning to comprehend deictic gestures in robots and human infants. In Proc. 14th IEEE International Workshop on Robot and Human Interactive Communication (RO-MAN'05), pages 217--222, Nashville, TN, August 2005.
|
 |
28
|
|
 |
29
|
Sharon Oviatt , Antonella DeAngeli , Karen Kuhn, Integration and synchronization of input modes during multimodal human-computer interaction, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258821]
|
| |
30
|
|
| |
31
|
|
| |
32
|
Peters, R.A.II, Hambuchen, K.E., Kawamura, K., and Wilkes, D.M. The sensory ego-sphere as a short-term memory for humanoids. In Proc. IEEE-RAS/RSJ Int'l Conf. on Humanoid Robots (Humanoids'01), pages 451--459, Tokyo, Japan, 2001.
|
| |
33
|
|
| |
34
|
Premack, D. and Woodruff, G. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4):515--526, 1978.
|
| |
35
|
Scaife, M. and Bruner, J.S. The capacity for joint visual attention in the infant. Nature, 253:265--266, 1975.
|
| |
36
|
Strobel, M., Illmann, J., Kluge, B., and Marrone, F. Using spatial context knowledge in gesture recognition for commanding a domestic service robot. In Proc. 11th IEEE Workshop on Robot and Human Interactive Communication (RO-MAN'02), pages 468--473, Berlin, Germany, September 25--27 2002.
|
CITED BY 2
|
|
|
Benjamin Fransen , Vlad Morariu , Eric Martinson , Samuel Blisard , Matthew Marge , Scott Thomas , Alan Schultz , Dennis Perzanowski, Using vision, acoustics, and natural language for disambiguation, Proceeding of the ACM/IEEE international conference on Human-robot interaction, March 10-12, 2007, Arlington, Virginia, USA
|
|