|
ABSTRACT
New language constantly emerges from complex, collaborative human-human interactions like meetings -- such as, for instance, when a presenter handwrites a new term on a whiteboard while saying it. Fixed vocabulary recognizers fail on such new terms, which often are critical to dialogue understanding. We present a proof-of-concept multimodal system that combines information from handwriting and speech recognition to learn the spelling, pronunciation and semantics of out-of-vocabulary terms from single instances of redundant multimodal presentation (e.g. saying a term while handwriting it). For the task of recognizing the spelling and semantics of abbreviated Gantt chart labels across a held-out test series of five scheduling meetings we show a significant relative error rate reduction of 37% when our learning methods are used and allowed to persist across the meeting series, as opposed to when they are not used.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Zdnet Whiteboard Videos, http://news.zdnet.com/2036-2_22-6035716.html.
|
| |
2
|
Kaiser, E. C. and Barthelmess, P., Edge-Splitting in a Cumulative Multimodal System, for a No-Wait Temporal Threshold on Information Fusion, Combined with an under-Specified Display. Interspeech 2006 - ICSLP, (Pittsburgh, PA, 2006).
|
 |
3
|
Ed Kaiser , David Demirdjian , Alexander Gruenstein , Xiaoguang Li , John Niekrasz , Matt Wesson , Sanjeev Kumar, A multimodal learning interface for sketch, speak and point creation of a schedule chart, Proceedings of the 6th international conference on Multimodal interfaces, October 13-15, 2004, State College, PA, USA
[doi> 10.1145/1027933.1027992]
|
| |
4
|
|
| |
5
|
Baldwin, D. A., Markman, E. M., Bill, B., Desjardins, R. N., Irwin, J. M. and Tidball, G. Infant's Reliance on a Social Criterion for Establishing Word Object Relations. Child development, 67. 3125--3153.
|
| |
6
|
Yu, C., Ballard, D. H. and Aslin, R. N., The Role of Embodied Intention in Early Lexical Acquisition. CogSci '03, (Boston, MA, 2003).
|
| |
7
|
Clark, H. H. and Wilkes-Gibbs, D. Referring as a Collaborative Process. Cognition, 22. 1--39.
|
| |
8
|
Brennan, S., Lexical Entrainment in Spontaneous Dialogue. In Proceedings of the International Symposium on Spoken Dialogue, (Philadelphia, USA, 1996), 41--44.
|
| |
9
|
Yu, H., Tomokiyo, T., Wang, Z. and Waibel, A., New Developments in Automatic Meeting Transcription. in Proceedings of ICSLP, (Beijing, China, 2000).
|
| |
10
|
Grice, H. P. Logic and Conversation. in Cole, P. and Morgan, J. eds. Speech Acts, Academic Press, New York, 1975, 41--58.
|
 |
11
|
Joyce Y. Chai , Zahar Prasov , Joseph Blaim , Rong Jin, Linguistic theories in efficient multimodal reference resolution: an empirical investigation, Proceedings of the 10th international conference on Intelligent user interfaces, January 10-13, 2005, San Diego, California, USA
[doi> 10.1145/1040830.1040850]
|
| |
12
|
Oviatt, S. and Olsen, E., Integration Themes in Multimodal Human-Computer Interaction. ICSLP '94, (1994), 551--554.
|
 |
13
|
Sharon Oviatt , Antonella DeAngeli , Karen Kuhn, Integration and synchronization of input modes during multimodal human-computer interaction, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258821]
|
| |
14
|
Gupta, A. K. and Anastasakos, T., Integration Patterns During Multimodal Interaction. In INTERSPEECH-2004, (Jeju Island, Korea, 2004), 2293--2296.
|
 |
15
|
Richard J. Anderson , Crystal Hoyer , Steven A. Wolfman , Ruth Anderson, A study of digital ink in lecture presentation, Proceedings of the SIGCHI conference on Human factors in computing systems, p.567-574, April 24-29, 2004, Vienna, Austria
[doi> 10.1145/985692.985764]
|
 |
16
|
Richard Anderson , Crystal Hoyer , Craig Prince , Jonathan Su , Fred Videon , Steve Wolfman, Speech, ink, and slides: the interaction of content channels, Proceedings of the 12th annual ACM international conference on Multimedia, October 10-16, 2004, New York, NY, USA
[doi> 10.1145/1027527.1027713]
|
| |
17
|
Dumas, B., Pugin, C., Hennebert, J., Petrovska-Delacrétaz, D., Humm, A., Evéquoz, F., Ingold, R. and Rotz, D. V., Myidea - Multimodal Biometrics Database, Description of Acquisition Protocols. Third COST 275 Workshop, (Hatfield (UK), 2005), 59--62.
|
 |
18
|
Kazutaka Kurihara , Masataka Goto , Jun Ogata , Takeo Igarashi, Speech pen: predictive handwriting based on ambient multimodal recognition, Proceedings of the SIGCHI conference on Human Factors in computing systems, April 22-27, 2006, Montréal, Québec, Canada
[doi> 10.1145/1124772.1124897]
|
| |
19
|
Schimke, S., Vogel, T., Vielhauer, C. and Dittmann, J., Integration and Fusion Aspects of Speech and Handwriting Media. SPECOM '04, (2004), 42--46.
|
| |
20
|
Park, A. and Glass, J. R., Towards Unsupervised Pattern Discovery in Speech. Proc. ASRU, (San Juan, Puerto Rico, 2005), 53--58.
|
| |
21
|
|
| |
22
|
Chung, G., Seneff, S., Wang, C. and Hetherington, L., A Dynamic Vocabulary Spoken Dialogue Interface. in Interspeech '04, (Jeju Island, Korea, 2004), pp. 327--330.
|
| |
23
|
Chung, G., Wang, C., Seneff, S., FIlisko, E. and Tang, M., Combining Linguistic Knowledge and Acoustic Information in Automatic Pronunciation Lexicon Generation. in Interspeech '04, (Jeju Island, Korea, 2004), pp. 328--332.
|
| |
24
|
Galescu, L. Sub-Lexical Language Models for Unlimited Vocabulary Speech Recognition, ATR, Kyoto, Japan, 2002.
|
| |
25
|
Potamianos, G., Neti, C., Luettin, J. and Matthews, I. Audio-Visual Automatic Speech Recognition: An Overview. in Bailly, G., Vatikiotis-Bateson, E. and Perrier, P. eds. Issues in Visual and Audio-Visual Speech Processing, MIT Press, 2004.
|
 |
26
|
Timothy J. Hazen , Kate Saenko , Chia-Hao La , James R. Glass, A segment-based audio-visual speech recognizer: data collection, development, and initial experiments, Proceedings of the 6th international conference on Multimodal interfaces, October 13-15, 2004, State College, PA, USA
[doi> 10.1145/1027933.1027972]
|
 |
27
|
|
| |
28
|
Roy, D. Learning Visually Grounded Words and Syntax for a Scene Description Task. Computer Speech and Language, 16. 353--385.
|
| |
29
|
Roy, D. and Pentland, A. Learning Words from Sights and Sounds: A Computational Model. Cognitive Science, 26 (1). 113--146.
|
 |
30
|
|
| |
31
|
|
| |
32
|
Kaiser, E. C., Barthelmess, P. and Arthur, A., Multimodal Play Back of Collaborative Multiparty Corpora. ICMI '05, Workshop on Multimodal, Multiparty Meeting Processing, (Trento, Italy, 2005).
|
| |
33
|
Meurville, E. and Leroux, D. D1.2 Collection and Annotation of Meeting Room Data, (M4 Project) http://www.m4project.org/outputs.html, 2004.
|
| |
34
|
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D. and Wellner, P., The Ami Meeting Corpus: A Pre-Announcement. in 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, (Edinburgh, UK, 2005).
|
| |
35
|
Black, A. W. and Lenzo, K. A., Flite: A Small Fast Run-Time Synthesis Engine. in The 4th ISCA Worskop on Speech Synthesis, (Perthshire, Scotland, 2001).
|
| |
36
|
Kaiser, E. C., Shacer: A Speech and Handwriting Recognizer. ICMI '05, Workshop on Multimodal, Multiparty Meeting Processing, (Trento, Italy, 2005).
|
| |
37
|
Dhande, S. S. A Computational Model to Connect Gestalt Perception and Natural Language, Computer Engineering, Program in Media Arts and Sciences, School of Artchitecture and Planning, MIT, Boston, MA., 2003, 82.
|
 |
38
|
|
| |
39
|
Roy, D. and Mukherjee, N. Towards Situated Speech Understanding: Visual Context Priming of Language Models. Computer Speech and Language, 19 (2). 227--248.
|
| |
40
|
|
 |
41
|
|
| |
42
|
|
| |
43
|
|
 |
44
|
Philip R. Cohen , Michael Johnston , David McGee , Sharon Oviatt , Jay Pittman , Ira Smith , Liang Chen , Josh Clow, QuickSet: multimodal interaction for distributed applications, Proceedings of the fifth ACM international conference on Multimedia, p.31-40, November 09-13, 1997, Seattle, Washington, United States
[doi> 10.1145/266180.266328]
|
| |
45
|
Michael Johnston , Philip R. Cohen , David McGee , Sharon L. Oviatt , James A. Pittman , Ira Smith, Unification-based multimodal integration, Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, p.281-288, July 07-12, 1997, Madrid, Spain
|
| |
46
|
Kaiser, E. C. and Cohen, P. R., Implementation Testing of a Hybrid Symbolic/Statistical Multimodal Architecture. ICSLP '02, (Denver, 2002), 173--176.
|
| |
47
|
Wu, L., Oviatt, S. L. and Cohen, P. R. From Members to Teams to Committee: A Robust Approach to Gestural and Multimodal Recognition. IEEE Transactions on Neural Networks, 13 (4).
|
| |
48
|
Kaiser, E. C., Can Modeling Redundancy in Multimodal, Multi-Party Tasks Support Dynamic Learning? CHI '05 Workshop: CHI Virtuality 2005, (Port. OR., USA, 2005).
|
| |
49
|
Gogate, L. J., Walker-Andrews, A. S. and Bahrick, L. E. The Intersensory Origins of Word Comprehension: An Ecological-Dynamic Systems View. Development Science, 4 (1). 1--37.
|
| |
50
|
Bahrick, L. E., Lickliter, R. and Flom, R. Intersensory Redundancy Guides Infants' Selective Attention, Perceptual and Cognitive Development. Current Directions in Psychological Science, 13. 99--102.
|
| |
51
|
Baird, J. A. and Baldwin, D. A. Making Sense of Human Behavior: Action Parsing and Intentional Inference. in Malle, B. F., Moses, L. J. and Baldwin, D. A. eds. Intentions and Intentionality, MIT Press, Cambridge, MA., 2001, 193--206.
|
| |
52
|
Baldwin, D. and Baird, J. A. Discerning Intentions in Dynamic Human Action. TRENDS in Cognitive Science, 5 (4). 171--178.
|
| |
53
|
Malle, B. F., Moses, L. J. and Baldwin, D. A. Introduction: The Significance of Intentionality. in Malle, B. F., Moses, L. J. and Baldwin, D. A. eds. Intentions and Intentionality: Foundations of Social Cognition, MIT Press, Cambridge, Mass., 2001, 1--27.
|
| |
54
|
Welleman, H. M. and Phillips, A. T. Developing Intentional Understandings. in Malle, B. F., Moses, L. J. and Baldwin, D. A. eds. Intentions and Intentionality: Foundations of Social Cognition, MIT Press, Cambridge, Mass, 2001, 125--148.
|
| |
55
|
Woodward, A. L., Sommerville, J. A. and Guajardo, J. J. How Infants Make Sense of Intentional Action. in Malle, B. F., Moses, L. J. and Baldwin, D. A. eds. Intentions and Intentionality, MIT Press, Cambridge, MA, 2001, 149--170.
|
| |
56
|
Mayer, R. E. and Moreno, R. Nine Ways to Reduce Cognitive Load in Multimedia Learning. Educational Psychologist, 38 (1). 43--52.
|
| |
57
|
McNeill, D. Growth Points, Catchments, and Contexts. Cognitive Studies: Bulletin of the Japanese Cognitive Science Society, 7 (1).
|
CITED BY 6
|
|
Paulo Barthelmess , Edward Kaiser , Xiao Huang , David McGee , Philip Cohen, Collaborative multimodal photo annotation over digital paper, Proceedings of the 8th international conference on Multimodal interfaces, November 02-04, 2006, Banff, Alberta, Canada
|
|
|
Edward C. Kaiser , Paulo Barthelmess , Candice Erdmann , Phil Cohen, Multimodal redundancy across handwriting and speech during computer mediated human-human interactions, Proceedings of the SIGCHI conference on Human factors in computing systems, April 28-May 03, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Yingying Jiang , Xugang Wang , Feng Tian , Xiang Ao , Guozhong Dai , Hongan Wang, Multimodal Chinese text entry with speech and keypad on mobile devices, Proceedings of the 13th international conference on Intelligent user interfaces, January 13-16, 2008, Gran Canaria, Spain
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.5
INFORMATION INTERFACES AND PRESENTATION (I.7)
H.5.2
User Interfaces (D.2.2, H.1.2, I.3.6)
Subjects:
Natural language
Additional Classification:
H.
Information Systems
H.5
INFORMATION INTERFACES AND PRESENTATION (I.7)
H.5.2
User Interfaces (D.2.2, H.1.2, I.3.6)
Subjects:
Input devices and strategies (e.g., mouse, touchscreen)
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.6
Learning
Subjects:
Language acquisition
General Terms:
Algorithms,
Design,
Measurement
Keywords:
handwriting,
multimodal,
speech
|