|
ABSTRACT
In this paper, we adopt a direct modeling approach to utilize conversational gesture cues in detecting sentence boundaries, called SUs, in video taped conversations. We treat the detection of SUs as a classification task such that for each inter-word boundary, the classifier decides whether there is an SU boundary or not. In addition to gesture cues, we also utilize prosody and lexical knowledge sources. In this first investigation, we find that gesture features complement the prosodic and lexical knowledge sources for this task. By using all of the knowledge sources, the model is able to achieve the lowest overall SU detection error rate.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
P. Boersma and D. Weeninck. Praat, a system for doing phonetics by computer. Technical Report 132, University of Amsterdam, Inst. of Phonetic Sc., 1996.
|
| |
4
|
|
| |
5
|
|
| |
6
|
R. Bryll, F. Quek, and A. Esposito. Automatic hand hold detection in natural conversation. In IEEE Workshop on Cues in Communication, Kauai,Hawaii, Dec 2001.
|
| |
7
|
W. Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
|
| |
8
|
J. Cassell and M. Stone. Living Hand to Mouth: Psychological Theories about Speech and Gesture in Interactive Dialogue Systems. In AAAI, 1999.
|
 |
9
|
|
| |
10
|
|
| |
11
|
S. Coquoz. Broadcast news segmentation using mde and stt information to improve speech recognition. Technical report, International Computer Science Institute, 2004.
|
| |
12
|
A. Esposito, K. E. McCullough, and F. Quek. Disfluencies in gesture: Gestural correlates to speech silent and filled pauses. In Proceeding of IEEE Workshop on Cues in Communication, Kauai,Hawaii, 2001.
|
| |
13
|
S. Fels and G. Hinton. Glove-talk II - A neural-network interface which maps gestures to parallel formant speech synthesizer controls. IEEE Transactions on Neural Networks, 8:977--984, Sept. 1997.
|
| |
14
|
Francis Quek , David McNeill , Rashid Ansari , Xin-Feng Ma , Robert Bryll , Susan Duncan , Karl E. McCullough, Gesture Cues for Conversational Interaction in Monocular Video, Proceedings of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, p.119, September 26-27, 1999
|
| |
15
|
D. Gibbon, B. Hell, K. Looks, and T. Trippel. Formal syntax of gesture : Cogest1.1. Technical report, Univ. of Bielefield, 2003.
|
| |
16
|
A. Kendon. Some relationships between body motion and speech. In A. W. Siegman and B. Pope, editors, Studies in Dynamic Communication. Pergamon, New York, 1972.
|
| |
17
|
|
| |
18
|
Y. Liu, A. Stolcke, E. Shriberg, and M. P. Harper. Comparing and combining generative and posterior probability models: Some advances in sentence boundary detection in speech. In Proceedings of the Empirical Methods in Natural Language Processing, 2004.
|
| |
19
|
Y. Liu, A. Stolcke, E. Shriberg, and M. P. Harper. Using machine learning to cope with imbalanced classes in natural speech: Evidence from sentence boundary and disfluency detection. In Proceedings of the International Conference on Spoken Language Processing, 2004.
|
| |
20
|
M. Mateer and A. Taylor. Disfluency annotation stylebook for the Switchboard corpus. Technical report, Department of Computer and Information Science, University of Pennsylvania, 1995.
|
| |
21
|
D. McNeil. Hand and Mind: What Gestures Reveal about Thought. Univ. Chicago Press, 1992.
|
| |
22
|
D. McNeill and S. Duncan. Growth points in thinking-for-speaking, chapter~7, pages 141--161. Cambridge Univ. Press, 2000.
|
| |
23
|
|
| |
24
|
F. Quek, M. P. Harper, Y. Haciahmetoglu, L. Chen, and L. Ramig. Speech pauses and gestural holds in Parkinson's disease. In Seventh International Conference on Spoken Language Processing, ICSLP, Denver,CO, Sept. 2002.
|
 |
25
|
Francis Quek , David McNeill , Robert Bryll , Susan Duncan , Xin-Feng Ma , Cemil Kirbas , Karl E. McCullough , Rashid Ansari, Multimodal human discourse: gesture and speech, ACM Transactions on Computer-Human Interaction (TOCHI), v.9 n.3, p.171-193, September 2002
[doi> 10.1145/568513.568514]
|
| |
26
|
F. Quek, Y. Shi, C. Kirbas, and S. Wu. Vissta: A tool for analyzing multimodal discourse data. In Seventh International Conference on Spoken Language Processing, Denver,CO, Sept. 2002.
|
| |
27
|
F. Quek and Y. Xiong. Oscillatory gestures and discourse. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP2003, Hong Kong, April 2003.
|
| |
28
|
F. Quek, Y. Xiong, and D. McNeill. Gestural trajectory symmetries and discourse segmentation. In 7th ICSLP, Denver, CO, Sept. 2002.
|
| |
29
|
L. Rabiner and B. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1):4--16, 1986.
|
| |
30
|
|
| |
31
|
E. Shriberg and A. Stolcke. Direct modeling of prosody: An overview of applications in automatic speech processing. In International Conference on Speech Prosody, 2004.
|
| |
32
|
|
| |
33
|
K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub. Modeling dynamic prosodic variation for speaker verification. In Proceedings of the International Conference on Spoken Language Processing, pages 3189--3192, 1998.
|
| |
34
|
A. Stolcke and E. Shriberg. Statistical language modeling for speech disfluencies. In ICASSP, 1996.
|
| |
35
|
S. Strassel. Simple Metadata Annotation Specification. Linguistic Data Consortium, 5.0 edition, 2003.
|
| |
36
|
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.5
INFORMATION INTERFACES AND PRESENTATION (I.7)
H.5.1
Multimedia Information Systems
Subjects:
Audio input/output
Additional Classification:
H.
Information Systems
H.5
INFORMATION INTERFACES AND PRESENTATION (I.7)
H.5.1
Multimedia Information Systems
Subjects:
Video (e.g., tape, disk, DVI)
H.5.5
Sound and Music Computing
Subjects:
Modeling;
Signal analysis, synthesis, and processing
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
I.2.7
Natural Language Processing
General Terms:
Algorithms,
Experimentation,
Languages,
Performance
Keywords:
dialog,
gesture,
language models,
multimodal fusion,
prosody,
sentence boundary detection
|