|
ABSTRACT
In this paper, we present a robust algorithm for audio classification that is capable of segmenting and classifying an audio stream into speech, music, environment sound and silence. Audio classification is processed in two steps, which makes it suitable for different applications. The first step of the classification is speech and non-speech discrimination. In this step, a novel algorithm based on KNN and LSP VQ is presented. The second step further divides non-speech class into music, environment sounds and silence with a rule based classification scheme. Some new features such as the noise frame ratio and band periodicity are introduced and discussed in detail. Our experiments in the context of video structure parsing have shown the algorithms produce very satisfactory results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Foote. Content-based retrieval of music and audio. In C. C. J. Kuo et al., editors, Multimedia Storage and Archiving Systems II, Proc. of SPIE, volume 3229, pages 138147, 1997.
|
| |
2
|
Erling Wold , Thom Blum , Douglas Keislar , James Wheaton, Content-Based Classification, Search, and Retrieval of Audio, IEEE MultiMedia, v.3 n.3, p.27-36, September 1996
[doi> 10.1109/93.556537]
|
 |
3
|
Silvia Pfeiffer , Stephan Fischer , Wolfgang Effelsberg, Automatic audio content analysis, Proceedings of the fourth ACM international conference on Multimedia, p.21-30, November 18-22, 1996, Boston, Massachusetts, United States
[doi> 10.1145/244130.244139]
|
| |
4
|
J. Saunders. Real-time Discrimination of Broadcast Speech/ Music. Proc. ICASSP96, vol.11, pp.993-996, Atlanta, May, 1996
|
| |
5
|
|
| |
6
|
D. Kimber and L. Wilcox. Acoustic Segmentation for Audio Browsers, Proc. Interface Conference, Sydney, Australia, July, 1996
|
| |
7
|
T. Zhang and C.-C. J. Kuo. Video Content Parsing Based on Combined Audio and Visual Information. SPIE 1999, Vol. IV, pp. 78-89, 1999.
|
| |
8
|
J. P. Campbell, JR. Speaker Recognition: A Tutorial. Proceedings of the IEEE, vol1.85, no.9, pp.1437-1462, 1997.
|
| |
9
|
A. V. McCree and T. P. Barnwell. Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Coding. IEEE Transaction on Speech and Audio Processing, vol. 3, No. 4, pp242-250. July 1995.
|
| |
10
|
K. El-Maleh, M. Klein, G. Petrucci and P. Kabal. Speech/music discrimination for multimedia application. ICASSPOO, 2000
|
| |
11
|
Y. Linde, A. Buzo, and R.M. Gray. A Algorithm for Vector Quantizer Design, IEEE Trans. on Comm. Corn-28, No. 1, pp. 84-95, 1980.
|
 |
12
|
Savitha Srinivasan , Dragutin Petkovic , Dulce Ponceleon, Towards robust features for classifying audio in the CueVideo system, Proceedings of the seventh ACM international conference on Multimedia (Part 1), p.393-400, October 30-November 05, 1999, Orlando, Florida, United States
[doi> 10.1145/319463.319658]
|
| |
13
|
|
| |
14
|
J. S. Boreczky and L. D. Wilcox. A Hidden Markov Model Frame Work for Video Segmentation Using Audio and Image Features. Proceedings of ICASSP'98, pp.3741- 3744, Seattle, May 1998.
|
CITED BY 19
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Maria da Graça Pimentel , Cassio Prazeres , Helder Ribas , Daniel Lobato , Cesar Teixeira, Documenting the pen-based interaction, Proceedings of the 11th Brazilian Symposium on Multimedia and the web, p.1-8, December 05-07, 2005, Pocos de Caldas - Minas Gerais, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
Jan C. van Gemert , Cees G. M. Snoek , Cor J. Veenman , Arnold W. M. Smeulders, The influence of cross-validation on video classification performance, Proceedings of the 14th annual ACM international conference on Multimedia, October 23-27, 2006, Santa Barbara, CA, USA
|
|
|
|
|
|
|
|
|
Jim Kleban , Anindya Sarkar , Emily Moxley , Stephen Mangiat , Swapna Joshi , Thomas Kuo , B. S. Manjunath, Feature fusion and redundancy pruning for rush video summarization, Proceedings of the international workshop on TRECVID video summarization, p.84-88, September 28-28, 2007, Augsburg, Bavaria, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mohammad Soleymani , Guillaume Chanel , Joep J.M. Kierkels , Thierry Pun, Affective ranking of movie scenes using physiological signals and content analysis, Proceeding of the 2nd ACM workshop on Multimedia semantics, October 31-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
REVIEW
"Hadi Harb : Reviewer"
The authors present a technique for the classification of audio into speech, music, environment sounds, and silence classes. Such a classification is useful for audio indexing and retrieval, and for video structure extraction. The technique
more...
|