|
ABSTRACT
Speaker clustering is the task of grouping a set of speech utterances into speaker-specific classes. The basic techniques for solving this task are similar to those used for speaker verification and identification. The hypothesis of this paper is that the techniques originally developed for speaker verification and identification are not sufficiently discriminative for speaker clustering. However, the processing chain for speaker clustering is quite large - there are many potential areas for improvement. The question is: where should improvements be made to improve the final result? To answer this question, this paper takes a biomimetic approach based on a study with human participants acting as an automatic speaker clustering system. Our findings are twofold: it is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing. Experimental results with our implementation of a speaker clustering system incorporating our findings and applying it on TIMIT data show the validity of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. G. Adami. Modeling Prosodic Di erences for Speaker Recognition. Speech Communication, 49:277--291, 2007.
|
| |
2
|
J.-J. Aucouturier. A Day in the Life of a Gaussian Mixture Model: Informing Music Pattern Recognition with Psychological Experiments. Journal of New Music Research, submitted, 2009.
|
| |
3
|
J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1), 2004.
|
| |
4
|
Y. Bar-Cohen. Biomimetics: Biologically Inspired Technologies. CRC Press, Boca Raton, FL, USA, 2006.
|
| |
5
|
H. Beigi, S. Maes, and J. Sorensen. A Distance Measure Between Collections of Distributions and its Application to Speaker Recognition. In IEEE Proc. of ICASSP, volume 2, pages 753--756, 1998.
|
| |
6
|
J. Benesty, M. M. Sondhi, and Y. Huang. Springer Handbook of Speech Processing. Springer, Germany, 2008.
|
| |
7
|
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006.
|
| |
8
|
J. P. Campbell. Speaker Recognition: A Tutorial. Proceedings of the IEEE, 85:1437--1462, 1997.
|
| |
9
|
C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
|
| |
10
|
Z.-H. Chen, Y.-F. Liao, and Y.-T. Juang. Prosody Modeling and Eigen-Prosody Analysis for Robust Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech & Signal Proc. ICASSP'05, pages I-185-I-188, 2005.
|
| |
11
|
R. Chengalvarayan and L. Deng. Speech Trajectory Discrimination Using the Minimum Classification Error Learning. IEEE Transactions on Speech and Audio Processing, 6(6), 1998.
|
| |
12
|
S. Davis and P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28:357--366, 1980.
|
| |
13
|
K. Demuynck, O. Garcia, and D. V. Compernolle. Synthesizing Speech from Speech Recognition Parameters. In Proc. International Conference on Spoken Language Processing, Jeju Island, Korea, volume II, pages 945--948, 2004.
|
| |
14
|
F. Desobry, M. Davy, and C. Doncarli. An Online Kernel Change Detection Algorithm. IEEE Transactions on Signal Processing, 53(8), 2005.
|
| |
15
|
F. Desobry, M. Davy, and W. J. Fitzgerald. A Class of Kernels for Sets of Vectors. In Proceedings of ESANN'2005, pages 461--466. MIT Press, 2005.
|
| |
16
|
M. Faundez-Zanuy and E. Monte-Moreno. State-of-the-Art in Speaker Recognition. IEEE Aerospace and Electronic Systems Magazine, 20:7--12, 2005.
|
| |
17
|
B. Fergani, M. Davy, and A. Houacine. Speaker Diarization using One-Class Support Vector Machines. Speech Communication, 50:355--365, 2008.
|
| |
18
|
L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. Sonmez, A. Stolcke, and A. Venkataraman. Modeling Duration Patterns for Speaker Recognition. In Proceedings of EUROSPEECH, pages 2017--2020, 2003.
|
| |
19
|
W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall. The DARPA Speech Recognition Research Database: Specification and Status. In Proceedings of the DARPA Speech Recognition Workshop, Report No. SAIC-86/1546, February 1986, Palo-Alto, 1986.
|
| |
20
|
S. Furui. 50 Years of Progress in Speech and Speaker Recognition. In Proc. SPECOM 2005, Patras, Greece, pages 1--9, 2005.
|
| |
21
|
B. Goertzel and C. Pennachin. Artificial General Intelligence. Springer, Berlin, Heidelberg, Germany, 2007.
|
| |
22
|
D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short-Time Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:236--243, 1984.
|
| |
23
|
K. J. Han, S. Kim, and S. S. Narayanan. Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16:1590--1601, 2008.
|
| |
24
|
H. Jin, F. Kubala, and R. Schwartz. Automatic Speaker Clustering. In Proc. of the DARPA Speech Recognition Workshop, pages 108--111, 1997.
|
| |
25
|
C. Joder, S. Essid, and G. Richard. Temporal Integration for Audio Classification With Application to Musical Instrument Classification. IEEE Transactions on Audio, Speech, and Language Processing, 17:174--186, 2009.
|
| |
26
|
D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd Edn. Addison Wesley, 1998.
|
| |
27
|
M. Kotti, E. Benetos, and C. Kotropoulos. Computationally Efficient and Robust BIC-Based Speaker Segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 16:920--933, 2008.
|
| |
28
|
M. Kotti, V. Moschou, and C. Kotropoulos. Speaker Segmentation and Clustering. Signal Processing, 88:1091--1124, 2008.
|
| |
29
|
H.-J. Z. Lie Lu. Unsupervised Speaker Segmentation and Tracking in Real-Time Audio Content Analysis. Multimedia Systems, 10:332--343, 2005.
|
| |
30
|
B. Lindblom, R. Diehl, and C. Creeger. Do 'Dominant Frequencies' Explain the Listener's Response to Formant and Spectrum Shape Variations? Speech Communication, 2008.
|
| |
31
|
J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava. Speech and Language Technologies for Audio Indexing and Retrieval. Proceedings of the IEEE, 88:1338--1353, 2000.
|
| |
32
|
A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, and S. Pillay. Discrimination E ectiveness of Speech Cepstral Features. Lecture Notes in Computer Science, 5372:91--99, 2008.
|
| |
33
|
L. Mary and B. Yegnanarayana. Extraction and Representation of Prosodic Features. Speech Communication, 2008.
|
| |
34
|
S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Besacier. Step-by-Step and Integrated Approaches in Broadcast News Speaker Diarization. Computer Speech and Language, 20:303--330, 2006.
|
| |
35
|
B. Milner and X. Shao. Speech Reconstruction from Mel-Frequency Cepstral Coefficients using a Source-Filter Model. In International Conference on Spoken Language Processing (ICSLP), pages 2421--2424, 2002.
|
| |
36
|
B. Milner and X. Shao. Clean Speech Reconstruction from MFCC Vectors and Fundamental Frequency using an Integrated Front-End. Speech Communication, 48:697--715, 2006.
|
| |
37
|
T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.
|
| |
38
|
B. C. J. Moore. Psychology of Hearing, Fifth Edition. Elsevier Academic Press, London, UK, 2004.
|
| |
39
|
A. Morris, D. Wu, and J. Koreman. GMM based Clustering and Speaker Separability in the TIMIT Speech Database. Technical Report Saar-IP-08-08-2004, Saarland University, 2004.
|
| |
40
|
F. Pachet and P. Roy. Exploring Billions of Audio Features. In Eurasip, editor, Proceedings of CBMI 07, pages 227--235, 2007.
|
| |
41
|
S. M. Prasanna, C. S. Gupta, and B. Yegnanarayana. Extraction of Speaker-Specific Excitation Information from Linear Prediction Residual of Speech. Speech Communication, 48:1243--1261, 2006.
|
| |
42
|
M. Przybocki and A. Martin. NIST Speaker Recognition Evaluation Chronicles. In Proceedings in Odyssey 2004, 2004.
|
| |
43
|
D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang. The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech & Signal Proc. ICASSP'03, pages IV-784-IV-787, 2003.
|
| |
44
|
D. Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, and A. Adami. The 2004 MIT Lincoln Laboratory Speaker Recognition System. In Proc. IEEE Int. Conf. Acoust. Speech & Signal Proc. ICASSP'05, pages I-177-I-180, 2005.
|
| |
45
|
D. Reynolds and P. Torres-Carrasquillo. The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast News and Telephone Conversations. In NIST Rich Transcription Workshop November 2004, 2004.
|
| |
46
|
D. Reynolds and P. Torres-Carrasquillo. Approaches and Applications of Audio Diarization. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing 2005, volume 5, pages V-953--V-956, 2005.
|
| |
47
|
D. A. Reynolds. Speaker Identification and Verification using Gaussian Mixture Speaker Models. Speech Communication, 17:91--108, 1995.
|
| |
48
|
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10:19--41, 2000.
|
| |
49
|
D. A. Reynolds and R. C. Rose. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, 3:72--83, 1995.
|
| |
50
|
P. Rose. Forensic Speaker Identification. Taylor & Francis, London and New York, 2002.
|
| |
51
|
L. Saul and M. Rahim. Markov Processes on Curves for Automatic Speech Recognition. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 751--757. MIT Press, 1999.
|
| |
52
|
B. Schouten, M. Tistarelli, C. Garcia-Mateo, F. Deravi, and M. Meints. Nineteen Urgent Research Topics in Biometrics and Identity Management. Lecture Notes in Computer Science, 5372:228--235, 2008.
|
| |
53
|
C. C. Sekhar and M. Panaliswami. Classification of Multidimensional Trajectories for Acoustic Modeling Using Support Vector Machines. In Proceedings of ICISIP'04, pages 153--158, 2004.
|
| |
54
|
S. W. Smith. Digital Signal Processing - A Practical Guide for Engineers and Scientists. Newnes, USA, 2003.
|
| |
55
|
M. K. Soenmez, L. Heck, M. Weintraub, and E. Shriberg. A Lognormal Tied Mixture Model of Pitch for Prosody-Based Speaker Recognition. In Proceedings of EUROSPEECH, pages 1391--1394, 1997.
|
| |
56
|
T. Su and J. G. Dy. In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering. Intelligent Data Analysis, 11:319--338, 2007.
|
| |
57
|
D. Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W. B. Klejin and K. K. Paliwal, editors, Speech Coding and Synthesis, chapter 3, pages 495--518. Elsevier Science, Amsterdam, NL, 1995.
|
| |
58
|
D. M. J. Tax. One-Class Classification - Concept-Learning in the Absence of Counter-Examples. PhD thesis, Technische Universteit Delft, 2001.
|
| |
59
|
T. Thiruvaran, E. Ambikairajah, and J. Epps. Group Delay Features for Speaker Recognition. In 6th International Conference on Information, Communications & Signal Processing, pages 1--5, 2007.
|
| |
60
|
S. E. Tranter and D. A. Reynolds. An Overview of Automatic Speaker Diarization Systems. IEEE Transactions on Audio, Speech, and Language Processing, 14:1557--1565, 2006.
|
| |
61
|
W.-H. Tsai, S.-S. Chen, and H.-M. Wang. Automatic Speaker Clustering using a Voice Characteristic Reference Space and Maximum Purity Estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15:1461--1474, 2007.
|
| |
62
|
D. A. van Leeuwen, A. F. Martin, M. A. Przybocki, and J. S. Bouten. NIST and NFI-TNO Evaluations of Automatic Speaker Recognition. Computer Speech and Language, 20:128--158, 2006.
|
| |
63
|
M. Vlachos, G. Kollios, and D. Gunopulos. Discovering Similar Multidimensional Trajectories. In Proceedings of ICDE'02, pages 673--684, 2002.
|
| |
64
|
D. Wu. Discriminative Preprocessing of Speech: Towards Improving Biometric Authentication. PhD thesis, Saarland University, 2006.
|
| |
65
|
D. Wu, J. Li, and H. Wu. α-Gaussian Mixture Modelling for Speaker Recognition. Pattern Recognition Letters, 2009.
|
| |
66
|
S. Zhang, W. Hu, T. Wang, J. Liu, and Y. Zhang. Speaker Clustering Aided by Visual Dialogue Analysis. In PCM 2008, Lecture Notes on Computer Science, volume 5353, pages 693--702, 2008.
|
|