ACM Home Page
Please provide us with feedback. Feedback
Generative model-based clustering of directional data
Full text PdfPdf (275 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Washington, D.C.
SESSION: Research track table of contents
Pages: 19 - 28  
Year of Publication: 2003
ISBN:1-58113-737-0
Authors
Arindam Banerjee  University of Texas, Austin, TX
Inderjit Dhillon  University of Texas, Austin, TX
Joydeep Ghosh  University of Texas, Austin, TX
Suvrit Sra  University of Texas, Austin, TX
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 70,   Citation Count: 11
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956750.956757
What is a DOI?

ABSTRACT

High dimensional directional data is becoming increasingly important in contemporary applications such as analysis of text and gene-expression data. A natural model for multi-variate directional data is provided by the von Mises-Fisher (vMF) distribution on the unit hypersphere that is analogous to the multi-variate Gaussian distribution in Rd. In this paper, we propose modeling complex directional data as a mixture of vMF distributions. We derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the parameters of this mixture. We also propose two clustering algorithms corresponding to these variants. An interesting aspect of our methodology is that the spherical kmeans algorithm (kmeans with cosine similarity) can be shown to be a special case of both our algorithms. Thus, modeling text data by vMF distributions lends theoretical validity to the use of cosine similarity which has been widely used by the information retrieval community. As part of experimental validation, we present results on modeling high-dimensional text and gene-expression data as a mixture of vMF distributions. The results indicate that our approach yields superior clusterings especially for difficult clustering tasks in high-dimensional spaces.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on hyperspheres using Expectation Maximization. Technical Report TR-03-07, Department of Computer Sciences, University of Texas, February 2003.
 
3
A. Banerjee and J. Ghosh. Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres. In Proceedings International Joint Conference on Neural Networks, pages 1590--1595, May 2002.
 
4
P. S. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, Microsoft Research, May 2000.
 
5
M. Collins. The EM algorithm. In fulfillment of Written Preliminary Exam II requirement, September 1997.
 
6
 
7
 
8
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1--38, 1977.
 
9
I. Dhillon and J. Kogan, editors. Proc. Workshop on Clustering High Dimensional Data and its Applications. SIAM, 2002.
 
10
 
11
 
12
I. S. Dhillon and S. Sra. Modeling data using directional distributions. Technical Report TR-06-03, University of Texas, Dept. of Coputer Sciences, February 2003.
 
13
M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci., 95:14863--14868, 1998.
 
14
T. R. H. et. al Functional discovery via a compendium of expression profiles. Cell, 102:109--126, 2000.
 
15
J. Ghosh. Scalable clustering methods for data mining. In N. Ye, editor, Handbook of Data Mining, pages 247--277. Lawrence Erlbaum, 2003.
 
16
G. K. Gupta and J. Ghosh. Detecting seasonal and divergent trends and visualization for very high dimensional transactional data. In Proc. 1st SIAM Intl. Conf. on Data Mining, April 2001.
 
17
 
18
 
19
 
20
M. Kearns, Y. Mansour, and A. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In 13th Annual Conf. on Uncertainty in Artificial Intelligence (UA197), 1997.
 
21
K. V. Mardia. Statistical Distributions in Scienctific Work, volume 3, chapter Characteristics of directional distributions, pages 365--385. Reidel, Dordrecht, 1975.
 
22
K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000.
 
23
 
24
 
25
C. R. Rao. Linear Statistical Inference and its Applications. Wiley, New York, 2nd edition, 1973.
26
 
27
B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, 2001.
 
28
 
29
P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing, volume 9, pages 648--654. MIT Press, 1997.
 
30
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In Proc 7th Natl Conf on Artificial Intelligence: Workshop of AI for Web Search (AAAI 2000), pages 58--64. AAAI, July 2000.

CITED BY  11

Collaborative Colleagues:
Arindam Banerjee: colleagues
Inderjit Dhillon: colleagues
Joydeep Ghosh: colleagues
Suvrit Sra: colleagues