ACM Home Page
Please provide us with feedback. Feedback
Hierarchical mixture models: a probabilistic analysis
Full text PdfPdf (937 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 580 - 589  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Author
Mark Sandler  Google Inc.
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 118,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281255
What is a DOI?

ABSTRACT

Mixture models form one of the most widely used classes of generative models for describing structured and clustered data. In this paper we develop a new approach for the analysis of hierarchical mixture models. More specifically, using a text clustering problem as a motivation, we describe a natural generative process that creates a hierarchical mixture model for the data. In this process, an adversary starts with an arbitrary base distribution and then builds a topic hierarchy via some evolutionary process, where he controls the parameters of the process. We prove that under our assumptions, given a subset of topics that represent generalizations of one another (such as baseballsportsbase), for any document which was produced via some topic in this hierarchy, we can efficiently determine the most specialized topic in this subset, it still belongs to. The quality of the classification is independent of the total number of topics in the hierarchy and our algorithm does not need to know the total number of topics in advance. Our approach also yields an algorithm for clustering and unsupervised topical tree reconstruction. We validate our model by showing that properties predicted by our theoretical results carry over to real data. We then apply our clustering algorithm to two different datasets: (i) "20 newsgroups"[19] and (ii) a snapshot of abstracts of arXiv {2} (15 categories, ~240,000 abstracts). In both cases our algorithm performs extremely well.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Arxiv. http://www.arxiv.org.
 
3
D. Blei, T. Gri, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, 2004.
 
4
 
5
 
6
7
8
 
9
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:301:13--30, March 1963.
 
10
 
11
 
12
 
13
A. Jepson and M. Black. Mixture models for optical flow computation. Partitioning Data Sets I, 1993.
 
14
L. B.-T. Jones, R. Bean, G. McLachlan, and J. Zhu. Intelligent Data Engineering and Automated Learning -- I, volume 3578, chapter Application of Mixture Models to Detect Differentially Expressed Genes, pages 422----431. Springer, 2005.
15
 
16
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
17
A. K. McCallum. Multi-label text classification with a mixture model trained by em. In Proc. of AAAI'99 workshop on text learning, 1999.
 
18
G. McLachlan and K. Basford. Mixture Models, inference and applications to clustering. Marcel Dekker, 1987.
 
19
T. Mitchell. 20 newsgroups.
 
20
 
21
L. A. Salter. Algorithms for phylogenetic tree reconstruction, 2000.
22
 
23
 
24
C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking. In CVPR, volume II, pages 246--252, 1999.
 
25
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. J. of the American Statistical Association, 2006.
26
 
27