|
ABSTRACT
We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information with time and space complexity typically linear in the data size. information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Ron Bekkerman , Ran El-Yaniv , Naftali Tishby , Yoad Winter, On feature distributional clustering for text categorization, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.146-153, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383976]
|
| |
2
|
|
 |
3
|
|
| |
4
|
R. El-Yaniv and O. Souroujon. Iterative Double Clustering for Unsupervised and Semi-supervised Learning. In Proc. of Neural Information Processing Systems (NIPS-14), 2002, to appear.
|
 |
5
|
|
| |
6
|
|
| |
7
|
K. Lang. Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, 1995.
|
| |
8
|
J. Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information theory, 37(1), 1991.
|
| |
9
|
|
| |
10
|
|
| |
11
|
G. Salton. Developments in Automatic Text Retrieval. Science, Vol. 253, pages 974--980, 1990.
|
| |
12
|
A. Schriebman, R. El-Yaniv, S. Fine and N. Tishby. On the two-sample problem and the Jensen-Shannon divergence for Markov sources. In preparation.
|
| |
13
|
N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In Proc. of Neural Information Processing Systems (NIPS-12), 1999.
|
 |
14
|
|
| |
15
|
N. Slonim and N. Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
|
| |
16
|
N. Slonim, N. Friedman and N. Tishby. Multivariate Agglomerative Information Bottleneck. In Proc. of Neural Information Processing Systems (NIPS-14), 2002, to appear.
|
| |
17
|
N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method. In Proc. 37th Allerton Conference on Communication and Computation. 1999.
|
| |
18
|
|
 |
19
|
|
CITED BY 24
|
|
Jaakko Peltonen , Janne Sinkkonen , Samuel Kaski, Sequential information bottleneck for finite data, Proceedings of the twenty-first international conference on Machine learning, p.82, July 04-08, 2004, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Samuel Kaski , Janne Nikkila , Janne Sinkkonen , Leo Lahti , Juha E. A. Knuuttila , Christophe Roos, Associative Clustering for Exploring Dependencies between Functional Genomics Data Sets, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), v.2 n.3, p.203-216, July 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jinxiu Chen , Donghong Ji , Chew Lim Tan , Zhengyu Niu, Relation extraction using label propagation based semi-supervised learning, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, p.129-136, July 17-18, 2006, Sydney, Australia
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Young-Min Kim , Jean-François Pessiot , Massih Reza Amini , Patrick Gallinari, An extension of PLSA for document clustering, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
Wei Zhang , Akshat Surve , Xiaoli Fern , Thomas Dietterich, Learning non-redundant codebooks for classifying complex objects, Proceedings of the 26th Annual International Conference on Machine Learning, p.1241-1248, June 14-18, 2009, Montreal, Quebec, Canada
|
|
|
|
|
|
|
|