ACM Home Page
Please provide us with feedback. Feedback
Unsupervised document classification using sequential information maximization
Full text PdfPdf (237 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Tampere, Finland
SESSION: Text Categorization table of contents
Pages: 129 - 136  
Year of Publication: 2002
ISBN:1-58113-561-0
Authors
Noam Slonim  The Hebrew University, Jerusalem, Israel
Nir Friedman  The Hebrew University, Jerusalem, Israel
Naftali Tishby  The Hebrew University, Jerusalem, Israel
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 130,   Citation Count: 24
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/564376.564401
What is a DOI?

ABSTRACT

We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information with time and space complexity typically linear in the data size. information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
R. El-Yaniv and O. Souroujon. Iterative Double Clustering for Unsupervised and Semi-supervised Learning. In Proc. of Neural Information Processing Systems (NIPS-14), 2002, to appear.
5
 
6
 
7
K. Lang. Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, 1995.
 
8
J. Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information theory, 37(1), 1991.
 
9
 
10
 
11
G. Salton. Developments in Automatic Text Retrieval. Science, Vol. 253, pages 974--980, 1990.
 
12
A. Schriebman, R. El-Yaniv, S. Fine and N. Tishby. On the two-sample problem and the Jensen-Shannon divergence for Markov sources. In preparation.
 
13
N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In Proc. of Neural Information Processing Systems (NIPS-12), 1999.
14
 
15
N. Slonim and N. Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
 
16
N. Slonim, N. Friedman and N. Tishby. Multivariate Agglomerative Information Bottleneck. In Proc. of Neural Information Processing Systems (NIPS-14), 2002, to appear.
 
17
N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method. In Proc. 37th Allerton Conference on Communication and Computation. 1999.
 
18
19

CITED BY  24

Collaborative Colleagues:
Noam Slonim: colleagues
Nir Friedman: colleagues
Naftali Tishby: colleagues