ACM Home Page
Please provide us with feedback. Feedback
Evaluation of hierarchical clustering algorithms for document datasets
Full text PdfPdf (130 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the eleventh international conference on Information and knowledge management table of contents
McLean, Virginia, USA
SESSION: Web clustering table of contents
Pages: 515 - 524  
Year of Publication: 2002
ISBN:1-58113-492-4
Authors
Ying Zhao  University of Minnesota, Minneapolis, MN
George Karypis  University of Minnesota, Minneapolis, MN
Sponsors
SIGMIS: ACM Special Interest Group on Management Information Systems
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 33,   Downloads (12 Months): 299,   Citation Count: 33
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/584792.584877
What is a DOI?

ABSTRACT

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
 
5
6
 
7
I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.
 
8
C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.
 
9
10
 
11
12
 
13
E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.
 
14
 
15
G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.
 
16
 
17
B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.
18
 
19
D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.
 
20
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob., pages 281--297, 1967.
 
21
J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.
 
22
 
23
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
 
24
 
25
P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.
 
26
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.
 
27
 
28
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
 
29
TREC. Text REtrieval conference. http://trec.nist.gov, 1999.
 
30
Yahoo! Yahoo! http://www.yahoo.com.
 
31
K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.
 
32
Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/~karypis/publications.

CITED BY  34

Collaborative Colleagues:
Ying Zhao: colleagues
George Karypis: colleagues