|
ABSTRACT
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Charu C. Aggarwal , Stephen C. Gates , Philip S. Yu, On the merits of building categorization systems by supervised clustering, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.352-356, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312279]
|
| |
2
|
|
| |
3
|
Daniel Boley , Maria Gini , Robert Gross , Eui-Hong (Sam) Han , Kyle Hastings , George Karypis , Vipin Kumar , Bamshad Mobasher , Jerome Moore, Document Categorization and Query Generation on the World Wide WebUsing WebACE, Artificial Intelligence Review, v.13 n.5-6, p.365-391, Dec. 1999
[doi> 10.1023/A:1006592405320]
|
| |
4
|
Daniel Boley , Maria Gini , Robert Gross , Eui-Hong Han , George Karypis , Vipin Kumar , Bamshad Mobasher , Jerome Moore , Kyle Hastings, Partitioning-based clustering for Web document categorization, Decision Support Systems, v.27 n.3, p.329-341, Dec.1999
[doi> 10.1016/S0167-9236(99)00055-X]
|
| |
5
|
|
 |
6
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
7
|
I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.
|
| |
8
|
C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.
|
| |
9
|
|
 |
10
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
11
|
|
 |
12
|
Eui-Hong Han , Daniel Boley , Maria Gini , Robert Gross , Kyle Hastings , George Karypis , Vipin Kumar , Bamshad Mobasher , Jerome Moore, WebACE: a Web agent for document categorization and exploration, Proceedings of the second international conference on Autonomous agents, p.408-415, May 10-13, 1998, Minneapolis, Minnesota, United States
[doi> 10.1145/280765.280872]
|
| |
13
|
E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.
|
| |
14
|
|
| |
15
|
G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.
|
| |
16
|
|
| |
17
|
B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.
|
 |
18
|
|
| |
19
|
D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.
|
| |
20
|
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob., pages 281--297, 1967.
|
| |
21
|
J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.
|
| |
22
|
|
| |
23
|
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
|
| |
24
|
|
| |
25
|
P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.
|
| |
26
|
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.
|
| |
27
|
|
| |
28
|
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
|
| |
29
|
TREC. Text REtrieval conference. http://trec.nist.gov, 1999.
|
| |
30
|
Yahoo! Yahoo! http://www.yahoo.com.
|
| |
31
|
K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.
|
| |
32
|
Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/~karypis/publications.
|
CITED BY 35
|
|
|
|
|
Krishna Kummamuru , Rohit Lotlikar , Shourya Roy , Karan Singal , Raghu Krishnapuram, A hierarchical monothetic document clustering algorithm for summarization and browsing search results, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
Gautam Pant , Kostas Tsioutsiouliklis , Judy Johnson , C. Lee Giles, Panorama: extending digital libraries with topical crawlers, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
Rebecca Cathey , Ling Ma , Nazli Goharian , David Grossman, Misuse detection for information retrieval systems, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Nachiketa Sahoo , Jamie Callan , Ramayya Krishnan , George Duncan , Rema Padman, Incremental hierarchical clustering of text documents, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
Dina Demner-Fushman , Jimmy Lin, Answer extraction, semantic clustering, and extractive summarization for clinical question answering, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, p.841-848, July 17-18, 2006, Sydney, Australia
|
|
|
|
|
|
|
|
|
|
|
|
Amol Ghoting , Gregory Buehrer , Srinivasan Parthasarathy , Daehyun Kim , Anthony Nguyen , Yen-Kuang Chen , Pradeep Dubey, A characterization of data mining algorithms on a modern processor, Proceedings of the 1st international workshop on Data management on new hardware, June 12-12, 2005, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
Tianming Hu , Hui Xiong , Wenjun Zhou , Sam Yuan Sung , Hangzai Luo, Hypergraph partitioning for document clustering: a unified clique perspective, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
Francesco Bonchi , Carlos Castillo , Debora Donato , Aristides Gionis, Topical query decomposition, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
Wenyuan Dai , Qiang Yang , Gui-Rong Xue , Yong Yu, Self-taught clustering, Proceedings of the 25th international conference on Machine learning, p.200-207, July 05-09, 2008, Helsinki, Finland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|