| Frequent term-based text clustering |
| Full text |
Pdf
(655 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Edmonton, Alberta, Canada
POSTER SESSION: Poster papers
table of contents
Pages: 436 - 442
Year of Publication: 2002
ISBN:1-58113-567-X
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 37, Downloads (12 Months): 255, Citation Count: 20
|
|
|
ABSTRACT
Text clustering methods can be used to structure large sets of text or hypertext documents. The well-known methods of text clustering, however, do not really address the special problems of text clustering: very high dimensionality of the data, very large size of the databases and understandability of the cluster description. In this paper, we introduce a novel approach which uses frequent item (term) sets for text clustering. Such frequent sets can be efficiently discovered using algorithms for association rule mining. To cluster based on frequent term sets, we measure the mutual overlap of frequent sets with respect to the sets of supporting documents. We present two algorithms for frequent term-based text clustering, FTC which creates flat clusterings and HFTC for hierarchical clustering. An experimental evaluation on classical text documents as well as on web documents demonstrates that the proposed algorithms obtain clusterings of comparable quality significantly more efficiently than state-of-the- art text clustering algorithms. Furthermore, our methods provide an understandable description of the discovered clusters by their frequent term sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
 |
3
|
|
| |
4
|
Steinbach M., Karypis G., Kumar V.: A Comparison of Document Clustering Techniques, Proc. TextMining Workshop, KDD 2000, 2000.
|
| |
5
|
|
| |
6
|
Kaufman L., Rousseeuw P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990.
|
 |
7
|
|
 |
8
|
|
| |
9
|
Liu B., Hsu W., Ma Y.: Integrating Classification and Association Rule Mining, Proc. KDD 98, pp. 80--86.
|
| |
10
|
|
| |
11
|
Yibin S.: An implementation of the Apriori algorithm, <u>http://www.cs.uregina.ca/~dbd/cs831/notes/itemsets/dic.java</u>, 2000.
|
 |
12
|
Eui-Hong Han , Daniel Boley , Maria Gini , Robert Gross , Kyle Hastings , George Karypis , Vipin Kumar , Bamshad Mobasher , Jerome Moore, WebACE: a Web agent for document categorization and exploration, Proceedings of the second international conference on Autonomous agents, p.408-415, May 10-13, 1998, Minneapolis, Minnesota, United States
[doi> 10.1145/280765.280872]
|
CITED BY 21
|
|
Ling Ma , Nazli Goharian , Abdur Chowdhury , Misun Chung, Extracting unstructured data from template generated web documents, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
Vandana P. Janeja , Vijayalakshmi Atluri , Ahmed Gomaa , Nabil Adam , Christof Bornhoevd , Tao Lin, DM-AMS: employing data mining techniques for alert management, Proceedings of the 2005 national conference on Digital government research, May 15-18, 2005, Atlanta, Georgia
|
|
|
|
|
|
Shengnan Cong , Jiawei Han , Jay Hoeflinger , David Padua, A sampling-based framework for parallel data mining, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 15-17, 2005, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chi-Hoon Lee , Osmar R. Zaïane , Ho-Hyun Park , Jiayuan Huang , Russell Greiner, Clustering high dimensional data: A graph-based relaxed optimization approach, Information Sciences: an International Journal, v.178 n.23, p.4501-4511, December, 2008
|
|
|
|
|