|
ABSTRACT
In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
W. B. Croft. Organizing and searching large files of documents. PhD thesis, University of Cambridge, 1978.
|
| |
4
|
|
| |
5
|
|
| |
6
|
R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3):331--353, 1997.
|
| |
7
|
|
| |
8
|
X. D. Hung Chim, Min Jiang. A semantics based information distribution framework for large web-based course forum system. Lecture Notes in Computer Science: Advances in Web Based Learning ICWL 2006, 4181/2006:93--104, 2006.
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
D. K. O'Neill and L. M. Gomez. The collaboratory notebook: A distributed knowledge-building environment for project-enhanced learning. In Proceedings of Ed-Meida'94, Vancouver, BC, 1994.
|
| |
13
|
O. M. Oren Zamir, Oren Etzioni and R. M. Karp. Fast and intuitive clustering of web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997.
|
| |
14
|
J. R. Paul Bieganski and J. V. Carlis. Generalized suffix trees for biological sequence data: Application and implentation. In Proceedings of 27th Annual Hawaii International Conference on System Sciences, pages 35--44, 1994.
|
| |
15
|
M. Porter. New models in probabilistic information retrieval. British Library Research and Development Report, no. 5587, 1980.
|
 |
16
|
Robert B. Allen , Pascal Obry , Michael Littman, An interface for navigating clustered document sets returned by queries, Proceedings of the conference on Organizational computing systems, p.166-171, November 01-04, 1993, Milpitas, California, United States
[doi> 10.1145/168555.168572]
|
| |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
D. S. Sven Meyer zu Eissen and M. Potthast. The suffix tree document model revisited. In Proceedings of the 5th International Conference on Knowledge Management, pages 596--603, 2005.
|
| |
21
|
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.
|
| |
22
|
|
| |
23
|
|
| |
24
|
William Hersh , Chris Buckley , T. J. Leone , David Hickam, OHSUMED: an interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, p.192-201, July 03-06, 1994, Dublin, Ireland
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
CITED BY 2
|
|
Ying Liu , Lucian V. Lita , R. Stefan Niculescu , Kun Bai , Prasenjit Mitra , C. Lee Giles, Real-time data pre-processing technique for efficient feature extraction in large scale datasets, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|