ACM Home Page
Please provide us with feedback. Feedback
A new suffix tree similarity measure for document clustering
Full text PdfPdf (230 KB)
Source
International World Wide Web Conference archive
Proceedings of the 16th international conference on World Wide Web table of contents
Banff, Alberta, Canada
SESSION: Similarity search table of contents
Pages: 121 - 130  
Year of Publication: 2007
ISBN:978-1-59593-654-7
Authors
Hung Chim  City University of Hong Kong, Hong Kong
Xiaotie Deng  City University of Hong Kong, Hong Kong
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 244,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1242572.1242590
What is a DOI?

ABSTRACT

In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
W. B. Croft. Organizing and searching large files of documents. PhD thesis, University of Cambridge, 1978.
 
4
 
5
 
6
R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3):331--353, 1997.
 
7
 
8
X. D. Hung Chim, Min Jiang. A semantics based information distribution framework for large web-based course forum system. Lecture Notes in Computer Science: Advances in Web Based Learning ICWL 2006, 4181/2006:93--104, 2006.
 
9
10
 
11
 
12
D. K. O'Neill and L. M. Gomez. The collaboratory notebook: A distributed knowledge-building environment for project-enhanced learning. In Proceedings of Ed-Meida'94, Vancouver, BC, 1994.
 
13
O. M. Oren Zamir, Oren Etzioni and R. M. Karp. Fast and intuitive clustering of web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997.
 
14
J. R. Paul Bieganski and J. V. Carlis. Generalized suffix trees for biological sequence data: Application and implentation. In Proceedings of 27th Annual Hawaii International Conference on System Sciences, pages 35--44, 1994.
 
15
M. Porter. New models in probabilistic information retrieval. British Library Research and Development Report, no. 5587, 1980.
16
 
17
18
19
 
20
D. S. Sven Meyer zu Eissen and M. Potthast. The suffix tree document model revisited. In Proceedings of the 5th International Conference on Knowledge Management, pages 596--603, 2005.
 
21
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.
 
22
 
23
 
24
 
25
26
 
27