ACM Home Page
Please provide us with feedback. Feedback
An investigation of linguistic features and clustering algorithms for topical document clustering
Full text PdfPdf (859 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Athens, Greece
Pages: 224 - 231  
Year of Publication: 2000
ISBN:1-58113-226-3
Authors
Vasileios Hatzivassiloglou  Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY
Luis Gravano  Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY
Ankineedu Maganti  Department of Computer Science, Columbia Unwersity, 1214 Amsterdam Avenue, New York, NY
Sponsors
Athens U of Econ & Business : Athens University of Economics and Business
Greek Com Soc : Greek Computer Society
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 89,   Citation Count: 15
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/345508.345582
What is a DOI?

ABSTRACT

We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in the official TDT2 competition.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
D. M. Bates and D. G. Watts. NonlinearRegressionAnalysis and its Applications. Wiley, New York, 1988.
3
 
4
J. Fiscus, G. Doddington, J. Garofolo, and A. Martin. NIST's 1998 Topic Detection and Tracking evaluation (TDT2). In Proceedings of the 1999 DARPA Broadcast News Workshop, pages 19-24, Hemdon, Virginia, February-March 1999.
 
5
 
6
7
 
8
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.
 
9
Mark Llberman. Topic Detection and Tracking Principal Investigators meeting, 1998.
 
10
Stephen A. Lowe. The beta-bmomml mixture model and its application to TDT tracking and detection. In Proceedings of the 1999 DARPA Broadcast News Workshop, pages 127-131, Hemdon, Virginia, February-March 1999.
 
11
 
12
National Institute of Standards and Technology. The Topic Detection and Tracking Phase 2 (TDT2) evaluation plan, 1998. Version 3.7, August 3rd, 1998. Available from http://www.itl .nist.gov/iaui/894. 01/ tdt98/doc/tdt2, eval .plan. 98 .v3.7 .pdf.
 
13
Ron Papka, James Allan, and Victor Lavrenko. UMass approaches to detection and tracking at TDT2. In Proceedings of the 1999 DARPA Broadcast News Workshop, pages 111-116, Hemdon, Virginia, February-March 1999.
 
14
 
15
16
 
17
T. J. Santner and D. E. Duffy. The Statistical Analysis of Discrete Data. Springer-Verlag, New York, 1989.
 
18
 
19
N. Wacholder. Simplex NPs clustered by head: A method for identifying significant topics in a document. In Proceedings of the COLING/ACL Workshop on the Computational Treatment of Nominals, pages 70-79, Montreal, Canada, October 1998.
 
20
21
22

CITED BY  15

Collaborative Colleagues:
Vasileios Hatzivassiloglou: colleagues
Luis Gravano: colleagues
Ankineedu Maganti: colleagues