ACM Home Page
Please provide us with feedback. Feedback
Improving text categorization methods for event tracking
Full text PdfPdf (881 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Athens, Greece
Pages: 65 - 72  
Year of Publication: 2000
ISBN:1-58113-226-3
Authors
Yiming Yang  Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA
Tom Ault  Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA
Thomas Pierce  Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA
Charles W. Lattimer  Language Technologies Institute and Computer Science Department, Newell Simon Hall 3612D, Carnegie Mellon University, Pittsburgh, PA
Sponsors
Athens U of Econ & Business : Athens University of Economics and Business
Greek Com Soc : Greek Computer Society
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 79,   Citation Count: 28
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/345508.345550
What is a DOI?

ABSTRACT

Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different data collections, suggesting a robust solution for parameter optimization.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topic detection and tracking pilot study: Final report. In ProceedIngs of the DARPA Broadcast News Transcription and Understanding Workshop, pages 194- 218, San Francisco, CA, 1998. Morgan Kaufmann Publishers, Inc.
2
 
3
Jaime Carbonell, Yiming Yang, John Lafferty, Ralf D.Brown, Tom Pierce, and Xin Liu. Cmu report on tdt-2: Segmentation, detection and tracking. In Proceedzngs of the DARPA Broadcast News Workshop, pages 117-120, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
4
 
5
Jon Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover). In IEEE Workshop on Automatzc Speech Recognitzon and Understandzng, Piscataway, N J, 1997. IEEE Signal Processing Society.
 
6
Jon Fiscus, George Doddington, John Garofolo, and Alvin Martin. Nist's 1998 topic detection and tracking evaluation (tdt2). In Proceedings of the DARPA Broadcast News Transcrzption and Understanding Workshop, pages 19-26, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
7
8
9
10
 
11
A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The det curve in assessment of detection task performance. In EuroSpeech 1997 Proceedings, volume 4, 1997.
 
12
J. J. Rocchio-Jr. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrzeval System: Experzments in Automatzc Document Processzng, pages 313-323. Prentice-Hall, Inc., Englewood Cliffs, New Jersay, 1971.
 
13
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Soczety for Information Sczences, 41:288-297, 1990.
14
 
15
J. Michael Schultz and Mark Liberman. Topic detection and tracking using idf-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Workshop, pages 189-192, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
 
16
R. Schwartz, T. Imai, L. Nguyen, and J. Makhoul. A maximum likelihood model for topic classification of broadcast news. In Proceedzngs of Eurospeech, Rhodes, Greece, 1997.
 
17
Joeseph A. Shaw and Edward A. Fox. Combination of multiple searches. In The Second Text REtrzeval Conference, pages 243-252, 1994.
 
18
F. Walls, H. Jin, S. Sista, and R. Schwartz. Topic detection in broadcast news. In Proceedzngs of the DARPA Broadcast News Workshop, pages 193-198, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
 
19
 
20
 
21
 
22
23
24

CITED BY  28

Collaborative Colleagues:
Yiming Yang: colleagues
Tom Ault: colleagues
Thomas Pierce: colleagues
Charles W. Lattimer: colleagues