|
ABSTRACT
Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different data collections, suggesting a robust solution for parameter optimization.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topic detection and tracking pilot study: Final report. In ProceedIngs of the DARPA Broadcast News Transcription and Understanding Workshop, pages 194- 218, San Francisco, CA, 1998. Morgan Kaufmann Publishers, Inc.
|
 |
2
|
|
| |
3
|
Jaime Carbonell, Yiming Yang, John Lafferty, Ralf D.Brown, Tom Pierce, and Xin Liu. Cmu report on tdt-2: Segmentation, detection and tracking. In Proceedzngs of the DARPA Broadcast News Workshop, pages 117-120, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
|
 |
4
|
|
| |
5
|
Jon Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover). In IEEE Workshop on Automatzc Speech Recognitzon and Understandzng, Piscataway, N J, 1997. IEEE Signal Processing Society.
|
| |
6
|
Jon Fiscus, George Doddington, John Garofolo, and Alvin Martin. Nist's 1998 topic detection and tracking evaluation (tdt2). In Proceedings of the DARPA Broadcast News Transcrzption and Understanding Workshop, pages 19-26, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
|
 |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
11
|
A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The det curve in assessment of detection task performance. In EuroSpeech 1997 Proceedings, volume 4, 1997.
|
| |
12
|
J. J. Rocchio-Jr. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrzeval System: Experzments in Automatzc Document Processzng, pages 313-323. Prentice-Hall, Inc., Englewood Cliffs, New Jersay, 1971.
|
| |
13
|
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Soczety for Information Sczences, 41:288-297, 1990.
|
 |
14
|
|
| |
15
|
J. Michael Schultz and Mark Liberman. Topic detection and tracking using idf-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Workshop, pages 189-192, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
|
| |
16
|
R. Schwartz, T. Imai, L. Nguyen, and J. Makhoul. A maximum likelihood model for topic classification of broadcast news. In Proceedzngs of Eurospeech, Rhodes, Greece, 1997.
|
| |
17
|
Joeseph A. Shaw and Edward A. Fox. Combination of multiple searches. In The Second Text REtrzeval Conference, pages 243-252, 1994.
|
| |
18
|
F. Walls, H. Jin, S. Sista, and R. Schwartz. Topic detection in broadcast news. In Proceedzngs of the DARPA Broadcast News Workshop, pages 193-198, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
|
| |
19
|
Sholom M. Weiss , Chidanand Apte , Fred J. Damerau , David E. Johnson , Frank J. Oles , Thilo Goetz , Thomas Hampp, Maximizing Text-Mining Performance, IEEE Intelligent Systems, v.14 n.4, p.63-69, July 1999
[doi> 10.1109/5254.784086]
|
| |
20
|
|
| |
21
|
|
| |
22
|
Yiming Yang , Jaime G. Carbonell , Ralf D. Brown , Thomas Pierce , Brian T. Archibald , Xin Liu, Learning Approaches for Detecting and Tracking News Events, IEEE Intelligent Systems, v.14 n.4, p.32-43, July 1999
[doi> 10.1109/5254.784083]
|
 |
23
|
|
 |
24
|
|
CITED BY 28
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yiming Yang , Jaime Carbonell , Ralf Brown , John Lafferty , Thomas Pierce , Thomas Ault, Multi-strategy learning for topic detection and tracking: a joint report of CMU approaches to multilingual TDT, Topic detection and tracking: event-based information organization, Kluwer Academic Publishers, Norwell, MA, 2002
|
|
Martin Franz , Todd Ward , J. Scott McCarley , Wei-Jing Zhu, Unsupervised and supervised clustering for topic tracking, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.310-317, September 2001, New Orleans, Louisiana, United States
|
|
|
|
|
|
|
Xuanhui Wang , ChengXiang Zhai , Xiao Hu , Richard Sproat, Mining correlated bursty topic patterns from coordinated text streams, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Gabriel Pui Cheong Fung , Jeffrey Xu Yu , Huan Liu , Philip S. Yu, Time-dependent event hierarchy construction, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
M4: a metamodel for data preprocessing
Proceedings of the 4th ACM international workshop on Data warehousing and OLAP
Anca Vaduva
, Jörg-Uwe Kietz
, Regina Zücker
-
The effect of latency on user performance in Warcraft III
Proceedings of the 2nd workshop on Network and system support for games
Nathan Sheldon
, Eric Girard
, Seth Borg
, Mark Claypool
, Emmanuel Agu
-
Learning subjective relevance to facilitate information access
Proceedings of the fourth international conference on Information and knowledge management
James R. Chen
, Nathalie Mathé
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
|