|
ABSTRACT
In this paper, we address the problem of skewed data in topic tracking: the small number of stories labeled positive as compared to negative stories and propose a method for estimating effective training stories for the topic-tracking task. For a small number of labeled positive stories, we use bilingual comparable, i.e., English, and Japanese corpora, together with the EDR bilingual dictionary, and extract story pairs consisting of positive and associated stories. To overcome the problem of a large number of labeled negative stories, we classified them into clusters. This is done using a semisupervised clustering algorithm, combining k means with EM. The method was tested on the TDT English corpus and the results showed that the system works well when the topic under tracking is talking about an event originating in the source language country, even for a small number of initial positive training stories.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Akaike, H. 1980. Likelihood and Bayes Procedure, Baysian, Statistics. University Press, Valencia, Spain.
|
| |
2
|
|
 |
3
|
|
| |
4
|
Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 194--218.
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
Carbonell, J., Yang, Y., Lafferty, J., Brown, R. D., Pierce, T., and Liu, X. 1999. CMU Report on TDT-2: Segmentation, detection and tracking. In Proceedings of the DARPA DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 117--120.
|
| |
9
|
|
| |
10
|
Chen, F., Farahat, A., and Brants, T. 2004. Multiple similarity measures and source-pair information in story link detection. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. ACL. 313--320.
|
| |
11
|
|
| |
12
|
Connell, M., Feng, A., Kumaran, G., Raghavan, H., Shah, C., and Allan, J. 2004. UMass at TDT 2004. In Proceedings of the Topic Detection and Tracking Workshop. Morgan Kaufmann, San Francisco, CA.
|
| |
13
|
|
 |
14
|
Martin Franz , Todd Ward , J. Scott McCarley , Wei-Jing Zhu, Unsupervised and supervised clustering for topic tracking, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.310-317, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.384013]
|
| |
15
|
Ghahramani, Z. and Jordan, M. I. 1994. Supervised learning from incomplete data via the EM approach. Advances in Neural Information Processing Systems 6 6, 1, 120--127.
|
| |
16
|
Jin, H., Schwartz, R., Sista, S., and Walls, F. 1999. Topic tracking for radio, TV broadcast, and newswire. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 199--204.
|
 |
17
|
|
| |
18
|
Lowe, S. A. 1999. The beta-binomial mixture model and its application to TDT Tracking and Detection. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 127--132.
|
| |
19
|
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., and Asahara, M. 2000. Japanese morphological analysis system chaSen version 2.2.1. In NAIST Technical Report. NAIST, Nara.
|
| |
20
|
Oard, D. W. 1999. Topic tracking with the PRISE information retrieval system. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 94--101.
|
| |
21
|
Papka, R. and Allan, J. 1999. UMASS approaches to detection and tracking at TDT2. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA.
|
| |
22
|
Papka, R. and Allan, J. 1999. UMASS approaches to detection and tracking at TDT2. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA.
|
| |
23
|
|
| |
24
|
Rissanen, J. 1984. Universal coding, information, prediction, and estimation. IEEE Trans. on Information Theory 30, 629--639.
|
| |
25
|
Schwartz, R., Imai, T., Nguyen, L., and Makhoul, J. 1997. A maximum likelihood model for topic classification of broadcast News. In Proceedings of Eurospeech. Morgan Kaufmann, San Francisco, CA. 270--278.
|
| |
26
|
Schmid, H. 1995. Improvements in part-of-speech tagging with an application to german. In Proceedings of the EACL SIGDAT Workshop. Morgan Kaufmann, San Francisco, CA. 47--50.
|
| |
27
|
Schultz, J. M. and Liberman, M. 1999. Topic detection and tracking using IDF-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 189--192.
|
| |
28
|
Takehito Utsuro , Takashi Horiuchi , Kohei Hino , Takeshi Hamamoto , Takeaki Nakayama, Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, April 12-17, 2003, Budapest, Hungary
[doi> 10.3115/1067807.1067854]
|
| |
29
|
|
| |
30
|
Yamron, J. P., Carp, I., Gillick, L., Lowe, S. A., and Mulbregt, P. 1999. Topic tracking in a news stream. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 133--136.
|
 |
31
|
Yiming Yang , Tom Ault , Thomas Pierce , Charles W. Lattimer, Improving text categorization methods for event tracking, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.65-72, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345550]
|
| |
32
|
Yu, M. Q., Luo, W. H., Zhou, Z. T., and Bai, S. 2004. ICT's approaches to HTD and tracking at TDT2004. In Proceedings of the Topic Detection and Tracking Workshop. Morgan Kaufmann, San Francisco, CA.
|
| |
33
|
Zhang, Y. and Callan, J. 2004. CMU DIR supervised tracking report. In Proceedings of the Topic Detection and Tracking Workshop. Morgan Kaufmann, San Francisco, CA.
|
|