ACM Home Page
Please provide us with feedback. Feedback
Topic tracking based on bilingual comparable corpora and semisupervised clustering
Full text PdfPdf (436 KB)
Source
ACM Transactions on Asian Language Information Processing (TALIP) archive
Volume 6 ,  Issue 3  (November 2007) table of contents
Article No. 11  
Year of Publication: 2007
ISSN:1530-0226
Authors
Fumiyo Fukumoto  University of Yamanashi, Kofu, Japan
Yoshimi Suzuki  University of Yamanashi, Kofu, Japan
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 78,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1290002.1290005
What is a DOI?

ABSTRACT

In this paper, we address the problem of skewed data in topic tracking: the small number of stories labeled positive as compared to negative stories and propose a method for estimating effective training stories for the topic-tracking task. For a small number of labeled positive stories, we use bilingual comparable, i.e., English, and Japanese corpora, together with the EDR bilingual dictionary, and extract story pairs consisting of positive and associated stories. To overcome the problem of a large number of labeled negative stories, we classified them into clusters. This is done using a semisupervised clustering algorithm, combining k means with EM. The method was tested on the TDT English corpus and the results showed that the system works well when the topic under tracking is talking about an event originating in the source language country, even for a small number of initial positive training stories.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Akaike, H. 1980. Likelihood and Bayes Procedure, Baysian, Statistics. University Press, Valencia, Spain.
 
2
3
 
4
Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 194--218.
5
 
6
7
 
8
Carbonell, J., Yang, Y., Lafferty, J., Brown, R. D., Pierce, T., and Liu, X. 1999. CMU Report on TDT-2: Segmentation, detection and tracking. In Proceedings of the DARPA DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 117--120.
 
9
 
10
Chen, F., Farahat, A., and Brants, T. 2004. Multiple similarity measures and source-pair information in story link detection. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. ACL. 313--320.
 
11
 
12
Connell, M., Feng, A., Kumaran, G., Raghavan, H., Shah, C., and Allan, J. 2004. UMass at TDT 2004. In Proceedings of the Topic Detection and Tracking Workshop. Morgan Kaufmann, San Francisco, CA.
 
13
14
 
15
Ghahramani, Z. and Jordan, M. I. 1994. Supervised learning from incomplete data via the EM approach. Advances in Neural Information Processing Systems 6 6, 1, 120--127.
 
16
Jin, H., Schwartz, R., Sista, S., and Walls, F. 1999. Topic tracking for radio, TV broadcast, and newswire. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 199--204.
17
 
18
Lowe, S. A. 1999. The beta-binomial mixture model and its application to TDT Tracking and Detection. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 127--132.
 
19
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., and Asahara, M. 2000. Japanese morphological analysis system chaSen version 2.2.1. In NAIST Technical Report. NAIST, Nara.
 
20
Oard, D. W. 1999. Topic tracking with the PRISE information retrieval system. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 94--101.
 
21
Papka, R. and Allan, J. 1999. UMASS approaches to detection and tracking at TDT2. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA.
 
22
Papka, R. and Allan, J. 1999. UMASS approaches to detection and tracking at TDT2. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA.
 
23
 
24
Rissanen, J. 1984. Universal coding, information, prediction, and estimation. IEEE Trans. on Information Theory 30, 629--639.
 
25
Schwartz, R., Imai, T., Nguyen, L., and Makhoul, J. 1997. A maximum likelihood model for topic classification of broadcast News. In Proceedings of Eurospeech. Morgan Kaufmann, San Francisco, CA. 270--278.
 
26
Schmid, H. 1995. Improvements in part-of-speech tagging with an application to german. In Proceedings of the EACL SIGDAT Workshop. Morgan Kaufmann, San Francisco, CA. 47--50.
 
27
Schultz, J. M. and Liberman, M. 1999. Topic detection and tracking using IDF-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 189--192.
 
28
 
29
 
30
Yamron, J. P., Carp, I., Gillick, L., Lowe, S. A., and Mulbregt, P. 1999. Topic tracking in a news stream. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, San Francisco, CA. 133--136.
31
 
32
Yu, M. Q., Luo, W. H., Zhou, Z. T., and Bai, S. 2004. ICT's approaches to HTD and tracking at TDT2004. In Proceedings of the Topic Detection and Tracking Workshop. Morgan Kaufmann, San Francisco, CA.
 
33
Zhang, Y. and Callan, J. 2004. CMU DIR supervised tracking report. In Proceedings of the Topic Detection and Tracking Workshop. Morgan Kaufmann, San Francisco, CA.

Collaborative Colleagues:
Fumiyo Fukumoto: colleagues
Yoshimi Suzuki: colleagues