ACM Home Page
Please provide us with feedback. Feedback
Co-clustering based classification for out-of-domain documents
Full text PdfPdf (930 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 210 - 219  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Wenyuan Dai  Shanghai Jiao Tong University
Gui-Rong Xue  Shanghai Jiao Tong University
Qiang Yang  Hong Kong University of Science and Technology
Yong Yu  Shanghai Jiao Tong University
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 278,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281218
What is a DOI?

ABSTRACT

In many real world applications, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data from a related but different domain. Traditional machine learning is not able to cope well with learning across different domains. In this paper, we address this problem for a text-mining task, where the labeled data are under one distribution in one domain known as in-domain data, while the unlabeled data are under a related but different domain known as out-of-domain data. Our general goal is to learn from the in-domain and apply the learned knowledge to out-of-domain. We propose a co-clustering based classification (CoCC) algorithm to tackle this problem. Co-clustering is used as a bridge to propagate the class structure and knowledge from the in-domain to the out-of-domain. We present theoretical and empirical analysis to show that our algorithm is able to produce high quality classification results, even when the distributions between the two data are different. The experimental results show that our algorithm greatly improves the classification performance over the traditional learning algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.
 
6
 
7
H. Daumé III and D. Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101--126, 2006.
8
9
 
10
J. Gao, P.-N. Tan, and H. Cheng. Semi-supervised clustering with partial background information. In Proceedings of the Sixth SIAM International Conference on Data Mining, 2006.
 
11
N. Grira, M. Crucianu, and N. Boujemaa. Unsupervised and semi-supervised clustering: a brief survey, 2005. In A Review of Machine Learning Techniques for Processing Multimedia Content, Report of the MUSCLE Eurepean Network of Excellence (6th Framework Programme).
12
 
13
T. Joachims. SGTlight. http://sgt.joachims.org/.
 
14
T. Joachims. SVMlight. http://svmlight.joachims.org/.
 
15
 
16
T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of Twentieth International Conference on Machine Learning, 2003.
 
17
G. Karypis. Cluto - software for clustering high-dimensional datasets. http://glaros.dtc.umn.edu/gkhome/views/cluto.
 
18
K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, 1995.
 
19
D. D. Lewis. Reuters-21578 test collection. http://www.daviddlewis.com/.
 
20
 
21
A. K. McCallum. Simulated/real/aviation/auto usenet data. http://www.cs.umass.edu/~mccallum/code-data.html.
 
22
 
23
 
24
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
 
25
S. Swarup and S. R. Ray. Cross-domain knowledge transfer using structured representations. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, 2006.
 
26
 
27
 
28
X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, University of Wisconsin-Madison, 2006.


Collaborative Colleagues:
Wenyuan Dai: colleagues
Gui-Rong Xue: colleagues
Qiang Yang: colleagues
Yong Yu: colleagues