ACM Home Page
Please provide us with feedback. Feedback
Intra-document structural frequency features for semi-supervised domain adaptation
Full text PdfPdf (530 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: KM: information extraction table of contents
Pages 1291-1300  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Andrew Arnold  Carnegie Mellon University, Pittsburgh, PA, USA
William W. Cohen  Carnegie Mellon University, Pittsburgh, PA, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 97,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458253
What is a DOI?

ABSTRACT

In this work we try to bridge the gap often encountered by researchers who find themselves with few or no labeled examples from their desired target domain, yet still have access to large amounts of labeled data from other related, but distinct source domains, and seemingly no way to transfer knowledge from one to the other. Experimentally, we focus on the problem of extracting protein mentions from academic publications in the field of biology, where the source domain data are abstracts labeled with protein mentions, and the target domain data are wholly unlabeled captions. We mine the large number of such full text articles freely available on the Internet in order to supplement the limited amount of annotated data available. By exploiting the explicit and implicit common structure of the different subsections of these documents, including the unlabeled full text, we are able to generate robust features that are insensitive to changes in marginal and conditional distributions of classes and data across domains. We supplement these domain-insensitive features with automatically obtained high-confidence positive and negative predictions on the target domain to learn extractors that generalize well from one section of a document to another. Finally, lacking labeled target testing data, we employ comparative user preference studies to evaluate the relative performance of the proposed methods with respect to existing baselines.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
A. Arnold, R. Nallapati, and W. W. Cohen. Exploiting feature hierarchy for transfer learning in named entity recognition. In ACL:HLT '08, 2008.
 
4
 
5
S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS 20, Cambridge, MA, 2007. MIT Press.
 
6
D. M. Blei, J. A. Bagnell, and A. McCallum. Learning with scope, with application to information extraction and classification. In UAI, pages 53--60, 2002.
 
7
J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In EMNLP, Sydney, Australia, 2006.
8
 
9
 
10
W. W. Cohen. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. http://minorthird.sourceforge.net, 2004.
11
 
12
H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.
 
13
H. Daumé III and D. Marcu. Domain adaptation for statistical classifiers. In Journal of Artificial Intelligence Research 26, pages 101--126, 2006.
 
14
K. Franzén, G. Eriksson, F. Olsson, L. Asker, P. Lidén, and J. Cöster. Protein names and how to find them. In International Journal of Medical Informatics, 2002.
 
15
 
16
Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In CAP, Nice, France, 2005.
 
17
 
18
 
19
R. F. Murphy, Z. Kou, J. Hua, M. Joffe, and W. W. Cohen. Extracting and structuring subcellular location information from on-line journal articles: The subcellular location image finder. In KSCE, 2004.
 
20
National Institues of Health. http://www.pubmedcentral.nih.gov/.
 
21
22
 
23
L. Shi and F. Campagne. Building a protein name dictionary from full text: a machine learning term extraction approach. BMC Bioinformatics, 6(88), 2005.
 
24
 
25
B. Taskar, M.-F. Wong, and D. Koller. Learning on the test data: Leveraging 'unseen' features. In Proc. Twentieth International Conference on Machine Learning (ICML), 2003.
 
26
S. Thrun. Is learning the n-th thing any easier than learning the first? In NIPS, volume 8, pages 640--646. MIT, 1996.
 
27
J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent component analysis, 2005.
 
28
X. Zhu. Semi-supervised learning literature survey. In Technical Report 1530. University of Wisconsin, 2005.
 
29
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002

Collaborative Colleagues:
Andrew Arnold: colleagues
William W. Cohen: colleagues