ACM Home Page
Please provide us with feedback. Feedback
News article extraction with template-independent wrapper
Full text PdfPdf (749 KB)
Source
International World Wide Web Conference archive
Proceedings of the 18th international conference on World wide web table of contents
Madrid, Spain
POSTER SESSION: Wednesday, April 22, 2009 table of contents
Pages 1085-1086  
Year of Publication: 2009
ISBN:978-1-60558-487-4
Authors
Junfeng Wang  Zhejiang University, Hangzhou, China
Xiaofei He  Zhejiang University, Hangzhou, China
Can Wang  Zhejiang University, Hangzhou, China
Jian Pei  Simon Fraser University, Central City, Canada
Jiajun Bu  Zhejiang University, Hangzhou, China
Chun Chen  Zhejiang University, Hangzhou, China
Ziyu Guan  Zhejiang University, Hangzhou, China
Gang Lu  College of Information, Zhejiang University of Finance and Ecomonics, Hangzhou, China
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 88,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1526709.1526868
What is a DOI?

ABSTRACT

We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Zheng, S., Song, R. and Wen, J. Template-Independent News Extraction Based on Visual Consistency. In Proc. AAAI'07, pages 1507--1513, 2007

Collaborative Colleagues:
Junfeng Wang: colleagues
Xiaofei He: colleagues
Can Wang: colleagues
Jian Pei: colleagues
Jiajun Bu: colleagues
Chun Chen: colleagues
Ziyu Guan: colleagues
Gang Lu: colleagues