| Can we learn a template-independent wrapper for news article extraction from a single training site? |
| Full text |
Mov
(9:42),
Pdf
(1.31 MB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Industrial track papers
table of contents
Pages 1345-1354
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Junfeng Wang
|
Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China
|
|
Chun Chen
|
Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China
|
|
Can Wang
|
Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China
|
|
Jian Pei
|
School of Computer Science, Simon Fraser University, Vancouver, Canada
|
|
Jiajun Bu
|
Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China
|
|
Ziyu Guan
|
Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China
|
|
Wei Vivian Zhang
|
Microsoft Research, Redmond, WA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 40, Downloads (12 Months): 125, Citation Count: 0
|
|
|
ABSTRACT
Automatic news extraction from news pages is important in many Web applications such as news aggregation. However, the existing news extraction methods based on template-level wrapper induction have three serious limitations. First, the existing methods cannot correctly extract pages belonging to an unseen template. Second, it is costly to maintain up-to-date wrappers for a large amount of news websites, because any change of a template may invalidate the corresponding wrapper. Last, the existing methods can merely extract unformatted plain texts, and thus are not user friendly. In this paper, we tackle the problem of template-independent Web news extraction in a user-friendly way. We formalize Web news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed. Correlations between news titles and news bodies are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. Moreover, our approach can extract not only texts, but also images and animates within the news bodies and the extracted news articles are in the same visual style as in the original pages. In our experiments, a wrapper learned from 40 pages from a single news site achieved an accuracy of 98.1% on 3,973 news pages from 12 news sites.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
 |
7
|
Yunhua Hu , Guomao Xin , Ruihua Song , Guoping Hu , Shuming Shi , Yunbo Cao , Hang Li, Title extraction from bodies of HTML documents and its application to web page retrieval, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
[doi> 10.1145/1076034.1076079]
|
 |
8
|
|
 |
9
|
|
| |
10
|
K. A. M. C. Kamba, T. Bharat. An interactive, personalized, newspaper on the www. In WWW'95, 1995.
|
| |
11
|
|
 |
12
|
|
| |
13
|
B. Liu. Web content mining (tutorial). In WWW'05, 2005.
|
 |
14
|
|
| |
15
|
J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 1999.
|
 |
16
|
D. C. Reis , P. B. Golgher , A. S. Silva , A. F. Laender, Automatic web news extraction using tree edit distance, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988740]
|
| |
17
|
S. Sarawagi. Automation in information extraction and data integration (tutorial). In VLDB'02, 2002.
|
| |
18
|
S. M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6(6):184--186, 1977.
|
 |
19
|
|
 |
20
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988700]
|
| |
21
|
V. Vapnik. Principles of risk minimization for learning theory. In NIPS'91, pages 831--838, 1991.
|
 |
22
|
|
 |
23
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
[doi> 10.1145/1060745.1060760]
|
 |
24
|
|
| |
25
|
S. Zheng, R. Song, and J. Wen. Template-independent news extraction based on visual consistency. In AAAI'07, volume 22, pages 1507--1513, 2007.
|
|