ACM Home Page
Please provide us with feedback. Feedback
Web article extraction for web printing: a DOM+visual based approach
Full text PdfPdf (503 KB)
Source
Document Engineering archive
Proceedings of the 9th ACM symposium on Document engineering table of contents
Munich, Germany
SESSION: Document analysis (I) table of contents
Pages 66-69  
Year of Publication: 2009
ISBN:978-1-60558-575-8
Authors
Ping Luo  HP Labs, Beijing, China
Jian Fan  HP Labs, Palo Alto, CA, USA
Sam Liu  HP Labs, Palo Alto, CA, USA
Fen Lin  HP Labs, Beijing, China
Yuhong Xiong  HP Labs, Beijing, China
Jerry Liu  HP Labs, Palo Alto, CA, USA
Sponsors
SIGDOC: ACM Special Interest Group for Design of Communications
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 17,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1600193.1600208
What is a DOI?

ABSTRACT

This work studies the problem of extracting articles from Web pages for better printing. Different from existing approaches of article extraction, Web printing poses several unique requirements: 1) Identifying just the boundary surrounding the text-body is not the ideal solution for article extraction. It is highly desirable to filter out some uninformative links and advertisements within this boundary. 2) It is necessary to identify paragraphs, which may not be readily separated as DOM nodes, for the purpose of better layout of the article. 3) Its performance should be independent of content domains, written languages, and Web page templates. Toward these goals we propose a novel method of article extraction using both DOM (Document Object Model) and visual features. The main components of our method include: 1) a text segment/paragraph identification algorithm based on line-breaking features, 2) a global optimization method, Maximum Scoring Subsequence, based on text segments for identifying the boundary of the article body, 3) an outlier elimination step based on left or right alignment of text segments with the article body. Our experiments showed the proposed method is effective in terms of precision and recall at the level of text segments.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Pasternack and D. Roth. Extracting article text from the web with maximum subsequence segmentation. In Proceedings of the 18th WWW, 2009.
 
2
W. Ruzzo and M. Tompa. A linear time algorithm for finding all maximal scoring subsequences. In Proceedings of ISMB, 1999.
 
3
J. Wang, X. He, C. Wang, J. Pei, J. Bu, C. Chen, Z. Guan, and W. V. Zhang. Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of the 15th SIGKDD, 2009.