ACM Home Page
Please provide us with feedback. Feedback
Coreex: content extraction from online news articles
Full text PdfPdf (371 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
POSTER SESSION: Poster session 1/knowledge management table of contents
Pages 1391-1392  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Jyotika Prasad  Stanford University, Stanford, CA, USA
Andreas Paepcke  Stanford University, Stanford, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 91,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458295
What is a DOI?

ABSTRACT

We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amount of text, the number of links it contains and additional heuristics. The method is site-independent and does not use any language-based features. We tested our algorithm on a set of 1120 news article pages from 27 domains. Our algorithm achieved over 97% precision and 98% recall, and an average processing speed of under 15ms per page.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
D.-K. Kang and J. Choi. Metanews: An information agent for gathering news articles on the web. In KDD '02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 588--593, New York, NY, USA, 2002. ACM.
 
3
J. Prasad and A. Paepcke. CoreEx: Content extraction from online news articles. Technical Report 2008-15, Stanford University, May 2008.

Collaborative Colleagues:
Jyotika Prasad: colleagues
Andreas Paepcke: colleagues