|
ABSTRACT
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction". We have developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction. Our key insight is to work with the DOM trees, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Aidan Finn, Nicholas Kushmerick and Barry Smyth. "Fact or fiction: Content classification for digital libraries". In Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin), 2001.
|
| |
2
|
A. F. R. Rahman, H. Alam and R. Hartono. "Content Extraction from HTML Documents". In 1st Int. Workshop on Web Document Analysis (WDA2001), 2001.
|
 |
3
|
|
 |
4
|
|
| |
5
|
Eija Kaasinen , Matti Aaltonen , Juha Kolari , Suvi Melakoski , Timo Laakko, Two approaches to bringing Internet services to WAP devices, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.231-246, June 2000, Amsterdam, The Netherlands
|
| |
6
|
Stuart Hanzlik "Gorilla Design Studios Presents: The Hosts File". Gorilla Design Studios. August 31, 2002. http://accs-net.com/hosts/.
|
 |
7
|
|
| |
8
|
K.R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M.Y. Kan, B. Schiffman and S. Teufel. "Columbia Multi-document Summarization: Approach and Evaluation", In Document Understanding Conf., 2001.
|
 |
9
|
|
 |
10
|
|
| |
11
|
Manuela Kunze and Dietmar Rosner. "An XML-based Approach for the Presentation and Exploitation of Extracted Information". In 19th International Conference on Computational Linguistics, (Coling) 2002.
|
| |
12
|
A. F. R. Rahman, H. Alam and R. Hartono. "Understanding the Flow of Content in Summarizing HTML Documents". In Int. Workshop on Document Layout Interpretation and its Applications, DLIA01, Sep., 2001.
|
| |
13
|
Wolfgang Reichl, Bob Carpenter, Jennifer Chu-Carroll and Wu Chou. "Language Modeling for Content Extraction in Human-Computer Dialogues". In ICSLP, what year?. What in the world is ICSLP??
|
 |
14
|
|
| |
15
|
Min-Yen Kan, Judith L. Klavans and Kathleen R. McKeown. "Linear Segmentation and Segment Relevance". In Proc. of 6th Int. Workshop of Very Large Corpora (WVLC-6), 1998.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
Private communication, Min-Yen Kan, Columbia NLP group, 2002.
|
CITED BY 28
|
|
|
|
|
Frank Allan Hansen , Niels Olof Bouvin , Bent G. Christensen , Kaj Grønbæk , Torben Bach Pedersen , Jevgenij Gagach, Integrating the web and the world: contextual trails on the move, Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, August 09-13, 2004, Santa Cruz, CA, USA
|
|
|
|
|
|
Tien Nhut Nguyen , Ethan Vincent Munson , Cheng Thao, Fine-grained, structured configuration management for web projects, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
Unmil P. Karadkar , Richard Furuta , Selen Ustun , YoungJoo Park , Jin-Cheon Na , Vivek Gupta , Tolga Ciftci , Yungah Park, Display-agnostic hypermedia, Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, August 09-13, 2004, Santa Cruz, CA, USA
|
|
|
Patrick Baudisch , Xing Xie , Chong Wang , Wei-Ying Ma, Collapse-to-zoom: viewing web pages on small screen devices by interactively removing irrelevant content, Proceedings of the 17th annual ACM symposium on User interface software and technology, October 24-27, 2004, Santa Fe, NM, USA
|
|
|
|
|
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Junfeng Wang , Chun Chen , Can Wang , Jian Pei , Jiajun Bu , Ziyu Guan , Wei Vivian Zhang, Can we learn a template-independent wrapper for news article extraction from a single training site?, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|