ACM Home Page
Please provide us with feedback. Feedback
Extracting context to improve accuracy for HTML content extraction
Full text PdfPdf (877 KB)
Source International World Wide Web Conference archive
Special interest tracks and posters of the 14th international conference on World Wide Web table of contents
Chiba, Japan
POSTER SESSION: Posters table of contents
Pages: 1114 - 1115  
Year of Publication: 2005
ISBN:1-59593-051-5
Authors
Suhit Gupta  Columbia University, New York, NY
Gail Kaiser  Columbia University, New York, NY
Salvatore Stolfo  Columbia University, New York, NY
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 62,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1062745.1062895
What is a DOI?

ABSTRACT

Previous work on content extraction utilized various heuristics such as link to text ratio, prominence of tables, and identification of advertising. Many of these heuristics were associated with "settings", whereby some heuristics could be turned on or off and others parameterized by minimum or maximum threshold values. A given collection of settings - such as removing table cells with high linked to non-linked text ratios and removing all apparent advertising -- might work very well for a news website, but leave little or no content left for the reader of a shopping site or a web portal We present a new technique, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Wolfgang Reichl, Bob Carpenter, Jennifer Chu-Carroll, Wu Chou, "Language Modeling for Content Extraction in Human-Computer Dialogues", In International Conference on Spoken Language Processing (ICSLP) 1998
2
 
3
Min-Yen Kan, Judith Klavans, Kathleen McKeown, "Linear Segmentation and Segment Relevance", In Proc. of 6th Int. Workshop of Very Large Corpora (WVLC-6), 1998
4
 
5
6


Collaborative Colleagues:
Suhit Gupta: colleagues
Gail Kaiser: colleagues
Salvatore Stolfo: colleagues