| Extracting context to improve accuracy for HTML content extraction |
| Full text |
Pdf
(877 KB)
|
| Source
|
International World Wide Web Conference
archive
Special interest tracks and posters of the 14th international conference on World Wide Web
table of contents
Chiba, Japan
POSTER SESSION: Posters
table of contents
Pages: 1114 - 1115
Year of Publication: 2005
ISBN:1-59593-051-5
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 62, Citation Count: 2
|
|
|
ABSTRACT
Previous work on content extraction utilized various heuristics such as link to text ratio, prominence of tables, and identification of advertising. Many of these heuristics were associated with "settings", whereby some heuristics could be turned on or off and others parameterized by minimum or maximum threshold values. A given collection of settings - such as removing table cells with high linked to non-linked text ratios and removing all apparent advertising -- might work very well for a news website, but leave little or no content left for the reader of a shopping site or a web portal We present a new technique, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Wolfgang Reichl, Bob Carpenter, Jennifer Chu-Carroll, Wu Chou, "Language Modeling for Content Extraction in Human-Computer Dialogues", In International Conference on Spoken Language Processing (ICSLP) 1998
|
 |
2
|
|
| |
3
|
Min-Yen Kan, Judith Klavans, Kathleen McKeown, "Linear Segmentation and Segment Relevance", In Proc. of 6th Int. Workshop of Very Large Corpora (WVLC-6), 1998
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
|