ACM Home Page
Please provide us with feedback. Feedback
Discovering informative content blocks from Web documents
Full text PdfPdf (693 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
POSTER SESSION: Poster papers table of contents
Pages: 588 - 593  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Shian-Hua Lin  Academia Sinica, Nankang, Taipei 115, Taiwan
Jan-Ming Ho  Academia Sinica, Nankang, Taipei 115, Taiwan
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 150,   Citation Count: 29
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775134
What is a DOI?

ABSTRACT

In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag <TABLE> in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bear, J., Israel D., Petit, J., and Martin, D., "Using Information Extraction to Improve Document Retrieval," the Sixth Text Retrieval Conference (TREC 6), 1997, pp. 367--378.
 
2
 
3
 
4
Brin, S. and Page, L., Google Search Engine, http://www.google.com/.
 
5
Cardie, C., "Empirical Methods in Information Extraction," AI Magazine, 18(4):5--79, 1997.
6
 
7
8
9
 
10
 
11
 
12
13
14
 
15
 
16
Porter, M., "The Porter Stemming Algorithm," http://www.tartarus.org/~martin/PorterStemmer/.
 
17
 
18
Shannon, C., "A Mathematical Theory of Communication," Bell System Technical Journal, Vol. 27, pp. 379--423 and 623--656, July and October, 1948.
19
 
20
W3C DOM, "Document Object Model (DOM)," http://www.w3.org/DOM/.
 
21
W3C HTML, "HyperText Markup Language," http://www.w3.org/MarkUp/.
 
22
W3C XML, "Extensible Markup Language," http://www.w3.org/XML/.
 
23

CITED BY  31

Collaborative Colleagues:
Shian-Hua Lin: colleagues
Jan-Ming Ho: colleagues