|
ABSTRACT
In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag <TABLE> in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bear, J., Israel D., Petit, J., and Martin, D., "Using Information Extraction to Improve Document Retrieval," the Sixth Text Retrieval Conference (TREC 6), 1997, pp. 367--378.
|
| |
2
|
|
| |
3
|
|
| |
4
|
Brin, S. and Page, L., Google Search Engine, http://www.google.com/.
|
| |
5
|
Cardie, C., "Empirical Methods in Information Extraction," AI Magazine, 18(4):5--79, 1997.
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
Porter, M., "The Porter Stemming Algorithm," http://www.tartarus.org/~martin/PorterStemmer/.
|
| |
17
|
|
| |
18
|
Shannon, C., "A Mathematical Theory of Communication," Bell System Technical Journal, Vol. 27, pp. 379--423 and 623--656, July and October, 1948.
|
 |
19
|
|
| |
20
|
W3C DOM, "Document Object Model (DOM)," http://www.w3.org/DOM/.
|
| |
21
|
W3C HTML, "HyperText Markup Language," http://www.w3.org/MarkUp/.
|
| |
22
|
W3C XML, "Extensible Markup Language," http://www.w3.org/XML/.
|
| |
23
|
|
CITED BY 31
|
|
|
|
|
|
|
|
|
|
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
Shen Huang , Yong Yu , Shengping Li , Gui-Rong Xue , Lei Zhang, A study on combination of block importance and relevance to estimate page relevance, Special interest tracks and posters of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
|
|
|
Jie Tang , Hang Li , Yunbo Cao , Zhaohui Tang, Email data cleaning, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
Rakesh Agrawal , Howard Ho , François Jacquenet , Marielle Jacquenet, Mining information extraction rules from datasheets without linguistic parsing, Proceedings of the 18th international conference on Innovations in Applied Artificial Intelligence, p.510-520, June 22-24, 2005, Bari, Italy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jie Han , Dingyi Han , Chenxi Lin , Hua-Jun Zeng , Zheng Chen , Yong Yu, Homepage live: automatic block tracing for web personalization, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
K. Selçuk Candan , Mehmet E. Dönderler , Terri Hedgpeth , Jong Wook Kim , Qing Li , Maria Luisa Sapino, SEA: Segment-enrich-annotate paradigm for adapting dialog-based content for improved accessibility, ACM Transactions on Information Systems (TOIS), v.27 n.3, p.1-45, May 2009
|
|
|
|
|
|
David Fernandes , Edleno S. de Moura , Berthier Ribeiro-Neto , Altigran S. da Silva , Marcos André Gonçalves, Computing block importance for searching on web sites, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
Eunyee Koh , Daniel Caruso , Andruid Kerne , Ricardo Gutierrez-Osuna, Elimination of junk document surrogate candidates through pattern recognition, Proceedings of the 2007 ACM symposium on Document engineering, August 28-31, 2007, Winnipeg, Manitoba, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|