|
ABSTRACT
Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.
|
 |
4
|
Jinlin Chen , Baoyao Zhou , Jin Shi , Hongjiang Zhang , Qiu Fengwu, Function-based object model towards website adaptation, Proceedings of the 10th international conference on World Wide Web, p.587-596, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372161]
|
| |
5
|
Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. EC-14, pp. 326--334.
|
| |
6
|
Dietterich, T. G. and Bakiri, G., Solving multiclass learning problem via error correcting output codes, Journal of Artificial Intelligence Research, 2:263--286, 1995.
|
| |
7
|
Dietterich, T. G. and Bakiri, G., Error-correcting output codes: a general method for improving multiclass inductive learning programs, in the proceedings of AAAI-91, pages 572--577. AAAI press / MIT press, 1991.
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
 |
11
|
Hao Liu , Xing Xie , Wei-Ying Ma , Hong-Jiang Zhang, Automatic browsing of large pictures on mobile devices, Proceedings of the eleventh ACM international conference on Multimedia, November 02-08, 2003, Berkeley, CA, USA
[doi> 10.1145/957013.957045]
|
| |
12
|
Mayoraz, E. and Alpaydin, E., Support vector machines for multiclass classification, in the proceedings of the international workshop on artificial intelligence neural networks, 1999.
|
| |
13
|
V. Vapnik. Principles of risk minimization for learning theory. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 831--838. Morgan Kaufmann, 1992
|
| |
14
|
|
| |
15
|
Yi, L. and Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003.
|
 |
16
|
|
 |
17
|
|
CITED BY 44
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shen Huang , Yong Yu , Shengping Li , Gui-Rong Xue , Lei Zhang, A study on combination of block importance and relevance to estimate page relevance, Special interest tracks and posters of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
Yao-Wen Huang , Chung-Hung Tsai , Tsung-Po Lin , Shih-Kun Huang , D. T. Lee , Sy-Yen Kuo, A testing framework for Web application security assessment, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.48 n.5, p.739-761, 5 August 2005
|
|
|
|
|
|
|
|
|
Yunhua Hu , Guomao Xin , Ruihua Song , Guoping Hu , Shuming Shi , Yunbo Cao , Hang Li, Title extraction from bodies of HTML documents and its application to web page retrieval, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
Karane Vieira , Altigran S. da Silva , Nick Pinto , Edleno S. de Moura , João M. B. Cavalcanti , Juliana Freire, A fast and robust method for web page template detection and removal, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
Xiangye Xiao , Qiong Luo , Xing Xie , Wei-Ying Ma, A comparative study on classifying the functions of web page blocks, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
Yewei Xue , Yunhua Hu , Guomao Xin , Ruihua Song , Shuming Shi , Yunbo Cao , Chin-Yew Lin , Hang Li, Web page title extraction and its application, Information Processing and Management: an International Journal, v.43 n.5, p.1332-1347, September, 2007
|
|
|
Jie Han , Dingyi Han , Chenxi Lin , Hua-Jun Zeng , Zheng Chen , Yong Yu, Homepage live: automatic block tracing for web personalization, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
Xiaofei He , Deng Cai , Ji-Rong Wen , Wei-Ying Ma , Hong-Jiang Zhang, Clustering and searching WWW images using link and page layout analysis, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.3 n.2, p.10-es, May 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Fernandes , Edleno S. de Moura , Berthier Ribeiro-Neto , Altigran S. da Silva , Marcos André Gonçalves, Computing block importance for searching on web sites, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Karane Vieira , André Luiz Costa Carvalho , Klessius Berlt , Edleno S. Moura , Altigran S. Silva , Juliana Freire, On Finding Templates on Web Collections, World Wide Web, v.12 n.2, p.171-211, June 2009
|
|
|
|
|
|
I. V. Ramakrishnan , Jalal Mahmud , Yevgen Borodin , Muhammad Asiful Islam , Faisal Ahmed, Bridging the Web Accessibility Divide, Electronic Notes in Theoretical Computer Science (ENTCS), 235, p.107-124, April, 2009
|
|
|
|
|
|
Junfeng Wang , Chun Chen , Can Wang , Jian Pei , Jiajun Bu , Ziyu Guan , Wei Vivian Zhang, Can we learn a template-independent wrapper for news article extraction from a single training site?, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
Shuyi Zheng , Ruihua Song , Ji-Rong Wen, Template-independent news extraction based on visual consistency, Proceedings of the 22nd national conference on Artificial intelligence, p.1507-1512, July 22-26, 2007, Vancouver, British Columbia, Canada
|
|