ACM Home Page
Please provide us with feedback. Feedback
Learning block importance models for web pages
Full text PdfPdf (1.23 MB)
Source International World Wide Web Conference archive
Proceedings of the 13th international conference on World Wide Web table of contents
New York, NY, USA
SESSION: Learning classifiers table of contents
Pages: 203 - 211  
Year of Publication: 2004
ISBN:1-58113-844-X
Authors
Ruihua Song  Microsoft Research Asia, Beijing, P.R. China
Haifeng Liu  University of Toronto, Toronto, ON, Canada
Ji-Rong Wen  Microsoft Research Asia, Beijing, P.R. China
Wei-Ying Ma  Microsoft Research Asia, Beijing, P.R. China
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 173,   Citation Count: 44
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/988672.988700
What is a DOI?

ABSTRACT

Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. Also, it has been proven that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different segments in web pages. Through a user study, we found that people do have a consistent view about the importance of blocks in web pages. In this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. We define the block importance estimation as a learning problem. First, we use a vision-based page segmentation algorithm to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms are used to train a model to assign importance to different segments in the web page. In our experiments, the best model can achieve the performance with Micro-F1 79% and Micro-Accuracy 85.9%, which is quite close to a person's view.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.
4
 
5
Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. EC-14, pp. 326--334.
 
6
Dietterich, T. G. and Bakiri, G., Solving multiclass learning problem via error correcting output codes, Journal of Artificial Intelligence Research, 2:263--286, 1995.
 
7
Dietterich, T. G. and Bakiri, G., Error-correcting output codes: a general method for improving multiclass inductive learning programs, in the proceedings of AAAI-91, pages 572--577. AAAI press / MIT press, 1991.
8
 
9
10
11
 
12
Mayoraz, E. and Alpaydin, E., Support vector machines for multiclass classification, in the proceedings of the international workshop on artificial intelligence neural networks, 1999.
 
13
V. Vapnik. Principles of risk minimization for learning theory. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 831--838. Morgan Kaufmann, 1992
 
14
 
15
Yi, L. and Liu, B., Web Page Cleaning for Web Mining through Feature Weighting, in the proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, August, 2003.
16
17

CITED BY  44

Collaborative Colleagues:
Ruihua Song: colleagues
Haifeng Liu: colleagues
Ji-Rong Wen: colleagues
Wei-Ying Ma: colleagues